<feed xmlns='http://www.w3.org/2005/Atom'>
<title>glusterfs.git/tests/bugs/replicate/bug-1539358-split-brain-detection.t, branch v3.12.14</title>
<subtitle></subtitle>
<link rel='alternate' type='text/html' href='http://dev.gluster.org/cgit/glusterfs.git/'/>
<entry>
<title>afr: don't treat all cases all bricks being blamed as split-brain</title>
<updated>2018-07-04T04:03:45+00:00</updated>
<author>
<name>Ravishankar N</name>
<email>ravishankar@redhat.com</email>
</author>
<published>2018-01-28T08:20:47+00:00</published>
<link rel='alternate' type='text/html' href='http://dev.gluster.org/cgit/glusterfs.git/commit/?id=b715dfb39460d053db726c4150c5f209a1503bff'/>
<id>b715dfb39460d053db726c4150c5f209a1503bff</id>
<content type='text'>
Problem:
We currently don't have a roll-back/undoing of post-ops if quorum is not
met. Though the FOP is still unwound with failure, the xattrs remain on
the disk.  Due to these partial post-ops and partial heals (healing only when
2 bricks are up), we can end up in split-brain purely from the afr
xattrs point of view i.e each brick is blamed by atleast one of the
others. These scenarios are hit when there is frequent
connect/disconnect of the client/shd to the bricks while I/O or heal
are in progress.

Fix:
Instead of undoing the post-op, pick a source based on the xattr values.
If 2 bricks blame one, the blamed one must be treated as sink.
If there is no majority, all are sources. Once we pick a source,
self-heal will then do the heal instead of erroring out due to
split-brain.

Change-Id: I3d0224b883eb0945785ade0e9697a1c828aec0ae
BUG: 1597123
Signed-off-by: Ravishankar N &lt;ravishankar@redhat.com&gt;
(cherry picked from commit 0e6e8216823c2d9dafb81aae0f6ee3497c23d140)
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Problem:
We currently don't have a roll-back/undoing of post-ops if quorum is not
met. Though the FOP is still unwound with failure, the xattrs remain on
the disk.  Due to these partial post-ops and partial heals (healing only when
2 bricks are up), we can end up in split-brain purely from the afr
xattrs point of view i.e each brick is blamed by atleast one of the
others. These scenarios are hit when there is frequent
connect/disconnect of the client/shd to the bricks while I/O or heal
are in progress.

Fix:
Instead of undoing the post-op, pick a source based on the xattr values.
If 2 bricks blame one, the blamed one must be treated as sink.
If there is no majority, all are sources. Once we pick a source,
self-heal will then do the heal instead of erroring out due to
split-brain.

Change-Id: I3d0224b883eb0945785ade0e9697a1c828aec0ae
BUG: 1597123
Signed-off-by: Ravishankar N &lt;ravishankar@redhat.com&gt;
(cherry picked from commit 0e6e8216823c2d9dafb81aae0f6ee3497c23d140)
</pre>
</div>
</content>
</entry>
</feed>
