summaryrefslogtreecommitdiffstats
path: root/Feature Planning/GlusterFS 3.5/Brick Failure Detection.md
diff options
context:
space:
mode:
Diffstat (limited to 'Feature Planning/GlusterFS 3.5/Brick Failure Detection.md')
-rw-r--r--Feature Planning/GlusterFS 3.5/Brick Failure Detection.md151
1 files changed, 0 insertions, 151 deletions
diff --git a/Feature Planning/GlusterFS 3.5/Brick Failure Detection.md b/Feature Planning/GlusterFS 3.5/Brick Failure Detection.md
deleted file mode 100644
index 9952698..0000000
--- a/Feature Planning/GlusterFS 3.5/Brick Failure Detection.md
+++ /dev/null
@@ -1,151 +0,0 @@
-Feature
--------
-
-Brick Failure Detection
-
-Summary
--------
-
-This feature attempts to identify storage/file system failures and
-disable the failed brick without disrupting the remainder of the node's
-operation.
-
-Owners
-------
-
-Vijay Bellur with help from Niels de Vos (or the other way around)
-
-Current status
---------------
-
-Currently, if the underlying storage or file system failure happens, a
-brick process will continue to function. In some cases, a brick can hang
-due to failures in the underlying system. Due to such hangs in brick
-processes, applications running on glusterfs clients can hang.
-
-Detailed Description
---------------------
-
-Detecting failures on the filesystem that a brick uses makes it possible
-to handle errors that are caused from outside of the Gluster
-environment.
-
-There have been hanging brick processes when the underlying storage of a
-brick went unavailable. A hanging brick process can still use the
-network and repond to clients, but actual I/O to the storage is
-impossible and can cause noticible delays on the client side.
-
-Benefit to GlusterFS
---------------------
-
-Provide better detection of storage subsytem failures and prevent bricks
-from hanging.
-
-Scope
------
-
-### Nature of proposed change
-
-Add a health-checker to the posix xlator that periodically checks the
-status of the filesystem (implies checking of functional
-storage-hardware).
-
-### Implications on manageability
-
-When a brick process detects that the underlaying storage is not
-responding anymore, the process will exit. There is no automated way
-that the brick process gets restarted, the sysadmin will need to fix the
-problem with the storage first.
-
-After correcting the storage (hardware or filesystem) issue, the
-following command will start the brick process again:
-
- # gluster volume start <VOLNAME> force
-
-### Implications on presentation layer
-
-None
-
-### Implications on persistence layer
-
-None
-
-### Implications on 'GlusterFS' backend
-
-None
-
-### Modification to GlusterFS metadata
-
-None
-
-### Implications on 'glusterd'
-
-'glusterd' can detect that the brick process has exited,
-`gluster volume status` will show that the brick process is not running
-anymore. System administrators checking the logs should be able to
-triage the cause.
-
-How To Test
------------
-
-The health-checker thread that is part of each brick process will get
-started automatically when a volume has been started. Verifying its
-functionality can be done in different ways.
-
-On virtual hardware:
-
-- disconnect the disk from the VM that holds the brick
-
-On real hardware:
-
-- simulate a RAID-card failure by unplugging the card or cables
-
-On a system that uses LVM for the bricks:
-
-- use device-mapper to load an error-table for the disk, see [this
- description](http://review.gluster.org/5176).
-
-On any system (writing to random offsets of the block device, more
-difficult to trigger):
-
-1. cause corruption on the filesystem that holds the brick
-2. read contents from the brick, hoping to hit the corrupted area
-3. the filsystem should abort after hitting a bad spot, the
- health-checker should notice that shortly afterwards
-
-User Experience
----------------
-
-No more hanging brick processes when storage-hardware or the filesystem
-fails.
-
-Dependencies
-------------
-
-Posix translator, not available for the BD-xlator.
-
-Documentation
--------------
-
-The health-checker is enabled by default and runs a check every 30
-seconds. This interval can be changed per volume with:
-
- # gluster volume set <VOLNAME> storage.health-check-interval <SECONDS>
-
-If `SECONDS` is set to 0, the health-checker will be disabled.
-
-For further details refer:
-<https://forge.gluster.org/glusterfs-core/glusterfs/blobs/release-3.5/doc/features/brick-failure-detection.md>
-
-Status
-------
-
-glusterfs-3.4 and newer include a health-checker for the posix xlator,
-which was introduced with [bug
-971774](https://bugzilla.redhat.com/971774):
-
-- [posix: add a simple
- health-checker](http://review.gluster.org/5176)?
-
-Comments and Discussion
------------------------