summaryrefslogtreecommitdiffstats
path: root/Feature Planning/GlusterFS 3.7/Improve Rebalance Performance.md
diff options
context:
space:
mode:
Diffstat (limited to 'Feature Planning/GlusterFS 3.7/Improve Rebalance Performance.md')
-rw-r--r--Feature Planning/GlusterFS 3.7/Improve Rebalance Performance.md277
1 files changed, 277 insertions, 0 deletions
diff --git a/Feature Planning/GlusterFS 3.7/Improve Rebalance Performance.md b/Feature Planning/GlusterFS 3.7/Improve Rebalance Performance.md
new file mode 100644
index 0000000..32a2eec
--- /dev/null
+++ b/Feature Planning/GlusterFS 3.7/Improve Rebalance Performance.md
@@ -0,0 +1,277 @@
+Feature
+-------
+
+Improve GlusterFS rebalance performance
+
+Summary
+-------
+
+Improve the current rebalance mechanism in GlusterFS by utilizing the
+resources better, to speed up overall rebalance and also to speed up the
+brick removal and addition cases where data needs to be spread faster
+than the current rebalance mechanism does.
+
+Owners
+------
+
+Raghavendra Gowdappa <rgowdapp@redhat.com>
+Shyamsundar Ranganathan <srangana@redhat.com>
+Susant Palai <spalai@redhat.com>
+Venkatesh Somyajulu <vsomyaju@redhat.com>
+
+Current status
+--------------
+
+This section is split into 2 parts, explaining how the current rebalance
+works and what its limitations are.
+
+### Current rebalance mechanism
+
+Currently rebalance works as follows,
+
+​A) Each node in the Gluster cluster kicks off a rebalance process for
+one of the following actions
+
+- layout fixing
+- rebalance data, with space constraints in check
+ - Will rebalance data with file size and disk free availability
+ constraints, and move files that will not cause a brick
+ imbalance in terms of amount of data stored across bricks
+- rebalance data, based on layout precision
+ - Will rebalance data so that the layout is adhered to and hence
+ optimize lookups in the future (find the file where the layout
+ claims it is)
+
+​B) Each nodes process, then uses the following algorithm to proceed,
+
+- 1: Open root of the volume
+- 2: Fix the layout of root
+- 3: Start a crawl on the current directory
+- 4: For each file in the current directory,
+ - 4.1: Determine if file is in the current node (optimize on
+ network reads for file data)
+ - 4.2: If it does, migrate file based on type of rebalance sought
+ - 4.3: End the file crawl once crawl returns no more entries
+- 5: For each directory in the current directory
+ - 5.1: Fix the layout, and iterate on starting the crawl for this
+ directory (goto step 3)
+- 6: End the directory crawl once crawl returns no more entries
+- 7: Cleanup and exit
+
+### Limitations and issues in the current mechanism
+
+The current mechanism spreads the work of rebalance to all nodes in the
+cluster, and also takes into account only files that belong to the node
+on which the rebalance process is running. This spreads the load of
+rebalance well and also optimizes network reads of data, by taking into
+account only files local to the current node.
+
+Where this becomes slow is in the following cases,
+
+​1) It rebalances one file at a time only as it uses the syncop
+infrastructure to start the rebalance of a file issuing a setxattr with
+the special attribute "distribute.migrate-data", which in turn would
+return after its synctask of migrating the file is completes (synctask:
+rebalance\_task)
+
+- This reduces the bandwidth consumption of several resources, like
+ disk, CPU and network as we would read and write a single file at a
+ time
+
+​2) Rebalance of data is serial between reads and writes of data, i.e
+for a file a chunk of data is read from disk, written to the network,
+awaiting an response on the write from the remote node, and then
+proceeds with the next read
+
+- This makes read-write dependent on each other, and waiting for one
+ or the other to complete, so we either have the network idle when
+ reads from disk are in progress or vice-versa
+
+- This further makes serial use of resource like the disk or network,
+ reading or writing one block at a time
+
+​3) Each rebalance process crawls the entire volume for files to
+migrate, and chooses only files that are local to it
+
+- This crawl could be expensive and as a node deals with files that
+ are local to it, based on the cluster size and number of nodes,
+ quite a proportion of the entries crawled would hence be dropped
+
+​4) On a remove brick the current rebalance, ends up rebalancing the
+entire cluster. If the interest is in removing the brick(s) or replacing
+the bricks, realancing the entire cluster can be costly.
+
+​5) On addition of bricks, again the entire cluster is rebalanced. If
+space constraints are in play due to which bricks were added, it is
+sub-optimal to rebalance the entire cluster.
+
+​6) In cases where AFR is below DHT, all the nodes in AFR participate in
+the rebalance, and end up rebalancing (or attempting to) the same set of
+files. This is racy, and could (maybe) be made better.
+
+Detailed Description
+--------------------
+
+The above limitations can be broken down into separate features to
+improve rebalance performance and to also provide options in rebalance
+when specific use cases like quicker brick removal is sought. The
+following sections detail out these improvements.
+
+### Rebalance multiple files in parallel
+
+Instead of rebalancing file by file due to using syncops to trigger the
+rebalance of a files data using setxattr, use the wind infrastructure to
+migrate multiple files at a time. This would end up using the disk and
+network infrastructure better and can hence enable faster rebalance of
+data. This would even mean that when one file is blocked on a disk read
+the other parallel stream could be writing data to the network, hence
+starvation of the read-write-read model between disk and network could
+also be alleviated to a point.
+
+### Split reads and writes of files into separate tasks when rebalancing the data
+
+This is to reduce the wait between a disk read or a network write, and
+to ensure both these resources can be kept busy. By rebalancing more
+files in parallel, this improvement may not be needed, as the parallel
+streams would end up in keeping one or the other resource busy with
+better probability. Noting this enhancement down anyway to see if this
+needs consideration post increasing the parallelism of rebalance as
+above.
+
+### Crawl only bricks that belong to the current node
+
+As explained, the current rebalance takes into account only those files
+that belong to the current node. As this is a DHT level operation, we
+can hence choose not to send opendir/readdir calls to subvolumes that do
+not belong to the current node. This would reduce the crawls that are
+performed in rebalance for files at least and help in speeding up the
+entire process.
+
+NOTE: We would still need to evaluate the cost of this crawl vis-a-vis
+the overall rebalance process, to evaluate its benefits
+
+### Rebalance on access
+
+When removing bricks, one of the intention is to drain the brick of all
+its data and to hence enable removing the brick as soon as possible.
+
+When adding bricks, one of the requirements could be that the cluster is
+reaching its capacity and hence we want to increase the same.
+
+In both the cases rebalancing the entire cluster could/would take time.
+Instead an alternate approach is being proposed, where we do 3 things
+essentially as follows,
+
+- Kick off rebalance to fix layout, and drain a brick of its data,
+ or rebalance files onto a newly added brick
+- On further access of data, if the access is leading to a double
+ lookup or redirection based on layout (due to older bricks data not
+ yet having been rebalanced), start a rebalance of this file in
+ tandem to IO access (call this rebalance on access)
+- Start a slower, or a later, rebalance of the cluster, once the
+ intended use case is met, i.e draining a brick of its data or
+ creating space in other bricks and filling the newly added brick
+ with relevant data. This is to get the cluster balanced again,
+ without requiring data to be accessed.
+
+### Make rebalance aware of IO path requirements
+
+One of the problems of improving resource consumption in a node by the
+rebalance process would be that, we could starve the IO path. So further
+to some of the above enhancements, take into account IO path resource
+utilization (i.e disk/network/CPU) and slow down or speed up the
+rebalance process appropriately (say by increasing or decreasing the
+number of files that are rebalanced in parallel).
+
+NOTE: This requirement is being noted down, just to ensure we do not
+make the IO access to the cluster slow as rebalance is being made
+faster, resources to monitor and tune rebalance may differ when tested
+and experimented upon
+
+### Further considerations
+
+- We could consider some further layout optimization to reduce the
+ amount of data that is being rebalanced
+- Addition of scheduled rebalance, or the ability to stop and
+ continue rebalance from a point, could be useful for preventing IO
+ path slowness in cases where an admin could choose to run rebalance
+ on non-critical hours (do these even exist today?)
+- There are no performance xlators in the rebalance graph. We should
+ try experiments loading them.
+
+Benefit to GlusterFS
+--------------------
+
+Gluster is a grow as you need distributed file system. With this in
+picture, rebalance is key to grow the cluster in relatively sane amount
+of time. This enhancement attempts to speed things up in rebalance, in
+order to better this use case.
+
+Scope
+-----
+
+### Nature of proposed change
+
+This is intended as a modification to existing code only, there are no
+new xlators being introduced. BUT, as things evolve and we consider say,
+layout optimization based on live data or some such notions, we would
+need to extend this section to capture the proposed changes.
+
+### Implications on manageability
+
+The gluster command would need some extensions, for example the number
+of files to process in parallel, as we introduce these changes. As it is
+currently in the prototype phase, keeping this and the sections below as
+TBDs
+
+**Document TBD from here on...**
+
+### Implications on presentation layer
+
+*NFS/SAMBA/UFO/FUSE/libglusterfsclient Integration*
+
+### Implications on persistence layer
+
+*LVM, XFS, RHEL ...*
+
+### Implications on 'GlusterFS' backend
+
+*brick's data format, layout changes*
+
+### Modification to GlusterFS metadata
+
+*extended attributes used, internal hidden files to keep the metadata...*
+
+### Implications on 'glusterd'
+
+*persistent store, configuration changes, brick-op...*
+
+How To Test
+-----------
+
+*Description on Testing the feature*
+
+User Experience
+---------------
+
+*Changes in CLI, effect on User experience...*
+
+Dependencies
+------------
+
+*Dependencies, if any*
+
+Documentation
+-------------
+
+*Documentation for the feature*
+
+Status
+------
+
+Design/Prototype in progress
+
+Comments and Discussion
+-----------------------
+
+*Follow here*