Feature ------- Improve GlusterFS rebalance performance Summary ------- Improve the current rebalance mechanism in GlusterFS by utilizing the resources better, to speed up overall rebalance and also to speed up the brick removal and addition cases where data needs to be spread faster than the current rebalance mechanism does. Owners ------ Raghavendra Gowdappa Shyamsundar Ranganathan Susant Palai Venkatesh Somyajulu Current status -------------- This section is split into 2 parts, explaining how the current rebalance works and what its limitations are. ### Current rebalance mechanism Currently rebalance works as follows, ​A) Each node in the Gluster cluster kicks off a rebalance process for one of the following actions - layout fixing - rebalance data, with space constraints in check - Will rebalance data with file size and disk free availability constraints, and move files that will not cause a brick imbalance in terms of amount of data stored across bricks - rebalance data, based on layout precision - Will rebalance data so that the layout is adhered to and hence optimize lookups in the future (find the file where the layout claims it is) ​B) Each nodes process, then uses the following algorithm to proceed, - 1: Open root of the volume - 2: Fix the layout of root - 3: Start a crawl on the current directory - 4: For each file in the current directory, - 4.1: Determine if file is in the current node (optimize on network reads for file data) - 4.2: If it does, migrate file based on type of rebalance sought - 4.3: End the file crawl once crawl returns no more entries - 5: For each directory in the current directory - 5.1: Fix the layout, and iterate on starting the crawl for this directory (goto step 3) - 6: End the directory crawl once crawl returns no more entries - 7: Cleanup and exit ### Limitations and issues in the current mechanism The current mechanism spreads the work of rebalance to all nodes in the cluster, and also takes into account only files that belong to the node on which the rebalance process is running. This spreads the load of rebalance well and also optimizes network reads of data, by taking into account only files local to the current node. Where this becomes slow is in the following cases, ​1) It rebalances one file at a time only as it uses the syncop infrastructure to start the rebalance of a file issuing a setxattr with the special attribute "distribute.migrate-data", which in turn would return after its synctask of migrating the file is completes (synctask: rebalance\_task) - This reduces the bandwidth consumption of several resources, like disk, CPU and network as we would read and write a single file at a time ​2) Rebalance of data is serial between reads and writes of data, i.e for a file a chunk of data is read from disk, written to the network, awaiting an response on the write from the remote node, and then proceeds with the next read - This makes read-write dependent on each other, and waiting for one or the other to complete, so we either have the network idle when reads from disk are in progress or vice-versa - This further makes serial use of resource like the disk or network, reading or writing one block at a time ​3) Each rebalance process crawls the entire volume for files to migrate, and chooses only files that are local to it - This crawl could be expensive and as a node deals with files that are local to it, based on the cluster size and number of nodes, quite a proportion of the entries crawled would hence be dropped ​4) On a remove brick the current rebalance, ends up rebalancing the entire cluster. If the interest is in removing the brick(s) or replacing the bricks, realancing the entire cluster can be costly. ​5) On addition of bricks, again the entire cluster is rebalanced. If space constraints are in play due to which bricks were added, it is sub-optimal to rebalance the entire cluster. ​6) In cases where AFR is below DHT, all the nodes in AFR participate in the rebalance, and end up rebalancing (or attempting to) the same set of files. This is racy, and could (maybe) be made better. Detailed Description -------------------- The above limitations can be broken down into separate features to improve rebalance performance and to also provide options in rebalance when specific use cases like quicker brick removal is sought. The following sections detail out these improvements. ### Rebalance multiple files in parallel Instead of rebalancing file by file due to using syncops to trigger the rebalance of a files data using setxattr, use the wind infrastructure to migrate multiple files at a time. This would end up using the disk and network infrastructure better and can hence enable faster rebalance of data. This would even mean that when one file is blocked on a disk read the other parallel stream could be writing data to the network, hence starvation of the read-write-read model between disk and network could also be alleviated to a point. ### Split reads and writes of files into separate tasks when rebalancing the data This is to reduce the wait between a disk read or a network write, and to ensure both these resources can be kept busy. By rebalancing more files in parallel, this improvement may not be needed, as the parallel streams would end up in keeping one or the other resource busy with better probability. Noting this enhancement down anyway to see if this needs consideration post increasing the parallelism of rebalance as above. ### Crawl only bricks that belong to the current node As explained, the current rebalance takes into account only those files that belong to the current node. As this is a DHT level operation, we can hence choose not to send opendir/readdir calls to subvolumes that do not belong to the current node. This would reduce the crawls that are performed in rebalance for files at least and help in speeding up the entire process. NOTE: We would still need to evaluate the cost of this crawl vis-a-vis the overall rebalance process, to evaluate its benefits ### Rebalance on access When removing bricks, one of the intention is to drain the brick of all its data and to hence enable removing the brick as soon as possible. When adding bricks, one of the requirements could be that the cluster is reaching its capacity and hence we want to increase the same. In both the cases rebalancing the entire cluster could/would take time. Instead an alternate approach is being proposed, where we do 3 things essentially as follows, - Kick off rebalance to fix layout, and drain a brick of its data, or rebalance files onto a newly added brick - On further access of data, if the access is leading to a double lookup or redirection based on layout (due to older bricks data not yet having been rebalanced), start a rebalance of this file in tandem to IO access (call this rebalance on access) - Start a slower, or a later, rebalance of the cluster, once the intended use case is met, i.e draining a brick of its data or creating space in other bricks and filling the newly added brick with relevant data. This is to get the cluster balanced again, without requiring data to be accessed. ### Make rebalance aware of IO path requirements One of the problems of improving resource consumption in a node by the rebalance process would be that, we could starve the IO path. So further to some of the above enhancements, take into account IO path resource utilization (i.e disk/network/CPU) and slow down or speed up the rebalance process appropriately (say by increasing or decreasing the number of files that are rebalanced in parallel). NOTE: This requirement is being noted down, just to ensure we do not make the IO access to the cluster slow as rebalance is being made faster, resources to monitor and tune rebalance may differ when tested and experimented upon ### Further considerations - We could consider some further layout optimization to reduce the amount of data that is being rebalanced - Addition of scheduled rebalance, or the ability to stop and continue rebalance from a point, could be useful for preventing IO path slowness in cases where an admin could choose to run rebalance on non-critical hours (do these even exist today?) - There are no performance xlators in the rebalance graph. We should try experiments loading them. Benefit to GlusterFS -------------------- Gluster is a grow as you need distributed file system. With this in picture, rebalance is key to grow the cluster in relatively sane amount of time. This enhancement attempts to speed things up in rebalance, in order to better this use case. Scope ----- ### Nature of proposed change This is intended as a modification to existing code only, there are no new xlators being introduced. BUT, as things evolve and we consider say, layout optimization based on live data or some such notions, we would need to extend this section to capture the proposed changes. ### Implications on manageability The gluster command would need some extensions, for example the number of files to process in parallel, as we introduce these changes. As it is currently in the prototype phase, keeping this and the sections below as TBDs **Document TBD from here on...** ### Implications on presentation layer *NFS/SAMBA/UFO/FUSE/libglusterfsclient Integration* ### Implications on persistence layer *LVM, XFS, RHEL ...* ### Implications on 'GlusterFS' backend *brick's data format, layout changes* ### Modification to GlusterFS metadata *extended attributes used, internal hidden files to keep the metadata...* ### Implications on 'glusterd' *persistent store, configuration changes, brick-op...* How To Test ----------- *Description on Testing the feature* User Experience --------------- *Changes in CLI, effect on User experience...* Dependencies ------------ *Dependencies, if any* Documentation ------------- *Documentation for the feature* Status ------ Design/Prototype in progress Comments and Discussion ----------------------- *Follow here*