diff options
Diffstat (limited to 'Feature Planning/GlusterFS 3.7/Improve Rebalance Performance.md')
-rw-r--r-- | Feature Planning/GlusterFS 3.7/Improve Rebalance Performance.md | 277 |
1 files changed, 277 insertions, 0 deletions
diff --git a/Feature Planning/GlusterFS 3.7/Improve Rebalance Performance.md b/Feature Planning/GlusterFS 3.7/Improve Rebalance Performance.md new file mode 100644 index 0000000..32a2eec --- /dev/null +++ b/Feature Planning/GlusterFS 3.7/Improve Rebalance Performance.md @@ -0,0 +1,277 @@ +Feature +------- + +Improve GlusterFS rebalance performance + +Summary +------- + +Improve the current rebalance mechanism in GlusterFS by utilizing the +resources better, to speed up overall rebalance and also to speed up the +brick removal and addition cases where data needs to be spread faster +than the current rebalance mechanism does. + +Owners +------ + +Raghavendra Gowdappa <rgowdapp@redhat.com> +Shyamsundar Ranganathan <srangana@redhat.com> +Susant Palai <spalai@redhat.com> +Venkatesh Somyajulu <vsomyaju@redhat.com> + +Current status +-------------- + +This section is split into 2 parts, explaining how the current rebalance +works and what its limitations are. + +### Current rebalance mechanism + +Currently rebalance works as follows, + +A) Each node in the Gluster cluster kicks off a rebalance process for +one of the following actions + +- layout fixing +- rebalance data, with space constraints in check + - Will rebalance data with file size and disk free availability + constraints, and move files that will not cause a brick + imbalance in terms of amount of data stored across bricks +- rebalance data, based on layout precision + - Will rebalance data so that the layout is adhered to and hence + optimize lookups in the future (find the file where the layout + claims it is) + +B) Each nodes process, then uses the following algorithm to proceed, + +- 1: Open root of the volume +- 2: Fix the layout of root +- 3: Start a crawl on the current directory +- 4: For each file in the current directory, + - 4.1: Determine if file is in the current node (optimize on + network reads for file data) + - 4.2: If it does, migrate file based on type of rebalance sought + - 4.3: End the file crawl once crawl returns no more entries +- 5: For each directory in the current directory + - 5.1: Fix the layout, and iterate on starting the crawl for this + directory (goto step 3) +- 6: End the directory crawl once crawl returns no more entries +- 7: Cleanup and exit + +### Limitations and issues in the current mechanism + +The current mechanism spreads the work of rebalance to all nodes in the +cluster, and also takes into account only files that belong to the node +on which the rebalance process is running. This spreads the load of +rebalance well and also optimizes network reads of data, by taking into +account only files local to the current node. + +Where this becomes slow is in the following cases, + +1) It rebalances one file at a time only as it uses the syncop +infrastructure to start the rebalance of a file issuing a setxattr with +the special attribute "distribute.migrate-data", which in turn would +return after its synctask of migrating the file is completes (synctask: +rebalance\_task) + +- This reduces the bandwidth consumption of several resources, like + disk, CPU and network as we would read and write a single file at a + time + +2) Rebalance of data is serial between reads and writes of data, i.e +for a file a chunk of data is read from disk, written to the network, +awaiting an response on the write from the remote node, and then +proceeds with the next read + +- This makes read-write dependent on each other, and waiting for one + or the other to complete, so we either have the network idle when + reads from disk are in progress or vice-versa + +- This further makes serial use of resource like the disk or network, + reading or writing one block at a time + +3) Each rebalance process crawls the entire volume for files to +migrate, and chooses only files that are local to it + +- This crawl could be expensive and as a node deals with files that + are local to it, based on the cluster size and number of nodes, + quite a proportion of the entries crawled would hence be dropped + +4) On a remove brick the current rebalance, ends up rebalancing the +entire cluster. If the interest is in removing the brick(s) or replacing +the bricks, realancing the entire cluster can be costly. + +5) On addition of bricks, again the entire cluster is rebalanced. If +space constraints are in play due to which bricks were added, it is +sub-optimal to rebalance the entire cluster. + +6) In cases where AFR is below DHT, all the nodes in AFR participate in +the rebalance, and end up rebalancing (or attempting to) the same set of +files. This is racy, and could (maybe) be made better. + +Detailed Description +-------------------- + +The above limitations can be broken down into separate features to +improve rebalance performance and to also provide options in rebalance +when specific use cases like quicker brick removal is sought. The +following sections detail out these improvements. + +### Rebalance multiple files in parallel + +Instead of rebalancing file by file due to using syncops to trigger the +rebalance of a files data using setxattr, use the wind infrastructure to +migrate multiple files at a time. This would end up using the disk and +network infrastructure better and can hence enable faster rebalance of +data. This would even mean that when one file is blocked on a disk read +the other parallel stream could be writing data to the network, hence +starvation of the read-write-read model between disk and network could +also be alleviated to a point. + +### Split reads and writes of files into separate tasks when rebalancing the data + +This is to reduce the wait between a disk read or a network write, and +to ensure both these resources can be kept busy. By rebalancing more +files in parallel, this improvement may not be needed, as the parallel +streams would end up in keeping one or the other resource busy with +better probability. Noting this enhancement down anyway to see if this +needs consideration post increasing the parallelism of rebalance as +above. + +### Crawl only bricks that belong to the current node + +As explained, the current rebalance takes into account only those files +that belong to the current node. As this is a DHT level operation, we +can hence choose not to send opendir/readdir calls to subvolumes that do +not belong to the current node. This would reduce the crawls that are +performed in rebalance for files at least and help in speeding up the +entire process. + +NOTE: We would still need to evaluate the cost of this crawl vis-a-vis +the overall rebalance process, to evaluate its benefits + +### Rebalance on access + +When removing bricks, one of the intention is to drain the brick of all +its data and to hence enable removing the brick as soon as possible. + +When adding bricks, one of the requirements could be that the cluster is +reaching its capacity and hence we want to increase the same. + +In both the cases rebalancing the entire cluster could/would take time. +Instead an alternate approach is being proposed, where we do 3 things +essentially as follows, + +- Kick off rebalance to fix layout, and drain a brick of its data, + or rebalance files onto a newly added brick +- On further access of data, if the access is leading to a double + lookup or redirection based on layout (due to older bricks data not + yet having been rebalanced), start a rebalance of this file in + tandem to IO access (call this rebalance on access) +- Start a slower, or a later, rebalance of the cluster, once the + intended use case is met, i.e draining a brick of its data or + creating space in other bricks and filling the newly added brick + with relevant data. This is to get the cluster balanced again, + without requiring data to be accessed. + +### Make rebalance aware of IO path requirements + +One of the problems of improving resource consumption in a node by the +rebalance process would be that, we could starve the IO path. So further +to some of the above enhancements, take into account IO path resource +utilization (i.e disk/network/CPU) and slow down or speed up the +rebalance process appropriately (say by increasing or decreasing the +number of files that are rebalanced in parallel). + +NOTE: This requirement is being noted down, just to ensure we do not +make the IO access to the cluster slow as rebalance is being made +faster, resources to monitor and tune rebalance may differ when tested +and experimented upon + +### Further considerations + +- We could consider some further layout optimization to reduce the + amount of data that is being rebalanced +- Addition of scheduled rebalance, or the ability to stop and + continue rebalance from a point, could be useful for preventing IO + path slowness in cases where an admin could choose to run rebalance + on non-critical hours (do these even exist today?) +- There are no performance xlators in the rebalance graph. We should + try experiments loading them. + +Benefit to GlusterFS +-------------------- + +Gluster is a grow as you need distributed file system. With this in +picture, rebalance is key to grow the cluster in relatively sane amount +of time. This enhancement attempts to speed things up in rebalance, in +order to better this use case. + +Scope +----- + +### Nature of proposed change + +This is intended as a modification to existing code only, there are no +new xlators being introduced. BUT, as things evolve and we consider say, +layout optimization based on live data or some such notions, we would +need to extend this section to capture the proposed changes. + +### Implications on manageability + +The gluster command would need some extensions, for example the number +of files to process in parallel, as we introduce these changes. As it is +currently in the prototype phase, keeping this and the sections below as +TBDs + +**Document TBD from here on...** + +### Implications on presentation layer + +*NFS/SAMBA/UFO/FUSE/libglusterfsclient Integration* + +### Implications on persistence layer + +*LVM, XFS, RHEL ...* + +### Implications on 'GlusterFS' backend + +*brick's data format, layout changes* + +### Modification to GlusterFS metadata + +*extended attributes used, internal hidden files to keep the metadata...* + +### Implications on 'glusterd' + +*persistent store, configuration changes, brick-op...* + +How To Test +----------- + +*Description on Testing the feature* + +User Experience +--------------- + +*Changes in CLI, effect on User experience...* + +Dependencies +------------ + +*Dependencies, if any* + +Documentation +------------- + +*Documentation for the feature* + +Status +------ + +Design/Prototype in progress + +Comments and Discussion +----------------------- + +*Follow here* |