From c6cce6a4c06f7f7992ac04d924a4d85a979ced2d Mon Sep 17 00:00:00 2001 From: Poornima G Date: Fri, 9 Dec 2016 12:37:17 +0530 Subject: Add parallel readdirp feature Change-Id: Iae0ef7181c0d416359dd87412bfa4c31c489559e Signed-off-by: Poornima G Reviewed-on: http://review.gluster.org/16090 Reviewed-by: Raghavendra G Tested-by: Raghavendra G --- under_review/readdir-ahead.md | 167 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 167 insertions(+) create mode 100644 under_review/readdir-ahead.md diff --git a/under_review/readdir-ahead.md b/under_review/readdir-ahead.md new file mode 100644 index 0000000..71e5b62 --- /dev/null +++ b/under_review/readdir-ahead.md @@ -0,0 +1,167 @@ +Feature +------- +Improve directory enumeration performance + +Summary +------- +Improve directory enumeration performance by implementing parallel readdirp +at the dht layer. + +Owners +------ + +Raghavendra G +Poornima G +Rajesh Joseph + +Current status +-------------- + +In development. + +Related Feature Requests and Bugs +--------------------------------- +https://bugzilla.redhat.com/show_bug.cgi?id=1401812 + +Detailed Description +-------------------- + +Currently readdirp is sequential at the dht layer. +This makes find and recursive listing of small directories very slow +(directory whose content can be accomodated in one readdirp call, +eg: ~600 entries if buf size is 128k). + +The number of readdirp fops required to fetch the ls -l -R for nested +directories is: +no. of fops = (x + 1) * m * n +n = number of bricks +m = number of directories +x = number of readdirp calls required to fetch the dentries completely +(this depends on the size of the directory and the readdirp buf size) +1 = readdirp fop that is sent to just detect the end of directory. + +Eg: Let's say, to list 800 directories with files ~300 each and readdirp +buf size 128K, on distribute 6: +(1+1) * 800 * 6 = 9600 fops + +And all the readdirp fops are sent in sequential manner to all the bricks. +With parallel readdirp, the number of fops may not decrease drastically +but since they are issued in parallel, it will increase the throughput. + +Why its not a straightforward problem to solve: +One needs to briefly understand, how the directory offset is handled in dht. +[1], [2], [3] are some of the links that will hint the same. +- The d_off is in the order of bricks identfied by dht. Hence, the dentries +should always be returned in the same order as bricks. i.e. brick2 entries +shouldn't be returned before brick1 reaches EOD. +- We cannot store any info of offset read so far etc. in inode_ctx or fd_ctx +- In case of a very large directories, and readdirp buf too small to hold +all the dentries in any brick, parallel readdirp is a overhead. Sequential +readdirp best suits the large directories. This demands dht be aware of or +speculate the directory size. + +There were two solutions that we evaluated: +1. Change dht_readdirp itself to wind readdirp parallely + http://review.gluster.org/15160 + http://review.gluster.org/15159 + http://review.gluster.org/15169 +2. Load readd-ahead as a child of dht + http://review.gluster.org/#/q/status:open+project:glusterfs+branch:master+topic:bug-1401812 + +For the below mentioned reasons we go with the second approach suggested by +Ragavendra G: +- It requires nil or very less changes in dht +- Along with empty/small directories it also benifits large directories +The only slightly complecated part would be to tune the readdir-ahead +buffer size for each instance. + +The perf gain observed is directly proportional to the: +- Number of nodes in the cluster/Volume +- Latency between client and each node in the volume. + +Some references: +[1] http://review.gluster.org/#/c/4711 +[2] https://www.mail-archive.com/gluster-devel@gluster.org/msg02834.html +[3] http://www.gluster.org/pipermail/gluster-devel/2015-January/043592.html + +Benefit to GlusterFS +-------------------- + +Improves directory enumeration performance in large clusters. + +Scope +----- + +#### Nature of proposed change + +- Changes in readdir-ahead, dht xlators. +- Change glusterd to load readdir-ahead as a child of dht + and without breaking upgrade and downgrade scenarios + +#### Implications on manageability + +N/A + +#### Implications on presentation layer + +N/A + +#### Implications on persistence layer + +N/A + +#### Implications on 'GlusterFS' backend + +N/A + +#### Modification to GlusterFS metadata + +N/A + +#### Implications on 'glusterd' + +GlusterD changes are integral to this feature, and described above. + +How To Test +----------- + +For the most part, testing is of the "do no harm" sort; the most thorough test +of this feature is to run our current regression suite. +Some specific test cases include readdirp on all kind of volumes: +- distribute +- replicate +- shard +- disperse +- tier +Also, readdirp while: +- rebalance in progress +- tiering migration in progress +- self heal in progress + +And all the test cases being run while the memory consumption of the process +is monitored. + +User Experience +--------------- + +Faster directory enumeration + +Dependencies +------------ + +N/A + +Documentation +------------- + +TBD (very little) + +Status +------ + +Development in progress + +Comments and Discussion +----------------------- + +N/A -- cgit