From 9e9e3c5620882d2f769694996ff4d7e0cf36cc2b Mon Sep 17 00:00:00 2001 From: raghavendra talur Date: Thu, 20 Aug 2015 15:09:31 +0530 Subject: Create basic directory structure All new features specs go into in_progress directory. Once signed off, it should be moved to done directory. For now, This change moves all the Gluster 4.0 feature specs to in_progress. All other specs are under done/release-version. More cleanup required will be done incrementally. Change-Id: Id272d301ba8c434cbf7a9a966ceba05fe63b230d BUG: 1206539 Signed-off-by: Raghavendra Talur Reviewed-on: http://review.gluster.org/11969 Reviewed-by: Humble Devassy Chirammal Reviewed-by: Prashanth Pai Tested-by: Humble Devassy Chirammal --- Feature Planning/Feature Template.md | 93 --- .../GlusterFS 3.5/AFR CLI enhancements.md | 204 ------ .../GlusterFS 3.5/Brick Failure Detection.md | 151 ----- Feature Planning/GlusterFS 3.5/Disk Encryption.md | 443 ------------ .../GlusterFS 3.5/Exposing Volume Capabilities.md | 161 ----- Feature Planning/GlusterFS 3.5/File Snapshot.md | 101 --- .../Onwire Compression-Decompression.md | 96 --- .../GlusterFS 3.5/Quota Scalability.md | 99 --- .../GlusterFS 3.5/Virt store usecase.md | 140 ---- Feature Planning/GlusterFS 3.5/Zerofill.md | 192 ------ Feature Planning/GlusterFS 3.5/gfid access.md | 89 --- Feature Planning/GlusterFS 3.5/index.md | 32 - .../GlusterFS 3.5/libgfapi with qemu libvirt.md | 222 ------ Feature Planning/GlusterFS 3.5/readdir ahead.md | 117 ---- Feature Planning/GlusterFS 3.6/Better Logging.md | 348 ---------- .../GlusterFS 3.6/Better Peer Identification.md | 172 ----- .../Gluster User Serviceable Snapshots.md | 39 -- .../GlusterFS 3.6/Gluster Volume Snapshot.md | 354 ---------- .../GlusterFS 3.6/New Style Replication.md | 230 ------- .../Persistent AFR Changelog xattributes.md | 178 ----- .../GlusterFS 3.6/RDMA Improvements.md | 101 --- .../GlusterFS 3.6/Server-side Barrier feature.md | 213 ------ .../GlusterFS 3.6/Thousand Node Gluster.md | 150 ----- Feature Planning/GlusterFS 3.6/afrv2.md | 244 ------- Feature Planning/GlusterFS 3.6/better-ssl.md | 137 ---- Feature Planning/GlusterFS 3.6/disperse.md | 142 ---- .../GlusterFS 3.6/glusterd volume locks.md | 48 -- .../GlusterFS 3.6/heterogeneous-bricks.md | 136 ---- Feature Planning/GlusterFS 3.6/index.md | 96 --- .../GlusterFS 3.7/Archipelago Integration.md | 93 --- Feature Planning/GlusterFS 3.7/BitRot.md | 211 ------ .../GlusterFS 3.7/Clone of Snapshot.md | 100 --- .../GlusterFS 3.7/Data Classification.md | 279 -------- .../Easy addition of Custom Translators.md | 129 ---- .../Exports and Netgroups Authentication.md | 134 ---- .../GlusterFS 3.7/Gluster CLI for NFS Ganesha.md | 120 ---- Feature Planning/GlusterFS 3.7/Gnotify.md | 168 ----- Feature Planning/GlusterFS 3.7/HA for Ganesha.md | 156 ----- .../GlusterFS 3.7/Improve Rebalance Performance.md | 277 -------- Feature Planning/GlusterFS 3.7/Object Count.md | 113 ---- .../Policy based Split-brain Resolution.md | 128 ---- .../GlusterFS 3.7/SE Linux Integration.md | 4 - .../GlusterFS 3.7/Scheduling of Snapshot.md | 229 ------- Feature Planning/GlusterFS 3.7/Sharding xlator.md | 129 ---- .../GlusterFS 3.7/Small File Performance.md | 433 ------------ Feature Planning/GlusterFS 3.7/Trash.md | 182 ----- .../GlusterFS 3.7/Upcall Infrastructure.md | 747 --------------------- Feature Planning/GlusterFS 3.7/arbiter.md | 100 --- Feature Planning/GlusterFS 3.7/index.md | 90 --- Feature Planning/GlusterFS 3.7/rest-api.md | 152 ----- .../GlusterFS 4.0/Better Brick Mgmt.md | 180 ----- .../GlusterFS 4.0/Compression Dedup.md | 128 ---- Feature Planning/GlusterFS 4.0/Split Network.md | 138 ---- Feature Planning/GlusterFS 4.0/caching.md | 143 ---- Feature Planning/GlusterFS 4.0/code-generation.md | 143 ---- .../GlusterFS 4.0/composite-operations.md | 438 ------------ Feature Planning/GlusterFS 4.0/dht-scalability.md | 171 ----- Feature Planning/GlusterFS 4.0/index.md | 82 --- Feature Planning/GlusterFS 4.0/lockdep.md | 101 --- Feature Planning/GlusterFS 4.0/stat-xattr-cache.md | 197 ------ Feature Planning/GlusterFS 4.0/volgen-rewrite.md | 128 ---- Feature Planning/index.md | 15 - Features/README.md | 42 -- Features/afr-arbiter-volumes.md | 56 -- Features/afr-statistics.md | 142 ---- Features/afr-v1.md | 340 ---------- Features/bitrot-docs.md | 7 - Features/brick-failure-detection.md | 67 -- Features/dht.md | 223 ------ Features/distributed-geo-rep.md | 71 -- Features/file-snapshot.md | 91 --- Features/gfid-access.md | 73 -- Features/glusterfs_nfs-ganesha_integration.md | 123 ---- Features/heal-info-and-split-brain-resolution.md | 448 ------------ Features/leases.md | 11 - Features/libgfapi.md | 382 ----------- Features/libgfchangelog.md | 119 ---- Features/memory-usage.md | 49 -- Features/meta.md | 206 ------ Features/mount_gluster_volume_using_pnfs.md | 68 -- Features/nufa.md | 20 - Features/object-versioning.md | 230 ------- Features/ovirt-integration.md | 106 --- Features/qemu-integration.md | 230 ------- Features/quota-object-count.md | 47 -- Features/quota-scalability.md | 52 -- Features/rdmacm.md | 26 - Features/readdir-ahead.md | 14 - Features/rebalance.md | 74 -- Features/server-quorum.md | 44 -- Features/shard.md | 68 -- Features/tier.md | 170 ----- Features/trash_xlator.md | 80 --- Features/upcall.md | 38 -- Features/worm.md | 75 --- Features/zerofill.md | 26 - README.md | 57 +- done/Features/README.md | 42 ++ done/Features/afr-arbiter-volumes.md | 56 ++ done/Features/afr-statistics.md | 142 ++++ done/Features/afr-v1.md | 340 ++++++++++ done/Features/bitrot-docs.md | 7 + done/Features/brick-failure-detection.md | 67 ++ done/Features/dht.md | 223 ++++++ done/Features/distributed-geo-rep.md | 71 ++ done/Features/file-snapshot.md | 91 +++ done/Features/gfid-access.md | 73 ++ done/Features/glusterfs_nfs-ganesha_integration.md | 123 ++++ .../heal-info-and-split-brain-resolution.md | 448 ++++++++++++ done/Features/leases.md | 11 + done/Features/libgfapi.md | 382 +++++++++++ done/Features/libgfchangelog.md | 119 ++++ done/Features/memory-usage.md | 49 ++ done/Features/meta.md | 206 ++++++ done/Features/mount_gluster_volume_using_pnfs.md | 68 ++ done/Features/nufa.md | 20 + done/Features/object-versioning.md | 230 +++++++ done/Features/ovirt-integration.md | 106 +++ done/Features/qemu-integration.md | 230 +++++++ done/Features/quota-object-count.md | 47 ++ done/Features/quota-scalability.md | 52 ++ done/Features/rdmacm.md | 26 + done/Features/readdir-ahead.md | 14 + done/Features/rebalance.md | 74 ++ done/Features/server-quorum.md | 44 ++ done/Features/shard.md | 68 ++ done/Features/tier.md | 170 +++++ done/Features/trash_xlator.md | 80 +++ done/Features/upcall.md | 38 ++ done/Features/worm.md | 75 +++ done/Features/zerofill.md | 26 + done/GlusterFS 3.5/AFR CLI enhancements.md | 204 ++++++ done/GlusterFS 3.5/Brick Failure Detection.md | 151 +++++ done/GlusterFS 3.5/Disk Encryption.md | 443 ++++++++++++ done/GlusterFS 3.5/Exposing Volume Capabilities.md | 161 +++++ done/GlusterFS 3.5/File Snapshot.md | 101 +++ .../Onwire Compression-Decompression.md | 96 +++ done/GlusterFS 3.5/Quota Scalability.md | 99 +++ done/GlusterFS 3.5/Virt store usecase.md | 140 ++++ done/GlusterFS 3.5/Zerofill.md | 192 ++++++ done/GlusterFS 3.5/gfid access.md | 89 +++ done/GlusterFS 3.5/index.md | 32 + done/GlusterFS 3.5/libgfapi with qemu libvirt.md | 222 ++++++ done/GlusterFS 3.5/readdir ahead.md | 117 ++++ done/GlusterFS 3.6/Better Logging.md | 348 ++++++++++ done/GlusterFS 3.6/Better Peer Identification.md | 172 +++++ .../Gluster User Serviceable Snapshots.md | 39 ++ done/GlusterFS 3.6/Gluster Volume Snapshot.md | 354 ++++++++++ done/GlusterFS 3.6/New Style Replication.md | 230 +++++++ .../Persistent AFR Changelog xattributes.md | 178 +++++ done/GlusterFS 3.6/RDMA Improvements.md | 101 +++ done/GlusterFS 3.6/Server-side Barrier feature.md | 213 ++++++ done/GlusterFS 3.6/Thousand Node Gluster.md | 150 +++++ done/GlusterFS 3.6/afrv2.md | 244 +++++++ done/GlusterFS 3.6/better-ssl.md | 137 ++++ done/GlusterFS 3.6/disperse.md | 142 ++++ done/GlusterFS 3.6/glusterd volume locks.md | 48 ++ done/GlusterFS 3.6/heterogeneous-bricks.md | 136 ++++ done/GlusterFS 3.6/index.md | 96 +++ done/GlusterFS 3.7/Archipelago Integration.md | 93 +++ done/GlusterFS 3.7/BitRot.md | 211 ++++++ done/GlusterFS 3.7/Clone of Snapshot.md | 100 +++ done/GlusterFS 3.7/Data Classification.md | 279 ++++++++ .../Easy addition of Custom Translators.md | 129 ++++ .../Exports and Netgroups Authentication.md | 134 ++++ done/GlusterFS 3.7/Gluster CLI for NFS Ganesha.md | 120 ++++ done/GlusterFS 3.7/Gnotify.md | 168 +++++ done/GlusterFS 3.7/HA for Ganesha.md | 156 +++++ .../GlusterFS 3.7/Improve Rebalance Performance.md | 277 ++++++++ done/GlusterFS 3.7/Object Count.md | 113 ++++ .../Policy based Split-brain Resolution.md | 128 ++++ done/GlusterFS 3.7/SE Linux Integration.md | 4 + done/GlusterFS 3.7/Scheduling of Snapshot.md | 229 +++++++ done/GlusterFS 3.7/Sharding xlator.md | 129 ++++ done/GlusterFS 3.7/Small File Performance.md | 433 ++++++++++++ done/GlusterFS 3.7/Trash.md | 182 +++++ done/GlusterFS 3.7/Upcall Infrastructure.md | 747 +++++++++++++++++++++ done/GlusterFS 3.7/arbiter.md | 100 +++ done/GlusterFS 3.7/index.md | 90 +++ done/GlusterFS 3.7/rest-api.md | 152 +++++ in_progress/Better Brick Mgmt.md | 180 +++++ in_progress/Compression Dedup.md | 128 ++++ in_progress/Split Network.md | 138 ++++ in_progress/caching.md | 143 ++++ in_progress/code-generation.md | 143 ++++ in_progress/composite-operations.md | 438 ++++++++++++ in_progress/dht-scalability.md | 171 +++++ in_progress/index.md | 82 +++ in_progress/lockdep.md | 101 +++ in_progress/stat-xattr-cache.md | 197 ++++++ in_progress/template.md | 93 +++ in_progress/volgen-rewrite.md | 128 ++++ 192 files changed, 14422 insertions(+), 14388 deletions(-) delete mode 100644 Feature Planning/Feature Template.md delete mode 100644 Feature Planning/GlusterFS 3.5/AFR CLI enhancements.md delete mode 100644 Feature Planning/GlusterFS 3.5/Brick Failure Detection.md delete mode 100644 Feature Planning/GlusterFS 3.5/Disk Encryption.md delete mode 100644 Feature Planning/GlusterFS 3.5/Exposing Volume Capabilities.md delete mode 100644 Feature Planning/GlusterFS 3.5/File Snapshot.md delete mode 100644 Feature Planning/GlusterFS 3.5/Onwire Compression-Decompression.md delete mode 100644 Feature Planning/GlusterFS 3.5/Quota Scalability.md delete mode 100644 Feature Planning/GlusterFS 3.5/Virt store usecase.md delete mode 100644 Feature Planning/GlusterFS 3.5/Zerofill.md delete mode 100644 Feature Planning/GlusterFS 3.5/gfid access.md delete mode 100644 Feature Planning/GlusterFS 3.5/index.md delete mode 100644 Feature Planning/GlusterFS 3.5/libgfapi with qemu libvirt.md delete mode 100644 Feature Planning/GlusterFS 3.5/readdir ahead.md delete mode 100644 Feature Planning/GlusterFS 3.6/Better Logging.md delete mode 100644 Feature Planning/GlusterFS 3.6/Better Peer Identification.md delete mode 100644 Feature Planning/GlusterFS 3.6/Gluster User Serviceable Snapshots.md delete mode 100644 Feature Planning/GlusterFS 3.6/Gluster Volume Snapshot.md delete mode 100644 Feature Planning/GlusterFS 3.6/New Style Replication.md delete mode 100644 Feature Planning/GlusterFS 3.6/Persistent AFR Changelog xattributes.md delete mode 100644 Feature Planning/GlusterFS 3.6/RDMA Improvements.md delete mode 100644 Feature Planning/GlusterFS 3.6/Server-side Barrier feature.md delete mode 100644 Feature Planning/GlusterFS 3.6/Thousand Node Gluster.md delete mode 100644 Feature Planning/GlusterFS 3.6/afrv2.md delete mode 100644 Feature Planning/GlusterFS 3.6/better-ssl.md delete mode 100644 Feature Planning/GlusterFS 3.6/disperse.md delete mode 100644 Feature Planning/GlusterFS 3.6/glusterd volume locks.md delete mode 100644 Feature Planning/GlusterFS 3.6/heterogeneous-bricks.md delete mode 100644 Feature Planning/GlusterFS 3.6/index.md delete mode 100644 Feature Planning/GlusterFS 3.7/Archipelago Integration.md delete mode 100644 Feature Planning/GlusterFS 3.7/BitRot.md delete mode 100644 Feature Planning/GlusterFS 3.7/Clone of Snapshot.md delete mode 100644 Feature Planning/GlusterFS 3.7/Data Classification.md delete mode 100644 Feature Planning/GlusterFS 3.7/Easy addition of Custom Translators.md delete mode 100644 Feature Planning/GlusterFS 3.7/Exports and Netgroups Authentication.md delete mode 100644 Feature Planning/GlusterFS 3.7/Gluster CLI for NFS Ganesha.md delete mode 100644 Feature Planning/GlusterFS 3.7/Gnotify.md delete mode 100644 Feature Planning/GlusterFS 3.7/HA for Ganesha.md delete mode 100644 Feature Planning/GlusterFS 3.7/Improve Rebalance Performance.md delete mode 100644 Feature Planning/GlusterFS 3.7/Object Count.md delete mode 100644 Feature Planning/GlusterFS 3.7/Policy based Split-brain Resolution.md delete mode 100644 Feature Planning/GlusterFS 3.7/SE Linux Integration.md delete mode 100644 Feature Planning/GlusterFS 3.7/Scheduling of Snapshot.md delete mode 100644 Feature Planning/GlusterFS 3.7/Sharding xlator.md delete mode 100644 Feature Planning/GlusterFS 3.7/Small File Performance.md delete mode 100644 Feature Planning/GlusterFS 3.7/Trash.md delete mode 100644 Feature Planning/GlusterFS 3.7/Upcall Infrastructure.md delete mode 100644 Feature Planning/GlusterFS 3.7/arbiter.md delete mode 100644 Feature Planning/GlusterFS 3.7/index.md delete mode 100644 Feature Planning/GlusterFS 3.7/rest-api.md delete mode 100644 Feature Planning/GlusterFS 4.0/Better Brick Mgmt.md delete mode 100644 Feature Planning/GlusterFS 4.0/Compression Dedup.md delete mode 100644 Feature Planning/GlusterFS 4.0/Split Network.md delete mode 100644 Feature Planning/GlusterFS 4.0/caching.md delete mode 100644 Feature Planning/GlusterFS 4.0/code-generation.md delete mode 100644 Feature Planning/GlusterFS 4.0/composite-operations.md delete mode 100644 Feature Planning/GlusterFS 4.0/dht-scalability.md delete mode 100644 Feature Planning/GlusterFS 4.0/index.md delete mode 100644 Feature Planning/GlusterFS 4.0/lockdep.md delete mode 100644 Feature Planning/GlusterFS 4.0/stat-xattr-cache.md delete mode 100644 Feature Planning/GlusterFS 4.0/volgen-rewrite.md delete mode 100644 Feature Planning/index.md delete mode 100644 Features/README.md delete mode 100644 Features/afr-arbiter-volumes.md delete mode 100644 Features/afr-statistics.md delete mode 100644 Features/afr-v1.md delete mode 100644 Features/bitrot-docs.md delete mode 100644 Features/brick-failure-detection.md delete mode 100644 Features/dht.md delete mode 100644 Features/distributed-geo-rep.md delete mode 100644 Features/file-snapshot.md delete mode 100644 Features/gfid-access.md delete mode 100644 Features/glusterfs_nfs-ganesha_integration.md delete mode 100644 Features/heal-info-and-split-brain-resolution.md delete mode 100644 Features/leases.md delete mode 100644 Features/libgfapi.md delete mode 100644 Features/libgfchangelog.md delete mode 100644 Features/memory-usage.md delete mode 100644 Features/meta.md delete mode 100644 Features/mount_gluster_volume_using_pnfs.md delete mode 100644 Features/nufa.md delete mode 100644 Features/object-versioning.md delete mode 100644 Features/ovirt-integration.md delete mode 100644 Features/qemu-integration.md delete mode 100644 Features/quota-object-count.md delete mode 100644 Features/quota-scalability.md delete mode 100644 Features/rdmacm.md delete mode 100644 Features/readdir-ahead.md delete mode 100644 Features/rebalance.md delete mode 100644 Features/server-quorum.md delete mode 100644 Features/shard.md delete mode 100644 Features/tier.md delete mode 100644 Features/trash_xlator.md delete mode 100644 Features/upcall.md delete mode 100644 Features/worm.md delete mode 100644 Features/zerofill.md create mode 100644 done/Features/README.md create mode 100644 done/Features/afr-arbiter-volumes.md create mode 100644 done/Features/afr-statistics.md create mode 100644 done/Features/afr-v1.md create mode 100644 done/Features/bitrot-docs.md create mode 100644 done/Features/brick-failure-detection.md create mode 100644 done/Features/dht.md create mode 100644 done/Features/distributed-geo-rep.md create mode 100644 done/Features/file-snapshot.md create mode 100644 done/Features/gfid-access.md create mode 100644 done/Features/glusterfs_nfs-ganesha_integration.md create mode 100644 done/Features/heal-info-and-split-brain-resolution.md create mode 100644 done/Features/leases.md create mode 100644 done/Features/libgfapi.md create mode 100644 done/Features/libgfchangelog.md create mode 100644 done/Features/memory-usage.md create mode 100644 done/Features/meta.md create mode 100644 done/Features/mount_gluster_volume_using_pnfs.md create mode 100644 done/Features/nufa.md create mode 100644 done/Features/object-versioning.md create mode 100644 done/Features/ovirt-integration.md create mode 100644 done/Features/qemu-integration.md create mode 100644 done/Features/quota-object-count.md create mode 100644 done/Features/quota-scalability.md create mode 100644 done/Features/rdmacm.md create mode 100644 done/Features/readdir-ahead.md create mode 100644 done/Features/rebalance.md create mode 100644 done/Features/server-quorum.md create mode 100644 done/Features/shard.md create mode 100644 done/Features/tier.md create mode 100644 done/Features/trash_xlator.md create mode 100644 done/Features/upcall.md create mode 100644 done/Features/worm.md create mode 100644 done/Features/zerofill.md create mode 100644 done/GlusterFS 3.5/AFR CLI enhancements.md create mode 100644 done/GlusterFS 3.5/Brick Failure Detection.md create mode 100644 done/GlusterFS 3.5/Disk Encryption.md create mode 100644 done/GlusterFS 3.5/Exposing Volume Capabilities.md create mode 100644 done/GlusterFS 3.5/File Snapshot.md create mode 100644 done/GlusterFS 3.5/Onwire Compression-Decompression.md create mode 100644 done/GlusterFS 3.5/Quota Scalability.md create mode 100644 done/GlusterFS 3.5/Virt store usecase.md create mode 100644 done/GlusterFS 3.5/Zerofill.md create mode 100644 done/GlusterFS 3.5/gfid access.md create mode 100644 done/GlusterFS 3.5/index.md create mode 100644 done/GlusterFS 3.5/libgfapi with qemu libvirt.md create mode 100644 done/GlusterFS 3.5/readdir ahead.md create mode 100644 done/GlusterFS 3.6/Better Logging.md create mode 100644 done/GlusterFS 3.6/Better Peer Identification.md create mode 100644 done/GlusterFS 3.6/Gluster User Serviceable Snapshots.md create mode 100644 done/GlusterFS 3.6/Gluster Volume Snapshot.md create mode 100644 done/GlusterFS 3.6/New Style Replication.md create mode 100644 done/GlusterFS 3.6/Persistent AFR Changelog xattributes.md create mode 100644 done/GlusterFS 3.6/RDMA Improvements.md create mode 100644 done/GlusterFS 3.6/Server-side Barrier feature.md create mode 100644 done/GlusterFS 3.6/Thousand Node Gluster.md create mode 100644 done/GlusterFS 3.6/afrv2.md create mode 100644 done/GlusterFS 3.6/better-ssl.md create mode 100644 done/GlusterFS 3.6/disperse.md create mode 100644 done/GlusterFS 3.6/glusterd volume locks.md create mode 100644 done/GlusterFS 3.6/heterogeneous-bricks.md create mode 100644 done/GlusterFS 3.6/index.md create mode 100644 done/GlusterFS 3.7/Archipelago Integration.md create mode 100644 done/GlusterFS 3.7/BitRot.md create mode 100644 done/GlusterFS 3.7/Clone of Snapshot.md create mode 100644 done/GlusterFS 3.7/Data Classification.md create mode 100644 done/GlusterFS 3.7/Easy addition of Custom Translators.md create mode 100644 done/GlusterFS 3.7/Exports and Netgroups Authentication.md create mode 100644 done/GlusterFS 3.7/Gluster CLI for NFS Ganesha.md create mode 100644 done/GlusterFS 3.7/Gnotify.md create mode 100644 done/GlusterFS 3.7/HA for Ganesha.md create mode 100644 done/GlusterFS 3.7/Improve Rebalance Performance.md create mode 100644 done/GlusterFS 3.7/Object Count.md create mode 100644 done/GlusterFS 3.7/Policy based Split-brain Resolution.md create mode 100644 done/GlusterFS 3.7/SE Linux Integration.md create mode 100644 done/GlusterFS 3.7/Scheduling of Snapshot.md create mode 100644 done/GlusterFS 3.7/Sharding xlator.md create mode 100644 done/GlusterFS 3.7/Small File Performance.md create mode 100644 done/GlusterFS 3.7/Trash.md create mode 100644 done/GlusterFS 3.7/Upcall Infrastructure.md create mode 100644 done/GlusterFS 3.7/arbiter.md create mode 100644 done/GlusterFS 3.7/index.md create mode 100644 done/GlusterFS 3.7/rest-api.md create mode 100644 in_progress/Better Brick Mgmt.md create mode 100644 in_progress/Compression Dedup.md create mode 100644 in_progress/Split Network.md create mode 100644 in_progress/caching.md create mode 100644 in_progress/code-generation.md create mode 100644 in_progress/composite-operations.md create mode 100644 in_progress/dht-scalability.md create mode 100644 in_progress/index.md create mode 100644 in_progress/lockdep.md create mode 100644 in_progress/stat-xattr-cache.md create mode 100644 in_progress/template.md create mode 100644 in_progress/volgen-rewrite.md diff --git a/Feature Planning/Feature Template.md b/Feature Planning/Feature Template.md deleted file mode 100644 index b648a86..0000000 --- a/Feature Planning/Feature Template.md +++ /dev/null @@ -1,93 +0,0 @@ -Feature -------- - -Summary -------- - -*Brief Description of the Feature * - -Owners ------- - -**Feature Owners** - *Ideally includes you * :-) - -Current status --------------- - -*Provide details on related existing features, if any and why this new feature is needed* - -Related Feature Requests and Bugs ---------------------------------- - -*Link all the related feature requests and bugs in [https://bugzilla.redhat.com bugzilla] here. If there is no bug filed for this feature, please do so now. Add a comment and a link to this page in each related bug.* - -Detailed Description --------------------- - -*Detailed Feature Description* - -Benefit to GlusterFS --------------------- - -*Describe Value additions to GlusterFS* - -Scope ------ - -#### Nature of proposed change - -*modification to existing code, new translators ...* - -#### Implications on manageability - -*Glusterd, GlusterCLI, Web Console, REST API* - -#### Implications on presentation layer - -*NFS/SAMBA/UFO/FUSE/libglusterfsclient Integration* - -#### Implications on persistence layer - -*LVM, XFS, RHEL ...* - -#### Implications on 'GlusterFS' backend - -*brick's data format, layout changes* - -#### Modification to GlusterFS metadata - -*extended attributes used, internal hidden files to keep the metadata...* - -#### Implications on 'glusterd' - -*persistent store, configuration changes, brick-op...* - -How To Test ------------ - -*Description on Testing the feature* - -User Experience ---------------- - -*Changes in CLI, effect on User experience...* - -Dependencies ------------- - -*Dependencies, if any* - -Documentation -------------- - -*Documentation for the feature* - -Status ------- - -*Status of development - Design Ready, In development, Completed* - -Comments and Discussion ------------------------ - -*Follow here* \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.5/AFR CLI enhancements.md b/Feature Planning/GlusterFS 3.5/AFR CLI enhancements.md deleted file mode 100644 index 88f4980..0000000 --- a/Feature Planning/GlusterFS 3.5/AFR CLI enhancements.md +++ /dev/null @@ -1,204 +0,0 @@ -Feature -------- - -AFR CLI enhancements - -SUMMARY -------- - -Presently the AFR reporting via CLI has lots of problems in the -representation of logs because of which they may not be able to use the -data effectively. This feature is to correct these problems and provide -a coherent mechanism to present heal status,information and the logs -associated. - -Owners ------- - -Venkatesh Somayajulu -Raghavan - -Current status --------------- - -There are many bugs related to this which indicates the current status -and why these requirements are required. - -​1) 924062 - gluster volume heal info shows only gfids in some cases and -sometimes names. This is very confusing for the end user. - -​2) 852294 - gluster volume heal info hangs/crashes when there is a -large number of entries to be healed. - -​3) 883698 - when self heal daemon is turned off, heal info does not -show any output. But healing can happen because of lookups from IO path. -Hence list of entries to be healed still needs to be shown. - -​4) 921025 - directories are not reported when list of split brain -entries needs to be displayed. - -​5) 981185 - when self heal daemon process is offline, volume heal info -gives error as "staging failure" - -​6) 952084 - We need a command to resolve files in split brain state. - -​7) 986309 - We need to report source information for files which got -healed during a self heal session. - -​8) 986317 - Sometimes list of files to get healed also includes files -to which IO s being done since the entries for these files could be in -the xattrop directory. This could be confusing for the user. - -There is a master bug 926044 that sums up most of the above problems. It -does give the QA perspective of the current representation out of the -present reporting infrastructure. - -Detailed Description --------------------- - -​1) One common thread among all the above complaints is that the -information presented to the user is FUD because of the following -reasons: - -(a) Split brain itself is a scary scenario especially with VMs. -(b) The data that we present to the users cannot be used in a stable - manner for them to get to the list of these files. For ex: we - need to give mechanisms by which he can automate the resolution out - of split brain. -(c) The logs that are generated are all the more scarier since we - see repetition of some error lines running into hundreds of lines. - Our mailing lists are filled with such emails from end users. - -Any data is useless unless it is associated with an event. For self -heal, the event that leads to self heal is the loss of connectivity to a -brick from a client. So all healing info and especially split brain -should be associated with such events. - -The following is hence the proposed mechanism: - -(a) Every loss of a brick from client's perspective is logged and - available via some ID. The information provides the time from when - the brick went down to when it came up. Also it should also report - the number of IO transactions(modifies) that hapenned during this - event. -(b) The list of these events are available via some CLI command. The - actual command needs to be detailed as part of this feature. -(c) All volume info commands regarding list of files to be healed, - files healed and split brain files should be associated with this - event(s). - -​2) Provide a mechanism to show statistics at a volume and replica group -level. It should show the number of files to be healed and number of -split brain files at both the volume and replica group level. - -​3) Provide a mechanism to show per volume list of files to be -healed/files healed/split brain in the following info: - -This should have the following information: - -(a) File name -(b) Bricks location -(c) Event association (brick going down) -(d) Source -(v) Sink - -​4) Self heal crawl statistics - Introduce new CLI commands for showing -more information on self heal crawl per volume. - -(a) Display why a self heal crawl ran (timeouts, brick coming up) -(b) Start time and end time -(c) Number of files it attempted to heal -(d) Location of the self heal daemon - -​5) Scale the logging infrastructure to handle huge number of file list -that needs to be displayed as part of the logging. - -(a) Right now the system crashes or hangs in case of a high number - of files. -(b) It causes CLI timeouts arbitrarily. The latencies involved in - the logging have to be studied (profiled) and mechanisms to - circumvent them have to be introduced. -(c) All files are displayed on the output. Have a better way of - representing them. - -Options are: - -(a) Maybe write to a glusterd log file or have a seperate directory - for afr heal logs. -(b) Have a status kind of command. This will display the current - status of the log building and maybe have batched way of - representing when there is a huge list. - -​6) We should provide mechanism where the user can heal split brain by -some pre-established policies: - -(a) Let the system figure out the latest files (assuming all nodes - are in time sync) and choose the copies that have the latest time. -(b) Choose one particular brick as the source for split brain and - heal all split brains from this brick. -(c) Just remove the split brain information from changelog. We leave - the exercise to the user to repair split brain where in he would - rewrite to the split brained files. (right now the user is forced to - remove xattrs manually for this step). - -Benefits to GlusterFS --------------------- - -Makes the end user more aware of healing status and provides statistics. - -Scope ------ - -6.1. Nature of proposed change - -Modification to AFR and CLI and glusterd code - -6.2. Implications on manageability - -New CLI commands to be added. Existing commands to be improved. - -6.3. Implications on presentation layer - -N/A - -6.4. Implications on persistence layer - -N/A - -6.5. Implications on 'GlusterFS' backend - -N/A - -6.6. Modification to GlusterFS metadata - -N/A - -6.7. Implications on 'glusterd' - -Changes for healing specific commands will be introduced. - -How To Test ------------ - -See documentation session - -User Experience ---------------- - -*Changes in CLI, effect on User experience...* - -Documentation -------------- - - - -Status ------- - -Patches : - - - -Status: - -Merged \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.5/Brick Failure Detection.md b/Feature Planning/GlusterFS 3.5/Brick Failure Detection.md deleted file mode 100644 index 9952698..0000000 --- a/Feature Planning/GlusterFS 3.5/Brick Failure Detection.md +++ /dev/null @@ -1,151 +0,0 @@ -Feature -------- - -Brick Failure Detection - -Summary -------- - -This feature attempts to identify storage/file system failures and -disable the failed brick without disrupting the remainder of the node's -operation. - -Owners ------- - -Vijay Bellur with help from Niels de Vos (or the other way around) - -Current status --------------- - -Currently, if the underlying storage or file system failure happens, a -brick process will continue to function. In some cases, a brick can hang -due to failures in the underlying system. Due to such hangs in brick -processes, applications running on glusterfs clients can hang. - -Detailed Description --------------------- - -Detecting failures on the filesystem that a brick uses makes it possible -to handle errors that are caused from outside of the Gluster -environment. - -There have been hanging brick processes when the underlying storage of a -brick went unavailable. A hanging brick process can still use the -network and repond to clients, but actual I/O to the storage is -impossible and can cause noticible delays on the client side. - -Benefit to GlusterFS --------------------- - -Provide better detection of storage subsytem failures and prevent bricks -from hanging. - -Scope ------ - -### Nature of proposed change - -Add a health-checker to the posix xlator that periodically checks the -status of the filesystem (implies checking of functional -storage-hardware). - -### Implications on manageability - -When a brick process detects that the underlaying storage is not -responding anymore, the process will exit. There is no automated way -that the brick process gets restarted, the sysadmin will need to fix the -problem with the storage first. - -After correcting the storage (hardware or filesystem) issue, the -following command will start the brick process again: - - # gluster volume start force - -### Implications on presentation layer - -None - -### Implications on persistence layer - -None - -### Implications on 'GlusterFS' backend - -None - -### Modification to GlusterFS metadata - -None - -### Implications on 'glusterd' - -'glusterd' can detect that the brick process has exited, -`gluster volume status` will show that the brick process is not running -anymore. System administrators checking the logs should be able to -triage the cause. - -How To Test ------------ - -The health-checker thread that is part of each brick process will get -started automatically when a volume has been started. Verifying its -functionality can be done in different ways. - -On virtual hardware: - -- disconnect the disk from the VM that holds the brick - -On real hardware: - -- simulate a RAID-card failure by unplugging the card or cables - -On a system that uses LVM for the bricks: - -- use device-mapper to load an error-table for the disk, see [this - description](http://review.gluster.org/5176). - -On any system (writing to random offsets of the block device, more -difficult to trigger): - -1. cause corruption on the filesystem that holds the brick -2. read contents from the brick, hoping to hit the corrupted area -3. the filsystem should abort after hitting a bad spot, the - health-checker should notice that shortly afterwards - -User Experience ---------------- - -No more hanging brick processes when storage-hardware or the filesystem -fails. - -Dependencies ------------- - -Posix translator, not available for the BD-xlator. - -Documentation -------------- - -The health-checker is enabled by default and runs a check every 30 -seconds. This interval can be changed per volume with: - - # gluster volume set storage.health-check-interval - -If `SECONDS` is set to 0, the health-checker will be disabled. - -For further details refer: - - -Status ------- - -glusterfs-3.4 and newer include a health-checker for the posix xlator, -which was introduced with [bug -971774](https://bugzilla.redhat.com/971774): - -- [posix: add a simple - health-checker](http://review.gluster.org/5176)? - -Comments and Discussion ------------------------ diff --git a/Feature Planning/GlusterFS 3.5/Disk Encryption.md b/Feature Planning/GlusterFS 3.5/Disk Encryption.md deleted file mode 100644 index 4c6ab89..0000000 --- a/Feature Planning/GlusterFS 3.5/Disk Encryption.md +++ /dev/null @@ -1,443 +0,0 @@ -Feature -======= - -Transparent encryption. Allows a volume to be encrypted "at rest" on the -server using keys only available on the client. - -1 Summary -========= - -Distributed systems impose tighter requirements to at-rest encryption. -This is because your encrypted data will be stored on servers, which are -de facto untrusted. In particular, your private encrypted data can be -subjected to analysis and tampering, which eventually will lead to its -revealing, if it is not properly protected. Specifically, usually it is -not enough to just encrypt data. In distributed systems serious -protection of your personal data is possible only in conjunction with a -special process, which is called authentication. GlusterFS provides such -enhanced service: In GlusterFS encryption is enhanced with -authentication. Currently we provide protection from "silent tampering". -This is a kind of tampering, which is hard to detect, because it doesn't -break POSIX compliance. Specifically, we protect encryption-specific -file's metadata. Such metadata includes unique file's object id (GFID), -cipher algorithm id, cipher block size and other attributes used by the -encryption process. - -1.1 Restrictions ----------------- - -​1. We encrypt only file content. The feature of transparent encryption -doesn't protect file names: they are neither encrypted, nor verified. -Protection of file names is not so critical as protection of -encryption-specific file's metadata: any attacks based on tampering file -names will break POSIX compliance and result in massive corruption, -which is easy to detect. - -​2. The feature of transparent encryption doesn't work in NFS-mounts of -GlusterFS volumes: NFS's file handles introduce security issues, which -are hard to resolve. NFS mounts of encrypted GlusterFS volumes will -result in failed file operations (see section "Encryption in different -types of mount sessions" for more details). - -​3. The feature of transparent encryption is incompatible with GlusterFS -performance translators quick-read, write-behind and open-behind. - -2 Owners -======== - -Jeff Darcy -Edward Shishkin - -3 Current status -================ - -Merged to the upstream. - -4 Detailed Description -====================== - -See Summary. - -5 Benefit to GlusterFS -====================== - -Besides the justifications that have applied to on-disk encryption just -about forever, recent events have raised awareness significantly. -Encryption using keys that are physically present at the server leaves -data vulnerable to physical seizure of the server. Encryption using keys -that are kept by the same organization entity leaves data vulnerable to -"insider threat" plus coercion or capture at the organization level. For -many, especially various kinds of service providers, only pure -client-side encryption provides the necessary levels of privacy and -deniability. - -Competitively, other projects - most notably -[Tahoe-LAFS](https://leastauthority.com/) - are already using recently -heightened awareness of these issues to attract users who would be -better served by our performance/scalability, usability, and diversity -of interfaces. Only the lack of proper encryption holds us back in these -cases. - -6 Scope -======= - -6.1. Nature of proposed change ------------------------------- - -This is a new client-side translator, using user-provided key -information plus information stored in xattrs to encrypt data -transparently as it's written and decrypt when it's read. - -6.2. Implications on manageability ----------------------------------- - -User needs to manage a per-volume master key (MK). That is: - -​1) Generate an independent MK for every volume which is to be -encrypted. Note, that one MK is created for the whole life of the -volume. - -​2) Provide MK on the client side at every mount in accordance with the -location, which has been specified at volume create time, or overridden -via respective mount option (see section How To Test). - -​3) Keep MK between mount sessions. Note that after successful mount MK -may be removed from the specified location. In this case user should -retain MK safely till next mount session. - -MK is a 256-bit secret string, which is known only to user. Generating -and retention of MK is in user's competence. - -WARNING!!! Losing MK will make content of all regular files of your -volume inaccessible. It is possible to mount a volume with improper MK, -however such mount sessions will allow to access only file names as they -are not encrypted. - -Recommendations on MK generation - -MK has to be a high-entropy key, appropriately generated by a key -derivation algorithm. One of the possible ways is using rand(1) provided -by the OpenSSL package. You need to specify the option "-hex" for proper -output format. For example, the next command prints a generated key to -the standard output: - - $ openssl rand -hex 32 - -6.3. Implications on presentation layer ---------------------------------------- - -N/A - -6.4. Implications on persistence layer --------------------------------------- - -N/A - -6.5. Implications on 'GlusterFS' backend ----------------------------------------- - -All encrypted files on the servers contains padding at the end of file. -That is, size of all enDefines location of the master volume key on the -trusted client machine.crypted files on the servers is multiple to -cipher block size. Real file size is stored as file's xattr with the key -"trusted.glusterfs.crypt.att.size". The translation padded-file-size -\> -real-file-size (and backward) is performed by the crypt translator. - -6.6. Modification to GlusterFS metadata ---------------------------------------- - -Encryption-specific metadata in specified format is stored as file's -xattr with the key "trusted.glusterfs.crypt.att.cfmt". Current format of -metadata string is described in the slide \#27 of the following [ design -document](http://www.gluster.org/community/documentation/index.php/File:GlusterFS_transparent_encryption.pdf) - -6.7. Options of the crypt translator ------------------------------------- - -- data-cipher-alg - -Specifies cipher algorithm for file data encryption. Currently only one -option is available: AES\_XTS. This is hidden option. - -- block-size - -Specifies size (in bytes) of logical chunk which is encrypted as a whole -unit in the file body. If cipher modes with initial vectors are used for -encryption, then the initial vector gets reset for every such chunk. -Available values are: "512", "1024", "2048" and "4096". Default value is -"4096". - -- data-key-size - -Specifies size (in bits) of data cipher key. For AES\_XTS available -values are: "256" and "512". Default value is "256". The larger key size -("512") is for stronger security. - -- master-key - -Specifies pathname of the regular file, or symlink. Defines location of -the master volume key on the trusted client machine. - -7 Getting Started With Crypt Translator -======================================= - -​1. Create a volume . - -​2. Turn on crypt xlator: - - # gluster volume set `` encryption on - -​3. Turn off performance xlators that currently encryption is -incompatible with: - - # gluster volume set  performance.quick-read off - # gluster volume set  performance.write-behind off - # gluster volume set  performance.open-behind off - -​4. (optional) Set location of the volume master key: - - # gluster volume set  encryption.master-key  - -where is an absolute pathname of the file, which -will contain the volume master key (see section implications on -manageability). - -​5. (optional) Override default options of crypt xlator: - - # gluster volume set  encryption.data-key-size  - -where should have one of the following values: -"256"(default), "512". - - # gluster volume set  encryption.block-size  - -where should have one of the following values: "512", -"1024", "2048", "4096"(default). - -​6. Define location of the master key on your client machine, if it -wasn't specified at section 4 above, or you want it to be different from -the , specified at section 4. - -​7. On the client side make sure that the file with name - (or defined at section -6) exists and contains respective per-volume master key (see section -implications on manageability). This key has to be in hex form, i.e. -should be represented by 64 symbols from the set {'0', ..., '9', 'a', -..., 'f'}. The key should start at the beginning of the file. All -symbols at offsets \>= 64 are ignored. - -NOTE: (or defined at -step 6) can be a symlink. In this case make sure that the target file of -this symlink exists and contains respective per-volume master key. - -​8. Mount the volume on the client side as usual. If you -specified a location of the master key at section 6, then use the mount -option - ---xlator-option=.master-key= - -where is location of master key specified at -section 6, is suffixed with "-crypt". For -example, if you created a volume "myvol" in the step 1, then -suffixed\_vol\_name is "myvol-crypt". - -​9. During mount your client machine receives configuration info from -the untrusted server, so this step is extremely important! Check, that -your volume is really encrypted, and that it is encrypted with the -proper master key (see FAQ \#1,\#2). - -​10. (optional) After successful mount the file which contains master -key may be removed. NOTE: Next mount session will require the master-key -again. Keeping the master key between mount sessions is in user's -competence (see section implications on manageability). - -8 How to test -============= - -From a correctness standpoint, it's sufficient to run normal tests with -encryption enabled. From a security standpoint, there's a whole -discipline devoted to analysing the stored data for weaknesses, and -engagement with practitioners of that discipline will be necessary to -develop the right tests. - -9 Dependencies -============== - -Crypt translator requires OpenSSL of version \>= 1.0.1 - -10 Documentation -================ - -10.1 Basic design concepts --------------------------- - -The basic design concepts are described in the following [pdf -slides](http://www.gluster.org/community/documentation/index.php/File:GlusterFS_transparent_encryption.pdf) - -10.2 Procedure of security open -------------------------------- - -So, in accordance with the basic design concepts above, before every -access to a file's body (by read(2), write(2), truncate(2), etc) we need -to make sure that the file's metadata is trusted. Otherwise, we risk to -deal with untrusted file's data. - -To make sure that file's metadata is trusted, file is subjected to a -special procedure of security open. The procedure of security open is -performed by crypt translator at FOP-\>open() (crypt\_open) time by the -function open\_format(). Currently this is a hardcoded composition of 2 -checks: - -1. verification of file's GFID by the file name; -2. verification of file's metadata by the verified GFID; - -If the security open succeeds, then the cache of trusted client machine -is replenished with file descriptor and file's inode, and user can -access the file's content by read(2), write(2), ftruncate(2), etc. -system calls, which accept file descriptor as argument. - -However, file API also allows to accept file body without opening the -file. For example, truncate(2), which accepts pathname instead of file -descriptor. To make sure that file's metadata is trusted, we create a -temporal file descriptor and mandatory call crypt\_open() before -truncating the file's body. - -10.3 Encryption in different types of mount sessions ----------------------------------------------------- - -Everything described in the section above is valid only for FUSE-mounts. -Besides, GlusterFS also supports so-called NFS-mounts. From the -standpoint of security the key difference between the mentioned types of -mount sessions is that in NFS-mount sessions file operations instead of -file name accept a so-called file handle (which is actually GFID). It -creates problems, since the file name is a basic point for verification. -As it follows from the section above, using the step 1, we can replenish -the cache of trusted machine with trusted file handles (GFIDs), and -perform a security open only by trusted GFID (by the step 2). However, -in this case we need to make sure that there is no leaks of non-trusted -GFIDs (and, moreover, such leaks won't be introduced by the development -process in future). This is possible only with changed GFID format: -everywhere in GlusterFS GFID should appear as a pair (uuid, -is\_verified), where is\_verified is a boolean variable, which is true, -if this GFID passed off the procedure of verification (step 1 in the -section above). - -The next problem is that current NFS protocol doesn't encrypt the -channel between NFS client and NFS server. It means that in NFS-mounts -of GlusterFS volumes NFS client and GlusterFS client should be the same -(trusted) machine. - -Taking into account the described problems, encryption in GlusterFS is -not supported in NFS-mount sessions. - -10.4 Class of cipher algorithms for file data encryption that can be supported by the crypt translator ------------------------------------------------------------------------------------------------------- - -We'll assume that any symmetric block cipher algorithm is completely -determined by a pair (alg\_id, mode\_id), where alg\_id is an algorithm -defined on elementary cipher blocks (e.g. AES), and mode\_id is a mode -of operation (e.g. ECB, XTS, etc). - -Technically, the crypt translator is able to support any symmetric block -cipher algorithms via additional options of the crypt translator. -However, in practice the set of supported algorithms is narrowed because -of various security and organization issues. Currently we support only -one algotithm. This is AES\_XTS. - -10.5 Bibliography ------------------ - -1. Recommendations for for Block Cipher Modes of Operation (NIST - Special Publication 800-38A). -2. Recommendation for Block Cipher Modes of Operation: The XTS-AES Mode - for Confidentiality on Storage Devices (NIST Special Publication - 800-38E). -3. Recommendation for Key Derivation Using Pseudorandom Functions, - (NIST Special Publication 800-108). -4. Recommendation for Block Cipher Modes of Operation: The CMAC Mode - for Authentication, (NIST Special Publication 800-38B). -5. Recommendation for Block Cipher Modes of Operation: Methods for Key - Wrapping, (NIST Special Publication 800-38F). -6. FIPS PUB 198-1 The Keyed-Hash Message Authentication Code (HMAC). -7. David A. McGrew, John Viega "The Galois/Counter Mode of Operation - (GCM)". - -11 FAQ -====== - -**1. How to make sure that my volume is really encrypted?** - -Check the respective graph of translators on your trusted client -machine. This graph is created at mount time and is stored by default in -the file /usr/local/var/log/glusterfs/mountpoint.log - -Here "mountpoint" is the absolute name of the mountpoint, where "/" are -replaced with "-". For example, if your volume is mounted to -/mnt/testfs, then you'll need to check the file -/usr/local/var/log/glusterfs/mnt-testfs.log - -Make sure that this graph contains the crypt translator, which looks -like the following: - - 13: volume xvol-crypt - 14:     type encryption/crypt - 15:     option master-key /home/edward/mykey - 16:     subvolumes xvol-dht - 17: end-volume - -**2. How to make sure that my volume is encrypted with a proper master -key?** - -Check the graph of translators on your trusted client machine (see the -FAQ\#1). Make sure that the option "master-key" of the crypt translator -specifies correct location of the master key on your trusted client -machine. - -**3. Can I change the encryption status of a volume?** - -You can change encryption status (enable/disable encryption) only for -empty volumes. Otherwise it will be incorrect (you'll end with IO -errors, data corruption and security problems). We strongly recommend to -decide once and forever at volume creation time, whether your volume has -to be encrypted, or not. - -**4. I am able to mount my encrypted volume with improper master keys -and get list of file names for every directory. Is it normal?** - -Yes, it is normal. It doesn't contradict the announced functionality: we -encrypt only file's content. File names are not encrypted, so it doesn't -make sense to hide them on the trusted client machine. - -**5. What is the reason for only supporting AES-XTS? This mode is not -using Intel's AES-NI instruction thus not utilizing hardware feature..** - -Distributed file systems impose tighter requirements to at-rest -encryption. We offer more than "at-rest-encryption". We offer "at-rest -encryption and authentication in distributed systems with non-trusted -servers". Data and metadata on the server can be easily subjected to -tampering and analysis with the purpose to reveal secret user's data. -And we have to resist to this tampering by performing data and metadata -authentication. - -Unfortunately, it is technically hard to implement full-fledged data -authentication via a stackable file system (GlusterFS translator), so we -have decided to perform a "light" authentication by using a special -cipher mode, which is resistant to tampering. Currently OpenSSL supports -only one such mode: this is XTS. Tampering of ciphertext created in XTS -mode will lead to unpredictable changes in the plain text. That said, -user will see "unpredictable gibberish" on the client side. Of course, -this is not an "official way" to detect tampering, but this is much -better than nothing. The "official way" (creating/checking MACs) we use -for metadata authentication. - -Other modes like CBC, CFB, OFB, etc supported by OpenSSL are strongly -not recommended for use in distributed systems with non-trusted servers. -For example, CBC mode doesn't "survive" overwrite of a logical block in -a file. It means that with every such overwrite (standard file system -operation) we'll need to re-encrypt the whole(!) file with different -key. CFB and OFB modes are sensitive to tampering: there is a way to -perform \*predictable\* changes in plaintext, which is unacceptable. - -Yes, XTS is slow (at least its current implementation in OpenSSL), but -we don't promise, that CFB, OFB with full-fledged authentication will be -faster. So.. diff --git a/Feature Planning/GlusterFS 3.5/Exposing Volume Capabilities.md b/Feature Planning/GlusterFS 3.5/Exposing Volume Capabilities.md deleted file mode 100644 index 0f72fbc..0000000 --- a/Feature Planning/GlusterFS 3.5/Exposing Volume Capabilities.md +++ /dev/null @@ -1,161 +0,0 @@ -Feature -------- - -Provide a capability to: - -- Probe the type (posix or bd) of volume. -- Provide list of capabilities of a xlator/volume. For example posix - xlator could support zerofill, BD xlator could support offloaded - copy, thin provisioning etc - -Summary -------- - -With multiple storage translators (posix and bd) being supported in -GlusterFS, it becomes necessary to know the volume type so that user can -issue appropriate calls that are relevant only to the a given volume -type. Hence there needs to be a way to expose the type of the storage -translator of the volume to the user. - -BD xlator is capable of providing server offloaded file copy, -server/storage offloaded zeroing of a file etc. This capabilities should -be visible to the client/user, so that these features can be exploited. - -Owners ------- - -M. Mohan Kumar -Bharata B Rao. - -Current status --------------- - -BD xlator exports capability information through gluster volume info -(and --xml) output. For eg: - -*snip of gluster volume info output for a BD based volume* - - Xlator 1: BD - Capability 1: thin - -*snip of gluster volume info --xml output for a BD based volume* - - -    -     BD -      -       thin -      -    - - -But this capability information should also exposed through some other -means so that a host which is not part of Gluster peer could also avail -this capabilities. - -Exposing about type of volume (ie posix or BD) is still in conceptual -state currently and needs discussion. - -Detailed Description --------------------- - -1. Type -- BD translator supports both regular files and block device, -i,e., one can create files on GlusterFS volume backed by BD -translator and this file could end up as regular posix file or a -logical volume (block device) based on the user's choice. User -can do a setxattr on the created file to convert it to a logical -volume. -- Users of BD backed volume like QEMU would like to know that it -is working with BD type of volume so that it can issue an -additional setxattr call after creating a VM image on GlusterFS -backend. This is necessary to ensure that the created VM image -is backed by LV instead of file. -- There are different ways to expose this information (BD type of -volume) to user. One way is to export it via a getxattr call. - -2. Capabilities -- BD xlator supports new features such as server offloaded file -copy, thin provisioned VM images etc (there is a patch posted to -Gerrit to add server offloaded file zeroing in posix xlator). -There is no standard way of exploiting these features from -client side (such as syscall to exploit server offloaded copy). -So these features need to be exported to the client so that they -can be used. BD xlator V2 patch exports these capabilities -information through gluster volume info (and --xml) output. But -if a client is not part of GlusterFS peer it can't run volume -info command to get the list of capabilities of a given -GlusterFS volume. Also GlusterFS block driver in qemu need to -get the capability list so that these features are used. - -Benefit to GlusterFS --------------------- - -Enables proper consumption of BD xlator and client exploits new features -added in both posix and BD xlator. - -### Scope - -Nature of proposed change -------------------------- - -- Quickest way to expose volume type to a client can be achieved by - using getxattr fop. When a client issues getxattr("volume\_type") on - a root gfid, bd xlator will return 1 implying its BD xlator. But - posix xlator will return ENODATA and client code can interpret this - as posix xlator. - -- Also capability list can be returned via getxattr("caps") for root - gfid. - -Implications on manageability ------------------------------ - -None. - -Implications on presentation layer ----------------------------------- - -N/A - -Implications on persistence layer ---------------------------------- - -N/A - -Implications on 'GlusterFS' backend ------------------------------------ - -N/A - -Modification to GlusterFS metadata ----------------------------------- - -N/A - -Implications on 'glusterd' --------------------------- - -N/A - -How To Test ------------ - -User Experience ---------------- - -Dependencies ------------- - -Documentation -------------- - -Status ------- - -Patch : - -Status : Merged - -Comments and Discussion ------------------------ diff --git a/Feature Planning/GlusterFS 3.5/File Snapshot.md b/Feature Planning/GlusterFS 3.5/File Snapshot.md deleted file mode 100644 index b2d6c69..0000000 --- a/Feature Planning/GlusterFS 3.5/File Snapshot.md +++ /dev/null @@ -1,101 +0,0 @@ -Feature -------- - -File Snapshots in GlusterFS - -### Summary - -Ability to take snapshots of files in GlusterFS - -### Owners - -Anand Avati - -### Source code - -Patch for this feature - - -### Detailed Description - -The feature adds file snapshotting support to GlusterFS. '' To use this -feature the file format should be QCOW2 (from QEMU)'' . The patch takes -the block layer code from Qemu and converts it into a translator in -gluster. - -### Benefit to GlusterFS - -Better integration with Openstack Cinder, and in general ability to take -snapshots of files (typically VM images) - -### Usage - -*To take snapshot of a file, the file format should be QCOW2. To set -file type as qcow2 check step \#2 below* - -​1. Turning on snapshot feature : - - gluster volume set `` features.file-snapshot on - -​2. To set qcow2 file format: - - setfattr -n trusted.glusterfs.block-format -v qcow2:10GB  - -​3. To create a snapshot: - - setfattr -n trusted.glusterfs.block-snapshot-create -v  - -​4. To apply/revert back to a snapshot: - - setfattr -n trusted.glusterfs.block-snapshot-goto -v   - -### Scope - -#### Nature of proposed change - -The work is going to be a new translator. Very minimal changes to -existing code (minor change in syncops) - -#### Implications on manageability - -Will need ability to load/unload the translator in the stack. - -#### Implications on presentation layer - -Feature must be presentation layer independent. - -#### Implications on persistence layer - -No implications - -#### Implications on 'GlusterFS' backend - -Internal snapshots - No implications. External snapshots - there will be -hidden directories added. - -#### Modification to GlusterFS metadata - -New xattr will be added to identify files which are 'snapshot managed' -vs raw files. - -#### Implications on 'glusterd' - -Yet another turn on/off feature for glusterd. Volgen will have to add a -new translator in the generated graph. - -### How To Test - -Snapshots can be tested by taking snapshots along with checksum of the -state of the file, making further changes and going back to old snapshot -and verify the checksum again. - -### Dependencies - -Dependent QEMU code is imported into the codebase. - -### Documentation - - - -### Status - -Merged in master and available in Gluster3.5 \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.5/Onwire Compression-Decompression.md b/Feature Planning/GlusterFS 3.5/Onwire Compression-Decompression.md deleted file mode 100644 index a26aa7a..0000000 --- a/Feature Planning/GlusterFS 3.5/Onwire Compression-Decompression.md +++ /dev/null @@ -1,96 +0,0 @@ -Feature -======= - -On-Wire Compression/Decompression - -1. Summary -========== - -Translator to compress/decompress data in flight between client and -server. - -2. Owners -========= - -- Venky Shankar -- Prashanth Pai - -3. Current Status -================= - -Code has already been merged. Needs more testing. - -The [initial submission](http://review.gluster.org/3251) contained a -`compress` option, which introduced [some -confusion](https://bugzilla.redhat.com/1053670). [A correction has been -sent](http://review.gluster.org/6765) to rename the user visible options -to start with `network.compression`. - -TODO - -- Make xlator pluggable to add support for other compression methods -- Add support for lz4 compression: - -4. Detailed Description -======================= - -- When a writev call occurs, the client compresses the data before - sending it to server. On the server, compressed data is - decompressed. Similarly, when a readv call occurs, the server - compresses the data before sending it to client. On the client, the - compressed data is decompressed. Thus the amount of data sent over - the wire is minimized. - -- Compression/Decompression is done using Zlib library. - -- During normal operation, this is the format of data sent over wire: - + trailer(8 bytes). The trailer contains the CRC32 - checksum and length of original uncompressed data. This is used for - validation. - -5. Usage -======== - -Turning on compression xlator: - - # gluster volume set  network.compression on - -Configurable options: - - # gluster volume set  network.compression.compression-level 8 - # gluster volume set  network.compression.min-size 50 - -6. Benefits to GlusterFS -======================== - -Fewer bytes transferred over the network. - -7. Issues -========= - -- Issues with striped volumes. Compression xlator cannot work with - striped volumes - -- Issues with write-behind: Mount point hangs when writing a file with - write-behind xlator turned on. To overcome this, turn off - write-behind entirely OR set "performance.strict-write-ordering" to - on. - -- Issues with AFR: AFR v1 currently does not propagate xdata. - This issue has - been resolved in AFR v2. - -8. Dependencies -=============== - -Zlib library - -9. Documentation -================ - - - -10. Status -========== - -Code merged upstream. \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.5/Quota Scalability.md b/Feature Planning/GlusterFS 3.5/Quota Scalability.md deleted file mode 100644 index f3b0a0d..0000000 --- a/Feature Planning/GlusterFS 3.5/Quota Scalability.md +++ /dev/null @@ -1,99 +0,0 @@ -Feature -------- - -Quota Scalability - -Summary -------- - -Support upto 65536 quota configurations per volume. - -Owners ------- - -Krishnan Parthasarathi -Vijay Bellur - -Current status --------------- - -Current implementation of Directory Quota cannot scale beyond a few -hundred configured limits per volume. The aim of this feature is to -support upto 65536 quota configurations per volume. - -Detailed Description --------------------- - -TBD - -Benefit to GlusterFS --------------------- - -More quotas can be configured in a single volume thereby leading to -support GlusterFS for use cases like home directory. - -Scope ------ - -### Nature of proposed change - -- Move quota enforcement translator to the server -- Introduce a new quota daemon which helps in aggregating directory - consumption on the server -- Enhance marker's accounting to be modular -- Revamp configuration persistence and CLI listing for better scale -- Allow configuration of soft limits in addition to hard limits. - -### Implications on manageability - -Mostly the CLI will be backward compatible. New CLI to be introduced -needs to be enumerated here. - -### Implications on presentation layer - -None - -### Implications on persistence layer - -None - -### Implications on 'GlusterFS' backend - -None - -### Modification to GlusterFS metadata - -- Addition of a new extended attribute for storing configured hard and -soft limits on directories. - -### Implications on 'glusterd' - -- New file based configuration persistence - -How To Test ------------ - -TBD - -User Experience ---------------- - -TBD - -Dependencies ------------- - -None - -Documentation -------------- - -TBD - -Status ------- - -In development - -Comments and Discussion ------------------------ diff --git a/Feature Planning/GlusterFS 3.5/Virt store usecase.md b/Feature Planning/GlusterFS 3.5/Virt store usecase.md deleted file mode 100644 index 3e649b2..0000000 --- a/Feature Planning/GlusterFS 3.5/Virt store usecase.md +++ /dev/null @@ -1,140 +0,0 @@ - Work In Progress - Author - Satheesaran Sundaramoorthi - - -**Introduction** ----------------- - -Gluster volumes are used to host Virtual Machines Images. (i.e) Virtual -machines Images are stored on gluster volumes. This usecase is popularly -known as *virt-store* usecase. - -This document explains more about, - -1. Enabling gluster volumes for virt-store usecase -2. Common Pitfalls -3. FAQs -4. References - -**Enabling gluster volumes for virt-store** -------------------------------------------- - -This section describes how to enable gluster volumes for virt store -usecase - -#### Volume Types - -Ideally gluster volumes serving virt-store, should provide -high-availability for the VMs running on it. If the volume is not -avilable, the VMs may move in to unusable state. So, its best -recommended to use **replica** or **distribute-replicate** volume for -this usecase - -*If you are new to GlusterFS, you can take a look at -[QuickStart](http://gluster.readthedocs.org/en/latest/Quick-Start-Guide/Quickstart/) guide or the [admin -guide](http://gluster.readthedocs.org/en/latest/Administrator%20Guide/README/)* - -#### Tunables - -The set of volume options are recommended for virt-store usecase, which -adds performance boost. Following are those options, - - quick-read=off - read-ahead=off - io-cache=off - stat-prefetch=off - eager-lock=enable - remote-dio=enable - quorum-type=auto - server-quorum-type=server - -- quick-read is meant for improving small-file read performance,which - is no longer reasonable for VM Image files -- read-ahead is turned off. VMs have their own way of doing that. This - is pretty usual to leave it to VM to determine the read-ahead -- io-cache is turned off -- stat-prefetch is turned off. stat-prefetch, caches the metadata - related to files and this is no longer a concern for VM Images (why - ?) -- eager-lock is turned on (why?) -- remote-dio is turned on,so in open() and creat() calls, O\_DIRECT - flag will be filtered at the client protocol level so server will - still continue to cache the file. -- quorum-type is set to auto. This basically enables client side - quorum. When client side quorum is enabled, there exists the rule - such that atleast half of the bricks in the replica group should be - UP and running. If not, the replica group would become read-only -- server-quorum-type is set to server. This basically enables - server-side quorum. This lays a condition that in a cluster, atleast - half the number of nodes, should be UP. If not the bricks ( read as - brick processes) will be killed, and thereby volume goes offline - -#### Applying the Tunables on the volume - -There are number of ways to do it. - -1. Make use of group-virt.example file -2. Copy & Paste - -##### Make use of group-virt.example file - -This is the method best suited and recommended. -*/etc/glusterfs/group-virt.example* has all options recommended for -virt-store as explained earlier. Copy this file, -*/etc/glusterfs/group-virt.example* to */var/lib/glusterd/groups/virt* - - cp /etc/glusterfs/group-virt.example /var/lib/glusterd/groups/virt - -Optimize the volume with all the options available in this *virt* file -in a single go - - gluster volume set group virt - -NOTE: No restart of the volume is required Verify the same with the -command, - - gluster volume info - -In forthcoming releases, this file will be automatically put in -*/var/lib/glusterd/groups/* and you can directly apply it on the volume - -##### Copy & Paste - -Copy all options from the above -section,[Virt-store-usecase\#Tunables](Virt-store-usecase#Tunables "wikilink") -and put in a file named *virt* in */var/lib/glusterd/groups/virt* Apply -all the options on the volume, - - gluster volume set group virt - -NOTE: This is not recommended, as the recommended volume options may/may -not change in future.Always stick to *virt* file available with the rpms - -#### Adding Ownership to Volume - -You can add uid:gid to the volume, - - gluster volume set storage.owner-uid - gluster volume set storage.owner-gid - -For example, when the volume would be accessed by qemu/kvm, you need to -add ownership as 107:107, - - gluster volume set storage.owner-uid 107 - gluster volume set storage.owner-gid 107 - -It would be 36:36 in the case of oVirt/RHEV, 165:165 in the case of -OpenStack Block Service (cinder),161:161 in case of OpenStack Image -Service (glance) is accessing this volume - -NOTE: Not setting the correct ownership may lead to "Permission Denied" -errors when accessing the image files residing on the volume - -**Common Pitfalls** -------------------- - -**FAQs** --------- - -**References** --------------- \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.5/Zerofill.md b/Feature Planning/GlusterFS 3.5/Zerofill.md deleted file mode 100644 index 43b279d..0000000 --- a/Feature Planning/GlusterFS 3.5/Zerofill.md +++ /dev/null @@ -1,192 +0,0 @@ -Feature -------- - -zerofill API for GlusterFS - -Summary -------- - -zerofill() API would allow creation of pre-allocated and zeroed-out -files on GlusterFS volumes by offloading the zeroing part to server -and/or storage (storage offloads use SCSI WRITESAME). - -Owners ------- - -Bharata B Rao -M. Mohankumar - -Current status --------------- - -Patch on gerrit: - -Detailed Description --------------------- - -Add support for a new ZEROFILL fop. Zerofill writes zeroes to a file in -the specified range. This fop will be useful when a whole file needs to -be initialized with zero (could be useful for zero filled VM disk image -provisioning or during scrubbing of VM disk images). - -Client/application can issue this FOP for zeroing out. Gluster server -will zero out required range of bytes ie server offloaded zeroing. In -the absence of this fop, client/application has to repetitively issue -write (zero) fop to the server, which is very inefficient method because -of the overheads involved in RPC calls and acknowledgements. - -WRITESAME is a SCSI T10 command that takes a block of data as input and -writes the same data to other blocks and this write is handled -completely within the storage and hence is known as offload . Linux ,now -has support for SCSI WRITESAME command which is exposed to the user in -the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to -implement this fop. Thus zeroing out operations can be completely -offloaded to the storage device , making it highly efficient. - -The fop takes two arguments offset and size. It zeroes out 'size' number -of bytes in an opened file starting from 'offset' position. - -Benefit to GlusterFS --------------------- - -Benefits GlusterFS in virtualization by providing the ability to quickly -create pre-allocated and zeroed-out VM disk image by using -server/storage off-loads. - -### Scope - -Nature of proposed change -------------------------- - -An FOP supported in libgfapi and FUSE. - -Implications on manageability ------------------------------ - -None. - -Implications on presentation layer ----------------------------------- - -N/A - -Implications on persistence layer ---------------------------------- - -N/A - -Implications on 'GlusterFS' backend ------------------------------------ - -N/A - -Modification to GlusterFS metadata ----------------------------------- - -N/A - -Implications on 'glusterd' --------------------------- - -N/A - -How To Test ------------ - -Test server offload by measuring the time taken for creating a fully -allocated and zeroed file on Posix backend. - -Test storage offload by measuring the time taken for creating a fully -allocated and zeroed file on BD backend. - -User Experience ---------------- - -Fast provisioning of VM images when GlusterFS is used as a file system -backend for KVM virtualization. - -Dependencies ------------- - -zerofill() support in BD backend depends on the new BD translator - - - -Documentation -------------- - -This feature add support for a new ZEROFILL fop. Zerofill writes zeroes -to a file in the specified range. This fop will be useful when a whole -file needs to be initialized with zero (could be useful for zero filled -VM disk image provisioning or during scrubbing of VM disk images). - -Client/application can issue this FOP for zeroing out. Gluster server -will zero out required range of bytes ie server offloaded zeroing. In -the absence of this fop, client/application has to repetitively issue -write (zero) fop to the server, which is very inefficient method because -of the overheads involved in RPC calls and acknowledgements. - -WRITESAME is a SCSI T10 command that takes a block of data as input and -writes the same data to other blocks and this write is handled -completely within the storage and hence is known as offload . Linux ,now -has support for SCSI WRITESAME command which is exposed to the user in -the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to -implement this fop. Thus zeroing out operations can be completely -offloaded to the storage device , making it highly efficient. - -The fop takes two arguments offset and size. It zeroes out 'size' number -of bytes in an opened file starting from 'offset' position. - -This feature adds zerofill support to the following areas: - --  libglusterfs --  io-stats --  performance/md-cache,open-behind --  quota --  cluster/afr,dht,stripe --  rpc/xdr --  protocol/client,server --  io-threads --  marker --  storage/posix --  libgfapi - -Client applications can exploit this fop by using glfs\_zerofill -introduced in libgfapi.FUSE support to this fop has not been added as -there is no system call for this fop. - -Here is a performance comparison of server offloaded zeofill vs zeroing -out using repeated writes. - - [root@llmvm02 remote]# time ./offloaded aakash-test log 20 - - real    3m34.155s - user    0m0.018s - sys 0m0.040s - - -  [root@llmvm02 remote]# time ./manually aakash-test log 20 - - real    4m23.043s - user    0m2.197s - sys 0m14.457s -  [root@llmvm02 remote]# time ./offloaded aakash-test log 25; - - real    4m28.363s - user    0m0.021s - sys 0m0.025s - [root@llmvm02 remote]# time ./manually aakash-test log 25 - - real    5m34.278s - user    0m2.957s - sys 0m18.808s - -The argument log is a file which we want to set for logging purpose and -the third argument is size in GB . - -As we can see there is a performance improvement of around 20% with this -fop. - -Status ------- - -Patch : Status : Merged \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.5/gfid access.md b/Feature Planning/GlusterFS 3.5/gfid access.md deleted file mode 100644 index db64076..0000000 --- a/Feature Planning/GlusterFS 3.5/gfid access.md +++ /dev/null @@ -1,89 +0,0 @@ -### Instructions - -**Feature** - -'gfid-access' translator to provide access to data in glusterfs using a virtual path. - -**1 Summary** - -This particular Translator is designed to provide direct access to files in glusterfs using its gfid.'GFID' is glusterfs's inode numbers for a file to identify it uniquely. - -**2 Owners** - -Amar Tumballi  -Raghavendra G  -Anand Avati  - -**3 Current status** - -With glusterfs-3.4.0, glusterfs provides only path based access.A feature is added in 'fuse' layer in the current master branch, -but its desirable to have it as a separate translator for long time -maintenance. - -**4 Detailed Description** - -With this method, we can consume the data in changelog translator -(which is logging 'gfid' internally) very efficiently. - -**5 Benefit to GlusterFS** - -Provides a way to access files quickly with direct gfid. - -​**6. Scope** - -6.1. Nature of proposed change - -* A new translator. -* Fixes in 'glusterfsd.c' to add this translator automatically based -on mount time option. -* change to mount.glusterfs to parse this new option  -(single digit number or lines changed) - -6.2. Implications on manageability - -* No CLI required. -* mount.glusterfs script gets a new option. - -6.3. Implications on presentation layer - -* A new virtual access path is made available. But all access protocols work seemlessly, as the complexities are handled internally. - -6.4. Implications on persistence layer - -* None - -6.5. Implications on 'GlusterFS' backend - -* None - -6.6. Modification to GlusterFS metadata - -* None - -6.7. Implications on 'glusterd' - -* None - -7 How To Test - -* Mount glusterfs client with '-o aux-gfid-mount' and access files using '/mount/point/.gfid/ '. - -8 User Experience - -* A new virtual path available for users. - -9 Dependencies - -* None - -10 Documentation - -This wiki. - -11 Status - -Patch sent upstream. More review comments required. (http://review.gluster.org/5497) - -12 Comments and Discussion - -Please do give comments :-) \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.5/index.md b/Feature Planning/GlusterFS 3.5/index.md deleted file mode 100644 index e8c2c88..0000000 --- a/Feature Planning/GlusterFS 3.5/index.md +++ /dev/null @@ -1,32 +0,0 @@ -GlusterFS 3.5 Release ---------------------- - -Tentative Dates: - -Latest: 13-Nov, 2014 GlusterFS 3.5.3 - -17th Apr, 2014 - 3.5.0 GA - -GlusterFS 3.5 -------------- - -### Features in 3.5.0 - -- [Features/AFR CLI enhancements](./AFR CLI enhancements.md) -- [Features/exposing volume capabilities](./Exposing Volume Capabilities.md) -- [Features/File Snapshot](./File Snapshot.md) -- [Features/gfid-access](./gfid access.md) -- [Features/On-Wire Compression + Decompression](./Onwire Compression-Decompression.md) -- [Features/Quota Scalability](./Quota Scalability.md) -- [Features/readdir ahead](./readdir ahead.md) -- [Features/zerofill](./Zerofill.md) -- [Features/Brick Failure Detection](./Brick Failure Detection.md) -- [Features/disk-encryption](./Disk Encryption.md) -- Changelog based parallel geo-replication -- Improved block device translator - -Proposing New Features ----------------------- - -New feature proposals should be built using the New Feature Template in -the GlusterFS 3.7 planning page diff --git a/Feature Planning/GlusterFS 3.5/libgfapi with qemu libvirt.md b/Feature Planning/GlusterFS 3.5/libgfapi with qemu libvirt.md deleted file mode 100644 index 2309016..0000000 --- a/Feature Planning/GlusterFS 3.5/libgfapi with qemu libvirt.md +++ /dev/null @@ -1,222 +0,0 @@ - Work In Progress - Author - Satheesaran Sundaramoorthi - - -**Purpose** ------------ - -Gluster volume can be used to store VM Disk images. This usecase is -popularly known as 'Virt-Store' usecase. Earlier, gluster volume had to -be fuse mounted and images are created/accessed over the fuse mount. - -With the introduction of GlusterFS libgfapi, QEMU supports glusterfs -through libgfapi directly. This we call as *QEMU driver for glusterfs*. -These document explains about the way to make use of QEMU driver for -glusterfs - -Steps for the entire procedure could be split in to 2 views viz,the -document from - -1. Steps to be done on gluster volume side -2. Steps to be done on Hypervisor side - -**Steps to be done on gluster side** ------------------------------------- - -These are the steps that needs to be done on the gluster side Precisely -this involves - -1. Creating "Trusted Storage Pool" -2. Creating a volume -3. Tuning the volume for virt-store -4. Tuning glusterd to accept requests from QEMU -5. Tuning glusterfsd to accept requests from QEMU -6. Setting ownership on the volume -7. Starting the volume - -##### Creating "Trusted Storage Pool" - -Install glusterfs rpms on the NODE. You can create a volume with a -single node. You can also scale up the cluster, as we call as *Trusted -Storage Pool*, by adding more nodes to the cluster - - gluster peer probe  - -##### Creating a volume - -It is highly recommended to have replicate volume or -distribute-replicate volume for virt-store usecase, as it would add high -availability and fault-tolerance. Remember the plain distribute works -equally well - - gluster volume create replica 2 .. - -where, is :/ Note: It is recommended to -create sub-directories inside brick and that could be used to create a -volume.For example, say, */home/brick1* is the mountpoint of XFS, then -you can create a sub-directory inside it */home/brick1/b1* and use it -while creating a volume.You can also use space available in root -filesystem for bricks. Gluster cli, by default, throws warning in that -case. You can override by using *force* option - - gluster volume create replica 2 .. force - -*If you are new to GlusterFS, you can take a look at -[QuickStart](http://gluster.readthedocs.org/en/latest/Quick-Start-Guide/Quickstart/) guide.* - -##### Tuning the volume for virt-store - -There are recommended settings available for virt-store. This provide -good performance characteristics when enabled on the volume that was -used for *virt-store* - -Refer to -[Virt-store-usecase\#Tunables](Virt-store-usecase#Tunables "wikilink") -for recommended tunables and for applying them on the volume, -[Virt-store-usecase\#Applying\_the\_Tunables\_on\_the\_volume](Virt-store-usecase#Applying_the_Tunables_on_the_volume "wikilink") - -##### Tuning glusterd to accept requests from QEMU - -glusterd receives the request only from the applications that run with -port number less than 1024 and it blocks otherwise. QEMU uses port -number greater than 1024 and to make glusterd accept requests from QEMU, -edit the glusterd vol file, */etc/glusterfs/glusterd.vol* and add the -following, - - option rpc-auth-allow-insecure on - -Note: If you have installed glusterfs from source, you can find glusterd -vol file at */usr/local/etc/glusterfs/glusterd.vol* - -Restart glusterd after adding that option to glusterd vol file - - service glusterd restart - -##### Tuning glusterfsd to accept requests from QEMU - -Enable the option *allow-insecure* on the particular volume - - gluster volume set  server.allow-insecure on - -**IMPORTANT :** As of now(april 2,2014)there is a bug, as -*allow-insecure* is not dynamically set on a volume.You need to restart -the volume for the change to take effect - -##### Setting ownership on the volume - -Set the ownership of qemu:qemu on to the volume - - gluster volume set  storage.owner-uid 107 - gluster volume set  storage.owner-gid 107 - -**IMPORTANT :** The UID and GID can differ per Linux distribution, or -even installation. The UID/GID should be the one fomr the *qemu* or -'kvm'' user, you can get the IDs with these commands: - - id qemu - getent group kvm - -##### Starting the volume - -Start the volume - - gluster volume start  - -**Steps to be done on Hypervisor Side** ---------------------------------------- - -Hypervisor is just the machine which spawns the Virtual Machines. This -machines should be necessarily the baremetal with more memory and -computing power. The following steps needs to be done on hypervisor, - -1. Install qemu-kvm -2. Install libvirt -3. Create a VM Image -4. Add ownership to the Image file -5. Create libvirt XML to define Virtual Machine -6. Define the VM -7. Start the VM -8. Verification - -##### Install qemu-kvm - -##### Install libvirt - -##### Create a VM Image - -Images can be created using *qemu-img* utility - - qemu-img create -f gluster://// - -- format - This can be raw or qcow2 -- server - One of the gluster Node's IP or FQDN -- vol-name - gluster volume name -- image - Image File name -- size - Size of the image - -Here is sample, - - qemu-img create -f qcow2 gluster://host.sample.com/vol1/vm1.img 10G - -##### Add ownership to the Image file - -NFS or FUSE mount the glusterfs volume and change the ownership of the -image file to qemu:qemu - - mount -t nfs -o vers=3 :/  - -Change the ownership of the image file that was earlier created using -*qemu-img* utility - - chown qemu:qemu / - -##### Create libvirt XML to define Virtual Machine - -*virt-install* is python wrapper which is mostly used to create VM using -set of params. *virt-install* doesn't support any network filesystem [ - ] - -Create a libvirt xml - See to -that the disk section is formatted in such a way, qemu driver for -glusterfs is being used. This can be seen in the following example xml -description - - - - - - - -
- - -##### Define the VM from XML - -Define the VM from the XML file that was created earlier - - virsh define  - -Verify that the VM is created successfully - - virsh list --all - -##### Start the VM - -Start the VM - - virsh start  - -##### Verification - -You can verify the disk image file that is being used by VM - - virsh domblklist  - -The above should show the volume name and image name. Here is the -example, - - [root@test ~]# virsh domblklist vm-test2 - Target Source - ------------------------------------------------ - vda distrepvol/test.img - hdc - \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.5/readdir ahead.md b/Feature Planning/GlusterFS 3.5/readdir ahead.md deleted file mode 100644 index fe34a97..0000000 --- a/Feature Planning/GlusterFS 3.5/readdir ahead.md +++ /dev/null @@ -1,117 +0,0 @@ -Feature -------- - -readdir-ahead - -Summary -------- - -Provide read-ahead support for directories to improve sequential -directory read performance. - -Owners ------- - -Brian Foster - -Current status --------------- - -Gluster currently does not attempt to improve directory read -performance. As a result, simple operations (i.e., ls) on large -directories are slow. - -Detailed Description --------------------- - -The read-ahead feature for directories is analogous to read-ahead for -files. The objective is to detect sequential directory read operations -and establish a pipeline for directory content. When a readdir request -is received and fulfilled, preemptively issue subsequent readdir -requests to the server in anticipation of those requests from the user. -If sequential readdir requests are received, the directory content is -already immediately available in the client. If subsequent requests are -not sequential or not received, said data is simply dropped and the -optimization is bypassed. - -Benefit to GlusterFS --------------------- - -Improved read performance of large directories. - -### Scope - -Nature of proposed change -------------------------- - -readdir-ahead support is enabled through a new client-side translator. - -Implications on manageability ------------------------------ - -None beyond the ability to enable and disable the translator. - -Implications on presentation layer ----------------------------------- - -N/A - -Implications on persistence layer ---------------------------------- - -N/A - -Implications on 'GlusterFS' backend ------------------------------------ - -N/A - -Modification to GlusterFS metadata ----------------------------------- - -N/A - -Implications on 'glusterd' --------------------------- - -N/A - -How To Test ------------ - -Performance testing. Verify that sequential reads of large directories -complete faster (i.e., ls, xfs\_io -c readdir). - -User Experience ---------------- - -Improved performance on sequential read workloads. The translator should -otherwise be invisible and not detract performance or disrupt behavior -in any way. - -Dependencies ------------- - -N/A - -Documentation -------------- - -Set the associated config option to enable or disable directory -read-ahead on a volume: - - gluster volume set  readdir-ahead [enable|disable] - -readdir-ahead is disabled by default. - -Status ------- - -Development complete for the initial version. Minor changes and bug -fixes likely. - -Future versions might expand to provide generic caching and more -flexible behavior. - -Comments and Discussion ------------------------ \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.6/Better Logging.md b/Feature Planning/GlusterFS 3.6/Better Logging.md deleted file mode 100644 index 6aad602..0000000 --- a/Feature Planning/GlusterFS 3.6/Better Logging.md +++ /dev/null @@ -1,348 +0,0 @@ -Feature -------- - -Gluster logging enhancements to support message IDs per message - -Summary -------- - -Enhance gluster logging to provide the following features, SubFeature ---\> SF - -- SF1: Add message IDs to message - -- SF2: Standardize error num reporting across messages - -- SF3: Enable repetitive message suppression in logs - -- SF4: Log location and hierarchy standardization (in case anything is -further required here, analysis pending) - -- SF5: Enable per sub-module logging level configuration - -- SF6: Enable logging to other frameworks, than just the current gluster -logs - -- SF7: Generate a catalogue of these message, with message ID, message, -reason for occurrence, recovery/troubleshooting steps. - -Owners ------- - -Balamurugan Arumugam -Krishnan Parthasarathi -Krutika Dhananjay -Shyamsundar Ranganathan - -Current status --------------- - -### Existing infrastructure: - -Currently gf\_logXXX exists as an infrastructure API for all logging -related needs. This (typically) takes the form, - -gf\_log(dom, levl, fmt...) - -where, - -    dom: Open format string usually the xlator name, or "cli" or volume name etc. -    levl: One of, GF_LOG_EMERG, GF_LOG_ALERT, GF_LOG_CRITICAL, GF_LOG_ERROR, GF_LOG_WARNING, GF_LOG_NOTICE, GF_LOG_INFO, GF_LOG_DEBUG, GF_LOG_TRACE -    fmt: the actual message string, followed by the required arguments in the string - -The log initialization happens through, - -gf\_log\_init (void \*data, const char \*filename, const char \*ident) - -where, - -    data: glusterfs_ctx_t, largely unused in logging other than the required FILE and mutex fields -    filename: file name to log to -    ident: Like syslog ident parameter, largely unused - -The above infrastructure leads to logs of type, (sample extraction from -nfs.log) - -     [2013-12-08 14:17:17.603879] I [socket.c:3485:socket_init] 0-socket.ACL: SSL support is NOT enabled -     [2013-12-08 14:17:17.603937] I [socket.c:3500:socket_init] 0-socket.ACL: using system polling thread -     [2013-12-08 14:17:17.612128] I [nfs.c:934:init] 0-nfs: NFS service started -     [2013-12-08 14:17:17.612383] I [dht-shared.c:311:dht_init_regex] 0-testvol-dht: using regex rsync-hash-regex = ^\.(.+)\.[^.]+$ - -### Limitations/Issues in the infrastructure - -​1) Auto analysis of logs needs to be done based on the final message -string. Automated tools that can help with log message and related -troubleshooting options need to use the final string, which needs to be -intelligently parsed and also may change between releases. It would be -desirable to have message IDs so that such tools and trouble shooting -options can leverage the same in a much easier fashion. - -​2) The log message itself currently does not use the \_ident\_ which -can help as we move to more common logging frameworks like journald, -rsyslog (or syslog as the case maybe) - -​3) errno is the primary identifier of errors across gluster, i.e we do -not have error codes in gluster and use errno values everywhere. The log -messages currently do not lend themselves to standardization like -printing the string equivalent of errno rather than the actual errno -value, which \_could\_ be cryptic to administrators - -​4) Typical logging infrastructures provide suppression (on a -configurable basis) for repetitive messages to prevent log flooding, -this is currently missing in the current infrastructure - -​5) The current infrastructure cannot be used to control log levels at a -per xlator or sub module, as the \_dom\_ passed is a string that change -based on volume name, translator name etc. It would be desirable to have -a better module identification mechanism that can help with this -feature. - -​6) Currently the entire logging infrastructure resides within gluster. -It would be desirable in scaled situations to have centralized logging -and monitoring solutions in place, to be able to better analyse and -monitor the cluster health and take actions. - -This requires some form of pluggable logging frameworks that can be used -within gluster to enable this possibility. Currently the existing -framework is used throughout gluster and hence we need only to change -configuration and logging.c to enable logging to other frameworks (as an -example the current syslog plug that was provided). - -It would be desirable to enhance this to provide a more robust framework -for future extensions to other frameworks. This is not a limitation of -the current framework, so much as a re-factor to be able to switch -logging frameworks with more ease. - -​7) For centralized logging in the future, it would need better -identification strings from various gluster processes and hosts, which -is currently missing or suppressed in the logging infrastructure. - -Due to the nature of enhancements proposed, it is required that we -better the current infrastructure for the stated needs and do some -future proofing in terms of newer messages that would be added. - -Detailed Description --------------------- - -NOTE: Covering details for SF1, SF2, and partially SF3, SF5, SF6. SF4/7 -will be covered in later revisions/phases. - -### Logging API changes: - -​1) Change the logging API as follows, - -From: gf\_log(dom, levl, fmt...) - -To: gf\_msg(dom, levl, errnum, msgid, fmt...) - -Where: - -    dom: Open string as used in the current logging infrastructure (helps in backward compat) -    levl: As in current logging infrastructure (current levels seem sufficient enough to not add more levels for better debuggability etc.) -     -    msgid: A message identifier, unique to this message FMT string and possibly this invocation. (SF1, lending to SF3) -    errnum: The errno that this message is generated for (with an implicit 0 meaning no error number per se with this message) (SF2) - -NOTE: Internally the gf\_msg would still be a macro that would add the -\_\_FILE\_\_ \_\_LINE\_\_ \_\_FUNCTION\_\_ arguments - -​2) Enforce \_ident\_ in the logging initialization API, gf\_log\_init -(void \*data, const char \*filename, const char \*ident) - -Where: - - ident would be the identifier string like, nfs, , brick-, cli, glusterd, as is the case with the log file name that is generated today (lending to SF6) - -#### What this achieves: - -With the above changes, we now have a message ID per message -(\_msgid\_), location of the message in terms of which component -(\_dom\_) and which process (\_ident\_). The further identification of -the message location in terms of host (ip/name) can be done in the -framework, when centralized logging infrastructure is introduced. - -#### Log message changes: - -With the above changes to the API the log message can now appear in a -compatibility mode to adhere to current logging format, or be presented -as follows, - -log invoked as: gf\_msg(dom, levl, ENOTSUP, msgidX) - -Example: gf\_msg ("logchecks", GF\_LOG\_CRITICAL, 22, logchecks\_msg\_4, -42, "Forty-Two", 42); - -Where: logchecks\_msg\_4 (GLFS\_COMP\_BASE + 4), "Critical: Format -testing: %d:%s:%x" - -​1) Gluster logging framework (logged as) - - [2014-02-17 08:52:28.038267] I [MSGID: 1002] [logchecks.c:44:go_log] 0-logchecks: Informational: Format testing: 42:Forty-Two:2a [Invalid argument] - -​2) syslog (passed as) - - Feb 17 14:17:42 somari logchecks[26205]: [MSGID: 1002] [logchecks.c:44:go_log] 0-logchecks: Informational: Format testing: 42:Forty-Two:2a [Invalid argument] - -​3) journald (passed as) - -    sd_journal_send("MESSAGE=", -                        "MESSAGE_ID=msgid", -                        "PRIORITY=levl", -                        "CODE_FILE=``", "CODE_LINE=`", "CODE_FUNC=", -                        "ERRNO=errnum", -                        "SYSLOG_IDENTIFIER=" -                        NULL); - -​4) CEE (Common Event Expression) format string passed to any CEE -consumer (say lumberjack) - -Based on generating @CEE JSON string as per specifications and passing -it the infrastructure in question. - -#### Message ID generation: - -​1) Some rules for message IDs - -- Every message, even if it is the same message FMT, will have a unique -message ID - Changes to a specific message string, hence will not change -its ID and also not impact other locations in the code that use the same -message FMT - -​2) A glfs-message-id.h file would contain ranges per component for -individual component based messages to be created without overlapping on -the ranges. - -​3) -message.h would contain something as follows, - -     #define GLFS_COMP_BASE         GLFS_MSGID_COMP_ -     #define GLFS_NUM_MESSAGES       1 -     #define GLFS_MSGID_END          (GLFS_COMP_BASE + GLFS_NUM_MESSAGES + 1) -     /* Messaged with message IDs */ -     #define glfs_msg_start_x GLFS_COMP_BASE, "Invalid: Start of messages" -     /*------------*/ -     #define _msg_1 (GLFS_COMP_BASE + 1), "Test message, replace with"\ -                        " original when using the template" -     /*------------*/ -     #define glfs_msg_end_x GLFS_MSGID_END, "Invalid: End of messages" - -​5) Each call to gf\_msg hence would be, - -    gf_msg(dom, levl, errnum, glfs_msg_x, ...) - -#### Setting per xlator logging levels (SF5): - -short description to be elaborated later - -Leverage this-\>loglevel to override the global loglevel. This can be -also configured from gluster CLI at runtime to change the log levels at -a per xlator level for targeted debugging. - -#### Multiple log suppression(SF3): - -short description to be elaborated later - -​1) Save the message string as follows, Msg\_Object(msgid, -msgstring(vasprintf(dom, fmt)), timestamp, repetitions) - -​2) On each message received by the logging infrastructure check the -list of saved last few Msg\_Objects as follows, - -2.1) compare msgid and on success compare msgstring for a match, compare -repetition tolerance time with current TS and saved TS in the -Msg\_Object - -2.1.1) if tolerance is within limits, increment repetitions and do not -print message - -2.1.2) if tolerance is outside limits, print repetition count for saved -message (if any) and print the new message - -2.2) If none of the messages match the current message, knock off the -oldest message in the list printing any repetition count message for the -same, and stash new message into the list - -The key things to remember and act on here would be to, minimize the -string duplication on each message, and also to keep the comparison -quick (hence base it off message IDs and errno to start with) - -#### Message catalogue (SF7): - - - -The idea is to use Doxygen comments in the -message.h per -component, to list information in various sections per message of -consequence and later use Doxygen to publish this catalogue on a per -release basis. - -Benefit to GlusterFS --------------------- - -The mentioned limitations and auto log analysis benefits would accrue -for GlusterFS - -Scope ------ - -### Nature of proposed change - -All gf\_logXXX function invocations would change to gf\_msgXXX -invocations. - -### Implications on manageability - -None - -### Implications on presentation layer - -None - -### Implications on persistence layer - -None - -### Implications on 'GlusterFS' backend - -None - -### Modification to GlusterFS metadata - -None - -### Implications on 'glusterd' - -None - -How To Test ------------ - -A separate test utility that tests various logs and formats would be -provided to ensure that functionality can be tested independent of -GlusterFS - -User Experience ---------------- - -Users would notice changed logging formats as mentioned above, the -additional field of importance would be the MSGID: - -Dependencies ------------- - -None - -Documentation -------------- - -Intending to add a logging.md (or modify the same) to elaborate on how a -new component should now use the new framework and generate messages -with IDs in the same. - -Status ------- - -In development (see, ) - -Comments and Discussion ------------------------ - - \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.6/Better Peer Identification.md b/Feature Planning/GlusterFS 3.6/Better Peer Identification.md deleted file mode 100644 index a8c6996..0000000 --- a/Feature Planning/GlusterFS 3.6/Better Peer Identification.md +++ /dev/null @@ -1,172 +0,0 @@ -Feature -------- - -**Better peer identification** - -Summary -------- - -This proposal is regarding better identification of peers. - -Owners ------- - -Kaushal Madappa - -Current status --------------- - -Glusterd currently is inconsistent in the way it identifies peers. This -causes problems when the same peer is referenced with different names in -different gluster commands. - -Detailed Description --------------------- - -Currently, the way we identify peers is not consistent all through the -gluster code. We use uuids internally and hostnames externally. - -This setup works pretty well when all the peers are on a single network, -have one address, and are referred to in all the gluster commands with -same address. - -But once we start mixing up addresses in the commands (ip, shortnames, -fqdn) and bring in multiple networks we have problems. - -The problems were discussed in the following mailing list threads and -some solutions were proposed. - -- How do we identify peers? [^1] -- RFC - "Connection Groups" concept [^2] - -The solution to the multi-network problem is dependent on the solution -to the peer identification problem. So it'll be good to target fixing -the peer identification problem asap, ie. in 3.6, and take up the -networks problem later. - -Benefit to GlusterFS --------------------- - -Sanity. It will be great to have all internal identifiers for peers -happening through a UUID, and being translated into a host/IP at the -most superficial layer. - -Scope ------ - -### Nature of proposed change - -The following changes will be done in Glusterd to improve peer -identification. - -1. Peerinfo struct will be extended to have a list of associated - hostnames/addresses, instead of a single hostname as it is - currently. The import/export and store/restore functions will be - changed to handle this. CLI will be updated to show this list of - addresses in peer status and pool list commands. -2. Peer probe will be changed to append an address to the peerinfo - address list, when we observe that the given address belongs to an - existing peer. -3. Have a new API for translation between hostname/addresses into - UUIDs. This new API will be used in all places where - hostnames/addresses were being validated, including peer probe, peer - detach, volume create, add-brick, remove-brick etc. -4. A new command - 'gluster peer add-address ' - - which appends to the address list will be implemented if time - permits. -5. A new command - 'gluster peer rename ' - which will - rename all occurrences of a peer with the newly given name will be - implemented if time permits. - -Changes 1-3 are the base for the other changes and will the primary -deliverables for this feature. - -### Implications on manageability - -The primary changes will bring about some changes to the CLI output of -'peer status' and 'pool list' commands. The normal and XML outputs for -these commands will contain a list of addresses for each peer, instead -of a single hostname. - -Tools depending on the output of these commands will need to be updated. - -**TODO**: *Add sample outputs* - -The new commands 'peer add-address' and 'peer rename' will improve -manageability of peers. - -### Implications on presentation layer - -None - -### Implications on persistence layer - -None - -### Implications on 'GlusterFS' backend - -None - -### Modification to GlusterFS metadata - -None - -### Implications on 'glusterd' - - - -How To Test ------------ - -**TODO:** *Add test cases* - -User Experience ---------------- - -User experience will improve for commands which used peer identifiers -(volume create/add-brick/remove-brick, peer probe, peer detach), as the -the user will no longer face errors caused by mixed usage of -identifiers. - -Dependencies ------------- - -None. - -Documentation -------------- - -The new behaviour of the peer probe command will need to be documented. -The new commands will need to be documented as well. - -**TODO:** *Add more documentations* - -Status ------- - -The feature is under development on forge [^3] and github [^4]. This -github merge request [^5] can be used for performing preliminary -reviews. Once we are satisfied with the changes, it will be posted for -review on gerrit. - -Comments and Discussion ------------------------ - -There are open issues around node crash + re-install with same IP (but -new UUID) which need to be addressed in this effort. - -Links ------ - - - - -[^1]: - -[^2]: - -[^3]: - -[^4]: - -[^5]: diff --git a/Feature Planning/GlusterFS 3.6/Gluster User Serviceable Snapshots.md b/Feature Planning/GlusterFS 3.6/Gluster User Serviceable Snapshots.md deleted file mode 100644 index 9af7062..0000000 --- a/Feature Planning/GlusterFS 3.6/Gluster User Serviceable Snapshots.md +++ /dev/null @@ -1,39 +0,0 @@ -Feature -------- - -Enable user-serviceable snapshots for GlusterFS Volumes based on -GlusterFS-Snapshot feature - -Owners ------- - -Anand Avati -Anand Subramanian -Raghavendra Bhat -Varun Shastry - -Summary -------- - -Each snapshot capable GlusterFS Volume will contain a .snaps directory -through which a user will be able to access previously point-in-time -snapshot copies of his data. This will be enabled through a hidden -.snaps folder in each directory or sub-directory within the volume. -These user-serviceable snapshot copies will be read-only. - -Tests ------ - -​1) Enable uss (gluster volume set features.uss enable) A -snap daemon should get started for the volume. It should be visible in -gluster volume status command. 2) entering the snapshot world ls on -.snaps from any directory within the filesystem should be successful and -should show the list of snapshots as directories. 3) accessing the -snapshots One of the snapshots can be entered and it should show the -contents of the directory from which .snaps was entered, when the -snapshot was taken. NOTE: If the directory was not present when a -snapshot was taken (say snap1) and created later, then entering snap1 -directory (or any access) will fail with stale file handle. 4) Reading -from snapshots Any kind of read operations from the snapshots should be -successful. But any modifications to snapshot data is not allowed. -Snapshots are read-only \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.6/Gluster Volume Snapshot.md b/Feature Planning/GlusterFS 3.6/Gluster Volume Snapshot.md deleted file mode 100644 index 468992a..0000000 --- a/Feature Planning/GlusterFS 3.6/Gluster Volume Snapshot.md +++ /dev/null @@ -1,354 +0,0 @@ -Feature -------- - -Snapshot of Gluster Volume - -Summary -------- - -Gluster volume snapshot will provide point-in-time copy of a GlusterFS -volume. This snapshot is an online-snapshot therefore file-system and -its associated data continue to be available for the clients, while the -snapshot is being taken. - -Snapshot of a GlusterFS volume will create another read-only volume -which will be a point-in-time copy of the original volume. Users can use -this read-only volume to recover any file(s) they want. Snapshot will -also provide restore feature which will help the user to recover an -entire volume. The restore operation will replace the original volume -with the snapshot volume. - -Owner(s) --------- - -Rajesh Joseph - -Copyright ---------- - -Copyright (c) 2013-2014 Red Hat, Inc. - -This feature is licensed under your choice of the GNU Lesser General -Public License, version 3 or any later version (LGPLv3 or later), or the -GNU General Public License, version 2 (GPLv2), in all cases as published -by the Free Software Foundation. - -Current status --------------- - -Gluster volume snapshot support is provided in GlusterFS 3.6 - -Detailed Description --------------------- - -GlusterFS snapshot feature will provide a crash consistent point-in-time -copy of Gluster volume(s). This snapshot is an online-snapshot therefore -file-system and its associated data continue to be available for the -clients, while the snapshot is being taken. As of now we are not -planning to provide application level crash consistency. That means if a -snapshot is restored then applications need to rely on journals or other -technique to recover or cleanup some of the operations performed on -GlusterFS volume. - -A GlusterFS volume is made up of multiple bricks spread across multiple -nodes. Each brick translates to a directory path on a given file-system. -The current snapshot design is based on thinly provisioned LVM2 snapshot -feature. Therefore as a prerequisite the Gluster bricks should be on -thinly provisioned LVM. For a single lvm, taking a snapshot would be -straight forward for the admin, but this is compounded in a GlusterFS -volume which has bricks spread across multiple LVM’s across multiple -nodes. Gluster volume snapshot feature aims to provide a set of -interfaces from which the admin can snap and manage the snapshots for -Gluster volumes. - -Gluster volume snapshot is nothing but snapshots of all the bricks in -the volume. So ideally all the bricks should be snapped at the same -time. But with real-life latencies (processor and network) this may not -hold true all the time. Therefore we need to make sure that during -snapshot the file-system is in consistent state. Therefore we barrier -few operation so that the file-system remains in a healthy state during -snapshot. - -For details about barrier [Server Side -Barrier](http://www.gluster.org/community/documentation/index.php/Features/Server-side_Barrier_feature) - -Benefit to GlusterFS --------------------- - -Snapshot of glusterfs volume allows users to - -- A point in time checkpoint from which to recover/failback -- Allow read-only snaps to be the source of backups. - -Scope ------ - -### Nature of proposed change - -Gluster cli will be modified to provide new commands for snapshot -management. The entire snapshot core implementation will be done in -glusterd. - -Apart from this Snapshot will also make use of quiescing xlator for -doing quiescing. This will be a server side translator which will -quiesce will fops which can modify disk state. The quescing will be done -till the snapshot operation is complete. - -### Implications on manageability - -Snapshot will provide new set of cli commands to manage snapshots. REST -APIs are not planned for this release. - -### Implications on persistence layer - -Snapshot will create new volume per snapshot. These volumes are stored -in /var/lib/glusterd/snaps folder. Apart from this each volume will have -additional snapshot related information stored in snap\_list.info file -in its respective vol folder. - -### Implications on 'glusterd' - -Snapshot information and snapshot volume details are stored in -persistent stores. - -How To Test ------------ - -For testing this feature one needs to have mulitple thinly provisioned -volumes or else need to create LVM using loop back devices. - -Details of how to create thin volume can be found at the following link - - -Each brick needs to be in a independent LVM. And these LVMs should be -thinly provisioned. From these bricks create Gluster volume. This volume -can then be used for snapshot testing. - -See the User Experience section for various commands of snapshot. - -User Experience ---------------- - -##### Snapshot creation - - snapshot create   [description ] [force] - -This command will create a sapshot of the volume identified by volname. -snapname is a mandatory field and the name should be unique in the -entire cluster. Users can also provide an optional description to be -saved along with the snap (max 1024 characters). force keyword is used -if some bricks of orginal volume is down and still you want to take the -snapshot. - -##### Listing of available snaps - - gluster snapshot list [snap-name] [vol ] - -This command is used to list all snapshots taken, or for a specified -volume. If snap-name is provided then it will list the details of that -snap. - -##### Configuring the snapshot behavior - - gluster snapshot config [vol-name] - -This command will display existing config values for a volume. If volume -name is not provided then config values of all the volume is displayed. - - gluster snapshot config [vol-name] [ ] [ ] [force] - -The above command can be used to change the existing config values. If -vol-name is provided then config value of that volume is changed, else -it will set/change the system limit. - -The system limit is the default value of the config for all the volume. -Volume specific limit cannot cross the system limit. If a volume -specific limit is not provided then system limit will be considered. - -If any of this limit is decreased and the current snap count of the -system/volume is more than the limit then the command will fail. If user -still want to decrease the limit then force option should be used. - -**snap-max-limit**: Maximum snapshot limit for a volume. Snapshots -creation will fail if snap count reach this limit. - -**snap-max-soft-limit**: Maximum snapshot limit for a volume. Snapshots -can still be created if snap count reaches this limit. An auto-deletion -will be triggered if this limit is reached. The oldest snaps will be -deleted if snap count reaches this limit. This is represented as -percentage value. - -##### Status of snapshots - - gluster snapshot status ([snap-name] | [volume ]) - -Shows the status of all the snapshots or the specified snapshot. The -status will include the brick details, LVM details, process details, -etc. - -##### Activating a snap volume - -By default the snapshot created will be in an inactive state. Use the -following commands to activate snapshot. - - gluster snapshot activate  - -##### Deactivating a snap volume - - gluster snapshot deactivate  - -The above command will deactivate an active snapshot - -##### Deleting snaps - - gluster snapshot delete  - -This command will delete the specified snapshot. - -##### Restoring snaps - - gluster snapshot restore  - -This command restores an already taken snapshot of single or multiple -volumes. Snapshot restore is an offline activity therefore if any volume -which is part of the given snap is online then the restore operation -will fail. - -Once the snapshot is restored it will be deleted from the list of -snapshot. - -Dependencies ------------- - -To provide support for a crash-consistent snapshot feature Gluster core -com- ponents itself should be crash-consistent. As of now Gluster as a -whole is not crash-consistent. In this section we will identify those -Gluster components which are not crash-consistent. - -**Geo-Replication**: Geo-replication provides master-slave -synchronization option to Gluster. Geo-replication maintains state -information for completing the sync operation. Therefore ideally when a -snapshot is taken then both the master and slave snapshot should be -taken. And both master and slave snapshot should be in mutually -consistent state. - -Geo-replication make use of change-log to do the sync. By default the -change-log is stored .glusterfs folder in every brick. But the -change-log path is configurable. If change-log is part of the brick then -snapshot will contain the change-log changes as well. But if it is not -then it needs to be saved separately during a snapshot. - -Following things should be considered for making change-log -crash-consistent: - -- Change-log is part of the brick of the same volume. -- Change-log is outside the brick. As of now there is no size limit on - the - -change-log files. We need to answer following questions here - -- - Time taken to make a copy of the entire change-log. Will affect - the - -overall time of snapshot operation. - -- - The location where it can be copied. Will impact the disk usage - of - -the target disk or file-system. - -- Some part of change-log is present in the brick and some are outside - -the brick. This situation will arrive when change-log path is changed -in-between. - -- Change-log is saved in another volume and this volume forms a CG - with - -the volume about to be snapped. - -**Note**: Considering the above points we have decided not to support -change-log stored outside the bricks. - -For this release automatic snapshot of both master and slave session is -not supported. If required user need to explicitly take snapshot of both -master and slave. Following steps need to be followed while taking -snapshot of a master and slave setup - -- Stop geo-replication manually. -- Snapshot all the slaves first. -- When the slave snapshot is done then initiate master snapshot. -- When both the snapshot is complete geo-syncronization can be started - again. - -**Gluster Quota**: Quota enables an admin to specify per directory -quota. Quota makes use of marker translator to enforce quota. As of now -the marker framework is not completely crash-consistent. As part of -snapshot feature we need to address following issues. - -- If a snapshot is taken while the contribution size of a file is - being updated then you might end up with a snapshot where there is a - mismatch between the actual size of the file and the contribution of - the file. These in-consistencies can only be rectified when a - look-up is issued on the snapshot volume for the same file. As a - workaround admin needs to issue an explicit file-system crawl to - rectify the problem. -- For NFS, quota makes use of pgfid to build a path from gfid and - enforce quota. As of now pgfid update is not crash-consistent. -- Quota saves its configuration in file-system under /var/lib/glusterd - folder. As part of snapshot feature we need to save this file. - -**NFS**: NFS uses a single graph to represent all the volumes in the -system. And to make all the snapshot volume accessible these snapshot -volumes should be added to this graph. This brings in another -restriction, i.e. all the snapshot names should be unique and -additionally snap name should not clash with any other volume name as -well. - -To handle this situation we have decided to use an internal uuid as snap -name. And keep a mapping of this uuid and user given snap name in an -internal structure. - -Another restriction with NFS is that when a newly created volume -(snapshot volume) is started it will restart NFS server. Therefore we -decided when snapshot is taken it will be in stopped state. Later when -snapshot volume is needed it can be started explicitly. - -**DHT**: DHT xlator decides which node to look for a file/directory. -Some of the DHT fop are not atomic in nature, e.g rename (both file and -directory). Also these operations are not transactional in nature. That -means if a crash happens the data in server might be in an inconsistent -state. Depending upon the time of snapshot and which DHT operation is in -what state there can be an inconsistent snapshot. - -**AFR**: AFR is the high-availability module in Gluster. AFR keeps track -of fresh and correct copy of data using extended attributes. Therefore -it is important that before taking snapshot these extended attributes -are written into the disk. To make sure these attributes are written to -disk snapshot module will issue explicit sync after the -barrier/quiescing. - -The other issue with the current AFR is that it writes the volume name -to the extended attribute of all the files. AFR uses this for -self-healing. When snapshot is taken of such a volume the snapshotted -volume will also have the same volume name. Therefore AFR needs to -create a mapping of the real volume name and the extended entry name in -the volfile. So that correct name can be referred during self-heal. - -Another dependency on AFR is that currently there is no direct API or -call back function which will tell that AFR self-healing is completed on -a volume. This feature is required to heal a snapshot volume before -restore. - -Documentation -------------- - -Status ------- - -In development - -Comments and Discussion ------------------------ - - diff --git a/Feature Planning/GlusterFS 3.6/New Style Replication.md b/Feature Planning/GlusterFS 3.6/New Style Replication.md deleted file mode 100644 index ffd8167..0000000 --- a/Feature Planning/GlusterFS 3.6/New Style Replication.md +++ /dev/null @@ -1,230 +0,0 @@ -Goal ----- - -More partition-tolerant replication, with higher performance for most -use cases. - -Summary -------- - -NSR is a new synchronous replication translator, complementing or -perhaps some day replacing AFR. - -Owners ------- - -Jeff Darcy -Venky Shankar - -Current status --------------- - -Design and prototype (nearly) complete, implementation beginning. - -Related Feature Requests and Bugs ---------------------------------- - -[AFR bugs related to "split -brain"](https://bugzilla.redhat.com/buglist.cgi?classification=Community&component=replicate&list_id=3040567&product=GlusterFS&query_format=advanced&short_desc=split&short_desc_type=allwordssubstr) - -[AFR bugs related to -"perf"](https://bugzilla.redhat.com/buglist.cgi?classification=Community&component=replicate&list_id=3040572&product=GlusterFS&query_format=advanced&short_desc=perf&short_desc_type=allwordssubstr) - -(Both lists are undoubtedly partial because not all bugs in these areas -using these specific words. In particular, "GFID mismatch" bugs are -really a kind of split brain, but aren't represented.) - -Detailed Description --------------------- - -NSR is designed to have the following features. - -- Server based - "chain" replication can use bandwidth of both client - and server instead of splitting client bandwidth N ways. - -- Journal based - for reduced network traffic in normal operation, - plus faster recovery and greater resistance to "split brain" errors. - -- Variable consistency model - based on - [Dynamo](http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) - to provide options trading some consistency for greater availability - and/or performance. - -- Newer, smaller codebase - reduces technical debt, enables higher - replica counts, more informative status reporting and logging, and - other future features (e.g. ordered asynchronous replication). - -Benefit to GlusterFS -==================== - -Faster, more robust, more manageable/maintainable replication. - -Scope -===== - -Nature of proposed change -------------------------- - -At least two new translators will be necessary. - -- A simple client-side translator to route requests to the current - leader among the bricks in a replica set. - -- A server-side translator to handle the "heavy lifting" of - replication, recovery, etc. - -Implications on manageability ------------------------------ - -At a high level, commands to enable, configure, and manage NSR will be -very similar to those already used for AFR. At a lower level, the -options affecting things things like quorum, consistency, and placement -of journals will all be completely different. - -Implications on presentation layer ----------------------------------- - -Minimal. Most changes will be to simplify or remove special handling for -AFR's unique behavior (especially around lookup vs. self-heal). - -Implications on persistence layer ---------------------------------- - -N/A - -Implications on 'GlusterFS' backend ------------------------------------ - -The journal for each brick in an NSR volume might (for performance -reasons) be placed on one or more local volumes other than the one -containing the brick's data. Special requirements around AIO, fsync, -etc. will be less than with AFR. - -Modification to GlusterFS metadata ----------------------------------- - -NSR will not use the same xattrs as AFR, reducing the need for larger -inodes. - -Implications on 'glusterd' --------------------------- - -Volgen must be able to configure the client-side and server-side parts -of NSR, instead of AFR on the client side and index (which will no -longer be necessary) on the server side. Other interactions with -glusterd should remain mostly the same. - -How To Test -=========== - -Most basic AFR tests - e.g. reading/writing data, killing nodes, -starting/stopping self-heal - would apply to NSR as well. Tests that -embed assumptions about AFR xattrs or other internal artifacts will need -to be re-written. - -User Experience -=============== - -Minimal change, mostly related to new options. - -Dependencies -============ - -NSR depends on a cluster-management framework that can provide -membership tracking, leader election, and robust consistent key/value -data storage. This is expected to be developed in parallel as part of -the glusterd-scalability feature, but can be implemented (in simplified -form) within NSR itself if necessary. - -Documentation -============= - -TBD. - -Status -====== - -Some parts of earlier implementation updated to current tree, others in -the middle of replacement. - -- [New design](http://review.gluster.org/#/c/8915/) - -- [Basic translator code](http://review.gluster.org/#/c/8913/) (needs - update to new code-generation infractructure) - -- [GF\_FOP\_IPC](http://review.gluster.org/#/c/8812/) - -- [etcd support](http://review.gluster.org/#/c/8887/) - -- [New code-generation - infrastructure](http://review.gluster.org/#/c/9411/) - -- [New data-logging - translator](https://forge.gluster.org/~jdarcy/glusterfs-core/jdarcys-glusterfs-data-logging) - -Comments and Discussion -======================= - -My biggest concern with journal-based replication comes from my previous -use of DRBD. They do an "activity log"[^1] which sounds like the same -basic concept. Once that log filled, I experienced cascading failure. -When the journal can be filled faster than it's emptied this could cause -the problem I experienced. - -So what I'm looking to be convinced is how journalled replication -maintains full redundancy and how it will prevent the journal input from -exceeding the capacity of the journal output or at least how it won't -fail if this should happen. - -[jjulian](User:Jjulian "wikilink") -([talk](User talk:Jjulian "wikilink")) 17:21, 13 August 2013 (UTC) - -
-This is akin to a CAP Theorem[^2][^3] problem. If your nodes can't -communicate, what do you do with writes? Our replication approach has -traditionally been CP - enforce quorum, allow writes only among the -majority - and for the sake of satisfying user expectations (or POSIX) -pretty much has to remain CP at least by default. I personally think we -need to allow an AP choice as well, which is why the quorum levels in -NSR are tunable to get that result. - -So, what do we do if a node runs out of journal space? Well, it's unable -to function normally, i.e. it's failed, so it can't count toward quorum. -This would immediately lead to loss of write availability in a two-node -replica set, and could happen easily enough in a three-node replica set -if two similarly configured nodes ran out of journal space -simultaneously. A significant part of the complexity in our design is -around pruning no-longer-needed journal segments, precisely because this -is an icky problem, but even with all the pruning in the world it could -still happen eventually. Therefore the design also includes the notion -of arbiters, which can be quorum-only or can also have their own -journals (with no or partial data). Therefore, your quorum for -admission/journaling purposes can be significantly higher than your -actual replica count. So what options do we have to avoid or deal with -journal exhaustion? - -- Add more journal space (it's just files, so this can be done - reactively during an extended outage). - -- Add arbiters. - -- Decrease the quorum levels. - -- Manually kick a node out of the replica set. - -- Add admission control, artificially delaying new requests as the - journal becomes full. (This one requires more code.) - -If you do \*none\* of these things then yeah, you're scrod. That said, -do you think these options seem sufficient? - -[Jdarcy](User:Jdarcy "wikilink") ([talk](User talk:Jdarcy "wikilink")) -15:27, 29 August 2013 (UTC) - - - -[^1]: - -[^2]: - -[^3]: diff --git a/Feature Planning/GlusterFS 3.6/Persistent AFR Changelog xattributes.md b/Feature Planning/GlusterFS 3.6/Persistent AFR Changelog xattributes.md deleted file mode 100644 index e21b788..0000000 --- a/Feature Planning/GlusterFS 3.6/Persistent AFR Changelog xattributes.md +++ /dev/null @@ -1,178 +0,0 @@ -Feature -------- - -Provide a unique and consistent name for AFR changelog extended -attributes/ client translator names in the volume graph. - -Summary -------- - -Make AFR changelog extended attribute names independent of brick -position in the graph, which ensures that there will be no potential -misdirected self-heals during remove-brick operation. - -Owners ------- - -Ravishankar N -Pranith Kumar K - -Current status --------------- - -Patches merged in master. - - - - - -Detailed Description --------------------- - -BACKGROUND ON THE PROBLEM: =========================== AFR makes use of -changelog extended attributes on a per file basis which records pending -operations on that file and is used to determine the sources and sinks -when healing needs to be done. As of today, AFR uses the client -translator names (from the volume graph) as the names of the changelog -attributes. For eg. for a replica 3 volume, each file on every brick has -the following extended attributes: - - trusted.afr.-client-0-->maps to Brick0 - trusted.afr.-client-1-->maps to Brick1 - trusted.afr.-client-2-->maps to Brick2 - -​1) Now when any brick is removed (say Brick1), the graph is regenerated -and AFR maps the xattrs to the bricks so: - - trusted.afr.-client-0-->maps to Brick0 - trusted.afr.-client-1-->maps to Brick2  - -Thus the xattr 'trusted.afr.testvol-client-1' which earlier referred to -Brick1's attributes now refer to Brick-2's. If there are pending -self-heals prior to the remove-brick happened, healing could possibly -happen in the wrong direction thereby causing data loss. - -​2) The second problem is a dependency with Snapshot feature. Snapshot -volumes have new names (UUID based) and thus the (client)xlator names -are different. Eg: \<-client-0\> will now be -\<-client-0\>. When AFR uses these names to query for its -changelog xattrs but the files on the bricks have the old changelog -xattrs. Hence the heal information is completely lost. - -WHAT IS THE EXACT ISSUE WE ARE SOLVING OR OBJECTIVE OF THE -FEATURE/DESIGN? -========================================================================== -In a nutshell, the solution is to generate unique and persistent names -for the client translators so that even if any of the bricks are -removed, the translator names always map to the same bricks. In turn, -AFR, which uses these names for the changelog xattr names also refer to -the correct bricks. - -SOLUTION: - -The solution is explained as a sequence of steps: - -- The client translator names will still use the existing - nomenclature, except that now they are monotonically increasing - (-client-0,1,2...) and are not dependent on the brick - position.Let us call these names as brick-IDs. These brick IDs are - also written to the brickinfo files (in - /var/lib/glusterd/vols//bricks/\*) by glusterd during - volume creation. When the volfile is generated, these brick - brick-IDs form the client xlator names. - -- Whenever a brick operation is performed, the names are retained for - existing bricks irrespective of their position in the graph. New - bricks get the monotonically increasing brick-ID while names for - existing bricks are obtained from the brickinfo file. - -- Note that this approach does not affect client versions (old/new) in - anyway because the clients just use the volume config provided by - the volfile server. - -- For retaining backward compatibility, We need to check two items: - (a)Under what condition is remove brick allowed; (b)When is brick-ID - written to brickinfo file. - -For the above 2 items, the implementation rules will be thus: - -​i) This feature is implemented in 3.6. Lets say its op-version is 5. - -​ii) We need to implement a check to allow remove-brick only if cluster -opversion is \>=5 - -​iii) The brick-ID is written to brickinfo when the nodes are upgraded -(during glusterd restore) and when a peer is probed (i.e. during volfile -import). - -Benefit to GlusterFS --------------------- - -Even if there are pending self-heals, remove-brick operations can be -carried out safely without fear of incorrect heals which may cause data -loss. - -Scope ------ - -### Nature of proposed change - -Modifications will be made in restore, volfile import and volgen -portions of glusterd. - -### Implications on manageability - -N/A - -### Implications on presentation layer - -N/A - -### Implications on persistence layer - -N/A - -### Implications on 'GlusterFS' backend - -N/A - -### Modification to GlusterFS metadata - -N/A - -### Implications on 'glusterd' - -As described earlier. - -How To Test ------------ - -remove-brick operation needs to be carried out on rep/dist-rep volumes -having pending self-heals and it must be verified that no data is lost. -snapshots of the volumes must also be able to access files without any -issues. - -User Experience ---------------- - -N/A - -Dependencies ------------- - -None. - -Documentation -------------- - -TBD - -Status ------- - -See 'Current status' section. - -Comments and Discussion ------------------------ - - \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.6/RDMA Improvements.md b/Feature Planning/GlusterFS 3.6/RDMA Improvements.md deleted file mode 100644 index 1e71729..0000000 --- a/Feature Planning/GlusterFS 3.6/RDMA Improvements.md +++ /dev/null @@ -1,101 +0,0 @@ -Feature -------- - -**RDMA Improvements** - -Summary -------- - -This proposal is regarding getting RDMA volumes out of tech preview. - -Owners ------- - -Raghavendra Gowdappa -Vijay Bellur - -Current status --------------- - -Work in progress - -Detailed Description --------------------- - -Fix known & unknown issues in volumes with transport type rdma so that -RDMA can be used as the interconnect between client - servers & between -servers. - -- Performance Issues - Had found that performance was bad when - compared with plain ib-verbs send/recv v/s RDMA reads and writes. -- Co-existence with tcp - There seemed to be some memory corruptions - when we had both tcp and rdma transports. -- librdmacm for connection management - with this there is a - requirement that the brick has to listen on an IPoIB address and - this affects our current ability where a peer has the flexibility to - connect to either ethernet or infiniband address. Another related - feature Better peer identification will help us to resolve this - issue. -- More testing required - -Benefit to GlusterFS --------------------- - -Scope ------ - -### Nature of proposed change - -Bug-fixes to transport/rdma - -### Implications on manageability - -Remove the warning about creation of rdma volumes in CLI. - -### Implications on presentation layer - -TBD - -### Implications on persistence layer - -No impact - -### Implications on 'GlusterFS' backend - -No impact - -### Modification to GlusterFS metadata - -No impact - -### Implications on 'glusterd' - -No impact - -How To Test ------------ - -TBD - -User Experience ---------------- - -TBD - -Dependencies ------------- - -Better Peer identification - -Documentation -------------- - -TBD - -Status ------- - -In development - -Comments and Discussion ------------------------ diff --git a/Feature Planning/GlusterFS 3.6/Server-side Barrier feature.md b/Feature Planning/GlusterFS 3.6/Server-side Barrier feature.md deleted file mode 100644 index c13e25a..0000000 --- a/Feature Planning/GlusterFS 3.6/Server-side Barrier feature.md +++ /dev/null @@ -1,213 +0,0 @@ -Server-side barrier feature -=========================== - -- Author(s): Varun Shastry, Krishnan Parthasarathi -- Date: Jan 28 2014 -- Bugzilla: -- Document ID: BZ1060002 -- Document Version: 1 -- Obsoletes: NA - -Abstract --------- - -Snapshot feature needs a mechanism in GlusterFS, where acknowledgements -to file operations (FOPs) are held back until the snapshot of all the -bricks of the volume are taken. - -The barrier feature would stop holding back FOPs after a configurable -'barrier-timeout' seconds. This is to prevent an accidental lockdown of -the volume. - -This mechanism should have the following properties: - -- Should keep 'barriering' transparent to the applications. -- Should not acknowledge FOPs that fall into the barrier class. A FOP - that when acknowledged to the application, could lead to the - snapshot of the volume become inconsistent, is a barrier class FOP. - -With the below example of 'unlink' how a FOP is classified as barrier -class is explained. - -For the following sequence of events, assuming unlink FOP was not -barriered. Assume a replicate volume with two bricks, namely b1 and b2. - - b1 b2 - time ---------------------------------- - | t1 snapshot - | t2 unlink /a unlink /a - \/ t3 mkdir /a mkdir /a - t4 snapshot - -The result of the sequence of events will store /a as a file in snapshot -b1 while /a is stored as directory in snapshot b2. This leads to split -brain problem of the AFR and in other way inconsistency of the volume. - -Copyright ---------- - -Copyright (c) 2014 Red Hat, Inc. - -This feature is licensed under your choice of the GNU Lesser General -Public License, version 3 or any later version (LGPLv3 or later), or the -GNU General Public License, version 2 (GPLv2), in all cases as published -by the Free Software Foundation. - -Introduction ------------- - -The volume snapshot feature snapshots a volume by snapshotting -individual bricks, that are available, using the lvm-snapshot -technology. As part of using lvm-snapshot, the design requires bricks to -be free from few set of modifications (fops in Barrier Class) to avoid -the inconsistency. This is where the server-side barriering of FOPs -comes into picture. - -Terminology ------------ - -- barrier(ing) - To make barrier fops temporarily inactive or - disabled. -- available - A brick is said to be available when the corresponding - glusterfsd process is running and serving file operations. -- FOP - File Operation - -High Level Design ------------------ - -### Architecture/Design Overview - -- Server-side barriering, for Snapshot, must be enabled/disabled on - the bricks of a volume in a synchronous manner. ie, any command - using this would be blocked until barriering is enabled/disabled. - The brick process would provide this mechanism via an RPC. -- Barrier translator would be placed immediately above io-threads - translator in the server/brick stack. -- Barrier translator would queue FOPs when enabled. On disable, the - translator dequeues all the FOPs, while serving new FOPs from - application. By default, barriering is disabled. -- The barrier feature would stop blocking the acknowledgements of FOPs - after a configurable 'barrier-timeout' seconds. This is to prevent - an accidental lockdown of the volume. -- Operations those fall into barrier class are listed below. Any other - fop not listed below does not fall into this category and hence are - not barriered. - - rmdir - - unlink - - rename - - [f]truncate - - fsync - - write with O\_SYNC flag - - [f]removexattr - -### Design Feature - -Following timeline diagram depicts message exchanges between glusterd -and brick during enable and disable of barriering. This diagram assumes -that enable operation is synchronous and disable is asynchronous. See -below for alternatives. - - glusterd (snapshot) barrier @ brick - ------------------ --------------- - t1 | | - t2 | continue to pass through - | all the fops - t3 send 'enable' | - t4 | * starts barriering the fops - | * send back the ack - t5 receive the ack | - | | - t6 | <take snap> | - | . | - | . | - | . | - | </take snap> | - | | - t7 send disable | - (does not wait for the ack) | - t8 | release all the holded fops - | and no more barriering - | | - t9 | continue in PASS_THROUGH mode - -Glusterd would send an RPC (described in API section), to enable -barriering on a brick, by setting option feature.barrier to 'ON' in -barrier translator. This would be performed on all the bricks present in -that node, belonging to the set of volumes that are being snapshotted. - -Disable of barriering can happen in synchronous or asynchronous mode. -The choice is left to the consumer of this feature. - -On disable, all FOPs queued up will be dequeued. Simultaneously the -subsequent barrier request(s) will be served. - -Barrier option enable/disable is persisted into the volfile. This is to -make the feature available for consumers in asynchronous mode, like any -other (configurable) feature. - -Barrier feature also has timeout option based on which dequeuing would -get triggered if the consumer fails to send the disable request. - -Low-level details of Barrier translator working ------------------------------------------------ - -The translator operates in one of two states, namely QUEUEING and -PASS\_THROUGH. - -When barriering is enabled, the translator moves to QUEUEING state. It -queues outgoing FOPs thereafter in the call back path. - -When barriering is disabled, the translator moves to PASS\_THROUGH state -and does not queue when it is in PASS\_THROUGH state. Additionally, the -queued FOPs are 'released', when the translator moves from QUEUEING to -PASS\_THROUGH state. - -It has a translator global queue (doubly linked lists, see -libglusterfs/src/list.h) where the FOPs are queued in the form of a call -stub (see libglusterfs/src/call-stub.[ch]) - -When the FOP has succeeded, but barrier translator failed to queue in -the call back, the barrier translator would disable barriering and -release any queued FOPs, barrier would inform the consumer about this -failure on succesive disable request. - -Interfaces ----------- - -### Application Programming Interface - -- An RPC procedure is added at the brick side, which allows any client - [sic] to set the feature.barrier option of the barrier translator - with a given value. -- Glusterd would be using this to set server-side-barriering on, on a - brick. - -Performance Considerations --------------------------- - -- The barriering of FOPs may be perceived as a performance degrade by - the applications. Since this is a hard requirement for snapshot, the - onus is on the snapshot feature to reduce the window for which - barriering is enabled. - -### Scalability - -- In glusterd, each brick operation is executed in a serial manner. - So, the latency of enabling barriering is a function of the no. of - bricks present on the node of the set of volumes being snapshotted. - This is not a scalability limitation of the mechanism of enabling - barriering but a limitation in the brick operations mechanism in - glusterd. - -Migration Considerations ------------------------- - -The barrier translator is introduced with op-version 4. It is a -server-side translator and does not impact older clients even when this -feature is enabled. - -Installation and deployment ---------------------------- - -- Barrier xlator is not packaged with glusterfs-server rpm. With this - changes, this has to be added to the rpm. diff --git a/Feature Planning/GlusterFS 3.6/Thousand Node Gluster.md b/Feature Planning/GlusterFS 3.6/Thousand Node Gluster.md deleted file mode 100644 index 54c3e13..0000000 --- a/Feature Planning/GlusterFS 3.6/Thousand Node Gluster.md +++ /dev/null @@ -1,150 +0,0 @@ -Goal ----- - -Thousand-node scalability for glusterd - -Summary -======= - -This "feature" is really a set of infrastructure changes that will -enable glusterd to manage a thousand servers gracefully. - -Owners -====== - -Krishnan Parthasarathi -Jeff Darcy - -Current status -============== - -Proposed, awaiting summit for approval. - -Related Feature Requests and Bugs -================================= - -N/A - -Detailed Description -==================== - -There are three major areas of change included in this proposal. - -- Replace the current order-n-squared heartbeat/membership protocol - with a much smaller "monitor cluster" based on Paxos or - [Raft](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf), - to which I/O servers check in. - -- Use the monitor cluster to designate specific functions or roles - - e.g. self-heal, rebalance, leadership in an NSR subvolume - to I/O - servers in a coordinated and globally optimal fashion. - -- Replace the current system of replicating configuration data on all - servers (providing practically no guarantee of consistency if one is - absent during a configuration change) with storage of configuration - data in the monitor cluster. - -Benefit to GlusterFS -==================== - -Scaling of our management plane to 1000+ nodes, enabling competition -with other projects such as HDFS or Ceph which already have or claim -such scalability. - -Scope -===== - -Nature of proposed change -------------------------- - -Functionality very similar to what we need in the monitor cluster -already exists in some of the Raft implementations, notably -[etcd](https://github.com/coreos/etcd). Such a component could provide -the services described above to a modified glusterd running on each -server. The changes to glusterd would mostly consist of removing the -current heartbeat and config-storage code, replacing it with calls into -(and callbacks from) the monitor cluster. - -Implications on manageability ------------------------------ - -Enabling/starting monitor daemons on those few nodes that have them must -be done separately from starting glusterd. Since the changes mostly are -to how each glusterd interacts with others and with its own local -storage back end, interactions with the CLI or with glusterfsd need not -change. - -Implications on presentation layer ----------------------------------- - -N/A - -Implications on persistence layer ---------------------------------- - -N/A - -Implications on 'GlusterFS' backend ------------------------------------ - -N/A - -Modification to GlusterFS metadata ----------------------------------- - -The monitor daemons need space for their data, much like that currently -maintained in /var/lib/glusterd currently. - -Implications on 'glusterd' --------------------------- - -Drastic. See sections above. - -How To Test -=========== - -A new set of tests for the monitor-cluster functionality will need to be -developed, perhaps derived from those for the external project if we -adopt one. Most tests related to our multi-node testing facilities -(cluster.rc) will also need to change. Tests which merely invoke the CLI -should require little if any change. - -User Experience -=============== - -Minimal change. - -Dependencies -============ - -A mature/stable enough implementation of Raft or a similar protocol. -Failing that, we'd need to develop our own service along similar lines. - -Documentation -============= - -TBD. - -Status -====== - -In design. - -The choice of technology and approaches are being discussed on the --devel ML. - -- "Proposal for Glusterd-2.0" - - [1](http://www.gluster.org/pipermail/gluster-users/2014-September/018639.html) - -: Though the discussion has become passive, the question is whether we - choose to implement consensus algorithm inside our project or depend - on external projects that provide similar service. - -- "Management volume proposal" - - [2](http://www.gluster.org/pipermail/gluster-devel/2014-November/042944.html) - -: This has limitations due to the circular dependency making it - infeasible. - -Comments and Discussion -======================= diff --git a/Feature Planning/GlusterFS 3.6/afrv2.md b/Feature Planning/GlusterFS 3.6/afrv2.md deleted file mode 100644 index a1767c7..0000000 --- a/Feature Planning/GlusterFS 3.6/afrv2.md +++ /dev/null @@ -1,244 +0,0 @@ -Feature -------- - -This feature is major code re-factor of current afr along with a key -design change in the way changelog extended attributes are stored in -afr. - -Summary -------- - -This feature introduces design change in afr which separates ongoing -transaction, pending operation count for files/directories. - -Owners ------- - -Anand Avati -Pranith Kumar Karampuri - -Current status --------------- - -The feature is in final stages of review at - - -Detailed Description --------------------- - -How AFR works: - -In order to keep track of what copies of the file are modified and up to -date, and what copies require to be healed, AFR keeps state information -in the extended attributes of the file called changelog extended -attributes. These extended attributes stores that copy's view of how up -to date the other copies are. The extended attributes are modified in a -transaction which consists of 5 phases - LOCK, PRE-OP, OP, POST-OP and -UNLOCK. In the PRE-OP phase the extended attributes are updated to store -the intent of modification (in the OP phase.) - -In the POST-OP phase, depending on how many servers crashed mid way and -on how many servers the OP was applied successfully, a corresponding -change is made in the extended attributes (of the surviving copies) to -represent the staleness of the copies which missed the OP phase. - -Further, when those lagging servers become available, healing decisions -are taken based on these extended attribute values. - -Today, a PRE-OP increments the pending counters of all elements in the -array (where each element represents a server, and therefore one of the -members of the array represents that server itself.) The POST-OP -decrements those counters which represent servers where the operation -was successful. The update is performed on all the servers which have -made it till the POST-OP phase. The decision of whether a server crashed -in the middle of a transaction or whether the server lived through the -transaction and witnessed the other server crash, is inferred by -inspecting the extended attributes of all servers together. Because -there is no distinction between these counters as to how many of those -increments represent "in transit" operations and how many of those are -retained without decrement to represent "pending counters", there is -value in adding clarity to the system by separating the two. - -The change is to now have only one dirty flag on each server per file. -We also make the PRE-OP increment only that dirty flag rather than all -the elements of the pending array. The dirty flag must be set before -performing the operation, and based on which of the servers the -operation failed, we will set the pending counters representing these -failed servers on the remaining ones in the POST-OP phase. The dirty -counter is also cleared at the end of the POST-OP. This means, in -successful operations only the dirty flag (one integer) is incremented -and decremented per server per file. However if a pending counter is set -because of an operation failure, then the flag is an unambiguous "finger -pointing" at the other server. Meaning, if a file has a pending counter -AND a dirty flag, it will not undermine the "strength" of the pending -counter. This change completely removes today's ambiguity of whether a -pending counter represents a still ongoing operation (or crashed in -transit) vs a surely missed operation. - -Benefit to GlusterFS --------------------- - -It increases the clarity of whether a file has any ongoing transactions -and any pending self-heals. Code is more maintainable now. - -Scope ------ - -### Nature of proposed change - -- Remove client side self-healing completely (opendir, openfd, lookup) - -Re-work readdir-failover to work reliably in case of NFS - Remove -unused/dead lock recovery code - Consistently use xdata in both calls -and callbacks in all FOPs - Per-inode event generation, used to force -inode ctx refresh - Implement dirty flag support (in place of pending -counts) - Eliminate inode ctx structure, use read subvol bits + -event\_generation - Implement inode ctx refreshing based on event -generation - Provide backward compatibility in transactions - remove -unused variables and functions - make code more consistent in style and -pattern - regularize and clean up inode-write transaction code - -regularize and clean up dir-write transaction code - regularize and -clean up common FOPs - reorganize transaction framework code - skip -setting xattrs in pending dict if nothing is pending - re-write -self-healing code using syncops - re-write simpler self-heal-daemon - -### Implications on manageability - -None - -### Implications on presentation layer - -None - -### Implications on persistence layer - -None - -### Implications on 'GlusterFS' backend - -None - -### Modification to GlusterFS metadata - -This changes the way pending counts vs Ongoing transactions are -represented in changelog extended attributes. - -### Implications on 'glusterd' - -None - -How To Test ------------ - -Same test cases of afrv1 hold. - -User Experience ---------------- - -None - -Dependencies ------------- - -None - -Documentation -------------- - ---- - -Status ------- - -The feature is in final stages of review at - - -Comments and Discussion ------------------------ - ---- - -Summary -------- - - - -Owners ------- - - - -Current status --------------- - - - -Detailed Description --------------------- - - - -Benefit to GlusterFS --------------------- - - - -Scope ------ - -### Nature of proposed change - - - -### Implications on manageability - - - -### Implications on presentation layer - - - -### Implications on persistence layer - - - -### Implications on 'GlusterFS' backend - - - -### Modification to GlusterFS metadata - -