diff options
Diffstat (limited to 'Feature Planning/GlusterFS 3.6')
15 files changed, 2588 insertions, 0 deletions
diff --git a/Feature Planning/GlusterFS 3.6/Better Logging.md b/Feature Planning/GlusterFS 3.6/Better Logging.md new file mode 100644 index 0000000..6aad602 --- /dev/null +++ b/Feature Planning/GlusterFS 3.6/Better Logging.md @@ -0,0 +1,348 @@ +Feature +------- + +Gluster logging enhancements to support message IDs per message + +Summary +------- + +Enhance gluster logging to provide the following features, SubFeature +--\> SF + +- SF1: Add message IDs to message + +- SF2: Standardize error num reporting across messages + +- SF3: Enable repetitive message suppression in logs + +- SF4: Log location and hierarchy standardization (in case anything is +further required here, analysis pending) + +- SF5: Enable per sub-module logging level configuration + +- SF6: Enable logging to other frameworks, than just the current gluster +logs + +- SF7: Generate a catalogue of these message, with message ID, message, +reason for occurrence, recovery/troubleshooting steps. + +Owners +------ + +Balamurugan Arumugam <barumuga@redhat.com> +Krishnan Parthasarathi <kparthas@redhat.com> +Krutika Dhananjay <kdhananj@redhat.com> +Shyamsundar Ranganathan <srangana@redhat.com> + +Current status +-------------- + +### Existing infrastructure: + +Currently gf\_logXXX exists as an infrastructure API for all logging +related needs. This (typically) takes the form, + +gf\_log(dom, levl, fmt...) + +where, + + dom: Open format string usually the xlator name, or "cli" or volume name etc. + levl: One of, GF_LOG_EMERG, GF_LOG_ALERT, GF_LOG_CRITICAL, GF_LOG_ERROR, GF_LOG_WARNING, GF_LOG_NOTICE, GF_LOG_INFO, GF_LOG_DEBUG, GF_LOG_TRACE + fmt: the actual message string, followed by the required arguments in the string + +The log initialization happens through, + +gf\_log\_init (void \*data, const char \*filename, const char \*ident) + +where, + + data: glusterfs_ctx_t, largely unused in logging other than the required FILE and mutex fields + filename: file name to log to + ident: Like syslog ident parameter, largely unused + +The above infrastructure leads to logs of type, (sample extraction from +nfs.log) + + [2013-12-08 14:17:17.603879] I [socket.c:3485:socket_init] 0-socket.ACL: SSL support is NOT enabled + [2013-12-08 14:17:17.603937] I [socket.c:3500:socket_init] 0-socket.ACL: using system polling thread + [2013-12-08 14:17:17.612128] I [nfs.c:934:init] 0-nfs: NFS service started + [2013-12-08 14:17:17.612383] I [dht-shared.c:311:dht_init_regex] 0-testvol-dht: using regex rsync-hash-regex = ^\.(.+)\.[^.]+$ + +### Limitations/Issues in the infrastructure + +1) Auto analysis of logs needs to be done based on the final message +string. Automated tools that can help with log message and related +troubleshooting options need to use the final string, which needs to be +intelligently parsed and also may change between releases. It would be +desirable to have message IDs so that such tools and trouble shooting +options can leverage the same in a much easier fashion. + +2) The log message itself currently does not use the \_ident\_ which +can help as we move to more common logging frameworks like journald, +rsyslog (or syslog as the case maybe) + +3) errno is the primary identifier of errors across gluster, i.e we do +not have error codes in gluster and use errno values everywhere. The log +messages currently do not lend themselves to standardization like +printing the string equivalent of errno rather than the actual errno +value, which \_could\_ be cryptic to administrators + +4) Typical logging infrastructures provide suppression (on a +configurable basis) for repetitive messages to prevent log flooding, +this is currently missing in the current infrastructure + +5) The current infrastructure cannot be used to control log levels at a +per xlator or sub module, as the \_dom\_ passed is a string that change +based on volume name, translator name etc. It would be desirable to have +a better module identification mechanism that can help with this +feature. + +6) Currently the entire logging infrastructure resides within gluster. +It would be desirable in scaled situations to have centralized logging +and monitoring solutions in place, to be able to better analyse and +monitor the cluster health and take actions. + +This requires some form of pluggable logging frameworks that can be used +within gluster to enable this possibility. Currently the existing +framework is used throughout gluster and hence we need only to change +configuration and logging.c to enable logging to other frameworks (as an +example the current syslog plug that was provided). + +It would be desirable to enhance this to provide a more robust framework +for future extensions to other frameworks. This is not a limitation of +the current framework, so much as a re-factor to be able to switch +logging frameworks with more ease. + +7) For centralized logging in the future, it would need better +identification strings from various gluster processes and hosts, which +is currently missing or suppressed in the logging infrastructure. + +Due to the nature of enhancements proposed, it is required that we +better the current infrastructure for the stated needs and do some +future proofing in terms of newer messages that would be added. + +Detailed Description +-------------------- + +NOTE: Covering details for SF1, SF2, and partially SF3, SF5, SF6. SF4/7 +will be covered in later revisions/phases. + +### Logging API changes: + +1) Change the logging API as follows, + +From: gf\_log(dom, levl, fmt...) + +To: gf\_msg(dom, levl, errnum, msgid, fmt...) + +Where: + + dom: Open string as used in the current logging infrastructure (helps in backward compat) + levl: As in current logging infrastructure (current levels seem sufficient enough to not add more levels for better debuggability etc.) + <new fields> + msgid: A message identifier, unique to this message FMT string and possibly this invocation. (SF1, lending to SF3) + errnum: The errno that this message is generated for (with an implicit 0 meaning no error number per se with this message) (SF2) + +NOTE: Internally the gf\_msg would still be a macro that would add the +\_\_FILE\_\_ \_\_LINE\_\_ \_\_FUNCTION\_\_ arguments + +2) Enforce \_ident\_ in the logging initialization API, gf\_log\_init +(void \*data, const char \*filename, const char \*ident) + +Where: + + ident would be the identifier string like, nfs, <mountpoint>, brick-<brick-name>, cli, glusterd, as is the case with the log file name that is generated today (lending to SF6) + +#### What this achieves: + +With the above changes, we now have a message ID per message +(\_msgid\_), location of the message in terms of which component +(\_dom\_) and which process (\_ident\_). The further identification of +the message location in terms of host (ip/name) can be done in the +framework, when centralized logging infrastructure is introduced. + +#### Log message changes: + +With the above changes to the API the log message can now appear in a +compatibility mode to adhere to current logging format, or be presented +as follows, + +log invoked as: gf\_msg(dom, levl, ENOTSUP, msgidX) + +Example: gf\_msg ("logchecks", GF\_LOG\_CRITICAL, 22, logchecks\_msg\_4, +42, "Forty-Two", 42); + +Where: logchecks\_msg\_4 (GLFS\_COMP\_BASE + 4), "Critical: Format +testing: %d:%s:%x" + +1) Gluster logging framework (logged as) + + [2014-02-17 08:52:28.038267] I [MSGID: 1002] [logchecks.c:44:go_log] 0-logchecks: Informational: Format testing: 42:Forty-Two:2a [Invalid argument] + +2) syslog (passed as) + + Feb 17 14:17:42 somari logchecks[26205]: [MSGID: 1002] [logchecks.c:44:go_log] 0-logchecks: Informational: Format testing: 42:Forty-Two:2a [Invalid argument] + +3) journald (passed as) + + sd_journal_send("MESSAGE=<vasprintf(dom, msgid(fmt))>", + "MESSAGE_ID=msgid", + "PRIORITY=levl", + "CODE_FILE=`<fname>`", "CODE_LINE=`<lnum>", "CODE_FUNC=<fnnam>", + "ERRNO=errnum", + "SYSLOG_IDENTIFIER=<ident>" + NULL); + +4) CEE (Common Event Expression) format string passed to any CEE +consumer (say lumberjack) + +Based on generating @CEE JSON string as per specifications and passing +it the infrastructure in question. + +#### Message ID generation: + +1) Some rules for message IDs + +- Every message, even if it is the same message FMT, will have a unique +message ID - Changes to a specific message string, hence will not change +its ID and also not impact other locations in the code that use the same +message FMT + +2) A glfs-message-id.h file would contain ranges per component for +individual component based messages to be created without overlapping on +the ranges. + +3) <component>-message.h would contain something as follows, + + #define GLFS_COMP_BASE GLFS_MSGID_COMP_<component> + #define GLFS_NUM_MESSAGES 1 + #define GLFS_MSGID_END (GLFS_COMP_BASE + GLFS_NUM_MESSAGES + 1) + /* Messaged with message IDs */ + #define glfs_msg_start_x GLFS_COMP_BASE, "Invalid: Start of messages" + /*------------*/ + #define <component>_msg_1 (GLFS_COMP_BASE + 1), "Test message, replace with"\ + " original when using the template" + /*------------*/ + #define glfs_msg_end_x GLFS_MSGID_END, "Invalid: End of messages" + +5) Each call to gf\_msg hence would be, + + gf_msg(dom, levl, errnum, glfs_msg_x, ...) + +#### Setting per xlator logging levels (SF5): + +short description to be elaborated later + +Leverage this-\>loglevel to override the global loglevel. This can be +also configured from gluster CLI at runtime to change the log levels at +a per xlator level for targeted debugging. + +#### Multiple log suppression(SF3): + +short description to be elaborated later + +1) Save the message string as follows, Msg\_Object(msgid, +msgstring(vasprintf(dom, fmt)), timestamp, repetitions) + +2) On each message received by the logging infrastructure check the +list of saved last few Msg\_Objects as follows, + +2.1) compare msgid and on success compare msgstring for a match, compare +repetition tolerance time with current TS and saved TS in the +Msg\_Object + +2.1.1) if tolerance is within limits, increment repetitions and do not +print message + +2.1.2) if tolerance is outside limits, print repetition count for saved +message (if any) and print the new message + +2.2) If none of the messages match the current message, knock off the +oldest message in the list printing any repetition count message for the +same, and stash new message into the list + +The key things to remember and act on here would be to, minimize the +string duplication on each message, and also to keep the comparison +quick (hence base it off message IDs and errno to start with) + +#### Message catalogue (SF7): + +<short description to be elaborated later> + +The idea is to use Doxygen comments in the <component>-message.h per +component, to list information in various sections per message of +consequence and later use Doxygen to publish this catalogue on a per +release basis. + +Benefit to GlusterFS +-------------------- + +The mentioned limitations and auto log analysis benefits would accrue +for GlusterFS + +Scope +----- + +### Nature of proposed change + +All gf\_logXXX function invocations would change to gf\_msgXXX +invocations. + +### Implications on manageability + +None + +### Implications on presentation layer + +None + +### Implications on persistence layer + +None + +### Implications on 'GlusterFS' backend + +None + +### Modification to GlusterFS metadata + +None + +### Implications on 'glusterd' + +None + +How To Test +----------- + +A separate test utility that tests various logs and formats would be +provided to ensure that functionality can be tested independent of +GlusterFS + +User Experience +--------------- + +Users would notice changed logging formats as mentioned above, the +additional field of importance would be the MSGID: + +Dependencies +------------ + +None + +Documentation +------------- + +Intending to add a logging.md (or modify the same) to elaborate on how a +new component should now use the new framework and generate messages +with IDs in the same. + +Status +------ + +In development (see, <http://review.gluster.org/#/c/6547/> ) + +Comments and Discussion +----------------------- + +<Follow here>
\ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.6/Better Peer Identification.md b/Feature Planning/GlusterFS 3.6/Better Peer Identification.md new file mode 100644 index 0000000..a8c6996 --- /dev/null +++ b/Feature Planning/GlusterFS 3.6/Better Peer Identification.md @@ -0,0 +1,172 @@ +Feature +------- + +**Better peer identification** + +Summary +------- + +This proposal is regarding better identification of peers. + +Owners +------ + +Kaushal Madappa <kmadappa@redhat.com> + +Current status +-------------- + +Glusterd currently is inconsistent in the way it identifies peers. This +causes problems when the same peer is referenced with different names in +different gluster commands. + +Detailed Description +-------------------- + +Currently, the way we identify peers is not consistent all through the +gluster code. We use uuids internally and hostnames externally. + +This setup works pretty well when all the peers are on a single network, +have one address, and are referred to in all the gluster commands with +same address. + +But once we start mixing up addresses in the commands (ip, shortnames, +fqdn) and bring in multiple networks we have problems. + +The problems were discussed in the following mailing list threads and +some solutions were proposed. + +- How do we identify peers? [^1] +- RFC - "Connection Groups" concept [^2] + +The solution to the multi-network problem is dependent on the solution +to the peer identification problem. So it'll be good to target fixing +the peer identification problem asap, ie. in 3.6, and take up the +networks problem later. + +Benefit to GlusterFS +-------------------- + +Sanity. It will be great to have all internal identifiers for peers +happening through a UUID, and being translated into a host/IP at the +most superficial layer. + +Scope +----- + +### Nature of proposed change + +The following changes will be done in Glusterd to improve peer +identification. + +1. Peerinfo struct will be extended to have a list of associated + hostnames/addresses, instead of a single hostname as it is + currently. The import/export and store/restore functions will be + changed to handle this. CLI will be updated to show this list of + addresses in peer status and pool list commands. +2. Peer probe will be changed to append an address to the peerinfo + address list, when we observe that the given address belongs to an + existing peer. +3. Have a new API for translation between hostname/addresses into + UUIDs. This new API will be used in all places where + hostnames/addresses were being validated, including peer probe, peer + detach, volume create, add-brick, remove-brick etc. +4. A new command - 'gluster peer add-address <existing> <new-address>' + - which appends to the address list will be implemented if time + permits. +5. A new command - 'gluster peer rename <existing> <new>' - which will + rename all occurrences of a peer with the newly given name will be + implemented if time permits. + +Changes 1-3 are the base for the other changes and will the primary +deliverables for this feature. + +### Implications on manageability + +The primary changes will bring about some changes to the CLI output of +'peer status' and 'pool list' commands. The normal and XML outputs for +these commands will contain a list of addresses for each peer, instead +of a single hostname. + +Tools depending on the output of these commands will need to be updated. + +**TODO**: *Add sample outputs* + +The new commands 'peer add-address' and 'peer rename' will improve +manageability of peers. + +### Implications on presentation layer + +None + +### Implications on persistence layer + +None + +### Implications on 'GlusterFS' backend + +None + +### Modification to GlusterFS metadata + +None + +### Implications on 'glusterd' + +<persistent store, configuration changes, brick-op...> + +How To Test +----------- + +**TODO:** *Add test cases* + +User Experience +--------------- + +User experience will improve for commands which used peer identifiers +(volume create/add-brick/remove-brick, peer probe, peer detach), as the +the user will no longer face errors caused by mixed usage of +identifiers. + +Dependencies +------------ + +None. + +Documentation +------------- + +The new behaviour of the peer probe command will need to be documented. +The new commands will need to be documented as well. + +**TODO:** *Add more documentations* + +Status +------ + +The feature is under development on forge [^3] and github [^4]. This +github merge request [^5] can be used for performing preliminary +reviews. Once we are satisfied with the changes, it will be posted for +review on gerrit. + +Comments and Discussion +----------------------- + +There are open issues around node crash + re-install with same IP (but +new UUID) which need to be addressed in this effort. + +Links +----- + +<references> +</references> + +[^1]: <http://lists.gnu.org/archive/html/gluster-devel/2013-06/msg00067.html> + +[^2]: <http://lists.gnu.org/archive/html/gluster-devel/2013-06/msg00069.html> + +[^3]: <https://forge.gluster.org/~kshlm/glusterfs-core/kshlms-glusterfs/commits/better-peer-identification> + +[^4]: <https://github.com/kshlm/glusterfs/tree/better-peer-identification> + +[^5]: <https://github.com/kshlm/glusterfs/pull/2> diff --git a/Feature Planning/GlusterFS 3.6/Gluster User Serviceable Snapshots.md b/Feature Planning/GlusterFS 3.6/Gluster User Serviceable Snapshots.md new file mode 100644 index 0000000..9af7062 --- /dev/null +++ b/Feature Planning/GlusterFS 3.6/Gluster User Serviceable Snapshots.md @@ -0,0 +1,39 @@ +Feature +------- + +Enable user-serviceable snapshots for GlusterFS Volumes based on +GlusterFS-Snapshot feature + +Owners +------ + +Anand Avati +Anand Subramanian <anands@redhat.com> +Raghavendra Bhat +Varun Shastry + +Summary +------- + +Each snapshot capable GlusterFS Volume will contain a .snaps directory +through which a user will be able to access previously point-in-time +snapshot copies of his data. This will be enabled through a hidden +.snaps folder in each directory or sub-directory within the volume. +These user-serviceable snapshot copies will be read-only. + +Tests +----- + +1) Enable uss (gluster volume set <volume name> features.uss enable) A +snap daemon should get started for the volume. It should be visible in +gluster volume status command. 2) entering the snapshot world ls on +.snaps from any directory within the filesystem should be successful and +should show the list of snapshots as directories. 3) accessing the +snapshots One of the snapshots can be entered and it should show the +contents of the directory from which .snaps was entered, when the +snapshot was taken. NOTE: If the directory was not present when a +snapshot was taken (say snap1) and created later, then entering snap1 +directory (or any access) will fail with stale file handle. 4) Reading +from snapshots Any kind of read operations from the snapshots should be +successful. But any modifications to snapshot data is not allowed. +Snapshots are read-only
\ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.6/Gluster Volume Snapshot.md b/Feature Planning/GlusterFS 3.6/Gluster Volume Snapshot.md new file mode 100644 index 0000000..468992a --- /dev/null +++ b/Feature Planning/GlusterFS 3.6/Gluster Volume Snapshot.md @@ -0,0 +1,354 @@ +Feature +------- + +Snapshot of Gluster Volume + +Summary +------- + +Gluster volume snapshot will provide point-in-time copy of a GlusterFS +volume. This snapshot is an online-snapshot therefore file-system and +its associated data continue to be available for the clients, while the +snapshot is being taken. + +Snapshot of a GlusterFS volume will create another read-only volume +which will be a point-in-time copy of the original volume. Users can use +this read-only volume to recover any file(s) they want. Snapshot will +also provide restore feature which will help the user to recover an +entire volume. The restore operation will replace the original volume +with the snapshot volume. + +Owner(s) +-------- + +Rajesh Joseph <rjoseph@redhat.com> + +Copyright +--------- + +Copyright (c) 2013-2014 Red Hat, Inc. <http://www.redhat.com> + +This feature is licensed under your choice of the GNU Lesser General +Public License, version 3 or any later version (LGPLv3 or later), or the +GNU General Public License, version 2 (GPLv2), in all cases as published +by the Free Software Foundation. + +Current status +-------------- + +Gluster volume snapshot support is provided in GlusterFS 3.6 + +Detailed Description +-------------------- + +GlusterFS snapshot feature will provide a crash consistent point-in-time +copy of Gluster volume(s). This snapshot is an online-snapshot therefore +file-system and its associated data continue to be available for the +clients, while the snapshot is being taken. As of now we are not +planning to provide application level crash consistency. That means if a +snapshot is restored then applications need to rely on journals or other +technique to recover or cleanup some of the operations performed on +GlusterFS volume. + +A GlusterFS volume is made up of multiple bricks spread across multiple +nodes. Each brick translates to a directory path on a given file-system. +The current snapshot design is based on thinly provisioned LVM2 snapshot +feature. Therefore as a prerequisite the Gluster bricks should be on +thinly provisioned LVM. For a single lvm, taking a snapshot would be +straight forward for the admin, but this is compounded in a GlusterFS +volume which has bricks spread across multiple LVM’s across multiple +nodes. Gluster volume snapshot feature aims to provide a set of +interfaces from which the admin can snap and manage the snapshots for +Gluster volumes. + +Gluster volume snapshot is nothing but snapshots of all the bricks in +the volume. So ideally all the bricks should be snapped at the same +time. But with real-life latencies (processor and network) this may not +hold true all the time. Therefore we need to make sure that during +snapshot the file-system is in consistent state. Therefore we barrier +few operation so that the file-system remains in a healthy state during +snapshot. + +For details about barrier [Server Side +Barrier](http://www.gluster.org/community/documentation/index.php/Features/Server-side_Barrier_feature) + +Benefit to GlusterFS +-------------------- + +Snapshot of glusterfs volume allows users to + +- A point in time checkpoint from which to recover/failback +- Allow read-only snaps to be the source of backups. + +Scope +----- + +### Nature of proposed change + +Gluster cli will be modified to provide new commands for snapshot +management. The entire snapshot core implementation will be done in +glusterd. + +Apart from this Snapshot will also make use of quiescing xlator for +doing quiescing. This will be a server side translator which will +quiesce will fops which can modify disk state. The quescing will be done +till the snapshot operation is complete. + +### Implications on manageability + +Snapshot will provide new set of cli commands to manage snapshots. REST +APIs are not planned for this release. + +### Implications on persistence layer + +Snapshot will create new volume per snapshot. These volumes are stored +in /var/lib/glusterd/snaps folder. Apart from this each volume will have +additional snapshot related information stored in snap\_list.info file +in its respective vol folder. + +### Implications on 'glusterd' + +Snapshot information and snapshot volume details are stored in +persistent stores. + +How To Test +----------- + +For testing this feature one needs to have mulitple thinly provisioned +volumes or else need to create LVM using loop back devices. + +Details of how to create thin volume can be found at the following link +<https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/thinly_provisioned_volume_creation.html> + +Each brick needs to be in a independent LVM. And these LVMs should be +thinly provisioned. From these bricks create Gluster volume. This volume +can then be used for snapshot testing. + +See the User Experience section for various commands of snapshot. + +User Experience +--------------- + +##### Snapshot creation + + snapshot create <snapname> <volname(s)> [description <description>] [force] + +This command will create a sapshot of the volume identified by volname. +snapname is a mandatory field and the name should be unique in the +entire cluster. Users can also provide an optional description to be +saved along with the snap (max 1024 characters). force keyword is used +if some bricks of orginal volume is down and still you want to take the +snapshot. + +##### Listing of available snaps + + gluster snapshot list [snap-name] [vol <volname>] + +This command is used to list all snapshots taken, or for a specified +volume. If snap-name is provided then it will list the details of that +snap. + +##### Configuring the snapshot behavior + + gluster snapshot config [vol-name] + +This command will display existing config values for a volume. If volume +name is not provided then config values of all the volume is displayed. + + gluster snapshot config [vol-name] [<snap-max-limit> <count>] [<snap-max-soft-limit> <percentage>] [force] + +The above command can be used to change the existing config values. If +vol-name is provided then config value of that volume is changed, else +it will set/change the system limit. + +The system limit is the default value of the config for all the volume. +Volume specific limit cannot cross the system limit. If a volume +specific limit is not provided then system limit will be considered. + +If any of this limit is decreased and the current snap count of the +system/volume is more than the limit then the command will fail. If user +still want to decrease the limit then force option should be used. + +**snap-max-limit**: Maximum snapshot limit for a volume. Snapshots +creation will fail if snap count reach this limit. + +**snap-max-soft-limit**: Maximum snapshot limit for a volume. Snapshots +can still be created if snap count reaches this limit. An auto-deletion +will be triggered if this limit is reached. The oldest snaps will be +deleted if snap count reaches this limit. This is represented as +percentage value. + +##### Status of snapshots + + gluster snapshot status ([snap-name] | [volume <vol-name>]) + +Shows the status of all the snapshots or the specified snapshot. The +status will include the brick details, LVM details, process details, +etc. + +##### Activating a snap volume + +By default the snapshot created will be in an inactive state. Use the +following commands to activate snapshot. + + gluster snapshot activate <snap-name> + +##### Deactivating a snap volume + + gluster snapshot deactivate <snap-name> + +The above command will deactivate an active snapshot + +##### Deleting snaps + + gluster snapshot delete <snap-name> + +This command will delete the specified snapshot. + +##### Restoring snaps + + gluster snapshot restore <snap-name> + +This command restores an already taken snapshot of single or multiple +volumes. Snapshot restore is an offline activity therefore if any volume +which is part of the given snap is online then the restore operation +will fail. + +Once the snapshot is restored it will be deleted from the list of +snapshot. + +Dependencies +------------ + +To provide support for a crash-consistent snapshot feature Gluster core +com- ponents itself should be crash-consistent. As of now Gluster as a +whole is not crash-consistent. In this section we will identify those +Gluster components which are not crash-consistent. + +**Geo-Replication**: Geo-replication provides master-slave +synchronization option to Gluster. Geo-replication maintains state +information for completing the sync operation. Therefore ideally when a +snapshot is taken then both the master and slave snapshot should be +taken. And both master and slave snapshot should be in mutually +consistent state. + +Geo-replication make use of change-log to do the sync. By default the +change-log is stored .glusterfs folder in every brick. But the +change-log path is configurable. If change-log is part of the brick then +snapshot will contain the change-log changes as well. But if it is not +then it needs to be saved separately during a snapshot. + +Following things should be considered for making change-log +crash-consistent: + +- Change-log is part of the brick of the same volume. +- Change-log is outside the brick. As of now there is no size limit on + the + +change-log files. We need to answer following questions here + +- - Time taken to make a copy of the entire change-log. Will affect + the + +overall time of snapshot operation. + +- - The location where it can be copied. Will impact the disk usage + of + +the target disk or file-system. + +- Some part of change-log is present in the brick and some are outside + +the brick. This situation will arrive when change-log path is changed +in-between. + +- Change-log is saved in another volume and this volume forms a CG + with + +the volume about to be snapped. + +**Note**: Considering the above points we have decided not to support +change-log stored outside the bricks. + +For this release automatic snapshot of both master and slave session is +not supported. If required user need to explicitly take snapshot of both +master and slave. Following steps need to be followed while taking +snapshot of a master and slave setup + +- Stop geo-replication manually. +- Snapshot all the slaves first. +- When the slave snapshot is done then initiate master snapshot. +- When both the snapshot is complete geo-syncronization can be started + again. + +**Gluster Quota**: Quota enables an admin to specify per directory +quota. Quota makes use of marker translator to enforce quota. As of now +the marker framework is not completely crash-consistent. As part of +snapshot feature we need to address following issues. + +- If a snapshot is taken while the contribution size of a file is + being updated then you might end up with a snapshot where there is a + mismatch between the actual size of the file and the contribution of + the file. These in-consistencies can only be rectified when a + look-up is issued on the snapshot volume for the same file. As a + workaround admin needs to issue an explicit file-system crawl to + rectify the problem. +- For NFS, quota makes use of pgfid to build a path from gfid and + enforce quota. As of now pgfid update is not crash-consistent. +- Quota saves its configuration in file-system under /var/lib/glusterd + folder. As part of snapshot feature we need to save this file. + +**NFS**: NFS uses a single graph to represent all the volumes in the +system. And to make all the snapshot volume accessible these snapshot +volumes should be added to this graph. This brings in another +restriction, i.e. all the snapshot names should be unique and +additionally snap name should not clash with any other volume name as +well. + +To handle this situation we have decided to use an internal uuid as snap +name. And keep a mapping of this uuid and user given snap name in an +internal structure. + +Another restriction with NFS is that when a newly created volume +(snapshot volume) is started it will restart NFS server. Therefore we +decided when snapshot is taken it will be in stopped state. Later when +snapshot volume is needed it can be started explicitly. + +**DHT**: DHT xlator decides which node to look for a file/directory. +Some of the DHT fop are not atomic in nature, e.g rename (both file and +directory). Also these operations are not transactional in nature. That +means if a crash happens the data in server might be in an inconsistent +state. Depending upon the time of snapshot and which DHT operation is in +what state there can be an inconsistent snapshot. + +**AFR**: AFR is the high-availability module in Gluster. AFR keeps track +of fresh and correct copy of data using extended attributes. Therefore +it is important that before taking snapshot these extended attributes +are written into the disk. To make sure these attributes are written to +disk snapshot module will issue explicit sync after the +barrier/quiescing. + +The other issue with the current AFR is that it writes the volume name +to the extended attribute of all the files. AFR uses this for +self-healing. When snapshot is taken of such a volume the snapshotted +volume will also have the same volume name. Therefore AFR needs to +create a mapping of the real volume name and the extended entry name in +the volfile. So that correct name can be referred during self-heal. + +Another dependency on AFR is that currently there is no direct API or +call back function which will tell that AFR self-healing is completed on +a volume. This feature is required to heal a snapshot volume before +restore. + +Documentation +------------- + +Status +------ + +In development + +Comments and Discussion +----------------------- + +<Follow here> diff --git a/Feature Planning/GlusterFS 3.6/New Style Replication.md b/Feature Planning/GlusterFS 3.6/New Style Replication.md new file mode 100644 index 0000000..ffd8167 --- /dev/null +++ b/Feature Planning/GlusterFS 3.6/New Style Replication.md @@ -0,0 +1,230 @@ +Goal +---- + +More partition-tolerant replication, with higher performance for most +use cases. + +Summary +------- + +NSR is a new synchronous replication translator, complementing or +perhaps some day replacing AFR. + +Owners +------ + +Jeff Darcy <jdarcy@redhat.com> +Venky Shankar <vshankar@redhat.com> + +Current status +-------------- + +Design and prototype (nearly) complete, implementation beginning. + +Related Feature Requests and Bugs +--------------------------------- + +[AFR bugs related to "split +brain"](https://bugzilla.redhat.com/buglist.cgi?classification=Community&component=replicate&list_id=3040567&product=GlusterFS&query_format=advanced&short_desc=split&short_desc_type=allwordssubstr) + +[AFR bugs related to +"perf"](https://bugzilla.redhat.com/buglist.cgi?classification=Community&component=replicate&list_id=3040572&product=GlusterFS&query_format=advanced&short_desc=perf&short_desc_type=allwordssubstr) + +(Both lists are undoubtedly partial because not all bugs in these areas +using these specific words. In particular, "GFID mismatch" bugs are +really a kind of split brain, but aren't represented.) + +Detailed Description +-------------------- + +NSR is designed to have the following features. + +- Server based - "chain" replication can use bandwidth of both client + and server instead of splitting client bandwidth N ways. + +- Journal based - for reduced network traffic in normal operation, + plus faster recovery and greater resistance to "split brain" errors. + +- Variable consistency model - based on + [Dynamo](http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) + to provide options trading some consistency for greater availability + and/or performance. + +- Newer, smaller codebase - reduces technical debt, enables higher + replica counts, more informative status reporting and logging, and + other future features (e.g. ordered asynchronous replication). + +Benefit to GlusterFS +==================== + +Faster, more robust, more manageable/maintainable replication. + +Scope +===== + +Nature of proposed change +------------------------- + +At least two new translators will be necessary. + +- A simple client-side translator to route requests to the current + leader among the bricks in a replica set. + +- A server-side translator to handle the "heavy lifting" of + replication, recovery, etc. + +Implications on manageability +----------------------------- + +At a high level, commands to enable, configure, and manage NSR will be +very similar to those already used for AFR. At a lower level, the +options affecting things things like quorum, consistency, and placement +of journals will all be completely different. + +Implications on presentation layer +---------------------------------- + +Minimal. Most changes will be to simplify or remove special handling for +AFR's unique behavior (especially around lookup vs. self-heal). + +Implications on persistence layer +--------------------------------- + +N/A + +Implications on 'GlusterFS' backend +----------------------------------- + +The journal for each brick in an NSR volume might (for performance +reasons) be placed on one or more local volumes other than the one +containing the brick's data. Special requirements around AIO, fsync, +etc. will be less than with AFR. + +Modification to GlusterFS metadata +---------------------------------- + +NSR will not use the same xattrs as AFR, reducing the need for larger +inodes. + +Implications on 'glusterd' +-------------------------- + +Volgen must be able to configure the client-side and server-side parts +of NSR, instead of AFR on the client side and index (which will no +longer be necessary) on the server side. Other interactions with +glusterd should remain mostly the same. + +How To Test +=========== + +Most basic AFR tests - e.g. reading/writing data, killing nodes, +starting/stopping self-heal - would apply to NSR as well. Tests that +embed assumptions about AFR xattrs or other internal artifacts will need +to be re-written. + +User Experience +=============== + +Minimal change, mostly related to new options. + +Dependencies +============ + +NSR depends on a cluster-management framework that can provide +membership tracking, leader election, and robust consistent key/value +data storage. This is expected to be developed in parallel as part of +the glusterd-scalability feature, but can be implemented (in simplified +form) within NSR itself if necessary. + +Documentation +============= + +TBD. + +Status +====== + +Some parts of earlier implementation updated to current tree, others in +the middle of replacement. + +- [New design](http://review.gluster.org/#/c/8915/) + +- [Basic translator code](http://review.gluster.org/#/c/8913/) (needs + update to new code-generation infractructure) + +- [GF\_FOP\_IPC](http://review.gluster.org/#/c/8812/) + +- [etcd support](http://review.gluster.org/#/c/8887/) + +- [New code-generation + infrastructure](http://review.gluster.org/#/c/9411/) + +- [New data-logging + translator](https://forge.gluster.org/~jdarcy/glusterfs-core/jdarcys-glusterfs-data-logging) + +Comments and Discussion +======================= + +My biggest concern with journal-based replication comes from my previous +use of DRBD. They do an "activity log"[^1] which sounds like the same +basic concept. Once that log filled, I experienced cascading failure. +When the journal can be filled faster than it's emptied this could cause +the problem I experienced. + +So what I'm looking to be convinced is how journalled replication +maintains full redundancy and how it will prevent the journal input from +exceeding the capacity of the journal output or at least how it won't +fail if this should happen. + +[jjulian](User:Jjulian "wikilink") +([talk](User talk:Jjulian "wikilink")) 17:21, 13 August 2013 (UTC) + +<hr/> +This is akin to a CAP Theorem[^2][^3] problem. If your nodes can't +communicate, what do you do with writes? Our replication approach has +traditionally been CP - enforce quorum, allow writes only among the +majority - and for the sake of satisfying user expectations (or POSIX) +pretty much has to remain CP at least by default. I personally think we +need to allow an AP choice as well, which is why the quorum levels in +NSR are tunable to get that result. + +So, what do we do if a node runs out of journal space? Well, it's unable +to function normally, i.e. it's failed, so it can't count toward quorum. +This would immediately lead to loss of write availability in a two-node +replica set, and could happen easily enough in a three-node replica set +if two similarly configured nodes ran out of journal space +simultaneously. A significant part of the complexity in our design is +around pruning no-longer-needed journal segments, precisely because this +is an icky problem, but even with all the pruning in the world it could +still happen eventually. Therefore the design also includes the notion +of arbiters, which can be quorum-only or can also have their own +journals (with no or partial data). Therefore, your quorum for +admission/journaling purposes can be significantly higher than your +actual replica count. So what options do we have to avoid or deal with +journal exhaustion? + +- Add more journal space (it's just files, so this can be done + reactively during an extended outage). + +- Add arbiters. + +- Decrease the quorum levels. + +- Manually kick a node out of the replica set. + +- Add admission control, artificially delaying new requests as the + journal becomes full. (This one requires more code.) + +If you do \*none\* of these things then yeah, you're scrod. That said, +do you think these options seem sufficient? + +[Jdarcy](User:Jdarcy "wikilink") ([talk](User talk:Jdarcy "wikilink")) +15:27, 29 August 2013 (UTC) + +<references/> + +[^1]: <http://www.drbd.org/users-guide-emb/s-activity-log.html> + +[^2]: <http://www.julianbrowne.com/article/viewer/brewers-cap-theorem> + +[^3]: <http://henryr.github.io/cap-faq/> diff --git a/Feature Planning/GlusterFS 3.6/Persistent AFR Changelog xattributes.md b/Feature Planning/GlusterFS 3.6/Persistent AFR Changelog xattributes.md new file mode 100644 index 0000000..e21b788 --- /dev/null +++ b/Feature Planning/GlusterFS 3.6/Persistent AFR Changelog xattributes.md @@ -0,0 +1,178 @@ +Feature +------- + +Provide a unique and consistent name for AFR changelog extended +attributes/ client translator names in the volume graph. + +Summary +------- + +Make AFR changelog extended attribute names independent of brick +position in the graph, which ensures that there will be no potential +misdirected self-heals during remove-brick operation. + +Owners +------ + +Ravishankar N <ravishankar@redhat.com> +Pranith Kumar K <pkarampu@redhat.com> + +Current status +-------------- + +Patches merged in master. + +<http://review.gluster.org/#/c/7122/> + +<http://review.gluster.org/#/c/7155/> + +Detailed Description +-------------------- + +BACKGROUND ON THE PROBLEM: =========================== AFR makes use of +changelog extended attributes on a per file basis which records pending +operations on that file and is used to determine the sources and sinks +when healing needs to be done. As of today, AFR uses the client +translator names (from the volume graph) as the names of the changelog +attributes. For eg. for a replica 3 volume, each file on every brick has +the following extended attributes: + + trusted.afr.<volname>-client-0-->maps to Brick0 + trusted.afr.<volname>-client-1-->maps to Brick1 + trusted.afr.<volname>-client-2-->maps to Brick2 + +1) Now when any brick is removed (say Brick1), the graph is regenerated +and AFR maps the xattrs to the bricks so: + + trusted.afr.<volname>-client-0-->maps to Brick0 + trusted.afr.<volname>-client-1-->maps to Brick2 + +Thus the xattr 'trusted.afr.testvol-client-1' which earlier referred to +Brick1's attributes now refer to Brick-2's. If there are pending +self-heals prior to the remove-brick happened, healing could possibly +happen in the wrong direction thereby causing data loss. + +2) The second problem is a dependency with Snapshot feature. Snapshot +volumes have new names (UUID based) and thus the (client)xlator names +are different. Eg: \<<volname>-client-0\> will now be +\<<snapvolname>-client-0\>. When AFR uses these names to query for its +changelog xattrs but the files on the bricks have the old changelog +xattrs. Hence the heal information is completely lost. + +WHAT IS THE EXACT ISSUE WE ARE SOLVING OR OBJECTIVE OF THE +FEATURE/DESIGN? +========================================================================== +In a nutshell, the solution is to generate unique and persistent names +for the client translators so that even if any of the bricks are +removed, the translator names always map to the same bricks. In turn, +AFR, which uses these names for the changelog xattr names also refer to +the correct bricks. + +SOLUTION: + +The solution is explained as a sequence of steps: + +- The client translator names will still use the existing + nomenclature, except that now they are monotonically increasing + (<volname>-client-0,1,2...) and are not dependent on the brick + position.Let us call these names as brick-IDs. These brick IDs are + also written to the brickinfo files (in + /var/lib/glusterd/vols/<volname>/bricks/\*) by glusterd during + volume creation. When the volfile is generated, these brick + brick-IDs form the client xlator names. + +- Whenever a brick operation is performed, the names are retained for + existing bricks irrespective of their position in the graph. New + bricks get the monotonically increasing brick-ID while names for + existing bricks are obtained from the brickinfo file. + +- Note that this approach does not affect client versions (old/new) in + anyway because the clients just use the volume config provided by + the volfile server. + +- For retaining backward compatibility, We need to check two items: + (a)Under what condition is remove brick allowed; (b)When is brick-ID + written to brickinfo file. + +For the above 2 items, the implementation rules will be thus: + +i) This feature is implemented in 3.6. Lets say its op-version is 5. + +ii) We need to implement a check to allow remove-brick only if cluster +opversion is \>=5 + +iii) The brick-ID is written to brickinfo when the nodes are upgraded +(during glusterd restore) and when a peer is probed (i.e. during volfile +import). + +Benefit to GlusterFS +-------------------- + +Even if there are pending self-heals, remove-brick operations can be +carried out safely without fear of incorrect heals which may cause data +loss. + +Scope +----- + +### Nature of proposed change + +Modifications will be made in restore, volfile import and volgen +portions of glusterd. + +### Implications on manageability + +N/A + +### Implications on presentation layer + +N/A + +### Implications on persistence layer + +N/A + +### Implications on 'GlusterFS' backend + +N/A + +### Modification to GlusterFS metadata + +N/A + +### Implications on 'glusterd' + +As described earlier. + +How To Test +----------- + +remove-brick operation needs to be carried out on rep/dist-rep volumes +having pending self-heals and it must be verified that no data is lost. +snapshots of the volumes must also be able to access files without any +issues. + +User Experience +--------------- + +N/A + +Dependencies +------------ + +None. + +Documentation +------------- + +TBD + +Status +------ + +See 'Current status' section. + +Comments and Discussion +----------------------- + +<Follow here>
\ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.6/RDMA Improvements.md b/Feature Planning/GlusterFS 3.6/RDMA Improvements.md new file mode 100644 index 0000000..1e71729 --- /dev/null +++ b/Feature Planning/GlusterFS 3.6/RDMA Improvements.md @@ -0,0 +1,101 @@ +Feature +------- + +**RDMA Improvements** + +Summary +------- + +This proposal is regarding getting RDMA volumes out of tech preview. + +Owners +------ + +Raghavendra Gowdappa <rgowdapp@redhat.com> +Vijay Bellur <vbellur@redhat.com> + +Current status +-------------- + +Work in progress + +Detailed Description +-------------------- + +Fix known & unknown issues in volumes with transport type rdma so that +RDMA can be used as the interconnect between client - servers & between +servers. + +- Performance Issues - Had found that performance was bad when + compared with plain ib-verbs send/recv v/s RDMA reads and writes. +- Co-existence with tcp - There seemed to be some memory corruptions + when we had both tcp and rdma transports. +- librdmacm for connection management - with this there is a + requirement that the brick has to listen on an IPoIB address and + this affects our current ability where a peer has the flexibility to + connect to either ethernet or infiniband address. Another related + feature Better peer identification will help us to resolve this + issue. +- More testing required + +Benefit to GlusterFS +-------------------- + +Scope +----- + +### Nature of proposed change + +Bug-fixes to transport/rdma + +### Implications on manageability + +Remove the warning about creation of rdma volumes in CLI. + +### Implications on presentation layer + +TBD + +### Implications on persistence layer + +No impact + +### Implications on 'GlusterFS' backend + +No impact + +### Modification to GlusterFS metadata + +No impact + +### Implications on 'glusterd' + +No impact + +How To Test +----------- + +TBD + +User Experience +--------------- + +TBD + +Dependencies +------------ + +Better Peer identification + +Documentation +------------- + +TBD + +Status +------ + +In development + +Comments and Discussion +----------------------- diff --git a/Feature Planning/GlusterFS 3.6/Server-side Barrier feature.md b/Feature Planning/GlusterFS 3.6/Server-side Barrier feature.md new file mode 100644 index 0000000..c13e25a --- /dev/null +++ b/Feature Planning/GlusterFS 3.6/Server-side Barrier feature.md @@ -0,0 +1,213 @@ +Server-side barrier feature +=========================== + +- Author(s): Varun Shastry, Krishnan Parthasarathi +- Date: Jan 28 2014 +- Bugzilla: <https://bugzilla.redhat.com/1060002> +- Document ID: BZ1060002 +- Document Version: 1 +- Obsoletes: NA + +Abstract +-------- + +Snapshot feature needs a mechanism in GlusterFS, where acknowledgements +to file operations (FOPs) are held back until the snapshot of all the +bricks of the volume are taken. + +The barrier feature would stop holding back FOPs after a configurable +'barrier-timeout' seconds. This is to prevent an accidental lockdown of +the volume. + +This mechanism should have the following properties: + +- Should keep 'barriering' transparent to the applications. +- Should not acknowledge FOPs that fall into the barrier class. A FOP + that when acknowledged to the application, could lead to the + snapshot of the volume become inconsistent, is a barrier class FOP. + +With the below example of 'unlink' how a FOP is classified as barrier +class is explained. + +For the following sequence of events, assuming unlink FOP was not +barriered. Assume a replicate volume with two bricks, namely b1 and b2. + + b1 b2 + time ---------------------------------- + | t1 snapshot + | t2 unlink /a unlink /a + \/ t3 mkdir /a mkdir /a + t4 snapshot + +The result of the sequence of events will store /a as a file in snapshot +b1 while /a is stored as directory in snapshot b2. This leads to split +brain problem of the AFR and in other way inconsistency of the volume. + +Copyright +--------- + +Copyright (c) 2014 Red Hat, Inc. <http://www.redhat.com> + +This feature is licensed under your choice of the GNU Lesser General +Public License, version 3 or any later version (LGPLv3 or later), or the +GNU General Public License, version 2 (GPLv2), in all cases as published +by the Free Software Foundation. + +Introduction +------------ + +The volume snapshot feature snapshots a volume by snapshotting +individual bricks, that are available, using the lvm-snapshot +technology. As part of using lvm-snapshot, the design requires bricks to +be free from few set of modifications (fops in Barrier Class) to avoid +the inconsistency. This is where the server-side barriering of FOPs +comes into picture. + +Terminology +----------- + +- barrier(ing) - To make barrier fops temporarily inactive or + disabled. +- available - A brick is said to be available when the corresponding + glusterfsd process is running and serving file operations. +- FOP - File Operation + +High Level Design +----------------- + +### Architecture/Design Overview + +- Server-side barriering, for Snapshot, must be enabled/disabled on + the bricks of a volume in a synchronous manner. ie, any command + using this would be blocked until barriering is enabled/disabled. + The brick process would provide this mechanism via an RPC. +- Barrier translator would be placed immediately above io-threads + translator in the server/brick stack. +- Barrier translator would queue FOPs when enabled. On disable, the + translator dequeues all the FOPs, while serving new FOPs from + application. By default, barriering is disabled. +- The barrier feature would stop blocking the acknowledgements of FOPs + after a configurable 'barrier-timeout' seconds. This is to prevent + an accidental lockdown of the volume. +- Operations those fall into barrier class are listed below. Any other + fop not listed below does not fall into this category and hence are + not barriered. + - rmdir + - unlink + - rename + - [f]truncate + - fsync + - write with O\_SYNC flag + - [f]removexattr + +### Design Feature + +Following timeline diagram depicts message exchanges between glusterd +and brick during enable and disable of barriering. This diagram assumes +that enable operation is synchronous and disable is asynchronous. See +below for alternatives. + + glusterd (snapshot) barrier @ brick + ------------------ --------------- + t1 | | + t2 | continue to pass through + | all the fops + t3 send 'enable' | + t4 | * starts barriering the fops + | * send back the ack + t5 receive the ack | + | | + t6 | <take snap> | + | . | + | . | + | . | + | </take snap> | + | | + t7 send disable | + (does not wait for the ack) | + t8 | release all the holded fops + | and no more barriering + | | + t9 | continue in PASS_THROUGH mode + +Glusterd would send an RPC (described in API section), to enable +barriering on a brick, by setting option feature.barrier to 'ON' in +barrier translator. This would be performed on all the bricks present in +that node, belonging to the set of volumes that are being snapshotted. + +Disable of barriering can happen in synchronous or asynchronous mode. +The choice is left to the consumer of this feature. + +On disable, all FOPs queued up will be dequeued. Simultaneously the +subsequent barrier request(s) will be served. + +Barrier option enable/disable is persisted into the volfile. This is to +make the feature available for consumers in asynchronous mode, like any +other (configurable) feature. + +Barrier feature also has timeout option based on which dequeuing would +get triggered if the consumer fails to send the disable request. + +Low-level details of Barrier translator working +----------------------------------------------- + +The translator operates in one of two states, namely QUEUEING and +PASS\_THROUGH. + +When barriering is enabled, the translator moves to QUEUEING state. It +queues outgoing FOPs thereafter in the call back path. + +When barriering is disabled, the translator moves to PASS\_THROUGH state +and does not queue when it is in PASS\_THROUGH state. Additionally, the +queued FOPs are 'released', when the translator moves from QUEUEING to +PASS\_THROUGH state. + +It has a translator global queue (doubly linked lists, see +libglusterfs/src/list.h) where the FOPs are queued in the form of a call +stub (see libglusterfs/src/call-stub.[ch]) + +When the FOP has succeeded, but barrier translator failed to queue in +the call back, the barrier translator would disable barriering and +release any queued FOPs, barrier would inform the consumer about this +failure on succesive disable request. + +Interfaces +---------- + +### Application Programming Interface + +- An RPC procedure is added at the brick side, which allows any client + [sic] to set the feature.barrier option of the barrier translator + with a given value. +- Glusterd would be using this to set server-side-barriering on, on a + brick. + +Performance Considerations +-------------------------- + +- The barriering of FOPs may be perceived as a performance degrade by + the applications. Since this is a hard requirement for snapshot, the + onus is on the snapshot feature to reduce the window for which + barriering is enabled. + +### Scalability + +- In glusterd, each brick operation is executed in a serial manner. + So, the latency of enabling barriering is a function of the no. of + bricks present on the node of the set of volumes being snapshotted. + This is not a scalability limitation of the mechanism of enabling + barriering but a limitation in the brick operations mechanism in + glusterd. + +Migration Considerations +------------------------ + +The barrier translator is introduced with op-version 4. It is a +server-side translator and does not impact older clients even when this +feature is enabled. + +Installation and deployment +--------------------------- + +- Barrier xlator is not packaged with glusterfs-server rpm. With this + changes, this has to be added to the rpm. diff --git a/Feature Planning/GlusterFS 3.6/Thousand Node Gluster.md b/Feature Planning/GlusterFS 3.6/Thousand Node Gluster.md new file mode 100644 index 0000000..54c3e13 --- /dev/null +++ b/Feature Planning/GlusterFS 3.6/Thousand Node Gluster.md @@ -0,0 +1,150 @@ +Goal +---- + +Thousand-node scalability for glusterd + +Summary +======= + +This "feature" is really a set of infrastructure changes that will +enable glusterd to manage a thousand servers gracefully. + +Owners +====== + +Krishnan Parthasarathi <kparthas@redhat.com> +Jeff Darcy <jdarcy@redhat.com> + +Current status +============== + +Proposed, awaiting summit for approval. + +Related Feature Requests and Bugs +================================= + +N/A + +Detailed Description +==================== + +There are three major areas of change included in this proposal. + +- Replace the current order-n-squared heartbeat/membership protocol + with a much smaller "monitor cluster" based on Paxos or + [Raft](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf), + to which I/O servers check in. + +- Use the monitor cluster to designate specific functions or roles - + e.g. self-heal, rebalance, leadership in an NSR subvolume - to I/O + servers in a coordinated and globally optimal fashion. + +- Replace the current system of replicating configuration data on all + servers (providing practically no guarantee of consistency if one is + absent during a configuration change) with storage of configuration + data in the monitor cluster. + +Benefit to GlusterFS +==================== + +Scaling of our management plane to 1000+ nodes, enabling competition +with other projects such as HDFS or Ceph which already have or claim +such scalability. + +Scope +===== + +Nature of proposed change +------------------------- + +Functionality very similar to what we need in the monitor cluster +already exists in some of the Raft implementations, notably +[etcd](https://github.com/coreos/etcd). Such a component could provide +the services described above to a modified glusterd running on each +server. The changes to glusterd would mostly consist of removing the +current heartbeat and config-storage code, replacing it with calls into +(and callbacks from) the monitor cluster. + +Implications on manageability +----------------------------- + +Enabling/starting monitor daemons on those few nodes that have them must +be done separately from starting glusterd. Since the changes mostly are +to how each glusterd interacts with others and with its own local +storage back end, interactions with the CLI or with glusterfsd need not +change. + +Implications on presentation layer +---------------------------------- + +N/A + +Implications on persistence layer +--------------------------------- + +N/A + +Implications on 'GlusterFS' backend +----------------------------------- + +N/A + +Modification to GlusterFS metadata +---------------------------------- + +The monitor daemons need space for their data, much like that currently +maintained in /var/lib/glusterd currently. + +Implications on 'glusterd' +-------------------------- + +Drastic. See sections above. + +How To Test +=========== + +A new set of tests for the monitor-cluster functionality will need to be +developed, perhaps derived from those for the external project if we +adopt one. Most tests related to our multi-node testing facilities +(cluster.rc) will also need to change. Tests which merely invoke the CLI +should require little if any change. + +User Experience +=============== + +Minimal change. + +Dependencies +============ + +A mature/stable enough implementation of Raft or a similar protocol. +Failing that, we'd need to develop our own service along similar lines. + +Documentation +============= + +TBD. + +Status +====== + +In design. + +The choice of technology and approaches are being discussed on the +-devel ML. + +- "Proposal for Glusterd-2.0" - + [1](http://www.gluster.org/pipermail/gluster-users/2014-September/018639.html) + +: Though the discussion has become passive, the question is whether we + choose to implement consensus algorithm inside our project or depend + on external projects that provide similar service. + +- "Management volume proposal" - + [2](http://www.gluster.org/pipermail/gluster-devel/2014-November/042944.html) + +: This has limitations due to the circular dependency making it + infeasible. + +Comments and Discussion +======================= diff --git a/Feature Planning/GlusterFS 3.6/afrv2.md b/Feature Planning/GlusterFS 3.6/afrv2.md new file mode 100644 index 0000000..a1767c7 --- /dev/null +++ b/Feature Planning/GlusterFS 3.6/afrv2.md @@ -0,0 +1,244 @@ +Feature +------- + +This feature is major code re-factor of current afr along with a key +design change in the way changelog extended attributes are stored in +afr. + +Summary +------- + +This feature introduces design change in afr which separates ongoing +transaction, pending operation count for files/directories. + +Owners +------ + +Anand Avati +Pranith Kumar Karampuri + +Current status +-------------- + +The feature is in final stages of review at +<http://review.gluster.org/6010> + +Detailed Description +-------------------- + +How AFR works: + +In order to keep track of what copies of the file are modified and up to +date, and what copies require to be healed, AFR keeps state information +in the extended attributes of the file called changelog extended +attributes. These extended attributes stores that copy's view of how up +to date the other copies are. The extended attributes are modified in a +transaction which consists of 5 phases - LOCK, PRE-OP, OP, POST-OP and +UNLOCK. In the PRE-OP phase the extended attributes are updated to store +the intent of modification (in the OP phase.) + +In the POST-OP phase, depending on how many servers crashed mid way and +on how many servers the OP was applied successfully, a corresponding +change is made in the extended attributes (of the surviving copies) to +represent the staleness of the copies which missed the OP phase. + +Further, when those lagging servers become available, healing decisions +are taken based on these extended attribute values. + +Today, a PRE-OP increments the pending counters of all elements in the +array (where each element represents a server, and therefore one of the +members of the array represents that server itself.) The POST-OP +decrements those counters which represent servers where the operation +was successful. The update is performed on all the servers which have +made it till the POST-OP phase. The decision of whether a server crashed +in the middle of a transaction or whether the server lived through the +transaction and witnessed the other server crash, is inferred by +inspecting the extended attributes of all servers together. Because +there is no distinction between these counters as to how many of those +increments represent "in transit" operations and how many of those are +retained without decrement to represent "pending counters", there is +value in adding clarity to the system by separating the two. + +The change is to now have only one dirty flag on each server per file. +We also make the PRE-OP increment only that dirty flag rather than all +the elements of the pending array. The dirty flag must be set before +performing the operation, and based on which of the servers the +operation failed, we will set the pending counters representing these +failed servers on the remaining ones in the POST-OP phase. The dirty +counter is also cleared at the end of the POST-OP. This means, in +successful operations only the dirty flag (one integer) is incremented +and decremented per server per file. However if a pending counter is set +because of an operation failure, then the flag is an unambiguous "finger +pointing" at the other server. Meaning, if a file has a pending counter +AND a dirty flag, it will not undermine the "strength" of the pending +counter. This change completely removes today's ambiguity of whether a +pending counter represents a still ongoing operation (or crashed in +transit) vs a surely missed operation. + +Benefit to GlusterFS +-------------------- + +It increases the clarity of whether a file has any ongoing transactions +and any pending self-heals. Code is more maintainable now. + +Scope +----- + +### Nature of proposed change + +- Remove client side self-healing completely (opendir, openfd, lookup) - +Re-work readdir-failover to work reliably in case of NFS - Remove +unused/dead lock recovery code - Consistently use xdata in both calls +and callbacks in all FOPs - Per-inode event generation, used to force +inode ctx refresh - Implement dirty flag support (in place of pending +counts) - Eliminate inode ctx structure, use read subvol bits + +event\_generation - Implement inode ctx refreshing based on event +generation - Provide backward compatibility in transactions - remove +unused variables and functions - make code more consistent in style and +pattern - regularize and clean up inode-write transaction code - +regularize and clean up dir-write transaction code - regularize and +clean up common FOPs - reorganize transaction framework code - skip +setting xattrs in pending dict if nothing is pending - re-write +self-healing code using syncops - re-write simpler self-heal-daemon + +### Implications on manageability + +None + +### Implications on presentation layer + +None + +### Implications on persistence layer + +None + +### Implications on 'GlusterFS' backend + +None + +### Modification to GlusterFS metadata + +This changes the way pending counts vs Ongoing transactions are +represented in changelog extended attributes. + +### Implications on 'glusterd' + +None + +How To Test +----------- + +Same test cases of afrv1 hold. + +User Experience +--------------- + +None + +Dependencies +------------ + +None + +Documentation +------------- + +--- + +Status +------ + +The feature is in final stages of review at +<http://review.gluster.org/6010> + +Comments and Discussion +----------------------- + +--- + +Summary +------- + +<Brief Description of the Feature> + +Owners +------ + +<Feature Owners - Ideally includes you :-)> + +Current status +-------------- + +<Provide details on related existing features, if any and why this new feature is needed> + +Detailed Description +-------------------- + +<Detailed Feature Description> + +Benefit to GlusterFS +-------------------- + +<Describe Value additions to GlusterFS> + +Scope +----- + +### Nature of proposed change + +<modification to existing code, new translators ...> + +### Implications on manageability + +<Glusterd, GlusterCLI, Web Console, REST API> + +### Implications on presentation layer + +<NFS/SAMBA/UFO/FUSE/libglusterfsclient Integration> + +### Implications on persistence layer + +<LVM, XFS, RHEL ...> + +### Implications on 'GlusterFS' backend + +<brick's data format, layout changes> + +### Modification to GlusterFS metadata + +<extended attributes used, internal hidden files to keep the metadata...> + +### Implications on 'glusterd' + +<persistent store, configuration changes, brick-op...> + +How To Test +----------- + +<Description on Testing the feature> + +User Experience +--------------- + +<Changes in CLI, effect on User experience...> + +Dependencies +------------ + +<Dependencies, if any> + +Documentation +------------- + +<Documentation for the feature> + +Status +------ + +<Status of development - Design Ready, In development, Completed> + +Comments and Discussion +----------------------- + +<Follow here> diff --git a/Feature Planning/GlusterFS 3.6/better-ssl.md b/Feature Planning/GlusterFS 3.6/better-ssl.md new file mode 100644 index 0000000..44136d5 --- /dev/null +++ b/Feature Planning/GlusterFS 3.6/better-ssl.md @@ -0,0 +1,137 @@ +Feature +======= + +Better SSL Support + +1 Summary +========= + +Our SSL support is currently incomplete in several areas. This "feature" +covers several enhancements (see Detailed Description below) to close +gaps and make it more user-friendly. + +2 Owners +======== + +Jeff Darcy <jdarcy@redhat.com> + +3 Current status +================ + +Some patches already submitted. + +4 Detailed Description +====================== + +These are the items necessary to make our SSL support more of a useful +differentiating feature vs. other projects. + +- Enable SSL for the management plane (glusterd). There are currently + several bugs and UI issues precluding this. + +- Allow SSL identities to be used for authorization as well as + authentication (and encryption). At a minimum this would apply to + the I/O path, restricting specific volumes to specific + SSL-identified principals. It might also apply to the management + path, restricting certain actions (and/or actions on certain + volumes) to certain principals. Ultimately this could be the basis + for full role-based access control, but that's not in scope + currently. + +- Provide more options, e.g. for cipher suites or certificate-signing + +- Fix bugs related to increased concurrency levels from the + multi-threaded transport. + +5 Benefit to GlusterFS +====================== + +Sufficient security to support deployment in environments where security +is a non-negotiable requirement (e.g. government). Sufficient usability +to support deployment by anyone who merely desires additional security. +Improved performance in some cases, due to the multi-threaded transport. + +6 Scope +======= + +6.1. Nature of proposed change +------------------------------ + +Most of the proposed changes do not actually involve the SSL transport +itself, but are in surrounding components instead. The exception is the +addition of options, which should be pretty simple. However, bugs +related to increased concurrency levels could show up anywhere, most +likely in our more complex translators (e.g. DHT or AFR), and will need +to be fixed on a case-by-case basis. + +6.2. Implications on manageability +---------------------------------- + +Additional configuration will be necessary to enable SSL for glusterd. +Additional commands will also be needed to manage certificates and keys; +the [HekaFS +documentation](https://git.fedorahosted.org/cgit/CloudFS.git/tree/doc) +can serve as an example of what's needed. + +6.3. Implications on presentation layer +--------------------------------------- + +N/A + +6.4. Implications on persistence layer +-------------------------------------- + +N/A + +6.5. Implications on 'GlusterFS' backend +---------------------------------------- + +N/A + +6.6. Modification to GlusterFS metadata +--------------------------------------- + +N/A + +6.7. Implications on 'glusterd' +------------------------------- + +Significant changes to how glusterd calls the transport layer (and +expects to be called in return) will be necessary to fix bugs and to +enable SSL on its connections. + +7 How To Test +============= + +New tests will be needed for each major change in the detailed +description. Also, to improve test coverage and smoke out all of the +concurrency bugs, it might be desirable to change the test framework to +allow running in a mode where SSL is enabled for all tests. + +8 User Experience +================= + +Correspondent to "implications on manageability" section above. + +9 Dependencies +============== + +Currently we use OpenSSL, so its idiosyncrasies guide implementation +choices and timelines. Sometimes it even affects the user experience, +e.g. in terms of what options exist for cipher suites or certificate +depth. It's possible that it will prove advantageous to switch to +another SSL/TLS package with a better interface, probably PolarSSL +(which often responds to new threats more quickly than OpenSSL). + +10 Documentation +================ + +TBD, likely extensive (see "User Experience" section). + +11 Status +========= + +Awaiting approval. + +12 Comments and Discussion +========================== diff --git a/Feature Planning/GlusterFS 3.6/disperse.md b/Feature Planning/GlusterFS 3.6/disperse.md new file mode 100644 index 0000000..e2bad37 --- /dev/null +++ b/Feature Planning/GlusterFS 3.6/disperse.md @@ -0,0 +1,142 @@ +Feature +======= + +Dispersed volume translator + +Summary +======= + +The disperse translator is a new type of volume for GlusterFS that can +be used to offer a configurable level of fault tolerance while +optimizing the disk space waste. It can be seen as a RAID5-like volume. + +Owners +====== + +Xavier Hernandez <xhernandez@datalab.es> + +Current status +============== + +A working version is included in GlusterFS 3.6 + +Detailed Description +==================== + +The disperse translator is based on erasure codes to allow the recovery +of the data stored on one or more bricks in case of failure. The number +of bricks that can fail without losing data is configurable. + +Each brick stores only a portion of each block of data. Some of these +portions are called parity or redundancy blocks. They are computed using +a mathematical transformation so that they can be used to recover the +content of the portion stored on another brick. + +Each volume is composed of a set of N bricks (as many as you want), and +R of them are used to store the redundancy information. On this +configuration, if each brick has capacity C, the total space available +on the volume will be (N - R) \* C. + +A versioning system is used to dectect inconsistencies and initiate a +self-heal if appropiate. + +All these operations are made on the fly, transparently to the user. + +Benefit to GlusterFS +==================== + +It can be used to create volumes with a configurable level of redundancy +(like replicate), but optimizing disk usage. + +Scope +===== + +Nature of proposed change +------------------------- + +The dispersed volume is implemented by a client-side translator that +will be responsible of encoding/decoding the brick contents. + +Implications on manageability +----------------------------- + +The new type of volume will be configured as any other one. However the +healing operations are quite different and maybe will be necessary to +handle them separately. + +Implications on presentation layer +---------------------------------- + +N/A + +Implications on persistence layer +--------------------------------- + +N/A + +Implications on 'GlusterFS' backend +----------------------------------- + +N/A + +Modification to GlusterFS metadata +---------------------------------- + +Three new extended attributes are created to manage a dispersed file: + +- trusted.ec.config: Contains information about the parameters used to + encode the file. +- trusted.ec.version: Tracks the number of changes made to the file. +- trusted.ec.size: Tracks the real size of the file. + +Implications on 'glusterd' +-------------------------- + +glusterd and cli have been modified to add the needed functionality to +create and manage dispersed volumes. + +How To Test +=========== + +There is a new glusterd syntax to create dispersed volumes: + + gluster volume create <volname> [disperse [count]] [redundancy <count>]] <bricks> + +Both 'disperse' and 'redundancy' are optional, but at least one of them +must be present to create a dispersed volume. The <count> of 'disperse' +is also optional: if not specified, the number of bricks specified in +the command line is taken as the <count> value. To create a +distributed-disperse volume, it's necessary to specify 'disperse' with a +<count> value smaller than the total number of bricks. + +When 'redundancy' is not specified, it's default value is computed so +that it generates an optimal configuration. A configuration is optimal +if *number of bricks - redundancy* is a power of 2. If such a value +exists and it's greater than one, a warning is shown to validate the +number. If it doesn't exist, 1 is taken and another warning is shown. + +Once created, the disperse volume can be started, mounted and used as +any other volume. + +User Experience +=============== + +Almost the same. Only a new volume type added. + +Dependencies +============ + +None + +Documentation +============= + +Not available yet. + +Status +====== + +First implementation ready. + +Comments and Discussion +======================= diff --git a/Feature Planning/GlusterFS 3.6/glusterd volume locks.md b/Feature Planning/GlusterFS 3.6/glusterd volume locks.md new file mode 100644 index 0000000..a8f8ebd --- /dev/null +++ b/Feature Planning/GlusterFS 3.6/glusterd volume locks.md @@ -0,0 +1,48 @@ +As of today most gluster commands take a cluster wide lock, before +performing their respective operations. As a result any two gluster +commands, which have no interdependency with each other, can't be +executed simultaneously. To remove this interdependency we propose to +replace this cluster wide lock with a volume specific lock, so that two +operations on two different volumes can be performed simultaneously. + +1. We classify all gluster operations in three different classes : +Create volume, Delete volume, and volume specific operations. + +2. At any given point of time, we should allow two simultaneous +operations (create, delete or volume specific), as long as each both the +operations are not happening on the same volume. + +3. If two simultaneous operations are performed on the same volume, the +operation which manages to acquire the volume lock will succeed, while +the other will fail. Also both might fail to acquire the volume lock on +the cluster, in which case both the operations will fail. + +In order to achieve this, we propose a locking engine, which will +receive lock requests from these three types of operations. Each such +request for a particular volume will contest for the same volume lock +(based on the volume name and the node-uuid). For example, a delete +volume command for volume1 and a volume status command for volume 1 will +contest for the same lock (comprising of the volume name, and the uuid +of the node winning the lock), in which case, one of these commands will +succeed and the other one will not, failing to acquire the lock. + +Whereas, if two operations are simultaneously performed on a different +volumes they should happen smoothly, as both these operations would +request the locking engine for two different locks, and will succeed in +locking them in parallel. + +We maintain a global list of volume-locks (using a dict for a list) +where the key is the volume name, and which saves the uuid of the +originator glusterd. These locks are held and released per volume +transaction. + +In order to acheive multiple gluster operations occuring at the same +time, we also separate opinfos in the op-state-machine, as a part of +this patch. To do so, we generate a unique transaction-id (uuid) per +gluster transaction. An opinfo is then associated with this transaction +id, which is used throughout the transaction. We maintain a run-time +global list(using a dict) of transaction-ids, and their respective +opinfos to achieve this. + +Gluster devel Mailing Thread: +<http://lists.gnu.org/archive/html/gluster-devel/2013-09/msg00042.html>
\ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.6/heterogeneous-bricks.md b/Feature Planning/GlusterFS 3.6/heterogeneous-bricks.md new file mode 100644 index 0000000..a769b56 --- /dev/null +++ b/Feature Planning/GlusterFS 3.6/heterogeneous-bricks.md @@ -0,0 +1,136 @@ +Feature +------- + +Support heterogeneous (different size) bricks. + +Summary +------- + +DHT is currently very naive about brick sizes, assigning equal "weight" +to each brick/subvolume for purposes of placing new files even though +the bricks might actually have different sizes. It would be better if +DHT assigned greater weight (i.e. would create more files) on bricks +with more total or free space. + +This proposal came out of a [mailing-list +discussion](http://www.gluster.org/pipermail/gluster-users/2014-January/038638.html) + +Owners +------ + +- Raghavendra G (rgowdapp@redhat.com) + +Current status +-------------- + +There is a +[script](https://github.com/gluster/glusterfs/blob/master/extras/rebalance.py) +representing much of the necessary logic, using DHT's "custom layout" +feature and other tricks. + +The most basic kind of heterogeneous-brick-friendly rebalancing has been +implemented. [patch](http://review.gluster.org/#/c/8020/) + +Detailed Description +-------------------- + +There should be (at least) three options: + +- Assign subvolume weights based on **total** space. + +- Assign subvolume weights based on **free** space. + +- Assign all (or nearly all) weight to specific subvolumes. + +The last option is useful for those who expand a volume by adding bricks +and intend to let the system "rebalance automatically" by directing new +files to the new bricks, without migrating any old data. Once the +appropriate weights have been calculated, the rebalance command should +apply the results recursively to all directories within the volume +(except those with custom layouts) and DHT should assign layouts to new +directories in accordance with the same weights. + +Benefit to GlusterFS +-------------------- + +Better support for adding new bricks that are a different size than the +old, which is common as disk capacity tends to improve with each +generation (as noted in the ML discussion). + +Better support for adding capacity without an expensive (data-migrating) +rebalance operation. + +Scope +----- + +This will involve changes to all current rebalance code - CLI, glusterd, +DHT, and probably others. + +### Implications on manageability + +New CLI options. + +### Implications on presentation layer + +None. + +### Implications on persistence layer + +None. + +### Implications on 'GlusterFS' backend + +None, unless we want to add a "policy" xattr on the root inode to be +consulted when new directories are created (could also be done via +translator options). + +### Modification to GlusterFS metadata + +Same as previous. + +### Implications on 'glusterd' + +New fields in rebalance-related RPCs. + +How To Test +----------- + +For each policy: + +1. Create a volume with small bricks (ramdisk-based if need be). + +1. Fill the bricks to varying degrees. + +1. (optional) Add more empty bricks. + +1. Rebalance using the target policy. + +1. Create some dozens/hundreds of new files. + +1. Verify that the distribution of new files matches what is expected + for the given policy. + +User Experience +--------------- + +New options for the "rebalance" command. + +Dependencies +------------ + +None. + +Documentation +------------- + +TBD + +Status +------ + +Original author has abandoned this change. If anyone else wants to make +a \*positive\* contribution to fix a long-standing user concern, feel +free. + +Comments and Discussion +----------------------- diff --git a/Feature Planning/GlusterFS 3.6/index.md b/Feature Planning/GlusterFS 3.6/index.md new file mode 100644 index 0000000..f4d83db --- /dev/null +++ b/Feature Planning/GlusterFS 3.6/index.md @@ -0,0 +1,96 @@ +GlusterFS 3.6 Release Planning +------------------------------ + +Tentative Dates: + +4th Mar, 2014 - Feature proposal freeze + +17th Jul, 2014 - Feature freeze & Branching + +12th Sep, 2014 - Community Test Weekend \#1 + +21st Sep, 2014 - 3.6.0 Beta Release + +22nd Sep, 2014 - Community Test Week + +31st Oct, 2014 - 3.6.0 GA + +Feature proposal for GlusterFS 3.6 +---------------------------------- + +### Features in 3.6.0 + +- [Features/better-ssl](./better-ssl.md): + Various improvements to SSL support. + +- [Features/heterogeneous-bricks](./heterogeneous-bricks.md): + Support different-sized bricks. + +- [Features/disperse](./disperse.md): + Volumes based on erasure codes. + +- [Features/glusterd-volume-locks](./glusterd volume locks.md): + Volume wide locks for glusterd + +- [Features/persistent-AFR-changelog-xattributes](./Persistent AFR Changelog xattributes.md): + Persistent naming scheme for client xlator names and AFR changelog + attributes. + +- [Features/better-logging](./Better Logging.md): + Gluster logging enhancements to support message IDs per message + +- [Features/Better peer identification](./Better Peer Identification.md) + +- [Features/Gluster Volume Snapshot](./Gluster Volume Snapshot.md) + +- [Features/Gluster User Serviceable Snapshots](./Gluster User Serviceable Snapshots.md) + +- **[Features/afrv2](./afrv2.md)**: Afr refactor. + +- [Features/RDMA Improvements](./RDMA Improvements.md): + Improvements for RDMA + +- [Features/Server-side Barrier feature](./Server-side Barrier feature.md): + A supplementary feature for the + [Features/Gluster Volume Snapshot](./Gluster Volume Snapshot.md) which maintains the consistency across the snapshots. + +### Features beyond 3.6.0 + +- [Features/Smallfile Perf](../GlusterFS 3.7/Small File Performance.md): + Small-file performance enhancement. + +- [Features/data-classification](../GlusterFS 3.7/Data Classification.md): + Tiering, rack-aware placement, and more. + +- [Features/new-style-replication](./New Style Replication.md): + Log-based, chain replication. + +- [Features/thousand-node-glusterd](./Thousand Node Gluster.md): + Glusterd changes for higher scale. + +- [Features/Trash](../GlusterFS 3.7/Trash.md): + Trash translator for GlusterFS + +- [Features/Object Count](../GlusterFS 3.7/Object Count.md) +- [Features/SELinux Integration](../GlusterFS 3.7/SE Linux Integration.md) +- [Features/Easy addition of custom translators](../GlusterFS 3.7/Easy addition of Custom Translators.md) +- [Features/Exports Netgroups Authentication](../GlusterFS 3.7/Exports and Netgroups Authentication.md) + [Features/outcast](../GlusterFS 3.7/Outcast.md): Outcast + +- **[Features/Policy based Split-brain Resolution](../GlusterFS 3.7/Policy based Split-brain Resolution.md)**: Policy Based +Split-brain resolution + +- [Features/rest-api](../GlusterFS 3.7/rest-api.md): + REST API for Gluster Volume and Peer Management + +- [Features/Archipelago Integration](../GlusterFS 3.7/Archipelago Integration.md): + Improvements for integration with Archipelago + +Release Criterion +----------------- + +- All new features to be documented in admin guide + +- Feature tested as part of testing days. + +- More to follow
\ No newline at end of file |