From 9e9e3c5620882d2f769694996ff4d7e0cf36cc2b Mon Sep 17 00:00:00 2001 From: raghavendra talur Date: Thu, 20 Aug 2015 15:09:31 +0530 Subject: Create basic directory structure All new features specs go into in_progress directory. Once signed off, it should be moved to done directory. For now, This change moves all the Gluster 4.0 feature specs to in_progress. All other specs are under done/release-version. More cleanup required will be done incrementally. Change-Id: Id272d301ba8c434cbf7a9a966ceba05fe63b230d BUG: 1206539 Signed-off-by: Raghavendra Talur Reviewed-on: http://review.gluster.org/11969 Reviewed-by: Humble Devassy Chirammal Reviewed-by: Prashanth Pai Tested-by: Humble Devassy Chirammal --- done/GlusterFS 3.6/Better Logging.md | 348 ++++++++++++++++++++ done/GlusterFS 3.6/Better Peer Identification.md | 172 ++++++++++ .../Gluster User Serviceable Snapshots.md | 39 +++ done/GlusterFS 3.6/Gluster Volume Snapshot.md | 354 +++++++++++++++++++++ done/GlusterFS 3.6/New Style Replication.md | 230 +++++++++++++ .../Persistent AFR Changelog xattributes.md | 178 +++++++++++ done/GlusterFS 3.6/RDMA Improvements.md | 101 ++++++ done/GlusterFS 3.6/Server-side Barrier feature.md | 213 +++++++++++++ done/GlusterFS 3.6/Thousand Node Gluster.md | 150 +++++++++ done/GlusterFS 3.6/afrv2.md | 244 ++++++++++++++ done/GlusterFS 3.6/better-ssl.md | 137 ++++++++ done/GlusterFS 3.6/disperse.md | 142 +++++++++ done/GlusterFS 3.6/glusterd volume locks.md | 48 +++ done/GlusterFS 3.6/heterogeneous-bricks.md | 136 ++++++++ done/GlusterFS 3.6/index.md | 96 ++++++ 15 files changed, 2588 insertions(+) create mode 100644 done/GlusterFS 3.6/Better Logging.md create mode 100644 done/GlusterFS 3.6/Better Peer Identification.md create mode 100644 done/GlusterFS 3.6/Gluster User Serviceable Snapshots.md create mode 100644 done/GlusterFS 3.6/Gluster Volume Snapshot.md create mode 100644 done/GlusterFS 3.6/New Style Replication.md create mode 100644 done/GlusterFS 3.6/Persistent AFR Changelog xattributes.md create mode 100644 done/GlusterFS 3.6/RDMA Improvements.md create mode 100644 done/GlusterFS 3.6/Server-side Barrier feature.md create mode 100644 done/GlusterFS 3.6/Thousand Node Gluster.md create mode 100644 done/GlusterFS 3.6/afrv2.md create mode 100644 done/GlusterFS 3.6/better-ssl.md create mode 100644 done/GlusterFS 3.6/disperse.md create mode 100644 done/GlusterFS 3.6/glusterd volume locks.md create mode 100644 done/GlusterFS 3.6/heterogeneous-bricks.md create mode 100644 done/GlusterFS 3.6/index.md (limited to 'done/GlusterFS 3.6') diff --git a/done/GlusterFS 3.6/Better Logging.md b/done/GlusterFS 3.6/Better Logging.md new file mode 100644 index 0000000..6aad602 --- /dev/null +++ b/done/GlusterFS 3.6/Better Logging.md @@ -0,0 +1,348 @@ +Feature +------- + +Gluster logging enhancements to support message IDs per message + +Summary +------- + +Enhance gluster logging to provide the following features, SubFeature +--\> SF + +- SF1: Add message IDs to message + +- SF2: Standardize error num reporting across messages + +- SF3: Enable repetitive message suppression in logs + +- SF4: Log location and hierarchy standardization (in case anything is +further required here, analysis pending) + +- SF5: Enable per sub-module logging level configuration + +- SF6: Enable logging to other frameworks, than just the current gluster +logs + +- SF7: Generate a catalogue of these message, with message ID, message, +reason for occurrence, recovery/troubleshooting steps. + +Owners +------ + +Balamurugan Arumugam +Krishnan Parthasarathi +Krutika Dhananjay +Shyamsundar Ranganathan + +Current status +-------------- + +### Existing infrastructure: + +Currently gf\_logXXX exists as an infrastructure API for all logging +related needs. This (typically) takes the form, + +gf\_log(dom, levl, fmt...) + +where, + +    dom: Open format string usually the xlator name, or "cli" or volume name etc. +    levl: One of, GF_LOG_EMERG, GF_LOG_ALERT, GF_LOG_CRITICAL, GF_LOG_ERROR, GF_LOG_WARNING, GF_LOG_NOTICE, GF_LOG_INFO, GF_LOG_DEBUG, GF_LOG_TRACE +    fmt: the actual message string, followed by the required arguments in the string + +The log initialization happens through, + +gf\_log\_init (void \*data, const char \*filename, const char \*ident) + +where, + +    data: glusterfs_ctx_t, largely unused in logging other than the required FILE and mutex fields +    filename: file name to log to +    ident: Like syslog ident parameter, largely unused + +The above infrastructure leads to logs of type, (sample extraction from +nfs.log) + +     [2013-12-08 14:17:17.603879] I [socket.c:3485:socket_init] 0-socket.ACL: SSL support is NOT enabled +     [2013-12-08 14:17:17.603937] I [socket.c:3500:socket_init] 0-socket.ACL: using system polling thread +     [2013-12-08 14:17:17.612128] I [nfs.c:934:init] 0-nfs: NFS service started +     [2013-12-08 14:17:17.612383] I [dht-shared.c:311:dht_init_regex] 0-testvol-dht: using regex rsync-hash-regex = ^\.(.+)\.[^.]+$ + +### Limitations/Issues in the infrastructure + +​1) Auto analysis of logs needs to be done based on the final message +string. Automated tools that can help with log message and related +troubleshooting options need to use the final string, which needs to be +intelligently parsed and also may change between releases. It would be +desirable to have message IDs so that such tools and trouble shooting +options can leverage the same in a much easier fashion. + +​2) The log message itself currently does not use the \_ident\_ which +can help as we move to more common logging frameworks like journald, +rsyslog (or syslog as the case maybe) + +​3) errno is the primary identifier of errors across gluster, i.e we do +not have error codes in gluster and use errno values everywhere. The log +messages currently do not lend themselves to standardization like +printing the string equivalent of errno rather than the actual errno +value, which \_could\_ be cryptic to administrators + +​4) Typical logging infrastructures provide suppression (on a +configurable basis) for repetitive messages to prevent log flooding, +this is currently missing in the current infrastructure + +​5) The current infrastructure cannot be used to control log levels at a +per xlator or sub module, as the \_dom\_ passed is a string that change +based on volume name, translator name etc. It would be desirable to have +a better module identification mechanism that can help with this +feature. + +​6) Currently the entire logging infrastructure resides within gluster. +It would be desirable in scaled situations to have centralized logging +and monitoring solutions in place, to be able to better analyse and +monitor the cluster health and take actions. + +This requires some form of pluggable logging frameworks that can be used +within gluster to enable this possibility. Currently the existing +framework is used throughout gluster and hence we need only to change +configuration and logging.c to enable logging to other frameworks (as an +example the current syslog plug that was provided). + +It would be desirable to enhance this to provide a more robust framework +for future extensions to other frameworks. This is not a limitation of +the current framework, so much as a re-factor to be able to switch +logging frameworks with more ease. + +​7) For centralized logging in the future, it would need better +identification strings from various gluster processes and hosts, which +is currently missing or suppressed in the logging infrastructure. + +Due to the nature of enhancements proposed, it is required that we +better the current infrastructure for the stated needs and do some +future proofing in terms of newer messages that would be added. + +Detailed Description +-------------------- + +NOTE: Covering details for SF1, SF2, and partially SF3, SF5, SF6. SF4/7 +will be covered in later revisions/phases. + +### Logging API changes: + +​1) Change the logging API as follows, + +From: gf\_log(dom, levl, fmt...) + +To: gf\_msg(dom, levl, errnum, msgid, fmt...) + +Where: + +    dom: Open string as used in the current logging infrastructure (helps in backward compat) +    levl: As in current logging infrastructure (current levels seem sufficient enough to not add more levels for better debuggability etc.) +     +    msgid: A message identifier, unique to this message FMT string and possibly this invocation. (SF1, lending to SF3) +    errnum: The errno that this message is generated for (with an implicit 0 meaning no error number per se with this message) (SF2) + +NOTE: Internally the gf\_msg would still be a macro that would add the +\_\_FILE\_\_ \_\_LINE\_\_ \_\_FUNCTION\_\_ arguments + +​2) Enforce \_ident\_ in the logging initialization API, gf\_log\_init +(void \*data, const char \*filename, const char \*ident) + +Where: + + ident would be the identifier string like, nfs, , brick-, cli, glusterd, as is the case with the log file name that is generated today (lending to SF6) + +#### What this achieves: + +With the above changes, we now have a message ID per message +(\_msgid\_), location of the message in terms of which component +(\_dom\_) and which process (\_ident\_). The further identification of +the message location in terms of host (ip/name) can be done in the +framework, when centralized logging infrastructure is introduced. + +#### Log message changes: + +With the above changes to the API the log message can now appear in a +compatibility mode to adhere to current logging format, or be presented +as follows, + +log invoked as: gf\_msg(dom, levl, ENOTSUP, msgidX) + +Example: gf\_msg ("logchecks", GF\_LOG\_CRITICAL, 22, logchecks\_msg\_4, +42, "Forty-Two", 42); + +Where: logchecks\_msg\_4 (GLFS\_COMP\_BASE + 4), "Critical: Format +testing: %d:%s:%x" + +​1) Gluster logging framework (logged as) + + [2014-02-17 08:52:28.038267] I [MSGID: 1002] [logchecks.c:44:go_log] 0-logchecks: Informational: Format testing: 42:Forty-Two:2a [Invalid argument] + +​2) syslog (passed as) + + Feb 17 14:17:42 somari logchecks[26205]: [MSGID: 1002] [logchecks.c:44:go_log] 0-logchecks: Informational: Format testing: 42:Forty-Two:2a [Invalid argument] + +​3) journald (passed as) + +    sd_journal_send("MESSAGE=", +                        "MESSAGE_ID=msgid", +                        "PRIORITY=levl", +                        "CODE_FILE=``", "CODE_LINE=`", "CODE_FUNC=", +                        "ERRNO=errnum", +                        "SYSLOG_IDENTIFIER=" +                        NULL); + +​4) CEE (Common Event Expression) format string passed to any CEE +consumer (say lumberjack) + +Based on generating @CEE JSON string as per specifications and passing +it the infrastructure in question. + +#### Message ID generation: + +​1) Some rules for message IDs + +- Every message, even if it is the same message FMT, will have a unique +message ID - Changes to a specific message string, hence will not change +its ID and also not impact other locations in the code that use the same +message FMT + +​2) A glfs-message-id.h file would contain ranges per component for +individual component based messages to be created without overlapping on +the ranges. + +​3) -message.h would contain something as follows, + +     #define GLFS_COMP_BASE         GLFS_MSGID_COMP_ +     #define GLFS_NUM_MESSAGES       1 +     #define GLFS_MSGID_END          (GLFS_COMP_BASE + GLFS_NUM_MESSAGES + 1) +     /* Messaged with message IDs */ +     #define glfs_msg_start_x GLFS_COMP_BASE, "Invalid: Start of messages" +     /*------------*/ +     #define _msg_1 (GLFS_COMP_BASE + 1), "Test message, replace with"\ +                        " original when using the template" +     /*------------*/ +     #define glfs_msg_end_x GLFS_MSGID_END, "Invalid: End of messages" + +​5) Each call to gf\_msg hence would be, + +    gf_msg(dom, levl, errnum, glfs_msg_x, ...) + +#### Setting per xlator logging levels (SF5): + +short description to be elaborated later + +Leverage this-\>loglevel to override the global loglevel. This can be +also configured from gluster CLI at runtime to change the log levels at +a per xlator level for targeted debugging. + +#### Multiple log suppression(SF3): + +short description to be elaborated later + +​1) Save the message string as follows, Msg\_Object(msgid, +msgstring(vasprintf(dom, fmt)), timestamp, repetitions) + +​2) On each message received by the logging infrastructure check the +list of saved last few Msg\_Objects as follows, + +2.1) compare msgid and on success compare msgstring for a match, compare +repetition tolerance time with current TS and saved TS in the +Msg\_Object + +2.1.1) if tolerance is within limits, increment repetitions and do not +print message + +2.1.2) if tolerance is outside limits, print repetition count for saved +message (if any) and print the new message + +2.2) If none of the messages match the current message, knock off the +oldest message in the list printing any repetition count message for the +same, and stash new message into the list + +The key things to remember and act on here would be to, minimize the +string duplication on each message, and also to keep the comparison +quick (hence base it off message IDs and errno to start with) + +#### Message catalogue (SF7): + + + +The idea is to use Doxygen comments in the -message.h per +component, to list information in various sections per message of +consequence and later use Doxygen to publish this catalogue on a per +release basis. + +Benefit to GlusterFS +-------------------- + +The mentioned limitations and auto log analysis benefits would accrue +for GlusterFS + +Scope +----- + +### Nature of proposed change + +All gf\_logXXX function invocations would change to gf\_msgXXX +invocations. + +### Implications on manageability + +None + +### Implications on presentation layer + +None + +### Implications on persistence layer + +None + +### Implications on 'GlusterFS' backend + +None + +### Modification to GlusterFS metadata + +None + +### Implications on 'glusterd' + +None + +How To Test +----------- + +A separate test utility that tests various logs and formats would be +provided to ensure that functionality can be tested independent of +GlusterFS + +User Experience +--------------- + +Users would notice changed logging formats as mentioned above, the +additional field of importance would be the MSGID: + +Dependencies +------------ + +None + +Documentation +------------- + +Intending to add a logging.md (or modify the same) to elaborate on how a +new component should now use the new framework and generate messages +with IDs in the same. + +Status +------ + +In development (see, ) + +Comments and Discussion +----------------------- + + \ No newline at end of file diff --git a/done/GlusterFS 3.6/Better Peer Identification.md b/done/GlusterFS 3.6/Better Peer Identification.md new file mode 100644 index 0000000..a8c6996 --- /dev/null +++ b/done/GlusterFS 3.6/Better Peer Identification.md @@ -0,0 +1,172 @@ +Feature +------- + +**Better peer identification** + +Summary +------- + +This proposal is regarding better identification of peers. + +Owners +------ + +Kaushal Madappa + +Current status +-------------- + +Glusterd currently is inconsistent in the way it identifies peers. This +causes problems when the same peer is referenced with different names in +different gluster commands. + +Detailed Description +-------------------- + +Currently, the way we identify peers is not consistent all through the +gluster code. We use uuids internally and hostnames externally. + +This setup works pretty well when all the peers are on a single network, +have one address, and are referred to in all the gluster commands with +same address. + +But once we start mixing up addresses in the commands (ip, shortnames, +fqdn) and bring in multiple networks we have problems. + +The problems were discussed in the following mailing list threads and +some solutions were proposed. + +- How do we identify peers? [^1] +- RFC - "Connection Groups" concept [^2] + +The solution to the multi-network problem is dependent on the solution +to the peer identification problem. So it'll be good to target fixing +the peer identification problem asap, ie. in 3.6, and take up the +networks problem later. + +Benefit to GlusterFS +-------------------- + +Sanity. It will be great to have all internal identifiers for peers +happening through a UUID, and being translated into a host/IP at the +most superficial layer. + +Scope +----- + +### Nature of proposed change + +The following changes will be done in Glusterd to improve peer +identification. + +1. Peerinfo struct will be extended to have a list of associated + hostnames/addresses, instead of a single hostname as it is + currently. The import/export and store/restore functions will be + changed to handle this. CLI will be updated to show this list of + addresses in peer status and pool list commands. +2. Peer probe will be changed to append an address to the peerinfo + address list, when we observe that the given address belongs to an + existing peer. +3. Have a new API for translation between hostname/addresses into + UUIDs. This new API will be used in all places where + hostnames/addresses were being validated, including peer probe, peer + detach, volume create, add-brick, remove-brick etc. +4. A new command - 'gluster peer add-address ' + - which appends to the address list will be implemented if time + permits. +5. A new command - 'gluster peer rename ' - which will + rename all occurrences of a peer with the newly given name will be + implemented if time permits. + +Changes 1-3 are the base for the other changes and will the primary +deliverables for this feature. + +### Implications on manageability + +The primary changes will bring about some changes to the CLI output of +'peer status' and 'pool list' commands. The normal and XML outputs for +these commands will contain a list of addresses for each peer, instead +of a single hostname. + +Tools depending on the output of these commands will need to be updated. + +**TODO**: *Add sample outputs* + +The new commands 'peer add-address' and 'peer rename' will improve +manageability of peers. + +### Implications on presentation layer + +None + +### Implications on persistence layer + +None + +### Implications on 'GlusterFS' backend + +None + +### Modification to GlusterFS metadata + +None + +### Implications on 'glusterd' + + + +How To Test +----------- + +**TODO:** *Add test cases* + +User Experience +--------------- + +User experience will improve for commands which used peer identifiers +(volume create/add-brick/remove-brick, peer probe, peer detach), as the +the user will no longer face errors caused by mixed usage of +identifiers. + +Dependencies +------------ + +None. + +Documentation +------------- + +The new behaviour of the peer probe command will need to be documented. +The new commands will need to be documented as well. + +**TODO:** *Add more documentations* + +Status +------ + +The feature is under development on forge [^3] and github [^4]. This +github merge request [^5] can be used for performing preliminary +reviews. Once we are satisfied with the changes, it will be posted for +review on gerrit. + +Comments and Discussion +----------------------- + +There are open issues around node crash + re-install with same IP (but +new UUID) which need to be addressed in this effort. + +Links +----- + + + + +[^1]: + +[^2]: + +[^3]: + +[^4]: + +[^5]: diff --git a/done/GlusterFS 3.6/Gluster User Serviceable Snapshots.md b/done/GlusterFS 3.6/Gluster User Serviceable Snapshots.md new file mode 100644 index 0000000..9af7062 --- /dev/null +++ b/done/GlusterFS 3.6/Gluster User Serviceable Snapshots.md @@ -0,0 +1,39 @@ +Feature +------- + +Enable user-serviceable snapshots for GlusterFS Volumes based on +GlusterFS-Snapshot feature + +Owners +------ + +Anand Avati +Anand Subramanian +Raghavendra Bhat +Varun Shastry + +Summary +------- + +Each snapshot capable GlusterFS Volume will contain a .snaps directory +through which a user will be able to access previously point-in-time +snapshot copies of his data. This will be enabled through a hidden +.snaps folder in each directory or sub-directory within the volume. +These user-serviceable snapshot copies will be read-only. + +Tests +----- + +​1) Enable uss (gluster volume set features.uss enable) A +snap daemon should get started for the volume. It should be visible in +gluster volume status command. 2) entering the snapshot world ls on +.snaps from any directory within the filesystem should be successful and +should show the list of snapshots as directories. 3) accessing the +snapshots One of the snapshots can be entered and it should show the +contents of the directory from which .snaps was entered, when the +snapshot was taken. NOTE: If the directory was not present when a +snapshot was taken (say snap1) and created later, then entering snap1 +directory (or any access) will fail with stale file handle. 4) Reading +from snapshots Any kind of read operations from the snapshots should be +successful. But any modifications to snapshot data is not allowed. +Snapshots are read-only \ No newline at end of file diff --git a/done/GlusterFS 3.6/Gluster Volume Snapshot.md b/done/GlusterFS 3.6/Gluster Volume Snapshot.md new file mode 100644 index 0000000..468992a --- /dev/null +++ b/done/GlusterFS 3.6/Gluster Volume Snapshot.md @@ -0,0 +1,354 @@ +Feature +------- + +Snapshot of Gluster Volume + +Summary +------- + +Gluster volume snapshot will provide point-in-time copy of a GlusterFS +volume. This snapshot is an online-snapshot therefore file-system and +its associated data continue to be available for the clients, while the +snapshot is being taken. + +Snapshot of a GlusterFS volume will create another read-only volume +which will be a point-in-time copy of the original volume. Users can use +this read-only volume to recover any file(s) they want. Snapshot will +also provide restore feature which will help the user to recover an +entire volume. The restore operation will replace the original volume +with the snapshot volume. + +Owner(s) +-------- + +Rajesh Joseph + +Copyright +--------- + +Copyright (c) 2013-2014 Red Hat, Inc. + +This feature is licensed under your choice of the GNU Lesser General +Public License, version 3 or any later version (LGPLv3 or later), or the +GNU General Public License, version 2 (GPLv2), in all cases as published +by the Free Software Foundation. + +Current status +-------------- + +Gluster volume snapshot support is provided in GlusterFS 3.6 + +Detailed Description +-------------------- + +GlusterFS snapshot feature will provide a crash consistent point-in-time +copy of Gluster volume(s). This snapshot is an online-snapshot therefore +file-system and its associated data continue to be available for the +clients, while the snapshot is being taken. As of now we are not +planning to provide application level crash consistency. That means if a +snapshot is restored then applications need to rely on journals or other +technique to recover or cleanup some of the operations performed on +GlusterFS volume. + +A GlusterFS volume is made up of multiple bricks spread across multiple +nodes. Each brick translates to a directory path on a given file-system. +The current snapshot design is based on thinly provisioned LVM2 snapshot +feature. Therefore as a prerequisite the Gluster bricks should be on +thinly provisioned LVM. For a single lvm, taking a snapshot would be +straight forward for the admin, but this is compounded in a GlusterFS +volume which has bricks spread across multiple LVM’s across multiple +nodes. Gluster volume snapshot feature aims to provide a set of +interfaces from which the admin can snap and manage the snapshots for +Gluster volumes. + +Gluster volume snapshot is nothing but snapshots of all the bricks in +the volume. So ideally all the bricks should be snapped at the same +time. But with real-life latencies (processor and network) this may not +hold true all the time. Therefore we need to make sure that during +snapshot the file-system is in consistent state. Therefore we barrier +few operation so that the file-system remains in a healthy state during +snapshot. + +For details about barrier [Server Side +Barrier](http://www.gluster.org/community/documentation/index.php/Features/Server-side_Barrier_feature) + +Benefit to GlusterFS +-------------------- + +Snapshot of glusterfs volume allows users to + +- A point in time checkpoint from which to recover/failback +- Allow read-only snaps to be the source of backups. + +Scope +----- + +### Nature of proposed change + +Gluster cli will be modified to provide new commands for snapshot +management. The entire snapshot core implementation will be done in +glusterd. + +Apart from this Snapshot will also make use of quiescing xlator for +doing quiescing. This will be a server side translator which will +quiesce will fops which can modify disk state. The quescing will be done +till the snapshot operation is complete. + +### Implications on manageability + +Snapshot will provide new set of cli commands to manage snapshots. REST +APIs are not planned for this release. + +### Implications on persistence layer + +Snapshot will create new volume per snapshot. These volumes are stored +in /var/lib/glusterd/snaps folder. Apart from this each volume will have +additional snapshot related information stored in snap\_list.info file +in its respective vol folder. + +### Implications on 'glusterd' + +Snapshot information and snapshot volume details are stored in +persistent stores. + +How To Test +----------- + +For testing this feature one needs to have mulitple thinly provisioned +volumes or else need to create LVM using loop back devices. + +Details of how to create thin volume can be found at the following link + + +Each brick needs to be in a independent LVM. And these LVMs should be +thinly provisioned. From these bricks create Gluster volume. This volume +can then be used for snapshot testing. + +See the User Experience section for various commands of snapshot. + +User Experience +--------------- + +##### Snapshot creation + + snapshot create   [description ] [force] + +This command will create a sapshot of the volume identified by volname. +snapname is a mandatory field and the name should be unique in the +entire cluster. Users can also provide an optional description to be +saved along with the snap (max 1024 characters). force keyword is used +if some bricks of orginal volume is down and still you want to take the +snapshot. + +##### Listing of available snaps + + gluster snapshot list [snap-name] [vol ] + +This command is used to list all snapshots taken, or for a specified +volume. If snap-name is provided then it will list the details of that +snap. + +##### Configuring the snapshot behavior + + gluster snapshot config [vol-name] + +This command will display existing config values for a volume. If volume +name is not provided then config values of all the volume is displayed. + + gluster snapshot config [vol-name] [ ] [ ] [force] + +The above command can be used to change the existing config values. If +vol-name is provided then config value of that volume is changed, else +it will set/change the system limit. + +The system limit is the default value of the config for all the volume. +Volume specific limit cannot cross the system limit. If a volume +specific limit is not provided then system limit will be considered. + +If any of this limit is decreased and the current snap count of the +system/volume is more than the limit then the command will fail. If user +still want to decrease the limit then force option should be used. + +**snap-max-limit**: Maximum snapshot limit for a volume. Snapshots +creation will fail if snap count reach this limit. + +**snap-max-soft-limit**: Maximum snapshot limit for a volume. Snapshots +can still be created if snap count reaches this limit. An auto-deletion +will be triggered if this limit is reached. The oldest snaps will be +deleted if snap count reaches this limit. This is represented as +percentage value. + +##### Status of snapshots + + gluster snapshot status ([snap-name] | [volume ]) + +Shows the status of all the snapshots or the specified snapshot. The +status will include the brick details, LVM details, process details, +etc. + +##### Activating a snap volume + +By default the snapshot created will be in an inactive state. Use the +following commands to activate snapshot. + + gluster snapshot activate  + +##### Deactivating a snap volume + + gluster snapshot deactivate  + +The above command will deactivate an active snapshot + +##### Deleting snaps + + gluster snapshot delete  + +This command will delete the specified snapshot. + +##### Restoring snaps + + gluster snapshot restore  + +This command restores an already taken snapshot of single or multiple +volumes. Snapshot restore is an offline activity therefore if any volume +which is part of the given snap is online then the restore operation +will fail. + +Once the snapshot is restored it will be deleted from the list of +snapshot. + +Dependencies +------------ + +To provide support for a crash-consistent snapshot feature Gluster core +com- ponents itself should be crash-consistent. As of now Gluster as a +whole is not crash-consistent. In this section we will identify those +Gluster components which are not crash-consistent. + +**Geo-Replication**: Geo-replication provides master-slave +synchronization option to Gluster. Geo-replication maintains state +information for completing the sync operation. Therefore ideally when a +snapshot is taken then both the master and slave snapshot should be +taken. And both master and slave snapshot should be in mutually +consistent state. + +Geo-replication make use of change-log to do the sync. By default the +change-log is stored .glusterfs folder in every brick. But the +change-log path is configurable. If change-log is part of the brick then +snapshot will contain the change-log changes as well. But if it is not +then it needs to be saved separately during a snapshot. + +Following things should be considered for making change-log +crash-consistent: + +- Change-log is part of the brick of the same volume. +- Change-log is outside the brick. As of now there is no size limit on + the + +change-log files. We need to answer following questions here + +- - Time taken to make a copy of the entire change-log. Will affect + the + +overall time of snapshot operation. + +- - The location where it can be copied. Will impact the disk usage + of + +the target disk or file-system. + +- Some part of change-log is present in the brick and some are outside + +the brick. This situation will arrive when change-log path is changed +in-between. + +- Change-log is saved in another volume and this volume forms a CG + with + +the volume about to be snapped. + +**Note**: Considering the above points we have decided not to support +change-log stored outside the bricks. + +For this release automatic snapshot of both master and slave session is +not supported. If required user need to explicitly take snapshot of both +master and slave. Following steps need to be followed while taking +snapshot of a master and slave setup + +- Stop geo-replication manually. +- Snapshot all the slaves first. +- When the slave snapshot is done then initiate master snapshot. +- When both the snapshot is complete geo-syncronization can be started + again. + +**Gluster Quota**: Quota enables an admin to specify per directory +quota. Quota makes use of marker translator to enforce quota. As of now +the marker framework is not completely crash-consistent. As part of +snapshot feature we need to address following issues. + +- If a snapshot is taken while the contribution size of a file is + being updated then you might end up with a snapshot where there is a + mismatch between the actual size of the file and the contribution of + the file. These in-consistencies can only be rectified when a + look-up is issued on the snapshot volume for the same file. As a + workaround admin needs to issue an explicit file-system crawl to + rectify the problem. +- For NFS, quota makes use of pgfid to build a path from gfid and + enforce quota. As of now pgfid update is not crash-consistent. +- Quota saves its configuration in file-system under /var/lib/glusterd + folder. As part of snapshot feature we need to save this file. + +**NFS**: NFS uses a single graph to represent all the volumes in the +system. And to make all the snapshot volume accessible these snapshot +volumes should be added to this graph. This brings in another +restriction, i.e. all the snapshot names should be unique and +additionally snap name should not clash with any other volume name as +well. + +To handle this situation we have decided to use an internal uuid as snap +name. And keep a mapping of this uuid and user given snap name in an +internal structure. + +Another restriction with NFS is that when a newly created volume +(snapshot volume) is started it will restart NFS server. Therefore we +decided when snapshot is taken it will be in stopped state. Later when +snapshot volume is needed it can be started explicitly. + +**DHT**: DHT xlator decides which node to look for a file/directory. +Some of the DHT fop are not atomic in nature, e.g rename (both file and +directory). Also these operations are not transactional in nature. That +means if a crash happens the data in server might be in an inconsistent +state. Depending upon the time of snapshot and which DHT operation is in +what state there can be an inconsistent snapshot. + +**AFR**: AFR is the high-availability module in Gluster. AFR keeps track +of fresh and correct copy of data using extended attributes. Therefore +it is important that before taking snapshot these extended attributes +are written into the disk. To make sure these attributes are written to +disk snapshot module will issue explicit sync after the +barrier/quiescing. + +The other issue with the current AFR is that it writes the volume name +to the extended attribute of all the files. AFR uses this for +self-healing. When snapshot is taken of such a volume the snapshotted +volume will also have the same volume name. Therefore AFR needs to +create a mapping of the real volume name and the extended entry name in +the volfile. So that correct name can be referred during self-heal. + +Another dependency on AFR is that currently there is no direct API or +call back function which will tell that AFR self-healing is completed on +a volume. This feature is required to heal a snapshot volume before +restore. + +Documentation +------------- + +Status +------ + +In development + +Comments and Discussion +----------------------- + + diff --git a/done/GlusterFS 3.6/New Style Replication.md b/done/GlusterFS 3.6/New Style Replication.md new file mode 100644 index 0000000..ffd8167 --- /dev/null +++ b/done/GlusterFS 3.6/New Style Replication.md @@ -0,0 +1,230 @@ +Goal +---- + +More partition-tolerant replication, with higher performance for most +use cases. + +Summary +------- + +NSR is a new synchronous replication translator, complementing or +perhaps some day replacing AFR. + +Owners +------ + +Jeff Darcy +Venky Shankar + +Current status +-------------- + +Design and prototype (nearly) complete, implementation beginning. + +Related Feature Requests and Bugs +--------------------------------- + +[AFR bugs related to "split +brain"](https://bugzilla.redhat.com/buglist.cgi?classification=Community&component=replicate&list_id=3040567&product=GlusterFS&query_format=advanced&short_desc=split&short_desc_type=allwordssubstr) + +[AFR bugs related to +"perf"](https://bugzilla.redhat.com/buglist.cgi?classification=Community&component=replicate&list_id=3040572&product=GlusterFS&query_format=advanced&short_desc=perf&short_desc_type=allwordssubstr) + +(Both lists are undoubtedly partial because not all bugs in these areas +using these specific words. In particular, "GFID mismatch" bugs are +really a kind of split brain, but aren't represented.) + +Detailed Description +-------------------- + +NSR is designed to have the following features. + +- Server based - "chain" replication can use bandwidth of both client + and server instead of splitting client bandwidth N ways. + +- Journal based - for reduced network traffic in normal operation, + plus faster recovery and greater resistance to "split brain" errors. + +- Variable consistency model - based on + [Dynamo](http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) + to provide options trading some consistency for greater availability + and/or performance. + +- Newer, smaller codebase - reduces technical debt, enables higher + replica counts, more informative status reporting and logging, and + other future features (e.g. ordered asynchronous replication). + +Benefit to GlusterFS +==================== + +Faster, more robust, more manageable/maintainable replication. + +Scope +===== + +Nature of proposed change +------------------------- + +At least two new translators will be necessary. + +- A simple client-side translator to route requests to the current + leader among the bricks in a replica set. + +- A server-side translator to handle the "heavy lifting" of + replication, recovery, etc. + +Implications on manageability +----------------------------- + +At a high level, commands to enable, configure, and manage NSR will be +very similar to those already used for AFR. At a lower level, the +options affecting things things like quorum, consistency, and placement +of journals will all be completely different. + +Implications on presentation layer +---------------------------------- + +Minimal. Most changes will be to simplify or remove special handling for +AFR's unique behavior (especially around lookup vs. self-heal). + +Implications on persistence layer +--------------------------------- + +N/A + +Implications on 'GlusterFS' backend +----------------------------------- + +The journal for each brick in an NSR volume might (for performance +reasons) be placed on one or more local volumes other than the one +containing the brick's data. Special requirements around AIO, fsync, +etc. will be less than with AFR. + +Modification to GlusterFS metadata +---------------------------------- + +NSR will not use the same xattrs as AFR, reducing the need for larger +inodes. + +Implications on 'glusterd' +-------------------------- + +Volgen must be able to configure the client-side and server-side parts +of NSR, instead of AFR on the client side and index (which will no +longer be necessary) on the server side. Other interactions with +glusterd should remain mostly the same. + +How To Test +=========== + +Most basic AFR tests - e.g. reading/writing data, killing nodes, +starting/stopping self-heal - would apply to NSR as well. Tests that +embed assumptions about AFR xattrs or other internal artifacts will need +to be re-written. + +User Experience +=============== + +Minimal change, mostly related to new options. + +Dependencies +============ + +NSR depends on a cluster-management framework that can provide +membership tracking, leader election, and robust consistent key/value +data storage. This is expected to be developed in parallel as part of +the glusterd-scalability feature, but can be implemented (in simplified +form) within NSR itself if necessary. + +Documentation +============= + +TBD. + +Status +====== + +Some parts of earlier implementation updated to current tree, others in +the middle of replacement. + +- [New design](http://review.gluster.org/#/c/8915/) + +- [Basic translator code](http://review.gluster.org/#/c/8913/) (needs + update to new code-generation infractructure) + +- [GF\_FOP\_IPC](http://review.gluster.org/#/c/8812/) + +- [etcd support](http://review.gluster.org/#/c/8887/) + +- [New code-generation + infrastructure](http://review.gluster.org/#/c/9411/) + +- [New data-logging + translator](https://forge.gluster.org/~jdarcy/glusterfs-core/jdarcys-glusterfs-data-logging) + +Comments and Discussion +======================= + +My biggest concern with journal-based replication comes from my previous +use of DRBD. They do an "activity log"[^1] which sounds like the same +basic concept. Once that log filled, I experienced cascading failure. +When the journal can be filled faster than it's emptied this could cause +the problem I experienced. + +So what I'm looking to be convinced is how journalled replication +maintains full redundancy and how it will prevent the journal input from +exceeding the capacity of the journal output or at least how it won't +fail if this should happen. + +[jjulian](User:Jjulian "wikilink") +([talk](User talk:Jjulian "wikilink")) 17:21, 13 August 2013 (UTC) + +
+This is akin to a CAP Theorem[^2][^3] problem. If your nodes can't +communicate, what do you do with writes? Our replication approach has +traditionally been CP - enforce quorum, allow writes only among the +majority - and for the sake of satisfying user expectations (or POSIX) +pretty much has to remain CP at least by default. I personally think we +need to allow an AP choice as well, which is why the quorum levels in +NSR are tunable to get that result. + +So, what do we do if a node runs out of journal space? Well, it's unable +to function normally, i.e. it's failed, so it can't count toward quorum. +This would immediately lead to loss of write availability in a two-node +replica set, and could happen easily enough in a three-node replica set +if two similarly configured nodes ran out of journal space +simultaneously. A significant part of the complexity in our design is +around pruning no-longer-needed journal segments, precisely because this +is an icky problem, but even with all the pruning in the world it could +still happen eventually. Therefore the design also includes the notion +of arbiters, which can be quorum-only or can also have their own +journals (with no or partial data). Therefore, your quorum for +admission/journaling purposes can be significantly higher than your +actual replica count. So what options do we have to avoid or deal with +journal exhaustion? + +- Add more journal space (it's just files, so this can be done + reactively during an extended outage). + +- Add arbiters. + +- Decrease the quorum levels. + +- Manually kick a node out of the replica set. + +- Add admission control, artificially delaying new requests as the + journal becomes full. (This one requires more code.) + +If you do \*none\* of these things then yeah, you're scrod. That said, +do you think these options seem sufficient? + +[Jdarcy](User:Jdarcy "wikilink") ([talk](User talk:Jdarcy "wikilink")) +15:27, 29 August 2013 (UTC) + + + +[^1]: + +[^2]: + +[^3]: diff --git a/done/GlusterFS 3.6/Persistent AFR Changelog xattributes.md b/done/GlusterFS 3.6/Persistent AFR Changelog xattributes.md new file mode 100644 index 0000000..e21b788 --- /dev/null +++ b/done/GlusterFS 3.6/Persistent AFR Changelog xattributes.md @@ -0,0 +1,178 @@ +Feature +------- + +Provide a unique and consistent name for AFR changelog extended +attributes/ client translator names in the volume graph. + +Summary +------- + +Make AFR changelog extended attribute names independent of brick +position in the graph, which ensures that there will be no potential +misdirected self-heals during remove-brick operation. + +Owners +------ + +Ravishankar N +Pranith Kumar K + +Current status +-------------- + +Patches merged in master. + + + + + +Detailed Description +-------------------- + +BACKGROUND ON THE PROBLEM: =========================== AFR makes use of +changelog extended attributes on a per file basis which records pending +operations on that file and is used to determine the sources and sinks +when healing needs to be done. As of today, AFR uses the client +translator names (from the volume graph) as the names of the changelog +attributes. For eg. for a replica 3 volume, each file on every brick has +the following extended attributes: + + trusted.afr.-client-0-->maps to Brick0 + trusted.afr.-client-1-->maps to Brick1 + trusted.afr.-client-2-->maps to Brick2 + +​1) Now when any brick is removed (say Brick1), the graph is regenerated +and AFR maps the xattrs to the bricks so: + + trusted.afr.-client-0-->maps to Brick0 + trusted.afr.-client-1-->maps to Brick2  + +Thus the xattr 'trusted.afr.testvol-client-1' which earlier referred to +Brick1's attributes now refer to Brick-2's. If there are pending +self-heals prior to the remove-brick happened, healing could possibly +happen in the wrong direction thereby causing data loss. + +​2) The second problem is a dependency with Snapshot feature. Snapshot +volumes have new names (UUID based) and thus the (client)xlator names +are different. Eg: \<-client-0\> will now be +\<-client-0\>. When AFR uses these names to query for its +changelog xattrs but the files on the bricks have the old changelog +xattrs. Hence the heal information is completely lost. + +WHAT IS THE EXACT ISSUE WE ARE SOLVING OR OBJECTIVE OF THE +FEATURE/DESIGN? +========================================================================== +In a nutshell, the solution is to generate unique and persistent names +for the client translators so that even if any of the bricks are +removed, the translator names always map to the same bricks. In turn, +AFR, which uses these names for the changelog xattr names also refer to +the correct bricks. + +SOLUTION: + +The solution is explained as a sequence of steps: + +- The client translator names will still use the existing + nomenclature, except that now they are monotonically increasing + (-client-0,1,2...) and are not dependent on the brick + position.Let us call these names as brick-IDs. These brick IDs are + also written to the brickinfo files (in + /var/lib/glusterd/vols//bricks/\*) by glusterd during + volume creation. When the volfile is generated, these brick + brick-IDs form the client xlator names. + +- Whenever a brick operation is performed, the names are retained for + existing bricks irrespective of their position in the graph. New + bricks get the monotonically increasing brick-ID while names for + existing bricks are obtained from the brickinfo file. + +- Note that this approach does not affect client versions (old/new) in + anyway because the clients just use the volume config provided by + the volfile server. + +- For retaining backward compatibility, We need to check two items: + (a)Under what condition is remove brick allowed; (b)When is brick-ID + written to brickinfo file. + +For the above 2 items, the implementation rules will be thus: + +​i) This feature is implemented in 3.6. Lets say its op-version is 5. + +​ii) We need to implement a check to allow remove-brick only if cluster +opversion is \>=5 + +​iii) The brick-ID is written to brickinfo when the nodes are upgraded +(during glusterd restore) and when a peer is probed (i.e. during volfile +import). + +Benefit to GlusterFS +-------------------- + +Even if there are pending self-heals, remove-brick operations can be +carried out safely without fear of incorrect heals which may cause data +loss. + +Scope +----- + +### Nature of proposed change + +Modifications will be made in restore, volfile import and volgen +portions of glusterd. + +### Implications on manageability + +N/A + +### Implications on presentation layer + +N/A + +### Implications on persistence layer + +N/A + +### Implications on 'GlusterFS' backend + +N/A + +### Modification to GlusterFS metadata + +N/A + +### Implications on 'glusterd' + +As described earlier. + +How To Test +----------- + +remove-brick operation needs to be carried out on rep/dist-rep volumes +having pending self-heals and it must be verified that no data is lost. +snapshots of the volumes must also be able to access files without any +issues. + +User Experience +--------------- + +N/A + +Dependencies +------------ + +None. + +Documentation +------------- + +TBD + +Status +------ + +See 'Current status' section. + +Comments and Discussion +----------------------- + + \ No newline at end of file diff --git a/done/GlusterFS 3.6/RDMA Improvements.md b/done/GlusterFS 3.6/RDMA Improvements.md new file mode 100644 index 0000000..1e71729 --- /dev/null +++ b/done/GlusterFS 3.6/RDMA Improvements.md @@ -0,0 +1,101 @@ +Feature +------- + +**RDMA Improvements** + +Summary +------- + +This proposal is regarding getting RDMA volumes out of tech preview. + +Owners +------ + +Raghavendra Gowdappa +Vijay Bellur + +Current status +-------------- + +Work in progress + +Detailed Description +-------------------- + +Fix known & unknown issues in volumes with transport type rdma so that +RDMA can be used as the interconnect between client - servers & between +servers. + +- Performance Issues - Had found that performance was bad when + compared with plain ib-verbs send/recv v/s RDMA reads and writes. +- Co-existence with tcp - There seemed to be some memory corruptions + when we had both tcp and rdma transports. +- librdmacm for connection management - with this there is a + requirement that the brick has to listen on an IPoIB address and + this affects our current ability where a peer has the flexibility to + connect to either ethernet or infiniband address. Another related + feature Better peer identification will help us to resolve this + issue. +- More testing required + +Benefit to GlusterFS +-------------------- + +Scope +----- + +### Nature of proposed change + +Bug-fixes to transport/rdma + +### Implications on manageability + +Remove the warning about creation of rdma volumes in CLI. + +### Implications on presentation layer + +TBD + +### Implications on persistence layer + +No impact + +### Implications on 'GlusterFS' backend + +No impact + +### Modification to GlusterFS metadata + +No impact + +### Implications on 'glusterd' + +No impact + +How To Test +----------- + +TBD + +User Experience +--------------- + +TBD + +Dependencies +------------ + +Better Peer identification + +Documentation +------------- + +TBD + +Status +------ + +In development + +Comments and Discussion +----------------------- diff --git a/done/GlusterFS 3.6/Server-side Barrier feature.md b/done/GlusterFS 3.6/Server-side Barrier feature.md new file mode 100644 index 0000000..c13e25a --- /dev/null +++ b/done/GlusterFS 3.6/Server-side Barrier feature.md @@ -0,0 +1,213 @@ +Server-side barrier feature +=========================== + +- Author(s): Varun Shastry, Krishnan Parthasarathi +- Date: Jan 28 2014 +- Bugzilla: +- Document ID: BZ1060002 +- Document Version: 1 +- Obsoletes: NA + +Abstract +-------- + +Snapshot feature needs a mechanism in GlusterFS, where acknowledgements +to file operations (FOPs) are held back until the snapshot of all the +bricks of the volume are taken. + +The barrier feature would stop holding back FOPs after a configurable +'barrier-timeout' seconds. This is to prevent an accidental lockdown of +the volume. + +This mechanism should have the following properties: + +- Should keep 'barriering' transparent to the applications. +- Should not acknowledge FOPs that fall into the barrier class. A FOP + that when acknowledged to the application, could lead to the + snapshot of the volume become inconsistent, is a barrier class FOP. + +With the below example of 'unlink' how a FOP is classified as barrier +class is explained. + +For the following sequence of events, assuming unlink FOP was not +barriered. Assume a replicate volume with two bricks, namely b1 and b2. + + b1 b2 + time ---------------------------------- + | t1 snapshot + | t2 unlink /a unlink /a + \/ t3 mkdir /a mkdir /a + t4 snapshot + +The result of the sequence of events will store /a as a file in snapshot +b1 while /a is stored as directory in snapshot b2. This leads to split +brain problem of the AFR and in other way inconsistency of the volume. + +Copyright +--------- + +Copyright (c) 2014 Red Hat, Inc. + +This feature is licensed under your choice of the GNU Lesser General +Public License, version 3 or any later version (LGPLv3 or later), or the +GNU General Public License, version 2 (GPLv2), in all cases as published +by the Free Software Foundation. + +Introduction +------------ + +The volume snapshot feature snapshots a volume by snapshotting +individual bricks, that are available, using the lvm-snapshot +technology. As part of using lvm-snapshot, the design requires bricks to +be free from few set of modifications (fops in Barrier Class) to avoid +the inconsistency. This is where the server-side barriering of FOPs +comes into picture. + +Terminology +----------- + +- barrier(ing) - To make barrier fops temporarily inactive or + disabled. +- available - A brick is said to be available when the corresponding + glusterfsd process is running and serving file operations. +- FOP - File Operation + +High Level Design +----------------- + +### Architecture/Design Overview + +- Server-side barriering, for Snapshot, must be enabled/disabled on + the bricks of a volume in a synchronous manner. ie, any command + using this would be blocked until barriering is enabled/disabled. + The brick process would provide this mechanism via an RPC. +- Barrier translator would be placed immediately above io-threads + translator in the server/brick stack. +- Barrier translator would queue FOPs when enabled. On disable, the + translator dequeues all the FOPs, while serving new FOPs from + application. By default, barriering is disabled. +- The barrier feature would stop blocking the acknowledgements of FOPs + after a configurable 'barrier-timeout' seconds. This is to prevent + an accidental lockdown of the volume. +- Operations those fall into barrier class are listed below. Any other + fop not listed below does not fall into this category and hence are + not barriered. + - rmdir + - unlink + - rename + - [f]truncate + - fsync + - write with O\_SYNC flag + - [f]removexattr + +### Design Feature + +Following timeline diagram depicts message exchanges between glusterd +and brick during enable and disable of barriering. This diagram assumes +that enable operation is synchronous and disable is asynchronous. See +below for alternatives. + + glusterd (snapshot) barrier @ brick + ------------------ --------------- + t1 | | + t2 | continue to pass through + | all the fops + t3 send 'enable' | + t4 | * starts barriering the fops + | * send back the ack + t5 receive the ack | + | | + t6 | <take snap> | + | . | + | . | + | . | + | </take snap> | + | | + t7 send disable | + (does not wait for the ack) | + t8 | release all the holded fops + | and no more barriering + | | + t9 | continue in PASS_THROUGH mode + +Glusterd would send an RPC (described in API section), to enable +barriering on a brick, by setting option feature.barrier to 'ON' in +barrier translator. This would be performed on all the bricks present in +that node, belonging to the set of volumes that are being snapshotted. + +Disable of barriering can happen in synchronous or asynchronous mode. +The choice is left to the consumer of this feature. + +On disable, all FOPs queued up will be dequeued. Simultaneously the +subsequent barrier request(s) will be served. + +Barrier option enable/disable is persisted into the volfile. This is to +make the feature available for consumers in asynchronous mode, like any +other (configurable) feature. + +Barrier feature also has timeout option based on which dequeuing would +get triggered if the consumer fails to send the disable request. + +Low-level details of Barrier translator working +----------------------------------------------- + +The translator operates in one of two states, namely QUEUEING and +PASS\_THROUGH. + +When barriering is enabled, the translator moves to QUEUEING state. It +queues outgoing FOPs thereafter in the call back path. + +When barriering is disabled, the translator moves to PASS\_THROUGH state +and does not queue when it is in PASS\_THROUGH state. Additionally, the +queued FOPs are 'released', when the translator moves from QUEUEING to +PASS\_THROUGH state. + +It has a translator global queue (doubly linked lists, see +libglusterfs/src/list.h) where the FOPs are queued in the form of a call +stub (see libglusterfs/src/call-stub.[ch]) + +When the FOP has succeeded, but barrier translator failed to queue in +the call back, the barrier translator would disable barriering and +release any queued FOPs, barrier would inform the consumer about this +failure on succesive disable request. + +Interfaces +---------- + +### Application Programming Interface + +- An RPC procedure is added at the brick side, which allows any client + [sic] to set the feature.barrier option of the barrier translator + with a given value. +- Glusterd would be using this to set server-side-barriering on, on a + brick. + +Performance Considerations +-------------------------- + +- The barriering of FOPs may be perceived as a performance degrade by + the applications. Since this is a hard requirement for snapshot, the + onus is on the snapshot feature to reduce the window for which + barriering is enabled. + +### Scalability + +- In glusterd, each brick operation is executed in a serial manner. + So, the latency of enabling barriering is a function of the no. of + bricks present on the node of the set of volumes being snapshotted. + This is not a scalability limitation of the mechanism of enabling + barriering but a limitation in the brick operations mechanism in + glusterd. + +Migration Considerations +------------------------ + +The barrier translator is introduced with op-version 4. It is a +server-side translator and does not impact older clients even when this +feature is enabled. + +Installation and deployment +--------------------------- + +- Barrier xlator is not packaged with glusterfs-server rpm. With this + changes, this has to be added to the rpm. diff --git a/done/GlusterFS 3.6/Thousand Node Gluster.md b/done/GlusterFS 3.6/Thousand Node Gluster.md new file mode 100644 index 0000000..54c3e13 --- /dev/null +++ b/done/GlusterFS 3.6/Thousand Node Gluster.md @@ -0,0 +1,150 @@ +Goal +---- + +Thousand-node scalability for glusterd + +Summary +======= + +This "feature" is really a set of infrastructure changes that will +enable glusterd to manage a thousand servers gracefully. + +Owners +====== + +Krishnan Parthasarathi +Jeff Darcy + +Current status +============== + +Proposed, awaiting summit for approval. + +Related Feature Requests and Bugs +================================= + +N/A + +Detailed Description +==================== + +There are three major areas of change included in this proposal. + +- Replace the current order-n-squared heartbeat/membership protocol + with a much smaller "monitor cluster" based on Paxos or + [Raft](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf), + to which I/O servers check in. + +- Use the monitor cluster to designate specific functions or roles - + e.g. self-heal, rebalance, leadership in an NSR subvolume - to I/O + servers in a coordinated and globally optimal fashion. + +- Replace the current system of replicating configuration data on all + servers (providing practically no guarantee of consistency if one is + absent during a configuration change) with storage of configuration + data in the monitor cluster. + +Benefit to GlusterFS +==================== + +Scaling of our management plane to 1000+ nodes, enabling competition +with other projects such as HDFS or Ceph which already have or claim +such scalability. + +Scope +===== + +Nature of proposed change +------------------------- + +Functionality very similar to what we need in the monitor cluster +already exists in some of the Raft implementations, notably +[etcd](https://github.com/coreos/etcd). Such a component could provide +the services described above to a modified glusterd running on each +server. The changes to glusterd would mostly consist of removing the +current heartbeat and config-storage code, replacing it with calls into +(and callbacks from) the monitor cluster. + +Implications on manageability +----------------------------- + +Enabling/starting monitor daemons on those few nodes that have them must +be done separately from starting glusterd. Since the changes mostly are +to how each glusterd interacts with others and with its own local +storage back end, interactions with the CLI or with glusterfsd need not +change. + +Implications on presentation layer +---------------------------------- + +N/A + +Implications on persistence layer +--------------------------------- + +N/A + +Implications on 'GlusterFS' backend +----------------------------------- + +N/A + +Modification to GlusterFS metadata +---------------------------------- + +The monitor daemons need space for their data, much like that currently +maintained in /var/lib/glusterd currently. + +Implications on 'glusterd' +-------------------------- + +Drastic. See sections above. + +How To Test +=========== + +A new set of tests for the monitor-cluster functionality will need to be +developed, perhaps derived from those for the external project if we +adopt one. Most tests related to our multi-node testing facilities +(cluster.rc) will also need to change. Tests which merely invoke the CLI +should require little if any change. + +User Experience +=============== + +Minimal change. + +Dependencies +============ + +A mature/stable enough implementation of Raft or a similar protocol. +Failing that, we'd need to develop our own service along similar lines. + +Documentation +============= + +TBD. + +Status +====== + +In design. + +The choice of technology and approaches are being discussed on the +-devel ML. + +- "Proposal for Glusterd-2.0" - + [1](http://www.gluster.org/pipermail/gluster-users/2014-September/018639.html) + +: Though the discussion has become passive, the question is whether we + choose to implement consensus algorithm inside our project or depend + on external projects that provide similar service. + +- "Management volume proposal" - + [2](http://www.gluster.org/pipermail/gluster-devel/2014-November/042944.html) + +: This has limitations due to the circular dependency making it + infeasible. + +Comments and Discussion +======================= diff --git a/done/GlusterFS 3.6/afrv2.md b/done/GlusterFS 3.6/afrv2.md new file mode 100644 index 0000000..a1767c7 --- /dev/null +++ b/done/GlusterFS 3.6/afrv2.md @@ -0,0 +1,244 @@ +Feature +------- + +This feature is major code re-factor of current afr along with a key +design change in the way changelog extended attributes are stored in +afr. + +Summary +------- + +This feature introduces design change in afr which separates ongoing +transaction, pending operation count for files/directories. + +Owners +------ + +Anand Avati +Pranith Kumar Karampuri + +Current status +-------------- + +The feature is in final stages of review at + + +Detailed Description +-------------------- + +How AFR works: + +In order to keep track of what copies of the file are modified and up to +date, and what copies require to be healed, AFR keeps state information +in the extended attributes of the file called changelog extended +attributes. These extended attributes stores that copy's view of how up +to date the other copies are. The extended attributes are modified in a +transaction which consists of 5 phases - LOCK, PRE-OP, OP, POST-OP and +UNLOCK. In the PRE-OP phase the extended attributes are updated to store +the intent of modification (in the OP phase.) + +In the POST-OP phase, depending on how many servers crashed mid way and +on how many servers the OP was applied successfully, a corresponding +change is made in the extended attributes (of the surviving copies) to +represent the staleness of the copies which missed the OP phase. + +Further, when those lagging servers become available, healing decisions +are taken based on these extended attribute values. + +Today, a PRE-OP increments the pending counters of all elements in the +array (where each element represents a server, and therefore one of the +members of the array represents that server itself.) The POST-OP +decrements those counters which represent servers where the operation +was successful. The update is performed on all the servers which have +made it till the POST-OP phase. The decision of whether a server crashed +in the middle of a transaction or whether the server lived through the +transaction and witnessed the other server crash, is inferred by +inspecting the extended attributes of all servers together. Because +there is no distinction between these counters as to how many of those +increments represent "in transit" operations and how many of those are +retained without decrement to represent "pending counters", there is +value in adding clarity to the system by separating the two. + +The change is to now have only one dirty flag on each server per file. +We also make the PRE-OP increment only that dirty flag rather than all +the elements of the pending array. The dirty flag must be set before +performing the operation, and based on which of the servers the +operation failed, we will set the pending counters representing these +failed servers on the remaining ones in the POST-OP phase. The dirty +counter is also cleared at the end of the POST-OP. This means, in +successful operations only the dirty flag (one integer) is incremented +and decremented per server per file. However if a pending counter is set +because of an operation failure, then the flag is an unambiguous "finger +pointing" at the other server. Meaning, if a file has a pending counter +AND a dirty flag, it will not undermine the "strength" of the pending +counter. This change completely removes today's ambiguity of whether a +pending counter represents a still ongoing operation (or crashed in +transit) vs a surely missed operation. + +Benefit to GlusterFS +-------------------- + +It increases the clarity of whether a file has any ongoing transactions +and any pending self-heals. Code is more maintainable now. + +Scope +----- + +### Nature of proposed change + +- Remove client side self-healing completely (opendir, openfd, lookup) - +Re-work readdir-failover to work reliably in case of NFS - Remove +unused/dead lock recovery code - Consistently use xdata in both calls +and callbacks in all FOPs - Per-inode event generation, used to force +inode ctx refresh - Implement dirty flag support (in place of pending +counts) - Eliminate inode ctx structure, use read subvol bits + +event\_generation - Implement inode ctx refreshing based on event +generation - Provide backward compatibility in transactions - remove +unused variables and functions - make code more consistent in style and +pattern - regularize and clean up inode-write transaction code - +regularize and clean up dir-write transaction code - regularize and +clean up common FOPs - reorganize transaction framework code - skip +setting xattrs in pending dict if nothing is pending - re-write +self-healing code using syncops - re-write simpler self-heal-daemon + +### Implications on manageability + +None + +### Implications on presentation layer + +None + +### Implications on persistence layer + +None + +### Implications on 'GlusterFS' backend + +None + +### Modification to GlusterFS metadata + +This changes the way pending counts vs Ongoing transactions are +represented in changelog extended attributes. + +### Implications on 'glusterd' + +None + +How To Test +----------- + +Same test cases of afrv1 hold. + +User Experience +--------------- + +None + +Dependencies +------------ + +None + +Documentation +------------- + +--- + +Status +------ + +The feature is in final stages of review at + + +Comments and Discussion +----------------------- + +--- + +Summary +------- + + + +Owners +------ + + + +Current status +-------------- + + + +Detailed Description +-------------------- + + + +Benefit to GlusterFS +-------------------- + + + +Scope +----- + +### Nature of proposed change + + + +### Implications on manageability + + + +### Implications on presentation layer + + + +### Implications on persistence layer + + + +### Implications on 'GlusterFS' backend + + + +### Modification to GlusterFS metadata + +