summaryrefslogtreecommitdiffstats
path: root/done/GlusterFS 3.7/Scheduling of Snapshot.md
diff options
context:
space:
mode:
Diffstat (limited to 'done/GlusterFS 3.7/Scheduling of Snapshot.md')
-rw-r--r--done/GlusterFS 3.7/Scheduling of Snapshot.md229
1 files changed, 229 insertions, 0 deletions
diff --git a/done/GlusterFS 3.7/Scheduling of Snapshot.md b/done/GlusterFS 3.7/Scheduling of Snapshot.md
new file mode 100644
index 0000000..0b2b49c
--- /dev/null
+++ b/done/GlusterFS 3.7/Scheduling of Snapshot.md
@@ -0,0 +1,229 @@
+Feature
+-------
+
+Scheduling Of Snapshot
+
+Summary
+-------
+
+GlusterFS volume snapshot provides point-in-time copy of a GlusterFS
+volume. Currently, GlusterFS volume snapshots can be easily scheduled by
+setting up cron jobs on one of the nodes in the GlusterFS trusted
+storage pool. This has a single point failure (SPOF), as scheduled jobs
+can be missed if the node running the cron jobs dies.
+
+We can avoid the SPOF by distributing the cron jobs to all nodes of the
+trusted storage pool.
+
+Owner(s)
+--------
+
+Avra Sengupta <asengupt@redhat.com>
+
+Copyright
+---------
+
+Copyright (c) 2015 Red Hat, Inc. <http://www.redhat.com>
+
+This feature is licensed under your choice of the GNU Lesser General
+Public License, version 3 or any later version (LGPLv3 or later), or the
+GNU General Public License, version 2 (GPLv2), in all cases as published
+by the Free Software Foundation.
+
+Detailed Description
+--------------------
+
+The solution to the above problems involves the usage of:
+
+- A shared storage - A gluster volume by the name of
+ "gluster\_shared\_storage" is used as a shared storage across nodes
+ to co-ordinate the scheduling operations. This shared storage is
+ mounted at /var/run/gluster/shared\_storage on all the nodes.
+
+- An agent - This agent will perform the actual snapshot commands,
+ instead of cron. It will contain the logic to perform coordinated
+ snapshots.
+
+- A helper script - This script will allow the user to initialise the
+ scheduler on the local node, enable/disable scheduling,
+ add/edit/list/delete snapshot schedules.
+
+- cronie - It is the default cron daemon shipped with RHEL. It invokes
+ the agent at the appropriate intervals as mentioned by the user to
+ perform the snapshot operation on the volume as mentioned by the
+ user in the schedule.
+
+Initial Setup
+-------------
+
+The administrator needs to create a shared storage that can be available
+to nodes across the cluster. A GlusterFS volume by the name of
+"gluster\_shared\_storage" should be created for this purpose. It is
+preferable that the \*shared volume\* be a replicate volume to avoid
+SPOF.
+
+Once the shared storage is created, it should be mounted on all nodes in
+the trusted storage pool which will be participating in the scheduling.
+The location where the shared\_storage should be mounted
+(/var/run/gluster/shared\_storage) in these nodes is fixed and is not
+configurable. Each node participating in the scheduling then needs to
+perform an initialisation of the snapshot scheduler by invoking the
+following:
+
+snap\_scheduler.py init
+
+NOTE: This command needs to be run on all the nodes participating in the
+scheduling
+
+Helper Script
+-------------
+
+The helper script(snap\_scheduler.py) will initialise the scheduler on
+the local node, enable/disable scheduling, add/edit/list/delete snapshot
+schedules.
+
+​a) snap\_scheduler.py init
+
+This command initialises the snap\_scheduler and interfaces it with the
+crond running on the local node. This is the first step, before
+executing any scheduling related commands from a node.
+
+NOTE: The helper script needs to be run with this option on all the
+nodes participating in the scheduling. Other options of the helper
+script can be run independently from any node, where initialisation has
+been successfully completed.
+
+​b) snap\_scheduler.py enable
+
+The snap scheduler is disabled by default after initialisation. This
+command enables the snap scheduler.
+
+​c) snap\_scheduler.py disable
+
+This command disables the snap scheduler.
+
+​d) snap\_scheduler.py status
+
+This command displays the current status(Enabled/Disabled) of the snap
+scheduler.
+
+​e) snap\_scheduler.py add "Job Name" "Schedule" "Volume Name"
+
+This command adds a new snapshot schedule. All the arguments must be
+provided within double-quotes(""). It takes three arguments:
+
+-\> Job Name: This name uniquely identifies this particular schedule,
+and can be used to reference this schedule for future events like
+edit/delete. If a schedule already exists for the specified Job Name,
+the add command will fail.
+
+-\> Schedule: The schedules are accepted in the format crond
+understands:-
+
+1. Example of job definition:
+2. .---------------- minute (0 - 59)
+3. | .------------- hour (0 - 23)
+4. | | .---------- day of month (1 - 31)
+5. | | | .------- month (1 - 12) OR jan,feb,mar,apr ...
+6. | | | | .---- day of week (0 - 6) (Sunday=0 or 7) OR
+ sun,mon,tue,wed,thu,fri,sat
+7. | | | | |
+8. - - - - - user-name command to be executed
+
+Although we accept all valid cron schedules, currently we support
+granularity of snapshot schedules to a maximum of half-hourly snapshots.
+
+-\> Volume Name: The name of the volume on which the scheduled snapshot
+operation will be performed.
+
+​f) snap\_scheduler.py edit "Job Name" "Schedule" "Volume Name"
+
+This command edits an existing snapshot schedule. It takes the same
+three arguments that the add option takes. All the arguments must be
+provided within double-quotes(""). If a schedule does not exists for the
+specified Job Name, the edit command will fail.
+
+​g) snap\_scheduler.py delete "Job Name"
+
+This command deletes an existing snapshot schedule. It takes the job
+name of the schedule as argument. The argument must be provided within
+double-quotes(""). If a schedule does not exists for the specified Job
+Name, the delete command will fail.
+
+​h) snap\_scheduler.py list
+
+This command lists the existing snapshot schedules in the following
+manner: Pseudocode:
+
+``\
+`# snap_scheduler.py list`\
+`JOB_NAME         SCHEDULE         OPERATION        VOLUME NAME      `\
+`--------------------------------------------------------------------`\
+`Job0             * * * * *        Snapshot Create  test_vol    `
+
+The agent
+---------
+
+The snapshots scheduled with the help of the helper script, are read by
+crond which then invokes the agent(gcron.py) at the scheduled intervals
+to perform the snapshot operations on the specified volumes. It then
+performs the scheduled snapshots using the following algorithm to
+coordinate.
+
+Pseudocode:
+
+``\
+`start_time = get current time`\
+`lock_file = job_name passed as an argument`\
+`vol_name = volume name psased as an argument`\
+`try POSIX locking the $lock_file`\
+`    if lock is obtained, then`\
+`        mod_time = Get modification time of $entry`\
+`        if $mod_time < $start_time, then`\
+`            Take snapshot of $entry.name (Volume name)`\
+`            if snapshot failed, then`\
+`                log the failure`\
+`            Update modification time of $entry to current time`\
+`        unlock the $entry`
+
+The coordination with other scripts running on other nodes, is handled
+by the use of POSIX locks. All the instances of the script will attempt
+to lock the lock\_file which is essentialy an empty file with the job
+name, and one which gets the lock will take the snapshot.
+
+To prevent redoing a done task, the script will make use of the mtime
+attribute of the entry. At the beginning execution, the script would
+have saved its start time. Once the script obtains the lock on the
+lock\_file, before taking the snapshot, it compares the mtime of the
+entry with the start time. The snapshot will only be taken if the mtime
+is smaller than start time. Once the snapshot command completes, the
+script will update the mtime of the lock\_file to the current time
+before unlocking.
+
+If a snapshot command fails, the script will log the failure (in syslog)
+and continue with its operation. It will not attempt to retry the failed
+snapshot in the current schedule, but will attempt it again in the next
+schedules. It is left to the administrator to monitor the logs and
+decide what to do after a failure.
+
+Assumptions and Limitations
+---------------------------
+
+It is assumed that all nodes in the have their times synced using NTP or
+any other mechanism. This is a hard requirement for this feature to
+work.
+
+The administrator needs to have python2.7 or higher installed, as well
+as the argparse module installed, to be able to use the helper
+script(snap\_scheduler.py).
+
+There is a latency of one minute, between providing a command by the
+helper script and that command taking effect. Hence, currently we do not
+support snapshot schedules with per minute granularity.
+
+The administrator can however leverage the scheduler to schedule
+snapshots with granularity of
+half-hourly/hourly/daily/weekly/monthly/yearly periodic intervals. They
+can also schedule snapshots, which are customised mentioning which
+minute of the hour, which day of the week, which week of the month, and
+which month of the year, they want to schedule the snapshot operation.