From 9e9e3c5620882d2f769694996ff4d7e0cf36cc2b Mon Sep 17 00:00:00 2001 From: raghavendra talur Date: Thu, 20 Aug 2015 15:09:31 +0530 Subject: Create basic directory structure All new features specs go into in_progress directory. Once signed off, it should be moved to done directory. For now, This change moves all the Gluster 4.0 feature specs to in_progress. All other specs are under done/release-version. More cleanup required will be done incrementally. Change-Id: Id272d301ba8c434cbf7a9a966ceba05fe63b230d BUG: 1206539 Signed-off-by: Raghavendra Talur Reviewed-on: http://review.gluster.org/11969 Reviewed-by: Humble Devassy Chirammal Reviewed-by: Prashanth Pai Tested-by: Humble Devassy Chirammal --- done/GlusterFS 3.5/AFR CLI enhancements.md | 204 ++++++++++ done/GlusterFS 3.5/Brick Failure Detection.md | 151 +++++++ done/GlusterFS 3.5/Disk Encryption.md | 443 +++++++++++++++++++++ done/GlusterFS 3.5/Exposing Volume Capabilities.md | 161 ++++++++ done/GlusterFS 3.5/File Snapshot.md | 101 +++++ .../Onwire Compression-Decompression.md | 96 +++++ done/GlusterFS 3.5/Quota Scalability.md | 99 +++++ done/GlusterFS 3.5/Virt store usecase.md | 140 +++++++ done/GlusterFS 3.5/Zerofill.md | 192 +++++++++ done/GlusterFS 3.5/gfid access.md | 89 +++++ done/GlusterFS 3.5/index.md | 32 ++ done/GlusterFS 3.5/libgfapi with qemu libvirt.md | 222 +++++++++++ done/GlusterFS 3.5/readdir ahead.md | 117 ++++++ 13 files changed, 2047 insertions(+) create mode 100644 done/GlusterFS 3.5/AFR CLI enhancements.md create mode 100644 done/GlusterFS 3.5/Brick Failure Detection.md create mode 100644 done/GlusterFS 3.5/Disk Encryption.md create mode 100644 done/GlusterFS 3.5/Exposing Volume Capabilities.md create mode 100644 done/GlusterFS 3.5/File Snapshot.md create mode 100644 done/GlusterFS 3.5/Onwire Compression-Decompression.md create mode 100644 done/GlusterFS 3.5/Quota Scalability.md create mode 100644 done/GlusterFS 3.5/Virt store usecase.md create mode 100644 done/GlusterFS 3.5/Zerofill.md create mode 100644 done/GlusterFS 3.5/gfid access.md create mode 100644 done/GlusterFS 3.5/index.md create mode 100644 done/GlusterFS 3.5/libgfapi with qemu libvirt.md create mode 100644 done/GlusterFS 3.5/readdir ahead.md (limited to 'done/GlusterFS 3.5') diff --git a/done/GlusterFS 3.5/AFR CLI enhancements.md b/done/GlusterFS 3.5/AFR CLI enhancements.md new file mode 100644 index 0000000..88f4980 --- /dev/null +++ b/done/GlusterFS 3.5/AFR CLI enhancements.md @@ -0,0 +1,204 @@ +Feature +------- + +AFR CLI enhancements + +SUMMARY +------- + +Presently the AFR reporting via CLI has lots of problems in the +representation of logs because of which they may not be able to use the +data effectively. This feature is to correct these problems and provide +a coherent mechanism to present heal status,information and the logs +associated. + +Owners +------ + +Venkatesh Somayajulu +Raghavan + +Current status +-------------- + +There are many bugs related to this which indicates the current status +and why these requirements are required. + +​1) 924062 - gluster volume heal info shows only gfids in some cases and +sometimes names. This is very confusing for the end user. + +​2) 852294 - gluster volume heal info hangs/crashes when there is a +large number of entries to be healed. + +​3) 883698 - when self heal daemon is turned off, heal info does not +show any output. But healing can happen because of lookups from IO path. +Hence list of entries to be healed still needs to be shown. + +​4) 921025 - directories are not reported when list of split brain +entries needs to be displayed. + +​5) 981185 - when self heal daemon process is offline, volume heal info +gives error as "staging failure" + +​6) 952084 - We need a command to resolve files in split brain state. + +​7) 986309 - We need to report source information for files which got +healed during a self heal session. + +​8) 986317 - Sometimes list of files to get healed also includes files +to which IO s being done since the entries for these files could be in +the xattrop directory. This could be confusing for the user. + +There is a master bug 926044 that sums up most of the above problems. It +does give the QA perspective of the current representation out of the +present reporting infrastructure. + +Detailed Description +-------------------- + +​1) One common thread among all the above complaints is that the +information presented to the user is FUD because of the following +reasons: + +(a) Split brain itself is a scary scenario especially with VMs. +(b) The data that we present to the users cannot be used in a stable + manner for them to get to the list of these files. For ex: we + need to give mechanisms by which he can automate the resolution out + of split brain. +(c) The logs that are generated are all the more scarier since we + see repetition of some error lines running into hundreds of lines. + Our mailing lists are filled with such emails from end users. + +Any data is useless unless it is associated with an event. For self +heal, the event that leads to self heal is the loss of connectivity to a +brick from a client. So all healing info and especially split brain +should be associated with such events. + +The following is hence the proposed mechanism: + +(a) Every loss of a brick from client's perspective is logged and + available via some ID. The information provides the time from when + the brick went down to when it came up. Also it should also report + the number of IO transactions(modifies) that hapenned during this + event. +(b) The list of these events are available via some CLI command. The + actual command needs to be detailed as part of this feature. +(c) All volume info commands regarding list of files to be healed, + files healed and split brain files should be associated with this + event(s). + +​2) Provide a mechanism to show statistics at a volume and replica group +level. It should show the number of files to be healed and number of +split brain files at both the volume and replica group level. + +​3) Provide a mechanism to show per volume list of files to be +healed/files healed/split brain in the following info: + +This should have the following information: + +(a) File name +(b) Bricks location +(c) Event association (brick going down) +(d) Source +(v) Sink + +​4) Self heal crawl statistics - Introduce new CLI commands for showing +more information on self heal crawl per volume. + +(a) Display why a self heal crawl ran (timeouts, brick coming up) +(b) Start time and end time +(c) Number of files it attempted to heal +(d) Location of the self heal daemon + +​5) Scale the logging infrastructure to handle huge number of file list +that needs to be displayed as part of the logging. + +(a) Right now the system crashes or hangs in case of a high number + of files. +(b) It causes CLI timeouts arbitrarily. The latencies involved in + the logging have to be studied (profiled) and mechanisms to + circumvent them have to be introduced. +(c) All files are displayed on the output. Have a better way of + representing them. + +Options are: + +(a) Maybe write to a glusterd log file or have a seperate directory + for afr heal logs. +(b) Have a status kind of command. This will display the current + status of the log building and maybe have batched way of + representing when there is a huge list. + +​6) We should provide mechanism where the user can heal split brain by +some pre-established policies: + +(a) Let the system figure out the latest files (assuming all nodes + are in time sync) and choose the copies that have the latest time. +(b) Choose one particular brick as the source for split brain and + heal all split brains from this brick. +(c) Just remove the split brain information from changelog. We leave + the exercise to the user to repair split brain where in he would + rewrite to the split brained files. (right now the user is forced to + remove xattrs manually for this step). + +Benefits to GlusterFS +-------------------- + +Makes the end user more aware of healing status and provides statistics. + +Scope +----- + +6.1. Nature of proposed change + +Modification to AFR and CLI and glusterd code + +6.2. Implications on manageability + +New CLI commands to be added. Existing commands to be improved. + +6.3. Implications on presentation layer + +N/A + +6.4. Implications on persistence layer + +N/A + +6.5. Implications on 'GlusterFS' backend + +N/A + +6.6. Modification to GlusterFS metadata + +N/A + +6.7. Implications on 'glusterd' + +Changes for healing specific commands will be introduced. + +How To Test +----------- + +See documentation session + +User Experience +--------------- + +*Changes in CLI, effect on User experience...* + +Documentation +------------- + + + +Status +------ + +Patches : + + + +Status: + +Merged \ No newline at end of file diff --git a/done/GlusterFS 3.5/Brick Failure Detection.md b/done/GlusterFS 3.5/Brick Failure Detection.md new file mode 100644 index 0000000..9952698 --- /dev/null +++ b/done/GlusterFS 3.5/Brick Failure Detection.md @@ -0,0 +1,151 @@ +Feature +------- + +Brick Failure Detection + +Summary +------- + +This feature attempts to identify storage/file system failures and +disable the failed brick without disrupting the remainder of the node's +operation. + +Owners +------ + +Vijay Bellur with help from Niels de Vos (or the other way around) + +Current status +-------------- + +Currently, if the underlying storage or file system failure happens, a +brick process will continue to function. In some cases, a brick can hang +due to failures in the underlying system. Due to such hangs in brick +processes, applications running on glusterfs clients can hang. + +Detailed Description +-------------------- + +Detecting failures on the filesystem that a brick uses makes it possible +to handle errors that are caused from outside of the Gluster +environment. + +There have been hanging brick processes when the underlying storage of a +brick went unavailable. A hanging brick process can still use the +network and repond to clients, but actual I/O to the storage is +impossible and can cause noticible delays on the client side. + +Benefit to GlusterFS +-------------------- + +Provide better detection of storage subsytem failures and prevent bricks +from hanging. + +Scope +----- + +### Nature of proposed change + +Add a health-checker to the posix xlator that periodically checks the +status of the filesystem (implies checking of functional +storage-hardware). + +### Implications on manageability + +When a brick process detects that the underlaying storage is not +responding anymore, the process will exit. There is no automated way +that the brick process gets restarted, the sysadmin will need to fix the +problem with the storage first. + +After correcting the storage (hardware or filesystem) issue, the +following command will start the brick process again: + + # gluster volume start force + +### Implications on presentation layer + +None + +### Implications on persistence layer + +None + +### Implications on 'GlusterFS' backend + +None + +### Modification to GlusterFS metadata + +None + +### Implications on 'glusterd' + +'glusterd' can detect that the brick process has exited, +`gluster volume status` will show that the brick process is not running +anymore. System administrators checking the logs should be able to +triage the cause. + +How To Test +----------- + +The health-checker thread that is part of each brick process will get +started automatically when a volume has been started. Verifying its +functionality can be done in different ways. + +On virtual hardware: + +- disconnect the disk from the VM that holds the brick + +On real hardware: + +- simulate a RAID-card failure by unplugging the card or cables + +On a system that uses LVM for the bricks: + +- use device-mapper to load an error-table for the disk, see [this + description](http://review.gluster.org/5176). + +On any system (writing to random offsets of the block device, more +difficult to trigger): + +1. cause corruption on the filesystem that holds the brick +2. read contents from the brick, hoping to hit the corrupted area +3. the filsystem should abort after hitting a bad spot, the + health-checker should notice that shortly afterwards + +User Experience +--------------- + +No more hanging brick processes when storage-hardware or the filesystem +fails. + +Dependencies +------------ + +Posix translator, not available for the BD-xlator. + +Documentation +------------- + +The health-checker is enabled by default and runs a check every 30 +seconds. This interval can be changed per volume with: + + # gluster volume set storage.health-check-interval + +If `SECONDS` is set to 0, the health-checker will be disabled. + +For further details refer: + + +Status +------ + +glusterfs-3.4 and newer include a health-checker for the posix xlator, +which was introduced with [bug +971774](https://bugzilla.redhat.com/971774): + +- [posix: add a simple + health-checker](http://review.gluster.org/5176)? + +Comments and Discussion +----------------------- diff --git a/done/GlusterFS 3.5/Disk Encryption.md b/done/GlusterFS 3.5/Disk Encryption.md new file mode 100644 index 0000000..4c6ab89 --- /dev/null +++ b/done/GlusterFS 3.5/Disk Encryption.md @@ -0,0 +1,443 @@ +Feature +======= + +Transparent encryption. Allows a volume to be encrypted "at rest" on the +server using keys only available on the client. + +1 Summary +========= + +Distributed systems impose tighter requirements to at-rest encryption. +This is because your encrypted data will be stored on servers, which are +de facto untrusted. In particular, your private encrypted data can be +subjected to analysis and tampering, which eventually will lead to its +revealing, if it is not properly protected. Specifically, usually it is +not enough to just encrypt data. In distributed systems serious +protection of your personal data is possible only in conjunction with a +special process, which is called authentication. GlusterFS provides such +enhanced service: In GlusterFS encryption is enhanced with +authentication. Currently we provide protection from "silent tampering". +This is a kind of tampering, which is hard to detect, because it doesn't +break POSIX compliance. Specifically, we protect encryption-specific +file's metadata. Such metadata includes unique file's object id (GFID), +cipher algorithm id, cipher block size and other attributes used by the +encryption process. + +1.1 Restrictions +---------------- + +​1. We encrypt only file content. The feature of transparent encryption +doesn't protect file names: they are neither encrypted, nor verified. +Protection of file names is not so critical as protection of +encryption-specific file's metadata: any attacks based on tampering file +names will break POSIX compliance and result in massive corruption, +which is easy to detect. + +​2. The feature of transparent encryption doesn't work in NFS-mounts of +GlusterFS volumes: NFS's file handles introduce security issues, which +are hard to resolve. NFS mounts of encrypted GlusterFS volumes will +result in failed file operations (see section "Encryption in different +types of mount sessions" for more details). + +​3. The feature of transparent encryption is incompatible with GlusterFS +performance translators quick-read, write-behind and open-behind. + +2 Owners +======== + +Jeff Darcy +Edward Shishkin + +3 Current status +================ + +Merged to the upstream. + +4 Detailed Description +====================== + +See Summary. + +5 Benefit to GlusterFS +====================== + +Besides the justifications that have applied to on-disk encryption just +about forever, recent events have raised awareness significantly. +Encryption using keys that are physically present at the server leaves +data vulnerable to physical seizure of the server. Encryption using keys +that are kept by the same organization entity leaves data vulnerable to +"insider threat" plus coercion or capture at the organization level. For +many, especially various kinds of service providers, only pure +client-side encryption provides the necessary levels of privacy and +deniability. + +Competitively, other projects - most notably +[Tahoe-LAFS](https://leastauthority.com/) - are already using recently +heightened awareness of these issues to attract users who would be +better served by our performance/scalability, usability, and diversity +of interfaces. Only the lack of proper encryption holds us back in these +cases. + +6 Scope +======= + +6.1. Nature of proposed change +------------------------------ + +This is a new client-side translator, using user-provided key +information plus information stored in xattrs to encrypt data +transparently as it's written and decrypt when it's read. + +6.2. Implications on manageability +---------------------------------- + +User needs to manage a per-volume master key (MK). That is: + +​1) Generate an independent MK for every volume which is to be +encrypted. Note, that one MK is created for the whole life of the +volume. + +​2) Provide MK on the client side at every mount in accordance with the +location, which has been specified at volume create time, or overridden +via respective mount option (see section How To Test). + +​3) Keep MK between mount sessions. Note that after successful mount MK +may be removed from the specified location. In this case user should +retain MK safely till next mount session. + +MK is a 256-bit secret string, which is known only to user. Generating +and retention of MK is in user's competence. + +WARNING!!! Losing MK will make content of all regular files of your +volume inaccessible. It is possible to mount a volume with improper MK, +however such mount sessions will allow to access only file names as they +are not encrypted. + +Recommendations on MK generation + +MK has to be a high-entropy key, appropriately generated by a key +derivation algorithm. One of the possible ways is using rand(1) provided +by the OpenSSL package. You need to specify the option "-hex" for proper +output format. For example, the next command prints a generated key to +the standard output: + + $ openssl rand -hex 32 + +6.3. Implications on presentation layer +--------------------------------------- + +N/A + +6.4. Implications on persistence layer +-------------------------------------- + +N/A + +6.5. Implications on 'GlusterFS' backend +---------------------------------------- + +All encrypted files on the servers contains padding at the end of file. +That is, size of all enDefines location of the master volume key on the +trusted client machine.crypted files on the servers is multiple to +cipher block size. Real file size is stored as file's xattr with the key +"trusted.glusterfs.crypt.att.size". The translation padded-file-size -\> +real-file-size (and backward) is performed by the crypt translator. + +6.6. Modification to GlusterFS metadata +--------------------------------------- + +Encryption-specific metadata in specified format is stored as file's +xattr with the key "trusted.glusterfs.crypt.att.cfmt". Current format of +metadata string is described in the slide \#27 of the following [ design +document](http://www.gluster.org/community/documentation/index.php/File:GlusterFS_transparent_encryption.pdf) + +6.7. Options of the crypt translator +------------------------------------ + +- data-cipher-alg + +Specifies cipher algorithm for file data encryption. Currently only one +option is available: AES\_XTS. This is hidden option. + +- block-size + +Specifies size (in bytes) of logical chunk which is encrypted as a whole +unit in the file body. If cipher modes with initial vectors are used for +encryption, then the initial vector gets reset for every such chunk. +Available values are: "512", "1024", "2048" and "4096". Default value is +"4096". + +- data-key-size + +Specifies size (in bits) of data cipher key. For AES\_XTS available +values are: "256" and "512". Default value is "256". The larger key size +("512") is for stronger security. + +- master-key + +Specifies pathname of the regular file, or symlink. Defines location of +the master volume key on the trusted client machine. + +7 Getting Started With Crypt Translator +======================================= + +​1. Create a volume . + +​2. Turn on crypt xlator: + + # gluster volume set `` encryption on + +​3. Turn off performance xlators that currently encryption is +incompatible with: + + # gluster volume set  performance.quick-read off + # gluster volume set  performance.write-behind off + # gluster volume set  performance.open-behind off + +​4. (optional) Set location of the volume master key: + + # gluster volume set  encryption.master-key  + +where is an absolute pathname of the file, which +will contain the volume master key (see section implications on +manageability). + +​5. (optional) Override default options of crypt xlator: + + # gluster volume set  encryption.data-key-size  + +where should have one of the following values: +"256"(default), "512". + + # gluster volume set  encryption.block-size  + +where should have one of the following values: "512", +"1024", "2048", "4096"(default). + +​6. Define location of the master key on your client machine, if it +wasn't specified at section 4 above, or you want it to be different from +the , specified at section 4. + +​7. On the client side make sure that the file with name + (or defined at section +6) exists and contains respective per-volume master key (see section +implications on manageability). This key has to be in hex form, i.e. +should be represented by 64 symbols from the set {'0', ..., '9', 'a', +..., 'f'}. The key should start at the beginning of the file. All +symbols at offsets \>= 64 are ignored. + +NOTE: (or defined at +step 6) can be a symlink. In this case make sure that the target file of +this symlink exists and contains respective per-volume master key. + +​8. Mount the volume on the client side as usual. If you +specified a location of the master key at section 6, then use the mount +option + +--xlator-option=.master-key= + +where is location of master key specified at +section 6, is suffixed with "-crypt". For +example, if you created a volume "myvol" in the step 1, then +suffixed\_vol\_name is "myvol-crypt". + +​9. During mount your client machine receives configuration info from +the untrusted server, so this step is extremely important! Check, that +your volume is really encrypted, and that it is encrypted with the +proper master key (see FAQ \#1,\#2). + +​10. (optional) After successful mount the file which contains master +key may be removed. NOTE: Next mount session will require the master-key +again. Keeping the master key between mount sessions is in user's +competence (see section implications on manageability). + +8 How to test +============= + +From a correctness standpoint, it's sufficient to run normal tests with +encryption enabled. From a security standpoint, there's a whole +discipline devoted to analysing the stored data for weaknesses, and +engagement with practitioners of that discipline will be necessary to +develop the right tests. + +9 Dependencies +============== + +Crypt translator requires OpenSSL of version \>= 1.0.1 + +10 Documentation +================ + +10.1 Basic design concepts +-------------------------- + +The basic design concepts are described in the following [pdf +slides](http://www.gluster.org/community/documentation/index.php/File:GlusterFS_transparent_encryption.pdf) + +10.2 Procedure of security open +------------------------------- + +So, in accordance with the basic design concepts above, before every +access to a file's body (by read(2), write(2), truncate(2), etc) we need +to make sure that the file's metadata is trusted. Otherwise, we risk to +deal with untrusted file's data. + +To make sure that file's metadata is trusted, file is subjected to a +special procedure of security open. The procedure of security open is +performed by crypt translator at FOP-\>open() (crypt\_open) time by the +function open\_format(). Currently this is a hardcoded composition of 2 +checks: + +1. verification of file's GFID by the file name; +2. verification of file's metadata by the verified GFID; + +If the security open succeeds, then the cache of trusted client machine +is replenished with file descriptor and file's inode, and user can +access the file's content by read(2), write(2), ftruncate(2), etc. +system calls, which accept file descriptor as argument. + +However, file API also allows to accept file body without opening the +file. For example, truncate(2), which accepts pathname instead of file +descriptor. To make sure that file's metadata is trusted, we create a +temporal file descriptor and mandatory call crypt\_open() before +truncating the file's body. + +10.3 Encryption in different types of mount sessions +---------------------------------------------------- + +Everything described in the section above is valid only for FUSE-mounts. +Besides, GlusterFS also supports so-called NFS-mounts. From the +standpoint of security the key difference between the mentioned types of +mount sessions is that in NFS-mount sessions file operations instead of +file name accept a so-called file handle (which is actually GFID). It +creates problems, since the file name is a basic point for verification. +As it follows from the section above, using the step 1, we can replenish +the cache of trusted machine with trusted file handles (GFIDs), and +perform a security open only by trusted GFID (by the step 2). However, +in this case we need to make sure that there is no leaks of non-trusted +GFIDs (and, moreover, such leaks won't be introduced by the development +process in future). This is possible only with changed GFID format: +everywhere in GlusterFS GFID should appear as a pair (uuid, +is\_verified), where is\_verified is a boolean variable, which is true, +if this GFID passed off the procedure of verification (step 1 in the +section above). + +The next problem is that current NFS protocol doesn't encrypt the +channel between NFS client and NFS server. It means that in NFS-mounts +of GlusterFS volumes NFS client and GlusterFS client should be the same +(trusted) machine. + +Taking into account the described problems, encryption in GlusterFS is +not supported in NFS-mount sessions. + +10.4 Class of cipher algorithms for file data encryption that can be supported by the crypt translator +------------------------------------------------------------------------------------------------------ + +We'll assume that any symmetric block cipher algorithm is completely +determined by a pair (alg\_id, mode\_id), where alg\_id is an algorithm +defined on elementary cipher blocks (e.g. AES), and mode\_id is a mode +of operation (e.g. ECB, XTS, etc). + +Technically, the crypt translator is able to support any symmetric block +cipher algorithms via additional options of the crypt translator. +However, in practice the set of supported algorithms is narrowed because +of various security and organization issues. Currently we support only +one algotithm. This is AES\_XTS. + +10.5 Bibliography +----------------- + +1. Recommendations for for Block Cipher Modes of Operation (NIST + Special Publication 800-38A). +2. Recommendation for Block Cipher Modes of Operation: The XTS-AES Mode + for Confidentiality on Storage Devices (NIST Special Publication + 800-38E). +3. Recommendation for Key Derivation Using Pseudorandom Functions, + (NIST Special Publication 800-108). +4. Recommendation for Block Cipher Modes of Operation: The CMAC Mode + for Authentication, (NIST Special Publication 800-38B). +5. Recommendation for Block Cipher Modes of Operation: Methods for Key + Wrapping, (NIST Special Publication 800-38F). +6. FIPS PUB 198-1 The Keyed-Hash Message Authentication Code (HMAC). +7. David A. McGrew, John Viega "The Galois/Counter Mode of Operation + (GCM)". + +11 FAQ +====== + +**1. How to make sure that my volume is really encrypted?** + +Check the respective graph of translators on your trusted client +machine. This graph is created at mount time and is stored by default in +the file /usr/local/var/log/glusterfs/mountpoint.log + +Here "mountpoint" is the absolute name of the mountpoint, where "/" are +replaced with "-". For example, if your volume is mounted to +/mnt/testfs, then you'll need to check the file +/usr/local/var/log/glusterfs/mnt-testfs.log + +Make sure that this graph contains the crypt translator, which looks +like the following: + + 13: volume xvol-crypt + 14:     type encryption/crypt + 15:     option master-key /home/edward/mykey + 16:     subvolumes xvol-dht + 17: end-volume + +**2. How to make sure that my volume is encrypted with a proper master +key?** + +Check the graph of translators on your trusted client machine (see the +FAQ\#1). Make sure that the option "master-key" of the crypt translator +specifies correct location of the master key on your trusted client +machine. + +**3. Can I change the encryption status of a volume?** + +You can change encryption status (enable/disable encryption) only for +empty volumes. Otherwise it will be incorrect (you'll end with IO +errors, data corruption and security problems). We strongly recommend to +decide once and forever at volume creation time, whether your volume has +to be encrypted, or not. + +**4. I am able to mount my encrypted volume with improper master keys +and get list of file names for every directory. Is it normal?** + +Yes, it is normal. It doesn't contradict the announced functionality: we +encrypt only file's content. File names are not encrypted, so it doesn't +make sense to hide them on the trusted client machine. + +**5. What is the reason for only supporting AES-XTS? This mode is not +using Intel's AES-NI instruction thus not utilizing hardware feature..** + +Distributed file systems impose tighter requirements to at-rest +encryption. We offer more than "at-rest-encryption". We offer "at-rest +encryption and authentication in distributed systems with non-trusted +servers". Data and metadata on the server can be easily subjected to +tampering and analysis with the purpose to reveal secret user's data. +And we have to resist to this tampering by performing data and metadata +authentication. + +Unfortunately, it is technically hard to implement full-fledged data +authentication via a stackable file system (GlusterFS translator), so we +have decided to perform a "light" authentication by using a special +cipher mode, which is resistant to tampering. Currently OpenSSL supports +only one such mode: this is XTS. Tampering of ciphertext created in XTS +mode will lead to unpredictable changes in the plain text. That said, +user will see "unpredictable gibberish" on the client side. Of course, +this is not an "official way" to detect tampering, but this is much +better than nothing. The "official way" (creating/checking MACs) we use +for metadata authentication. + +Other modes like CBC, CFB, OFB, etc supported by OpenSSL are strongly +not recommended for use in distributed systems with non-trusted servers. +For example, CBC mode doesn't "survive" overwrite of a logical block in +a file. It means that with every such overwrite (standard file system +operation) we'll need to re-encrypt the whole(!) file with different +key. CFB and OFB modes are sensitive to tampering: there is a way to +perform \*predictable\* changes in plaintext, which is unacceptable. + +Yes, XTS is slow (at least its current implementation in OpenSSL), but +we don't promise, that CFB, OFB with full-fledged authentication will be +faster. So.. diff --git a/done/GlusterFS 3.5/Exposing Volume Capabilities.md b/done/GlusterFS 3.5/Exposing Volume Capabilities.md new file mode 100644 index 0000000..0f72fbc --- /dev/null +++ b/done/GlusterFS 3.5/Exposing Volume Capabilities.md @@ -0,0 +1,161 @@ +Feature +------- + +Provide a capability to: + +- Probe the type (posix or bd) of volume. +- Provide list of capabilities of a xlator/volume. For example posix + xlator could support zerofill, BD xlator could support offloaded + copy, thin provisioning etc + +Summary +------- + +With multiple storage translators (posix and bd) being supported in +GlusterFS, it becomes necessary to know the volume type so that user can +issue appropriate calls that are relevant only to the a given volume +type. Hence there needs to be a way to expose the type of the storage +translator of the volume to the user. + +BD xlator is capable of providing server offloaded file copy, +server/storage offloaded zeroing of a file etc. This capabilities should +be visible to the client/user, so that these features can be exploited. + +Owners +------ + +M. Mohan Kumar +Bharata B Rao. + +Current status +-------------- + +BD xlator exports capability information through gluster volume info +(and --xml) output. For eg: + +*snip of gluster volume info output for a BD based volume* + + Xlator 1: BD + Capability 1: thin + +*snip of gluster volume info --xml output for a BD based volume* + + +    +     BD +      +       thin +      +    + + +But this capability information should also exposed through some other +means so that a host which is not part of Gluster peer could also avail +this capabilities. + +Exposing about type of volume (ie posix or BD) is still in conceptual +state currently and needs discussion. + +Detailed Description +-------------------- + +1. Type +- BD translator supports both regular files and block device, +i,e., one can create files on GlusterFS volume backed by BD +translator and this file could end up as regular posix file or a +logical volume (block device) based on the user's choice. User +can do a setxattr on the created file to convert it to a logical +volume. +- Users of BD backed volume like QEMU would like to know that it +is working with BD type of volume so that it can issue an +additional setxattr call after creating a VM image on GlusterFS +backend. This is necessary to ensure that the created VM image +is backed by LV instead of file. +- There are different ways to expose this information (BD type of +volume) to user. One way is to export it via a getxattr call. + +2. Capabilities +- BD xlator supports new features such as server offloaded file +copy, thin provisioned VM images etc (there is a patch posted to +Gerrit to add server offloaded file zeroing in posix xlator). +There is no standard way of exploiting these features from +client side (such as syscall to exploit server offloaded copy). +So these features need to be exported to the client so that they +can be used. BD xlator V2 patch exports these capabilities +information through gluster volume info (and --xml) output. But +if a client is not part of GlusterFS peer it can't run volume +info command to get the list of capabilities of a given +GlusterFS volume. Also GlusterFS block driver in qemu need to +get the capability list so that these features are used. + +Benefit to GlusterFS +-------------------- + +Enables proper consumption of BD xlator and client exploits new features +added in both posix and BD xlator. + +### Scope + +Nature of proposed change +------------------------- + +- Quickest way to expose volume type to a client can be achieved by + using getxattr fop. When a client issues getxattr("volume\_type") on + a root gfid, bd xlator will return 1 implying its BD xlator. But + posix xlator will return ENODATA and client code can interpret this + as posix xlator. + +- Also capability list can be returned via getxattr("caps") for root + gfid. + +Implications on manageability +----------------------------- + +None. + +Implications on presentation layer +---------------------------------- + +N/A + +Implications on persistence layer +--------------------------------- + +N/A + +Implications on 'GlusterFS' backend +----------------------------------- + +N/A + +Modification to GlusterFS metadata +---------------------------------- + +N/A + +Implications on 'glusterd' +-------------------------- + +N/A + +How To Test +----------- + +User Experience +--------------- + +Dependencies +------------ + +Documentation +------------- + +Status +------ + +Patch : + +Status : Merged + +Comments and Discussion +----------------------- diff --git a/done/GlusterFS 3.5/File Snapshot.md b/done/GlusterFS 3.5/File Snapshot.md new file mode 100644 index 0000000..b2d6c69 --- /dev/null +++ b/done/GlusterFS 3.5/File Snapshot.md @@ -0,0 +1,101 @@ +Feature +------- + +File Snapshots in GlusterFS + +### Summary + +Ability to take snapshots of files in GlusterFS + +### Owners + +Anand Avati + +### Source code + +Patch for this feature - + +### Detailed Description + +The feature adds file snapshotting support to GlusterFS. '' To use this +feature the file format should be QCOW2 (from QEMU)'' . The patch takes +the block layer code from Qemu and converts it into a translator in +gluster. + +### Benefit to GlusterFS + +Better integration with Openstack Cinder, and in general ability to take +snapshots of files (typically VM images) + +### Usage + +*To take snapshot of a file, the file format should be QCOW2. To set +file type as qcow2 check step \#2 below* + +​1. Turning on snapshot feature : + + gluster volume set `` features.file-snapshot on + +​2. To set qcow2 file format: + + setfattr -n trusted.glusterfs.block-format -v qcow2:10GB  + +​3. To create a snapshot: + + setfattr -n trusted.glusterfs.block-snapshot-create -v  + +​4. To apply/revert back to a snapshot: + + setfattr -n trusted.glusterfs.block-snapshot-goto -v   + +### Scope + +#### Nature of proposed change + +The work is going to be a new translator. Very minimal changes to +existing code (minor change in syncops) + +#### Implications on manageability + +Will need ability to load/unload the translator in the stack. + +#### Implications on presentation layer + +Feature must be presentation layer independent. + +#### Implications on persistence layer + +No implications + +#### Implications on 'GlusterFS' backend + +Internal snapshots - No implications. External snapshots - there will be +hidden directories added. + +#### Modification to GlusterFS metadata + +New xattr will be added to identify files which are 'snapshot managed' +vs raw files. + +#### Implications on 'glusterd' + +Yet another turn on/off feature for glusterd. Volgen will have to add a +new translator in the generated graph. + +### How To Test + +Snapshots can be tested by taking snapshots along with checksum of the +state of the file, making further changes and going back to old snapshot +and verify the checksum again. + +### Dependencies + +Dependent QEMU code is imported into the codebase. + +### Documentation + + + +### Status + +Merged in master and available in Gluster3.5 \ No newline at end of file diff --git a/done/GlusterFS 3.5/Onwire Compression-Decompression.md b/done/GlusterFS 3.5/Onwire Compression-Decompression.md new file mode 100644 index 0000000..a26aa7a --- /dev/null +++ b/done/GlusterFS 3.5/Onwire Compression-Decompression.md @@ -0,0 +1,96 @@ +Feature +======= + +On-Wire Compression/Decompression + +1. Summary +========== + +Translator to compress/decompress data in flight between client and +server. + +2. Owners +========= + +- Venky Shankar +- Prashanth Pai + +3. Current Status +================= + +Code has already been merged. Needs more testing. + +The [initial submission](http://review.gluster.org/3251) contained a +`compress` option, which introduced [some +confusion](https://bugzilla.redhat.com/1053670). [A correction has been +sent](http://review.gluster.org/6765) to rename the user visible options +to start with `network.compression`. + +TODO + +- Make xlator pluggable to add support for other compression methods +- Add support for lz4 compression: + +4. Detailed Description +======================= + +- When a writev call occurs, the client compresses the data before + sending it to server. On the server, compressed data is + decompressed. Similarly, when a readv call occurs, the server + compresses the data before sending it to client. On the client, the + compressed data is decompressed. Thus the amount of data sent over + the wire is minimized. + +- Compression/Decompression is done using Zlib library. + +- During normal operation, this is the format of data sent over wire: + + trailer(8 bytes). The trailer contains the CRC32 + checksum and length of original uncompressed data. This is used for + validation. + +5. Usage +======== + +Turning on compression xlator: + + # gluster volume set  network.compression on + +Configurable options: + + # gluster volume set  network.compression.compression-level 8 + # gluster volume set  network.compression.min-size 50 + +6. Benefits to GlusterFS +======================== + +Fewer bytes transferred over the network. + +7. Issues +========= + +- Issues with striped volumes. Compression xlator cannot work with + striped volumes + +- Issues with write-behind: Mount point hangs when writing a file with + write-behind xlator turned on. To overcome this, turn off + write-behind entirely OR set "performance.strict-write-ordering" to + on. + +- Issues with AFR: AFR v1 currently does not propagate xdata. + This issue has + been resolved in AFR v2. + +8. Dependencies +=============== + +Zlib library + +9. Documentation +================ + + + +10. Status +========== + +Code merged upstream. \ No newline at end of file diff --git a/done/GlusterFS 3.5/Quota Scalability.md b/done/GlusterFS 3.5/Quota Scalability.md new file mode 100644 index 0000000..f3b0a0d --- /dev/null +++ b/done/GlusterFS 3.5/Quota Scalability.md @@ -0,0 +1,99 @@ +Feature +------- + +Quota Scalability + +Summary +------- + +Support upto 65536 quota configurations per volume. + +Owners +------ + +Krishnan Parthasarathi +Vijay Bellur + +Current status +-------------- + +Current implementation of Directory Quota cannot scale beyond a few +hundred configured limits per volume. The aim of this feature is to +support upto 65536 quota configurations per volume. + +Detailed Description +-------------------- + +TBD + +Benefit to GlusterFS +-------------------- + +More quotas can be configured in a single volume thereby leading to +support GlusterFS for use cases like home directory. + +Scope +----- + +### Nature of proposed change + +- Move quota enforcement translator to the server +- Introduce a new quota daemon which helps in aggregating directory + consumption on the server +- Enhance marker's accounting to be modular +- Revamp configuration persistence and CLI listing for better scale +- Allow configuration of soft limits in addition to hard limits. + +### Implications on manageability + +Mostly the CLI will be backward compatible. New CLI to be introduced +needs to be enumerated here. + +### Implications on presentation layer + +None + +### Implications on persistence layer + +None + +### Implications on 'GlusterFS' backend + +None + +### Modification to GlusterFS metadata + +- Addition of a new extended attribute for storing configured hard and +soft limits on directories. + +### Implications on 'glusterd' + +- New file based configuration persistence + +How To Test +----------- + +TBD + +User Experience +--------------- + +TBD + +Dependencies +------------ + +None + +Documentation +------------- + +TBD + +Status +------ + +In development + +Comments and Discussion +----------------------- diff --git a/done/GlusterFS 3.5/Virt store usecase.md b/done/GlusterFS 3.5/Virt store usecase.md new file mode 100644 index 0000000..3e649b2 --- /dev/null +++ b/done/GlusterFS 3.5/Virt store usecase.md @@ -0,0 +1,140 @@ + Work In Progress + Author - Satheesaran Sundaramoorthi + + +**Introduction** +---------------- + +Gluster volumes are used to host Virtual Machines Images. (i.e) Virtual +machines Images are stored on gluster volumes. This usecase is popularly +known as *virt-store* usecase. + +This document explains more about, + +1. Enabling gluster volumes for virt-store usecase +2. Common Pitfalls +3. FAQs +4. References + +**Enabling gluster volumes for virt-store** +------------------------------------------- + +This section describes how to enable gluster volumes for virt store +usecase + +#### Volume Types + +Ideally gluster volumes serving virt-store, should provide +high-availability for the VMs running on it. If the volume is not +avilable, the VMs may move in to unusable state. So, its best +recommended to use **replica** or **distribute-replicate** volume for +this usecase + +*If you are new to GlusterFS, you can take a look at +[QuickStart](http://gluster.readthedocs.org/en/latest/Quick-Start-Guide/Quickstart/) guide or the [admin +guide](http://gluster.readthedocs.org/en/latest/Administrator%20Guide/README/)* + +#### Tunables + +The set of volume options are recommended for virt-store usecase, which +adds performance boost. Following are those options, + + quick-read=off + read-ahead=off + io-cache=off + stat-prefetch=off + eager-lock=enable + remote-dio=enable + quorum-type=auto + server-quorum-type=server + +- quick-read is meant for improving small-file read performance,which + is no longer reasonable for VM Image files +- read-ahead is turned off. VMs have their own way of doing that. This + is pretty usual to leave it to VM to determine the read-ahead +- io-cache is turned off +- stat-prefetch is turned off. stat-prefetch, caches the metadata + related to files and this is no longer a concern for VM Images (why + ?) +- eager-lock is turned on (why?) +- remote-dio is turned on,so in open() and creat() calls, O\_DIRECT + flag will be filtered at the client protocol level so server will + still continue to cache the file. +- quorum-type is set to auto. This basically enables client side + quorum. When client side quorum is enabled, there exists the rule + such that atleast half of the bricks in the replica group should be + UP and running. If not, the replica group would become read-only +- server-quorum-type is set to server. This basically enables + server-side quorum. This lays a condition that in a cluster, atleast + half the number of nodes, should be UP. If not the bricks ( read as + brick processes) will be killed, and thereby volume goes offline + +#### Applying the Tunables on the volume + +There are number of ways to do it. + +1. Make use of group-virt.example file +2. Copy & Paste + +##### Make use of group-virt.example file + +This is the method best suited and recommended. +*/etc/glusterfs/group-virt.example* has all options recommended for +virt-store as explained earlier. Copy this file, +*/etc/glusterfs/group-virt.example* to */var/lib/glusterd/groups/virt* + + cp /etc/glusterfs/group-virt.example /var/lib/glusterd/groups/virt + +Optimize the volume with all the options available in this *virt* file +in a single go + + gluster volume set group virt + +NOTE: No restart of the volume is required Verify the same with the +command, + + gluster volume info + +In forthcoming releases, this file will be automatically put in +*/var/lib/glusterd/groups/* and you can directly apply it on the volume + +##### Copy & Paste + +Copy all options from the above +section,[Virt-store-usecase\#Tunables](Virt-store-usecase#Tunables "wikilink") +and put in a file named *virt* in */var/lib/glusterd/groups/virt* Apply +all the options on the volume, + + gluster volume set group virt + +NOTE: This is not recommended, as the recommended volume options may/may +not change in future.Always stick to *virt* file available with the rpms + +#### Adding Ownership to Volume + +You can add uid:gid to the volume, + + gluster volume set storage.owner-uid + gluster volume set storage.owner-gid + +For example, when the volume would be accessed by qemu/kvm, you need to +add ownership as 107:107, + + gluster volume set storage.owner-uid 107 + gluster volume set storage.owner-gid 107 + +It would be 36:36 in the case of oVirt/RHEV, 165:165 in the case of +OpenStack Block Service (cinder),161:161 in case of OpenStack Image +Service (glance) is accessing this volume + +NOTE: Not setting the correct ownership may lead to "Permission Denied" +errors when accessing the image files residing on the volume + +**Common Pitfalls** +------------------- + +**FAQs** +-------- + +**References** +-------------- \ No newline at end of file diff --git a/done/GlusterFS 3.5/Zerofill.md b/done/GlusterFS 3.5/Zerofill.md new file mode 100644 index 0000000..43b279d --- /dev/null +++ b/done/GlusterFS 3.5/Zerofill.md @@ -0,0 +1,192 @@ +Feature +------- + +zerofill API for GlusterFS + +Summary +------- + +zerofill() API would allow creation of pre-allocated and zeroed-out +files on GlusterFS volumes by offloading the zeroing part to server +and/or storage (storage offloads use SCSI WRITESAME). + +Owners +------ + +Bharata B Rao +M. Mohankumar + +Current status +-------------- + +Patch on gerrit: + +Detailed Description +-------------------- + +Add support for a new ZEROFILL fop. Zerofill writes zeroes to a file in +the specified range. This fop will be useful when a whole file needs to +be initialized with zero (could be useful for zero filled VM disk image +provisioning or during scrubbing of VM disk images). + +Client/application can issue this FOP for zeroing out. Gluster server +will zero out required range of bytes ie server offloaded zeroing. In +the absence of this fop, client/application has to repetitively issue +write (zero) fop to the server, which is very inefficient method because +of the overheads involved in RPC calls and acknowledgements. + +WRITESAME is a SCSI T10 command that takes a block of data as input and +writes the same data to other blocks and this write is handled +completely within the storage and hence is known as offload . Linux ,now +has support for SCSI WRITESAME command which is exposed to the user in +the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to +implement this fop. Thus zeroing out operations can be completely +offloaded to the storage device , making it highly efficient. + +The fop takes two arguments offset and size. It zeroes out 'size' number +of bytes in an opened file starting from 'offset' position. + +Benefit to GlusterFS +-------------------- + +Benefits GlusterFS in virtualization by providing the ability to quickly +create pre-allocated and zeroed-out VM disk image by using +server/storage off-loads. + +### Scope + +Nature of proposed change +------------------------- + +An FOP supported in libgfapi and FUSE. + +Implications on manageability +----------------------------- + +None. + +Implications on presentation layer +---------------------------------- + +N/A + +Implications on persistence layer +--------------------------------- + +N/A + +Implications on 'GlusterFS' backend +----------------------------------- + +N/A + +Modification to GlusterFS metadata +---------------------------------- + +N/A + +Implications on 'glusterd' +-------------------------- + +N/A + +How To Test +----------- + +Test server offload by measuring the time taken for creating a fully +allocated and zeroed file on Posix backend. + +Test storage offload by measuring the time taken for creating a fully +allocated and zeroed file on BD backend. + +User Experience +--------------- + +Fast provisioning of VM images when GlusterFS is used as a file system +backend for KVM virtualization. + +Dependencies +------------ + +zerofill() support in BD backend depends on the new BD translator - + + +Documentation +------------- + +This feature add support for a new ZEROFILL fop. Zerofill writes zeroes +to a file in the specified range. This fop will be useful when a whole +file needs to be initialized with zero (could be useful for zero filled +VM disk image provisioning or during scrubbing of VM disk images). + +Client/application can issue this FOP for zeroing out. Gluster server +will zero out required range of bytes ie server offloaded zeroing. In +the absence of this fop, client/application has to repetitively issue +write (zero) fop to the server, which is very inefficient method because +of the overheads involved in RPC calls and acknowledgements. + +WRITESAME is a SCSI T10 command that takes a block of data as input and +writes the same data to other blocks and this write is handled +completely within the storage and hence is known as offload . Linux ,now +has support for SCSI WRITESAME command which is exposed to the user in +the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to +implement this fop. Thus zeroing out operations can be completely +offloaded to the storage device , making it highly efficient. + +The fop takes two arguments offset and size. It zeroes out 'size' number +of bytes in an opened file starting from 'offset' position. + +This feature adds zerofill support to the following areas: + +-  libglusterfs +-  io-stats +-  performance/md-cache,open-behind +-  quota +-  cluster/afr,dht,stripe +-  rpc/xdr +-  protocol/client,server +-  io-threads +-  marker +-  storage/posix +-  libgfapi + +Client applications can exploit this fop by using glfs\_zerofill +introduced in libgfapi.FUSE support to this fop has not been added as +there is no system call for this fop. + +Here is a performance comparison of server offloaded zeofill vs zeroing +out using repeated writes. + + [root@llmvm02 remote]# time ./offloaded aakash-test log 20 + + real    3m34.155s + user    0m0.018s + sys 0m0.040s + + +  [root@llmvm02 remote]# time ./manually aakash-test log 20 + + real    4m23.043s + user    0m2.197s + sys 0m14.457s +  [root@llmvm02 remote]# time ./offloaded aakash-test log 25; + + real    4m28.363s + user    0m0.021s + sys 0m0.025s + [root@llmvm02 remote]# time ./manually aakash-test log 25 + + real    5m34.278s + user    0m2.957s + sys 0m18.808s + +The argument log is a file which we want to set for logging purpose and +the third argument is size in GB . + +As we can see there is a performance improvement of around 20% with this +fop. + +Status +------ + +Patch : Status : Merged \ No newline at end of file diff --git a/done/GlusterFS 3.5/gfid access.md b/done/GlusterFS 3.5/gfid access.md new file mode 100644 index 0000000..db64076 --- /dev/null +++ b/done/GlusterFS 3.5/gfid access.md @@ -0,0 +1,89 @@ +### Instructions + +**Feature** + +'gfid-access' translator to provide access to data in glusterfs using a virtual path. + +**1 Summary** + +This particular Translator is designed to provide direct access to files in glusterfs using its gfid.'GFID' is glusterfs's inode numbers for a file to identify it uniquely. + +**2 Owners** + +Amar Tumballi  +Raghavendra G  +Anand Avati  + +**3 Current status** + +With glusterfs-3.4.0, glusterfs provides only path based access.A feature is added in 'fuse' layer in the current master branch, +but its desirable to have it as a separate translator for long time +maintenance. + +**4 Detailed Description** + +With this method, we can consume the data in changelog translator +(which is logging 'gfid' internally) very efficiently. + +**5 Benefit to GlusterFS** + +Provides a way to access files quickly with direct gfid. + +​**6. Scope** + +6.1. Nature of proposed change + +* A new translator. +* Fixes in 'glusterfsd.c' to add this translator automatically based +on mount time option. +* change to mount.glusterfs to parse this new option  +(single digit number or lines changed) + +6.2. Implications on manageability + +* No CLI required. +* mount.glusterfs script gets a new option. + +6.3. Implications on presentation layer + +* A new virtual access path is made available. But all access protocols work seemlessly, as the complexities are handled internally. + +6.4. Implications on persistence layer + +* None + +6.5. Implications on 'GlusterFS' backend + +* None + +6.6. Modification to GlusterFS metadata + +* None + +6.7. Implications on 'glusterd' + +* None + +7 How To Test + +* Mount glusterfs client with '-o aux-gfid-mount' and access files using '/mount/point/.gfid/ '. + +8 User Experience + +* A new virtual path available for users. + +9 Dependencies + +* None + +10 Documentation + +This wiki. + +11 Status + +Patch sent upstream. More review comments required. (http://review.gluster.org/5497) + +12 Comments and Discussion + +Please do give comments :-) \ No newline at end of file diff --git a/done/GlusterFS 3.5/index.md b/done/GlusterFS 3.5/index.md new file mode 100644 index 0000000..e8c2c88 --- /dev/null +++ b/done/GlusterFS 3.5/index.md @@ -0,0 +1,32 @@ +GlusterFS 3.5 Release +--------------------- + +Tentative Dates: + +Latest: 13-Nov, 2014 GlusterFS 3.5.3 + +17th Apr, 2014 - 3.5.0 GA + +GlusterFS 3.5 +------------- + +### Features in 3.5.0 + +- [Features/AFR CLI enhancements](./AFR CLI enhancements.md) +- [Features/exposing volume capabilities](./Exposing Volume Capabilities.md) +- [Features/File Snapshot](./File Snapshot.md) +- [Features/gfid-access](./gfid access.md) +- [Features/On-Wire Compression + Decompression](./Onwire Compression-Decompression.md) +- [Features/Quota Scalability](./Quota Scalability.md) +- [Features/readdir ahead](./readdir ahead.md) +- [Features/zerofill](./Zerofill.md) +- [Features/Brick Failure Detection](./Brick Failure Detection.md) +- [Features/disk-encryption](./Disk Encryption.md) +- Changelog based parallel geo-replication +- Improved block device translator + +Proposing New Features +---------------------- + +New feature proposals should be built using the New Feature Template in +the GlusterFS 3.7 planning page diff --git a/done/GlusterFS 3.5/libgfapi with qemu libvirt.md b/done/GlusterFS 3.5/libgfapi with qemu libvirt.md new file mode 100644 index 0000000..2309016 --- /dev/null +++ b/done/GlusterFS 3.5/libgfapi with qemu libvirt.md @@ -0,0 +1,222 @@ + Work In Progress + Author - Satheesaran Sundaramoorthi + + +**Purpose** +----------- + +Gluster volume can be used to store VM Disk images. This usecase is +popularly known as 'Virt-Store' usecase. Earlier, gluster volume had to +be fuse mounted and images are created/accessed over the fuse mount. + +With the introduction of GlusterFS libgfapi, QEMU supports glusterfs +through libgfapi directly. This we call as *QEMU driver for glusterfs*. +These document explains about the way to make use of QEMU driver for +glusterfs + +Steps for the entire procedure could be split in to 2 views viz,the +document from + +1. Steps to be done on gluster volume side +2. Steps to be done on Hypervisor side + +**Steps to be done on gluster side** +------------------------------------ + +These are the steps that needs to be done on the gluster side Precisely +this involves + +1. Creating "Trusted Storage Pool" +2. Creating a volume +3. Tuning the volume for virt-store +4. Tuning glusterd to accept requests from QEMU +5. Tuning glusterfsd to accept requests from QEMU +6. Setting ownership on the volume +7. Starting the volume + +##### Creating "Trusted Storage Pool" + +Install glusterfs rpms on the NODE. You can create a volume with a +single node. You can also scale up the cluster, as we call as *Trusted +Storage Pool*, by adding more nodes to the cluster + + gluster peer probe  + +##### Creating a volume + +It is highly recommended to have replicate volume or +distribute-replicate volume for virt-store usecase, as it would add high +availability and fault-tolerance. Remember the plain distribute works +equally well + + gluster volume create replica 2 .. + +where, is :/ Note: It is recommended to +create sub-directories inside brick and that could be used to create a +volume.For example, say, */home/brick1* is the mountpoint of XFS, then +you can create a sub-directory inside it */home/brick1/b1* and use it +while creating a volume.You can also use space available in root +filesystem for bricks. Gluster cli, by default, throws warning in that +case. You can override by using *force* option + + gluster volume create replica 2 .. force + +*If you are new to GlusterFS, you can take a look at +[QuickStart](http://gluster.readthedocs.org/en/latest/Quick-Start-Guide/Quickstart/) guide.* + +##### Tuning the volume for virt-store + +There are recommended settings available for virt-store. This provide +good performance characteristics when enabled on the volume that was +used for *virt-store* + +Refer to +[Virt-store-usecase\#Tunables](Virt-store-usecase#Tunables "wikilink") +for recommended tunables and for applying them on the volume, +[Virt-store-usecase\#Applying\_the\_Tunables\_on\_the\_volume](Virt-store-usecase#Applying_the_Tunables_on_the_volume "wikilink") + +##### Tuning glusterd to accept requests from QEMU + +glusterd receives the request only from the applications that run with +port number less than 1024 and it blocks otherwise. QEMU uses port +number greater than 1024 and to make glusterd accept requests from QEMU, +edit the glusterd vol file, */etc/glusterfs/glusterd.vol* and add the +following, + + option rpc-auth-allow-insecure on + +Note: If you have installed glusterfs from source, you can find glusterd +vol file at */usr/local/etc/glusterfs/glusterd.vol* + +Restart glusterd after adding that option to glusterd vol file + + service glusterd restart + +##### Tuning glusterfsd to accept requests from QEMU + +Enable the option *allow-insecure* on the particular volume + + gluster volume set  server.allow-insecure on + +**IMPORTANT :** As of now(april 2,2014)there is a bug, as +*allow-insecure* is not dynamically set on a volume.You need to restart +the volume for the change to take effect + +##### Setting ownership on the volume + +Set the ownership of qemu:qemu on to the volume + + gluster volume set  storage.owner-uid 107 + gluster volume set  storage.owner-gid 107 + +**IMPORTANT :** The UID and GID can differ per Linux distribution, or +even installation. The UID/GID should be the one fomr the *qemu* or +'kvm'' user, you can get the IDs with these commands: + + id qemu + getent group kvm + +##### Starting the volume + +Start the volume + + gluster volume start  + +**Steps to be done on Hypervisor Side** +--------------------------------------- + +Hypervisor is just the machine which spawns the Virtual Machines. This +machines should be necessarily the baremetal with more memory and +computing power. The following steps needs to be done on hypervisor, + +1. Install qemu-kvm +2. Install libvirt +3. Create a VM Image +4. Add ownership to the Image file +5. Create libvirt XML to define Virtual Machine +6. Define the VM +7. Start the VM +8. Verification + +##### Install qemu-kvm + +##### Install libvirt + +##### Create a VM Image + +Images can be created using *qemu-img* utility + + qemu-img create -f gluster://// + +- format - This can be raw or qcow2 +- server - One of the gluster Node's IP or FQDN +- vol-name - gluster volume name +- image - Image File name +- size - Size of the image + +Here is sample, + + qemu-img create -f qcow2 gluster://host.sample.com/vol1/vm1.img 10G + +##### Add ownership to the Image file + +NFS or FUSE mount the glusterfs volume and change the ownership of the +image file to qemu:qemu + + mount -t nfs -o vers=3 :/  + +Change the ownership of the image file that was earlier created using +*qemu-img* utility + + chown qemu:qemu / + +##### Create libvirt XML to define Virtual Machine + +*virt-install* is python wrapper which is mostly used to create VM using +set of params. *virt-install* doesn't support any network filesystem [ + ] + +Create a libvirt xml - See to +that the disk section is formatted in such a way, qemu driver for +glusterfs is being used. This can be seen in the following example xml +description + + + + + + + +
+ + +##### Define the VM from XML + +Define the VM from the XML file that was created earlier + + virsh define  + +Verify that the VM is created successfully + + virsh list --all + +##### Start the VM + +Start the VM + + virsh start  + +##### Verification + +You can verify the disk image file that is being used by VM + + virsh domblklist  + +The above should show the volume name and image name. Here is the +example, + + [root@test ~]# virsh domblklist vm-test2 + Target Source + ------------------------------------------------ + vda distrepvol/test.img + hdc - \ No newline at end of file diff --git a/done/GlusterFS 3.5/readdir ahead.md b/done/GlusterFS 3.5/readdir ahead.md new file mode 100644 index 0000000..fe34a97 --- /dev/null +++ b/done/GlusterFS 3.5/readdir ahead.md @@ -0,0 +1,117 @@ +Feature +------- + +readdir-ahead + +Summary +------- + +Provide read-ahead support for directories to improve sequential +directory read performance. + +Owners +------ + +Brian Foster + +Current status +-------------- + +Gluster currently does not attempt to improve directory read +performance. As a result, simple operations (i.e., ls) on large +directories are slow. + +Detailed Description +-------------------- + +The read-ahead feature for directories is analogous to read-ahead for +files. The objective is to detect sequential directory read operations +and establish a pipeline for directory content. When a readdir request +is received and fulfilled, preemptively issue subsequent readdir +requests to the server in anticipation of those requests from the user. +If sequential readdir requests are received, the directory content is +already immediately available in the client. If subsequent requests are +not sequential or not received, said data is simply dropped and the +optimization is bypassed. + +Benefit to GlusterFS +-------------------- + +Improved read performance of large directories. + +### Scope + +Nature of proposed change +------------------------- + +readdir-ahead support is enabled through a new client-side translator. + +Implications on manageability +----------------------------- + +None beyond the ability to enable and disable the translator. + +Implications on presentation layer +---------------------------------- + +N/A + +Implications on persistence layer +--------------------------------- + +N/A + +Implications on 'GlusterFS' backend +----------------------------------- + +N/A + +Modification to GlusterFS metadata +---------------------------------- + +N/A + +Implications on 'glusterd' +-------------------------- + +N/A + +How To Test +----------- + +Performance testing. Verify that sequential reads of large directories +complete faster (i.e., ls, xfs\_io -c readdir). + +User Experience +--------------- + +Improved performance on sequential read workloads. The translator should +otherwise be invisible and not detract performance or disrupt behavior +in any way. + +Dependencies +------------ + +N/A + +Documentation +------------- + +Set the associated config option to enable or disable directory +read-ahead on a volume: + + gluster volume set  readdir-ahead [enable|disable] + +readdir-ahead is disabled by default. + +Status +------ + +Development complete for the initial version. Minor changes and bug +fixes likely. + +Future versions might expand to provide generic caching and more +flexible behavior. + +Comments and Discussion +----------------------- \ No newline at end of file -- cgit