From 9e9e3c5620882d2f769694996ff4d7e0cf36cc2b Mon Sep 17 00:00:00 2001 From: raghavendra talur Date: Thu, 20 Aug 2015 15:09:31 +0530 Subject: Create basic directory structure All new features specs go into in_progress directory. Once signed off, it should be moved to done directory. For now, This change moves all the Gluster 4.0 feature specs to in_progress. All other specs are under done/release-version. More cleanup required will be done incrementally. Change-Id: Id272d301ba8c434cbf7a9a966ceba05fe63b230d BUG: 1206539 Signed-off-by: Raghavendra Talur Reviewed-on: http://review.gluster.org/11969 Reviewed-by: Humble Devassy Chirammal Reviewed-by: Prashanth Pai Tested-by: Humble Devassy Chirammal --- .../GlusterFS 3.5/AFR CLI enhancements.md | 204 ---------- .../GlusterFS 3.5/Brick Failure Detection.md | 151 ------- Feature Planning/GlusterFS 3.5/Disk Encryption.md | 443 --------------------- .../GlusterFS 3.5/Exposing Volume Capabilities.md | 161 -------- Feature Planning/GlusterFS 3.5/File Snapshot.md | 101 ----- .../Onwire Compression-Decompression.md | 96 ----- .../GlusterFS 3.5/Quota Scalability.md | 99 ----- .../GlusterFS 3.5/Virt store usecase.md | 140 ------- Feature Planning/GlusterFS 3.5/Zerofill.md | 192 --------- Feature Planning/GlusterFS 3.5/gfid access.md | 89 ----- Feature Planning/GlusterFS 3.5/index.md | 32 -- .../GlusterFS 3.5/libgfapi with qemu libvirt.md | 222 ----------- Feature Planning/GlusterFS 3.5/readdir ahead.md | 117 ------ 13 files changed, 2047 deletions(-) delete mode 100644 Feature Planning/GlusterFS 3.5/AFR CLI enhancements.md delete mode 100644 Feature Planning/GlusterFS 3.5/Brick Failure Detection.md delete mode 100644 Feature Planning/GlusterFS 3.5/Disk Encryption.md delete mode 100644 Feature Planning/GlusterFS 3.5/Exposing Volume Capabilities.md delete mode 100644 Feature Planning/GlusterFS 3.5/File Snapshot.md delete mode 100644 Feature Planning/GlusterFS 3.5/Onwire Compression-Decompression.md delete mode 100644 Feature Planning/GlusterFS 3.5/Quota Scalability.md delete mode 100644 Feature Planning/GlusterFS 3.5/Virt store usecase.md delete mode 100644 Feature Planning/GlusterFS 3.5/Zerofill.md delete mode 100644 Feature Planning/GlusterFS 3.5/gfid access.md delete mode 100644 Feature Planning/GlusterFS 3.5/index.md delete mode 100644 Feature Planning/GlusterFS 3.5/libgfapi with qemu libvirt.md delete mode 100644 Feature Planning/GlusterFS 3.5/readdir ahead.md (limited to 'Feature Planning/GlusterFS 3.5') diff --git a/Feature Planning/GlusterFS 3.5/AFR CLI enhancements.md b/Feature Planning/GlusterFS 3.5/AFR CLI enhancements.md deleted file mode 100644 index 88f4980..0000000 --- a/Feature Planning/GlusterFS 3.5/AFR CLI enhancements.md +++ /dev/null @@ -1,204 +0,0 @@ -Feature -------- - -AFR CLI enhancements - -SUMMARY -------- - -Presently the AFR reporting via CLI has lots of problems in the -representation of logs because of which they may not be able to use the -data effectively. This feature is to correct these problems and provide -a coherent mechanism to present heal status,information and the logs -associated. - -Owners ------- - -Venkatesh Somayajulu -Raghavan - -Current status --------------- - -There are many bugs related to this which indicates the current status -and why these requirements are required. - -​1) 924062 - gluster volume heal info shows only gfids in some cases and -sometimes names. This is very confusing for the end user. - -​2) 852294 - gluster volume heal info hangs/crashes when there is a -large number of entries to be healed. - -​3) 883698 - when self heal daemon is turned off, heal info does not -show any output. But healing can happen because of lookups from IO path. -Hence list of entries to be healed still needs to be shown. - -​4) 921025 - directories are not reported when list of split brain -entries needs to be displayed. - -​5) 981185 - when self heal daemon process is offline, volume heal info -gives error as "staging failure" - -​6) 952084 - We need a command to resolve files in split brain state. - -​7) 986309 - We need to report source information for files which got -healed during a self heal session. - -​8) 986317 - Sometimes list of files to get healed also includes files -to which IO s being done since the entries for these files could be in -the xattrop directory. This could be confusing for the user. - -There is a master bug 926044 that sums up most of the above problems. It -does give the QA perspective of the current representation out of the -present reporting infrastructure. - -Detailed Description --------------------- - -​1) One common thread among all the above complaints is that the -information presented to the user is FUD because of the following -reasons: - -(a) Split brain itself is a scary scenario especially with VMs. -(b) The data that we present to the users cannot be used in a stable - manner for them to get to the list of these files. For ex: we - need to give mechanisms by which he can automate the resolution out - of split brain. -(c) The logs that are generated are all the more scarier since we - see repetition of some error lines running into hundreds of lines. - Our mailing lists are filled with such emails from end users. - -Any data is useless unless it is associated with an event. For self -heal, the event that leads to self heal is the loss of connectivity to a -brick from a client. So all healing info and especially split brain -should be associated with such events. - -The following is hence the proposed mechanism: - -(a) Every loss of a brick from client's perspective is logged and - available via some ID. The information provides the time from when - the brick went down to when it came up. Also it should also report - the number of IO transactions(modifies) that hapenned during this - event. -(b) The list of these events are available via some CLI command. The - actual command needs to be detailed as part of this feature. -(c) All volume info commands regarding list of files to be healed, - files healed and split brain files should be associated with this - event(s). - -​2) Provide a mechanism to show statistics at a volume and replica group -level. It should show the number of files to be healed and number of -split brain files at both the volume and replica group level. - -​3) Provide a mechanism to show per volume list of files to be -healed/files healed/split brain in the following info: - -This should have the following information: - -(a) File name -(b) Bricks location -(c) Event association (brick going down) -(d) Source -(v) Sink - -​4) Self heal crawl statistics - Introduce new CLI commands for showing -more information on self heal crawl per volume. - -(a) Display why a self heal crawl ran (timeouts, brick coming up) -(b) Start time and end time -(c) Number of files it attempted to heal -(d) Location of the self heal daemon - -​5) Scale the logging infrastructure to handle huge number of file list -that needs to be displayed as part of the logging. - -(a) Right now the system crashes or hangs in case of a high number - of files. -(b) It causes CLI timeouts arbitrarily. The latencies involved in - the logging have to be studied (profiled) and mechanisms to - circumvent them have to be introduced. -(c) All files are displayed on the output. Have a better way of - representing them. - -Options are: - -(a) Maybe write to a glusterd log file or have a seperate directory - for afr heal logs. -(b) Have a status kind of command. This will display the current - status of the log building and maybe have batched way of - representing when there is a huge list. - -​6) We should provide mechanism where the user can heal split brain by -some pre-established policies: - -(a) Let the system figure out the latest files (assuming all nodes - are in time sync) and choose the copies that have the latest time. -(b) Choose one particular brick as the source for split brain and - heal all split brains from this brick. -(c) Just remove the split brain information from changelog. We leave - the exercise to the user to repair split brain where in he would - rewrite to the split brained files. (right now the user is forced to - remove xattrs manually for this step). - -Benefits to GlusterFS --------------------- - -Makes the end user more aware of healing status and provides statistics. - -Scope ------ - -6.1. Nature of proposed change - -Modification to AFR and CLI and glusterd code - -6.2. Implications on manageability - -New CLI commands to be added. Existing commands to be improved. - -6.3. Implications on presentation layer - -N/A - -6.4. Implications on persistence layer - -N/A - -6.5. Implications on 'GlusterFS' backend - -N/A - -6.6. Modification to GlusterFS metadata - -N/A - -6.7. Implications on 'glusterd' - -Changes for healing specific commands will be introduced. - -How To Test ------------ - -See documentation session - -User Experience ---------------- - -*Changes in CLI, effect on User experience...* - -Documentation -------------- - - - -Status ------- - -Patches : - - - -Status: - -Merged \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.5/Brick Failure Detection.md b/Feature Planning/GlusterFS 3.5/Brick Failure Detection.md deleted file mode 100644 index 9952698..0000000 --- a/Feature Planning/GlusterFS 3.5/Brick Failure Detection.md +++ /dev/null @@ -1,151 +0,0 @@ -Feature -------- - -Brick Failure Detection - -Summary -------- - -This feature attempts to identify storage/file system failures and -disable the failed brick without disrupting the remainder of the node's -operation. - -Owners ------- - -Vijay Bellur with help from Niels de Vos (or the other way around) - -Current status --------------- - -Currently, if the underlying storage or file system failure happens, a -brick process will continue to function. In some cases, a brick can hang -due to failures in the underlying system. Due to such hangs in brick -processes, applications running on glusterfs clients can hang. - -Detailed Description --------------------- - -Detecting failures on the filesystem that a brick uses makes it possible -to handle errors that are caused from outside of the Gluster -environment. - -There have been hanging brick processes when the underlying storage of a -brick went unavailable. A hanging brick process can still use the -network and repond to clients, but actual I/O to the storage is -impossible and can cause noticible delays on the client side. - -Benefit to GlusterFS --------------------- - -Provide better detection of storage subsytem failures and prevent bricks -from hanging. - -Scope ------ - -### Nature of proposed change - -Add a health-checker to the posix xlator that periodically checks the -status of the filesystem (implies checking of functional -storage-hardware). - -### Implications on manageability - -When a brick process detects that the underlaying storage is not -responding anymore, the process will exit. There is no automated way -that the brick process gets restarted, the sysadmin will need to fix the -problem with the storage first. - -After correcting the storage (hardware or filesystem) issue, the -following command will start the brick process again: - - # gluster volume start force - -### Implications on presentation layer - -None - -### Implications on persistence layer - -None - -### Implications on 'GlusterFS' backend - -None - -### Modification to GlusterFS metadata - -None - -### Implications on 'glusterd' - -'glusterd' can detect that the brick process has exited, -`gluster volume status` will show that the brick process is not running -anymore. System administrators checking the logs should be able to -triage the cause. - -How To Test ------------ - -The health-checker thread that is part of each brick process will get -started automatically when a volume has been started. Verifying its -functionality can be done in different ways. - -On virtual hardware: - -- disconnect the disk from the VM that holds the brick - -On real hardware: - -- simulate a RAID-card failure by unplugging the card or cables - -On a system that uses LVM for the bricks: - -- use device-mapper to load an error-table for the disk, see [this - description](http://review.gluster.org/5176). - -On any system (writing to random offsets of the block device, more -difficult to trigger): - -1. cause corruption on the filesystem that holds the brick -2. read contents from the brick, hoping to hit the corrupted area -3. the filsystem should abort after hitting a bad spot, the - health-checker should notice that shortly afterwards - -User Experience ---------------- - -No more hanging brick processes when storage-hardware or the filesystem -fails. - -Dependencies ------------- - -Posix translator, not available for the BD-xlator. - -Documentation -------------- - -The health-checker is enabled by default and runs a check every 30 -seconds. This interval can be changed per volume with: - - # gluster volume set storage.health-check-interval - -If `SECONDS` is set to 0, the health-checker will be disabled. - -For further details refer: - - -Status ------- - -glusterfs-3.4 and newer include a health-checker for the posix xlator, -which was introduced with [bug -971774](https://bugzilla.redhat.com/971774): - -- [posix: add a simple - health-checker](http://review.gluster.org/5176)? - -Comments and Discussion ------------------------ diff --git a/Feature Planning/GlusterFS 3.5/Disk Encryption.md b/Feature Planning/GlusterFS 3.5/Disk Encryption.md deleted file mode 100644 index 4c6ab89..0000000 --- a/Feature Planning/GlusterFS 3.5/Disk Encryption.md +++ /dev/null @@ -1,443 +0,0 @@ -Feature -======= - -Transparent encryption. Allows a volume to be encrypted "at rest" on the -server using keys only available on the client. - -1 Summary -========= - -Distributed systems impose tighter requirements to at-rest encryption. -This is because your encrypted data will be stored on servers, which are -de facto untrusted. In particular, your private encrypted data can be -subjected to analysis and tampering, which eventually will lead to its -revealing, if it is not properly protected. Specifically, usually it is -not enough to just encrypt data. In distributed systems serious -protection of your personal data is possible only in conjunction with a -special process, which is called authentication. GlusterFS provides such -enhanced service: In GlusterFS encryption is enhanced with -authentication. Currently we provide protection from "silent tampering". -This is a kind of tampering, which is hard to detect, because it doesn't -break POSIX compliance. Specifically, we protect encryption-specific -file's metadata. Such metadata includes unique file's object id (GFID), -cipher algorithm id, cipher block size and other attributes used by the -encryption process. - -1.1 Restrictions ----------------- - -​1. We encrypt only file content. The feature of transparent encryption -doesn't protect file names: they are neither encrypted, nor verified. -Protection of file names is not so critical as protection of -encryption-specific file's metadata: any attacks based on tampering file -names will break POSIX compliance and result in massive corruption, -which is easy to detect. - -​2. The feature of transparent encryption doesn't work in NFS-mounts of -GlusterFS volumes: NFS's file handles introduce security issues, which -are hard to resolve. NFS mounts of encrypted GlusterFS volumes will -result in failed file operations (see section "Encryption in different -types of mount sessions" for more details). - -​3. The feature of transparent encryption is incompatible with GlusterFS -performance translators quick-read, write-behind and open-behind. - -2 Owners -======== - -Jeff Darcy -Edward Shishkin - -3 Current status -================ - -Merged to the upstream. - -4 Detailed Description -====================== - -See Summary. - -5 Benefit to GlusterFS -====================== - -Besides the justifications that have applied to on-disk encryption just -about forever, recent events have raised awareness significantly. -Encryption using keys that are physically present at the server leaves -data vulnerable to physical seizure of the server. Encryption using keys -that are kept by the same organization entity leaves data vulnerable to -"insider threat" plus coercion or capture at the organization level. For -many, especially various kinds of service providers, only pure -client-side encryption provides the necessary levels of privacy and -deniability. - -Competitively, other projects - most notably -[Tahoe-LAFS](https://leastauthority.com/) - are already using recently -heightened awareness of these issues to attract users who would be -better served by our performance/scalability, usability, and diversity -of interfaces. Only the lack of proper encryption holds us back in these -cases. - -6 Scope -======= - -6.1. Nature of proposed change ------------------------------- - -This is a new client-side translator, using user-provided key -information plus information stored in xattrs to encrypt data -transparently as it's written and decrypt when it's read. - -6.2. Implications on manageability ----------------------------------- - -User needs to manage a per-volume master key (MK). That is: - -​1) Generate an independent MK for every volume which is to be -encrypted. Note, that one MK is created for the whole life of the -volume. - -​2) Provide MK on the client side at every mount in accordance with the -location, which has been specified at volume create time, or overridden -via respective mount option (see section How To Test). - -​3) Keep MK between mount sessions. Note that after successful mount MK -may be removed from the specified location. In this case user should -retain MK safely till next mount session. - -MK is a 256-bit secret string, which is known only to user. Generating -and retention of MK is in user's competence. - -WARNING!!! Losing MK will make content of all regular files of your -volume inaccessible. It is possible to mount a volume with improper MK, -however such mount sessions will allow to access only file names as they -are not encrypted. - -Recommendations on MK generation - -MK has to be a high-entropy key, appropriately generated by a key -derivation algorithm. One of the possible ways is using rand(1) provided -by the OpenSSL package. You need to specify the option "-hex" for proper -output format. For example, the next command prints a generated key to -the standard output: - - $ openssl rand -hex 32 - -6.3. Implications on presentation layer ---------------------------------------- - -N/A - -6.4. Implications on persistence layer --------------------------------------- - -N/A - -6.5. Implications on 'GlusterFS' backend ----------------------------------------- - -All encrypted files on the servers contains padding at the end of file. -That is, size of all enDefines location of the master volume key on the -trusted client machine.crypted files on the servers is multiple to -cipher block size. Real file size is stored as file's xattr with the key -"trusted.glusterfs.crypt.att.size". The translation padded-file-size -\> -real-file-size (and backward) is performed by the crypt translator. - -6.6. Modification to GlusterFS metadata ---------------------------------------- - -Encryption-specific metadata in specified format is stored as file's -xattr with the key "trusted.glusterfs.crypt.att.cfmt". Current format of -metadata string is described in the slide \#27 of the following [ design -document](http://www.gluster.org/community/documentation/index.php/File:GlusterFS_transparent_encryption.pdf) - -6.7. Options of the crypt translator ------------------------------------- - -- data-cipher-alg - -Specifies cipher algorithm for file data encryption. Currently only one -option is available: AES\_XTS. This is hidden option. - -- block-size - -Specifies size (in bytes) of logical chunk which is encrypted as a whole -unit in the file body. If cipher modes with initial vectors are used for -encryption, then the initial vector gets reset for every such chunk. -Available values are: "512", "1024", "2048" and "4096". Default value is -"4096". - -- data-key-size - -Specifies size (in bits) of data cipher key. For AES\_XTS available -values are: "256" and "512". Default value is "256". The larger key size -("512") is for stronger security. - -- master-key - -Specifies pathname of the regular file, or symlink. Defines location of -the master volume key on the trusted client machine. - -7 Getting Started With Crypt Translator -======================================= - -​1. Create a volume . - -​2. Turn on crypt xlator: - - # gluster volume set `` encryption on - -​3. Turn off performance xlators that currently encryption is -incompatible with: - - # gluster volume set  performance.quick-read off - # gluster volume set  performance.write-behind off - # gluster volume set  performance.open-behind off - -​4. (optional) Set location of the volume master key: - - # gluster volume set  encryption.master-key  - -where is an absolute pathname of the file, which -will contain the volume master key (see section implications on -manageability). - -​5. (optional) Override default options of crypt xlator: - - # gluster volume set  encryption.data-key-size  - -where should have one of the following values: -"256"(default), "512". - - # gluster volume set  encryption.block-size  - -where should have one of the following values: "512", -"1024", "2048", "4096"(default). - -​6. Define location of the master key on your client machine, if it -wasn't specified at section 4 above, or you want it to be different from -the , specified at section 4. - -​7. On the client side make sure that the file with name - (or defined at section -6) exists and contains respective per-volume master key (see section -implications on manageability). This key has to be in hex form, i.e. -should be represented by 64 symbols from the set {'0', ..., '9', 'a', -..., 'f'}. The key should start at the beginning of the file. All -symbols at offsets \>= 64 are ignored. - -NOTE: (or defined at -step 6) can be a symlink. In this case make sure that the target file of -this symlink exists and contains respective per-volume master key. - -​8. Mount the volume on the client side as usual. If you -specified a location of the master key at section 6, then use the mount -option - ---xlator-option=.master-key= - -where is location of master key specified at -section 6, is suffixed with "-crypt". For -example, if you created a volume "myvol" in the step 1, then -suffixed\_vol\_name is "myvol-crypt". - -​9. During mount your client machine receives configuration info from -the untrusted server, so this step is extremely important! Check, that -your volume is really encrypted, and that it is encrypted with the -proper master key (see FAQ \#1,\#2). - -​10. (optional) After successful mount the file which contains master -key may be removed. NOTE: Next mount session will require the master-key -again. Keeping the master key between mount sessions is in user's -competence (see section implications on manageability). - -8 How to test -============= - -From a correctness standpoint, it's sufficient to run normal tests with -encryption enabled. From a security standpoint, there's a whole -discipline devoted to analysing the stored data for weaknesses, and -engagement with practitioners of that discipline will be necessary to -develop the right tests. - -9 Dependencies -============== - -Crypt translator requires OpenSSL of version \>= 1.0.1 - -10 Documentation -================ - -10.1 Basic design concepts --------------------------- - -The basic design concepts are described in the following [pdf -slides](http://www.gluster.org/community/documentation/index.php/File:GlusterFS_transparent_encryption.pdf) - -10.2 Procedure of security open -------------------------------- - -So, in accordance with the basic design concepts above, before every -access to a file's body (by read(2), write(2), truncate(2), etc) we need -to make sure that the file's metadata is trusted. Otherwise, we risk to -deal with untrusted file's data. - -To make sure that file's metadata is trusted, file is subjected to a -special procedure of security open. The procedure of security open is -performed by crypt translator at FOP-\>open() (crypt\_open) time by the -function open\_format(). Currently this is a hardcoded composition of 2 -checks: - -1. verification of file's GFID by the file name; -2. verification of file's metadata by the verified GFID; - -If the security open succeeds, then the cache of trusted client machine -is replenished with file descriptor and file's inode, and user can -access the file's content by read(2), write(2), ftruncate(2), etc. -system calls, which accept file descriptor as argument. - -However, file API also allows to accept file body without opening the -file. For example, truncate(2), which accepts pathname instead of file -descriptor. To make sure that file's metadata is trusted, we create a -temporal file descriptor and mandatory call crypt\_open() before -truncating the file's body. - -10.3 Encryption in different types of mount sessions ----------------------------------------------------- - -Everything described in the section above is valid only for FUSE-mounts. -Besides, GlusterFS also supports so-called NFS-mounts. From the -standpoint of security the key difference between the mentioned types of -mount sessions is that in NFS-mount sessions file operations instead of -file name accept a so-called file handle (which is actually GFID). It -creates problems, since the file name is a basic point for verification. -As it follows from the section above, using the step 1, we can replenish -the cache of trusted machine with trusted file handles (GFIDs), and -perform a security open only by trusted GFID (by the step 2). However, -in this case we need to make sure that there is no leaks of non-trusted -GFIDs (and, moreover, such leaks won't be introduced by the development -process in future). This is possible only with changed GFID format: -everywhere in GlusterFS GFID should appear as a pair (uuid, -is\_verified), where is\_verified is a boolean variable, which is true, -if this GFID passed off the procedure of verification (step 1 in the -section above). - -The next problem is that current NFS protocol doesn't encrypt the -channel between NFS client and NFS server. It means that in NFS-mounts -of GlusterFS volumes NFS client and GlusterFS client should be the same -(trusted) machine. - -Taking into account the described problems, encryption in GlusterFS is -not supported in NFS-mount sessions. - -10.4 Class of cipher algorithms for file data encryption that can be supported by the crypt translator ------------------------------------------------------------------------------------------------------- - -We'll assume that any symmetric block cipher algorithm is completely -determined by a pair (alg\_id, mode\_id), where alg\_id is an algorithm -defined on elementary cipher blocks (e.g. AES), and mode\_id is a mode -of operation (e.g. ECB, XTS, etc). - -Technically, the crypt translator is able to support any symmetric block -cipher algorithms via additional options of the crypt translator. -However, in practice the set of supported algorithms is narrowed because -of various security and organization issues. Currently we support only -one algotithm. This is AES\_XTS. - -10.5 Bibliography ------------------ - -1. Recommendations for for Block Cipher Modes of Operation (NIST - Special Publication 800-38A). -2. Recommendation for Block Cipher Modes of Operation: The XTS-AES Mode - for Confidentiality on Storage Devices (NIST Special Publication - 800-38E). -3. Recommendation for Key Derivation Using Pseudorandom Functions, - (NIST Special Publication 800-108). -4. Recommendation for Block Cipher Modes of Operation: The CMAC Mode - for Authentication, (NIST Special Publication 800-38B). -5. Recommendation for Block Cipher Modes of Operation: Methods for Key - Wrapping, (NIST Special Publication 800-38F). -6. FIPS PUB 198-1 The Keyed-Hash Message Authentication Code (HMAC). -7. David A. McGrew, John Viega "The Galois/Counter Mode of Operation - (GCM)". - -11 FAQ -====== - -**1. How to make sure that my volume is really encrypted?** - -Check the respective graph of translators on your trusted client -machine. This graph is created at mount time and is stored by default in -the file /usr/local/var/log/glusterfs/mountpoint.log - -Here "mountpoint" is the absolute name of the mountpoint, where "/" are -replaced with "-". For example, if your volume is mounted to -/mnt/testfs, then you'll need to check the file -/usr/local/var/log/glusterfs/mnt-testfs.log - -Make sure that this graph contains the crypt translator, which looks -like the following: - - 13: volume xvol-crypt - 14:     type encryption/crypt - 15:     option master-key /home/edward/mykey - 16:     subvolumes xvol-dht - 17: end-volume - -**2. How to make sure that my volume is encrypted with a proper master -key?** - -Check the graph of translators on your trusted client machine (see the -FAQ\#1). Make sure that the option "master-key" of the crypt translator -specifies correct location of the master key on your trusted client -machine. - -**3. Can I change the encryption status of a volume?** - -You can change encryption status (enable/disable encryption) only for -empty volumes. Otherwise it will be incorrect (you'll end with IO -errors, data corruption and security problems). We strongly recommend to -decide once and forever at volume creation time, whether your volume has -to be encrypted, or not. - -**4. I am able to mount my encrypted volume with improper master keys -and get list of file names for every directory. Is it normal?** - -Yes, it is normal. It doesn't contradict the announced functionality: we -encrypt only file's content. File names are not encrypted, so it doesn't -make sense to hide them on the trusted client machine. - -**5. What is the reason for only supporting AES-XTS? This mode is not -using Intel's AES-NI instruction thus not utilizing hardware feature..** - -Distributed file systems impose tighter requirements to at-rest -encryption. We offer more than "at-rest-encryption". We offer "at-rest -encryption and authentication in distributed systems with non-trusted -servers". Data and metadata on the server can be easily subjected to -tampering and analysis with the purpose to reveal secret user's data. -And we have to resist to this tampering by performing data and metadata -authentication. - -Unfortunately, it is technically hard to implement full-fledged data -authentication via a stackable file system (GlusterFS translator), so we -have decided to perform a "light" authentication by using a special -cipher mode, which is resistant to tampering. Currently OpenSSL supports -only one such mode: this is XTS. Tampering of ciphertext created in XTS -mode will lead to unpredictable changes in the plain text. That said, -user will see "unpredictable gibberish" on the client side. Of course, -this is not an "official way" to detect tampering, but this is much -better than nothing. The "official way" (creating/checking MACs) we use -for metadata authentication. - -Other modes like CBC, CFB, OFB, etc supported by OpenSSL are strongly -not recommended for use in distributed systems with non-trusted servers. -For example, CBC mode doesn't "survive" overwrite of a logical block in -a file. It means that with every such overwrite (standard file system -operation) we'll need to re-encrypt the whole(!) file with different -key. CFB and OFB modes are sensitive to tampering: there is a way to -perform \*predictable\* changes in plaintext, which is unacceptable. - -Yes, XTS is slow (at least its current implementation in OpenSSL), but -we don't promise, that CFB, OFB with full-fledged authentication will be -faster. So.. diff --git a/Feature Planning/GlusterFS 3.5/Exposing Volume Capabilities.md b/Feature Planning/GlusterFS 3.5/Exposing Volume Capabilities.md deleted file mode 100644 index 0f72fbc..0000000 --- a/Feature Planning/GlusterFS 3.5/Exposing Volume Capabilities.md +++ /dev/null @@ -1,161 +0,0 @@ -Feature -------- - -Provide a capability to: - -- Probe the type (posix or bd) of volume. -- Provide list of capabilities of a xlator/volume. For example posix - xlator could support zerofill, BD xlator could support offloaded - copy, thin provisioning etc - -Summary -------- - -With multiple storage translators (posix and bd) being supported in -GlusterFS, it becomes necessary to know the volume type so that user can -issue appropriate calls that are relevant only to the a given volume -type. Hence there needs to be a way to expose the type of the storage -translator of the volume to the user. - -BD xlator is capable of providing server offloaded file copy, -server/storage offloaded zeroing of a file etc. This capabilities should -be visible to the client/user, so that these features can be exploited. - -Owners ------- - -M. Mohan Kumar -Bharata B Rao. - -Current status --------------- - -BD xlator exports capability information through gluster volume info -(and --xml) output. For eg: - -*snip of gluster volume info output for a BD based volume* - - Xlator 1: BD - Capability 1: thin - -*snip of gluster volume info --xml output for a BD based volume* - - -    -     BD -      -       thin -      -    - - -But this capability information should also exposed through some other -means so that a host which is not part of Gluster peer could also avail -this capabilities. - -Exposing about type of volume (ie posix or BD) is still in conceptual -state currently and needs discussion. - -Detailed Description --------------------- - -1. Type -- BD translator supports both regular files and block device, -i,e., one can create files on GlusterFS volume backed by BD -translator and this file could end up as regular posix file or a -logical volume (block device) based on the user's choice. User -can do a setxattr on the created file to convert it to a logical -volume. -- Users of BD backed volume like QEMU would like to know that it -is working with BD type of volume so that it can issue an -additional setxattr call after creating a VM image on GlusterFS -backend. This is necessary to ensure that the created VM image -is backed by LV instead of file. -- There are different ways to expose this information (BD type of -volume) to user. One way is to export it via a getxattr call. - -2. Capabilities -- BD xlator supports new features such as server offloaded file -copy, thin provisioned VM images etc (there is a patch posted to -Gerrit to add server offloaded file zeroing in posix xlator). -There is no standard way of exploiting these features from -client side (such as syscall to exploit server offloaded copy). -So these features need to be exported to the client so that they -can be used. BD xlator V2 patch exports these capabilities -information through gluster volume info (and --xml) output. But -if a client is not part of GlusterFS peer it can't run volume -info command to get the list of capabilities of a given -GlusterFS volume. Also GlusterFS block driver in qemu need to -get the capability list so that these features are used. - -Benefit to GlusterFS --------------------- - -Enables proper consumption of BD xlator and client exploits new features -added in both posix and BD xlator. - -### Scope - -Nature of proposed change -------------------------- - -- Quickest way to expose volume type to a client can be achieved by - using getxattr fop. When a client issues getxattr("volume\_type") on - a root gfid, bd xlator will return 1 implying its BD xlator. But - posix xlator will return ENODATA and client code can interpret this - as posix xlator. - -- Also capability list can be returned via getxattr("caps") for root - gfid. - -Implications on manageability ------------------------------ - -None. - -Implications on presentation layer ----------------------------------- - -N/A - -Implications on persistence layer ---------------------------------- - -N/A - -Implications on 'GlusterFS' backend ------------------------------------ - -N/A - -Modification to GlusterFS metadata ----------------------------------- - -N/A - -Implications on 'glusterd' --------------------------- - -N/A - -How To Test ------------ - -User Experience ---------------- - -Dependencies ------------- - -Documentation -------------- - -Status ------- - -Patch : - -Status : Merged - -Comments and Discussion ------------------------ diff --git a/Feature Planning/GlusterFS 3.5/File Snapshot.md b/Feature Planning/GlusterFS 3.5/File Snapshot.md deleted file mode 100644 index b2d6c69..0000000 --- a/Feature Planning/GlusterFS 3.5/File Snapshot.md +++ /dev/null @@ -1,101 +0,0 @@ -Feature -------- - -File Snapshots in GlusterFS - -### Summary - -Ability to take snapshots of files in GlusterFS - -### Owners - -Anand Avati - -### Source code - -Patch for this feature - - -### Detailed Description - -The feature adds file snapshotting support to GlusterFS. '' To use this -feature the file format should be QCOW2 (from QEMU)'' . The patch takes -the block layer code from Qemu and converts it into a translator in -gluster. - -### Benefit to GlusterFS - -Better integration with Openstack Cinder, and in general ability to take -snapshots of files (typically VM images) - -### Usage - -*To take snapshot of a file, the file format should be QCOW2. To set -file type as qcow2 check step \#2 below* - -​1. Turning on snapshot feature : - - gluster volume set `` features.file-snapshot on - -​2. To set qcow2 file format: - - setfattr -n trusted.glusterfs.block-format -v qcow2:10GB  - -​3. To create a snapshot: - - setfattr -n trusted.glusterfs.block-snapshot-create -v  - -​4. To apply/revert back to a snapshot: - - setfattr -n trusted.glusterfs.block-snapshot-goto -v   - -### Scope - -#### Nature of proposed change - -The work is going to be a new translator. Very minimal changes to -existing code (minor change in syncops) - -#### Implications on manageability - -Will need ability to load/unload the translator in the stack. - -#### Implications on presentation layer - -Feature must be presentation layer independent. - -#### Implications on persistence layer - -No implications - -#### Implications on 'GlusterFS' backend - -Internal snapshots - No implications. External snapshots - there will be -hidden directories added. - -#### Modification to GlusterFS metadata - -New xattr will be added to identify files which are 'snapshot managed' -vs raw files. - -#### Implications on 'glusterd' - -Yet another turn on/off feature for glusterd. Volgen will have to add a -new translator in the generated graph. - -### How To Test - -Snapshots can be tested by taking snapshots along with checksum of the -state of the file, making further changes and going back to old snapshot -and verify the checksum again. - -### Dependencies - -Dependent QEMU code is imported into the codebase. - -### Documentation - - - -### Status - -Merged in master and available in Gluster3.5 \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.5/Onwire Compression-Decompression.md b/Feature Planning/GlusterFS 3.5/Onwire Compression-Decompression.md deleted file mode 100644 index a26aa7a..0000000 --- a/Feature Planning/GlusterFS 3.5/Onwire Compression-Decompression.md +++ /dev/null @@ -1,96 +0,0 @@ -Feature -======= - -On-Wire Compression/Decompression - -1. Summary -========== - -Translator to compress/decompress data in flight between client and -server. - -2. Owners -========= - -- Venky Shankar -- Prashanth Pai - -3. Current Status -================= - -Code has already been merged. Needs more testing. - -The [initial submission](http://review.gluster.org/3251) contained a -`compress` option, which introduced [some -confusion](https://bugzilla.redhat.com/1053670). [A correction has been -sent](http://review.gluster.org/6765) to rename the user visible options -to start with `network.compression`. - -TODO - -- Make xlator pluggable to add support for other compression methods -- Add support for lz4 compression: - -4. Detailed Description -======================= - -- When a writev call occurs, the client compresses the data before - sending it to server. On the server, compressed data is - decompressed. Similarly, when a readv call occurs, the server - compresses the data before sending it to client. On the client, the - compressed data is decompressed. Thus the amount of data sent over - the wire is minimized. - -- Compression/Decompression is done using Zlib library. - -- During normal operation, this is the format of data sent over wire: - + trailer(8 bytes). The trailer contains the CRC32 - checksum and length of original uncompressed data. This is used for - validation. - -5. Usage -======== - -Turning on compression xlator: - - # gluster volume set  network.compression on - -Configurable options: - - # gluster volume set  network.compression.compression-level 8 - # gluster volume set  network.compression.min-size 50 - -6. Benefits to GlusterFS -======================== - -Fewer bytes transferred over the network. - -7. Issues -========= - -- Issues with striped volumes. Compression xlator cannot work with - striped volumes - -- Issues with write-behind: Mount point hangs when writing a file with - write-behind xlator turned on. To overcome this, turn off - write-behind entirely OR set "performance.strict-write-ordering" to - on. - -- Issues with AFR: AFR v1 currently does not propagate xdata. - This issue has - been resolved in AFR v2. - -8. Dependencies -=============== - -Zlib library - -9. Documentation -================ - - - -10. Status -========== - -Code merged upstream. \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.5/Quota Scalability.md b/Feature Planning/GlusterFS 3.5/Quota Scalability.md deleted file mode 100644 index f3b0a0d..0000000 --- a/Feature Planning/GlusterFS 3.5/Quota Scalability.md +++ /dev/null @@ -1,99 +0,0 @@ -Feature -------- - -Quota Scalability - -Summary -------- - -Support upto 65536 quota configurations per volume. - -Owners ------- - -Krishnan Parthasarathi -Vijay Bellur - -Current status --------------- - -Current implementation of Directory Quota cannot scale beyond a few -hundred configured limits per volume. The aim of this feature is to -support upto 65536 quota configurations per volume. - -Detailed Description --------------------- - -TBD - -Benefit to GlusterFS --------------------- - -More quotas can be configured in a single volume thereby leading to -support GlusterFS for use cases like home directory. - -Scope ------ - -### Nature of proposed change - -- Move quota enforcement translator to the server -- Introduce a new quota daemon which helps in aggregating directory - consumption on the server -- Enhance marker's accounting to be modular -- Revamp configuration persistence and CLI listing for better scale -- Allow configuration of soft limits in addition to hard limits. - -### Implications on manageability - -Mostly the CLI will be backward compatible. New CLI to be introduced -needs to be enumerated here. - -### Implications on presentation layer - -None - -### Implications on persistence layer - -None - -### Implications on 'GlusterFS' backend - -None - -### Modification to GlusterFS metadata - -- Addition of a new extended attribute for storing configured hard and -soft limits on directories. - -### Implications on 'glusterd' - -- New file based configuration persistence - -How To Test ------------ - -TBD - -User Experience ---------------- - -TBD - -Dependencies ------------- - -None - -Documentation -------------- - -TBD - -Status ------- - -In development - -Comments and Discussion ------------------------ diff --git a/Feature Planning/GlusterFS 3.5/Virt store usecase.md b/Feature Planning/GlusterFS 3.5/Virt store usecase.md deleted file mode 100644 index 3e649b2..0000000 --- a/Feature Planning/GlusterFS 3.5/Virt store usecase.md +++ /dev/null @@ -1,140 +0,0 @@ - Work In Progress - Author - Satheesaran Sundaramoorthi - - -**Introduction** ----------------- - -Gluster volumes are used to host Virtual Machines Images. (i.e) Virtual -machines Images are stored on gluster volumes. This usecase is popularly -known as *virt-store* usecase. - -This document explains more about, - -1. Enabling gluster volumes for virt-store usecase -2. Common Pitfalls -3. FAQs -4. References - -**Enabling gluster volumes for virt-store** -------------------------------------------- - -This section describes how to enable gluster volumes for virt store -usecase - -#### Volume Types - -Ideally gluster volumes serving virt-store, should provide -high-availability for the VMs running on it. If the volume is not -avilable, the VMs may move in to unusable state. So, its best -recommended to use **replica** or **distribute-replicate** volume for -this usecase - -*If you are new to GlusterFS, you can take a look at -[QuickStart](http://gluster.readthedocs.org/en/latest/Quick-Start-Guide/Quickstart/) guide or the [admin -guide](http://gluster.readthedocs.org/en/latest/Administrator%20Guide/README/)* - -#### Tunables - -The set of volume options are recommended for virt-store usecase, which -adds performance boost. Following are those options, - - quick-read=off - read-ahead=off - io-cache=off - stat-prefetch=off - eager-lock=enable - remote-dio=enable - quorum-type=auto - server-quorum-type=server - -- quick-read is meant for improving small-file read performance,which - is no longer reasonable for VM Image files -- read-ahead is turned off. VMs have their own way of doing that. This - is pretty usual to leave it to VM to determine the read-ahead -- io-cache is turned off -- stat-prefetch is turned off. stat-prefetch, caches the metadata - related to files and this is no longer a concern for VM Images (why - ?) -- eager-lock is turned on (why?) -- remote-dio is turned on,so in open() and creat() calls, O\_DIRECT - flag will be filtered at the client protocol level so server will - still continue to cache the file. -- quorum-type is set to auto. This basically enables client side - quorum. When client side quorum is enabled, there exists the rule - such that atleast half of the bricks in the replica group should be - UP and running. If not, the replica group would become read-only -- server-quorum-type is set to server. This basically enables - server-side quorum. This lays a condition that in a cluster, atleast - half the number of nodes, should be UP. If not the bricks ( read as - brick processes) will be killed, and thereby volume goes offline - -#### Applying the Tunables on the volume - -There are number of ways to do it. - -1. Make use of group-virt.example file -2. Copy & Paste - -##### Make use of group-virt.example file - -This is the method best suited and recommended. -*/etc/glusterfs/group-virt.example* has all options recommended for -virt-store as explained earlier. Copy this file, -*/etc/glusterfs/group-virt.example* to */var/lib/glusterd/groups/virt* - - cp /etc/glusterfs/group-virt.example /var/lib/glusterd/groups/virt - -Optimize the volume with all the options available in this *virt* file -in a single go - - gluster volume set group virt - -NOTE: No restart of the volume is required Verify the same with the -command, - - gluster volume info - -In forthcoming releases, this file will be automatically put in -*/var/lib/glusterd/groups/* and you can directly apply it on the volume - -##### Copy & Paste - -Copy all options from the above -section,[Virt-store-usecase\#Tunables](Virt-store-usecase#Tunables "wikilink") -and put in a file named *virt* in */var/lib/glusterd/groups/virt* Apply -all the options on the volume, - - gluster volume set group virt - -NOTE: This is not recommended, as the recommended volume options may/may -not change in future.Always stick to *virt* file available with the rpms - -#### Adding Ownership to Volume - -You can add uid:gid to the volume, - - gluster volume set storage.owner-uid - gluster volume set storage.owner-gid - -For example, when the volume would be accessed by qemu/kvm, you need to -add ownership as 107:107, - - gluster volume set storage.owner-uid 107 - gluster volume set storage.owner-gid 107 - -It would be 36:36 in the case of oVirt/RHEV, 165:165 in the case of -OpenStack Block Service (cinder),161:161 in case of OpenStack Image -Service (glance) is accessing this volume - -NOTE: Not setting the correct ownership may lead to "Permission Denied" -errors when accessing the image files residing on the volume - -**Common Pitfalls** -------------------- - -**FAQs** --------- - -**References** --------------- \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.5/Zerofill.md b/Feature Planning/GlusterFS 3.5/Zerofill.md deleted file mode 100644 index 43b279d..0000000 --- a/Feature Planning/GlusterFS 3.5/Zerofill.md +++ /dev/null @@ -1,192 +0,0 @@ -Feature -------- - -zerofill API for GlusterFS - -Summary -------- - -zerofill() API would allow creation of pre-allocated and zeroed-out -files on GlusterFS volumes by offloading the zeroing part to server -and/or storage (storage offloads use SCSI WRITESAME). - -Owners ------- - -Bharata B Rao -M. Mohankumar - -Current status --------------- - -Patch on gerrit: - -Detailed Description --------------------- - -Add support for a new ZEROFILL fop. Zerofill writes zeroes to a file in -the specified range. This fop will be useful when a whole file needs to -be initialized with zero (could be useful for zero filled VM disk image -provisioning or during scrubbing of VM disk images). - -Client/application can issue this FOP for zeroing out. Gluster server -will zero out required range of bytes ie server offloaded zeroing. In -the absence of this fop, client/application has to repetitively issue -write (zero) fop to the server, which is very inefficient method because -of the overheads involved in RPC calls and acknowledgements. - -WRITESAME is a SCSI T10 command that takes a block of data as input and -writes the same data to other blocks and this write is handled -completely within the storage and hence is known as offload . Linux ,now -has support for SCSI WRITESAME command which is exposed to the user in -the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to -implement this fop. Thus zeroing out operations can be completely -offloaded to the storage device , making it highly efficient. - -The fop takes two arguments offset and size. It zeroes out 'size' number -of bytes in an opened file starting from 'offset' position. - -Benefit to GlusterFS --------------------- - -Benefits GlusterFS in virtualization by providing the ability to quickly -create pre-allocated and zeroed-out VM disk image by using -server/storage off-loads. - -### Scope - -Nature of proposed change -------------------------- - -An FOP supported in libgfapi and FUSE. - -Implications on manageability ------------------------------ - -None. - -Implications on presentation layer ----------------------------------- - -N/A - -Implications on persistence layer ---------------------------------- - -N/A - -Implications on 'GlusterFS' backend ------------------------------------ - -N/A - -Modification to GlusterFS metadata ----------------------------------- - -N/A - -Implications on 'glusterd' --------------------------- - -N/A - -How To Test ------------ - -Test server offload by measuring the time taken for creating a fully -allocated and zeroed file on Posix backend. - -Test storage offload by measuring the time taken for creating a fully -allocated and zeroed file on BD backend. - -User Experience ---------------- - -Fast provisioning of VM images when GlusterFS is used as a file system -backend for KVM virtualization. - -Dependencies ------------- - -zerofill() support in BD backend depends on the new BD translator - - - -Documentation -------------- - -This feature add support for a new ZEROFILL fop. Zerofill writes zeroes -to a file in the specified range. This fop will be useful when a whole -file needs to be initialized with zero (could be useful for zero filled -VM disk image provisioning or during scrubbing of VM disk images). - -Client/application can issue this FOP for zeroing out. Gluster server -will zero out required range of bytes ie server offloaded zeroing. In -the absence of this fop, client/application has to repetitively issue -write (zero) fop to the server, which is very inefficient method because -of the overheads involved in RPC calls and acknowledgements. - -WRITESAME is a SCSI T10 command that takes a block of data as input and -writes the same data to other blocks and this write is handled -completely within the storage and hence is known as offload . Linux ,now -has support for SCSI WRITESAME command which is exposed to the user in -the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to -implement this fop. Thus zeroing out operations can be completely -offloaded to the storage device , making it highly efficient. - -The fop takes two arguments offset and size. It zeroes out 'size' number -of bytes in an opened file starting from 'offset' position. - -This feature adds zerofill support to the following areas: - --  libglusterfs --  io-stats --  performance/md-cache,open-behind --  quota --  cluster/afr,dht,stripe --  rpc/xdr --  protocol/client,server --  io-threads --  marker --  storage/posix --  libgfapi - -Client applications can exploit this fop by using glfs\_zerofill -introduced in libgfapi.FUSE support to this fop has not been added as -there is no system call for this fop. - -Here is a performance comparison of server offloaded zeofill vs zeroing -out using repeated writes. - - [root@llmvm02 remote]# time ./offloaded aakash-test log 20 - - real    3m34.155s - user    0m0.018s - sys 0m0.040s - - -  [root@llmvm02 remote]# time ./manually aakash-test log 20 - - real    4m23.043s - user    0m2.197s - sys 0m14.457s -  [root@llmvm02 remote]# time ./offloaded aakash-test log 25; - - real    4m28.363s - user    0m0.021s - sys 0m0.025s - [root@llmvm02 remote]# time ./manually aakash-test log 25 - - real    5m34.278s - user    0m2.957s - sys 0m18.808s - -The argument log is a file which we want to set for logging purpose and -the third argument is size in GB . - -As we can see there is a performance improvement of around 20% with this -fop. - -Status ------- - -Patch : Status : Merged \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.5/gfid access.md b/Feature Planning/GlusterFS 3.5/gfid access.md deleted file mode 100644 index db64076..0000000 --- a/Feature Planning/GlusterFS 3.5/gfid access.md +++ /dev/null @@ -1,89 +0,0 @@ -### Instructions - -**Feature** - -'gfid-access' translator to provide access to data in glusterfs using a virtual path. - -**1 Summary** - -This particular Translator is designed to provide direct access to files in glusterfs using its gfid.'GFID' is glusterfs's inode numbers for a file to identify it uniquely. - -**2 Owners** - -Amar Tumballi  -Raghavendra G  -Anand Avati  - -**3 Current status** - -With glusterfs-3.4.0, glusterfs provides only path based access.A feature is added in 'fuse' layer in the current master branch, -but its desirable to have it as a separate translator for long time -maintenance. - -**4 Detailed Description** - -With this method, we can consume the data in changelog translator -(which is logging 'gfid' internally) very efficiently. - -**5 Benefit to GlusterFS** - -Provides a way to access files quickly with direct gfid. - -​**6. Scope** - -6.1. Nature of proposed change - -* A new translator. -* Fixes in 'glusterfsd.c' to add this translator automatically based -on mount time option. -* change to mount.glusterfs to parse this new option  -(single digit number or lines changed) - -6.2. Implications on manageability - -* No CLI required. -* mount.glusterfs script gets a new option. - -6.3. Implications on presentation layer - -* A new virtual access path is made available. But all access protocols work seemlessly, as the complexities are handled internally. - -6.4. Implications on persistence layer - -* None - -6.5. Implications on 'GlusterFS' backend - -* None - -6.6. Modification to GlusterFS metadata - -* None - -6.7. Implications on 'glusterd' - -* None - -7 How To Test - -* Mount glusterfs client with '-o aux-gfid-mount' and access files using '/mount/point/.gfid/ '. - -8 User Experience - -* A new virtual path available for users. - -9 Dependencies - -* None - -10 Documentation - -This wiki. - -11 Status - -Patch sent upstream. More review comments required. (http://review.gluster.org/5497) - -12 Comments and Discussion - -Please do give comments :-) \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.5/index.md b/Feature Planning/GlusterFS 3.5/index.md deleted file mode 100644 index e8c2c88..0000000 --- a/Feature Planning/GlusterFS 3.5/index.md +++ /dev/null @@ -1,32 +0,0 @@ -GlusterFS 3.5 Release ---------------------- - -Tentative Dates: - -Latest: 13-Nov, 2014 GlusterFS 3.5.3 - -17th Apr, 2014 - 3.5.0 GA - -GlusterFS 3.5 -------------- - -### Features in 3.5.0 - -- [Features/AFR CLI enhancements](./AFR CLI enhancements.md) -- [Features/exposing volume capabilities](./Exposing Volume Capabilities.md) -- [Features/File Snapshot](./File Snapshot.md) -- [Features/gfid-access](./gfid access.md) -- [Features/On-Wire Compression + Decompression](./Onwire Compression-Decompression.md) -- [Features/Quota Scalability](./Quota Scalability.md) -- [Features/readdir ahead](./readdir ahead.md) -- [Features/zerofill](./Zerofill.md) -- [Features/Brick Failure Detection](./Brick Failure Detection.md) -- [Features/disk-encryption](./Disk Encryption.md) -- Changelog based parallel geo-replication -- Improved block device translator - -Proposing New Features ----------------------- - -New feature proposals should be built using the New Feature Template in -the GlusterFS 3.7 planning page diff --git a/Feature Planning/GlusterFS 3.5/libgfapi with qemu libvirt.md b/Feature Planning/GlusterFS 3.5/libgfapi with qemu libvirt.md deleted file mode 100644 index 2309016..0000000 --- a/Feature Planning/GlusterFS 3.5/libgfapi with qemu libvirt.md +++ /dev/null @@ -1,222 +0,0 @@ - Work In Progress - Author - Satheesaran Sundaramoorthi - - -**Purpose** ------------ - -Gluster volume can be used to store VM Disk images. This usecase is -popularly known as 'Virt-Store' usecase. Earlier, gluster volume had to -be fuse mounted and images are created/accessed over the fuse mount. - -With the introduction of GlusterFS libgfapi, QEMU supports glusterfs -through libgfapi directly. This we call as *QEMU driver for glusterfs*. -These document explains about the way to make use of QEMU driver for -glusterfs - -Steps for the entire procedure could be split in to 2 views viz,the -document from - -1. Steps to be done on gluster volume side -2. Steps to be done on Hypervisor side - -**Steps to be done on gluster side** ------------------------------------- - -These are the steps that needs to be done on the gluster side Precisely -this involves - -1. Creating "Trusted Storage Pool" -2. Creating a volume -3. Tuning the volume for virt-store -4. Tuning glusterd to accept requests from QEMU -5. Tuning glusterfsd to accept requests from QEMU -6. Setting ownership on the volume -7. Starting the volume - -##### Creating "Trusted Storage Pool" - -Install glusterfs rpms on the NODE. You can create a volume with a -single node. You can also scale up the cluster, as we call as *Trusted -Storage Pool*, by adding more nodes to the cluster - - gluster peer probe  - -##### Creating a volume - -It is highly recommended to have replicate volume or -distribute-replicate volume for virt-store usecase, as it would add high -availability and fault-tolerance. Remember the plain distribute works -equally well - - gluster volume create replica 2 .. - -where, is :/ Note: It is recommended to -create sub-directories inside brick and that could be used to create a -volume.For example, say, */home/brick1* is the mountpoint of XFS, then -you can create a sub-directory inside it */home/brick1/b1* and use it -while creating a volume.You can also use space available in root -filesystem for bricks. Gluster cli, by default, throws warning in that -case. You can override by using *force* option - - gluster volume create replica 2 .. force - -*If you are new to GlusterFS, you can take a look at -[QuickStart](http://gluster.readthedocs.org/en/latest/Quick-Start-Guide/Quickstart/) guide.* - -##### Tuning the volume for virt-store - -There are recommended settings available for virt-store. This provide -good performance characteristics when enabled on the volume that was -used for *virt-store* - -Refer to -[Virt-store-usecase\#Tunables](Virt-store-usecase#Tunables "wikilink") -for recommended tunables and for applying them on the volume, -[Virt-store-usecase\#Applying\_the\_Tunables\_on\_the\_volume](Virt-store-usecase#Applying_the_Tunables_on_the_volume "wikilink") - -##### Tuning glusterd to accept requests from QEMU - -glusterd receives the request only from the applications that run with -port number less than 1024 and it blocks otherwise. QEMU uses port -number greater than 1024 and to make glusterd accept requests from QEMU, -edit the glusterd vol file, */etc/glusterfs/glusterd.vol* and add the -following, - - option rpc-auth-allow-insecure on - -Note: If you have installed glusterfs from source, you can find glusterd -vol file at */usr/local/etc/glusterfs/glusterd.vol* - -Restart glusterd after adding that option to glusterd vol file - - service glusterd restart - -##### Tuning glusterfsd to accept requests from QEMU - -Enable the option *allow-insecure* on the particular volume - - gluster volume set  server.allow-insecure on - -**IMPORTANT :** As of now(april 2,2014)there is a bug, as -*allow-insecure* is not dynamically set on a volume.You need to restart -the volume for the change to take effect - -##### Setting ownership on the volume - -Set the ownership of qemu:qemu on to the volume - - gluster volume set  storage.owner-uid 107 - gluster volume set  storage.owner-gid 107 - -**IMPORTANT :** The UID and GID can differ per Linux distribution, or -even installation. The UID/GID should be the one fomr the *qemu* or -'kvm'' user, you can get the IDs with these commands: - - id qemu - getent group kvm - -##### Starting the volume - -Start the volume - - gluster volume start  - -**Steps to be done on Hypervisor Side** ---------------------------------------- - -Hypervisor is just the machine which spawns the Virtual Machines. This -machines should be necessarily the baremetal with more memory and -computing power. The following steps needs to be done on hypervisor, - -1. Install qemu-kvm -2. Install libvirt -3. Create a VM Image -4. Add ownership to the Image file -5. Create libvirt XML to define Virtual Machine -6. Define the VM -7. Start the VM -8. Verification - -##### Install qemu-kvm - -##### Install libvirt - -##### Create a VM Image - -Images can be created using *qemu-img* utility - - qemu-img create -f gluster://// - -- format - This can be raw or qcow2 -- server - One of the gluster Node's IP or FQDN -- vol-name - gluster volume name -- image - Image File name -- size - Size of the image - -Here is sample, - - qemu-img create -f qcow2 gluster://host.sample.com/vol1/vm1.img 10G - -##### Add ownership to the Image file - -NFS or FUSE mount the glusterfs volume and change the ownership of the -image file to qemu:qemu - - mount -t nfs -o vers=3 :/  - -Change the ownership of the image file that was earlier created using -*qemu-img* utility - - chown qemu:qemu / - -##### Create libvirt XML to define Virtual Machine - -*virt-install* is python wrapper which is mostly used to create VM using -set of params. *virt-install* doesn't support any network filesystem [ - ] - -Create a libvirt xml - See to -that the disk section is formatted in such a way, qemu driver for -glusterfs is being used. This can be seen in the following example xml -description - - - - - - - -
- - -##### Define the VM from XML - -Define the VM from the XML file that was created earlier - - virsh define  - -Verify that the VM is created successfully - - virsh list --all - -##### Start the VM - -Start the VM - - virsh start  - -##### Verification - -You can verify the disk image file that is being used by VM - - virsh domblklist  - -The above should show the volume name and image name. Here is the -example, - - [root@test ~]# virsh domblklist vm-test2 - Target Source - ------------------------------------------------ - vda distrepvol/test.img - hdc - \ No newline at end of file diff --git a/Feature Planning/GlusterFS 3.5/readdir ahead.md b/Feature Planning/GlusterFS 3.5/readdir ahead.md deleted file mode 100644 index fe34a97..0000000 --- a/Feature Planning/GlusterFS 3.5/readdir ahead.md +++ /dev/null @@ -1,117 +0,0 @@ -Feature -------- - -readdir-ahead - -Summary -------- - -Provide read-ahead support for directories to improve sequential -directory read performance. - -Owners ------- - -Brian Foster - -Current status --------------- - -Gluster currently does not attempt to improve directory read -performance. As a result, simple operations (i.e., ls) on large -directories are slow. - -Detailed Description --------------------- - -The read-ahead feature for directories is analogous to read-ahead for -files. The objective is to detect sequential directory read operations -and establish a pipeline for directory content. When a readdir request -is received and fulfilled, preemptively issue subsequent readdir -requests to the server in anticipation of those requests from the user. -If sequential readdir requests are received, the directory content is -already immediately available in the client. If subsequent requests are -not sequential or not received, said data is simply dropped and the -optimization is bypassed. - -Benefit to GlusterFS --------------------- - -Improved read performance of large directories. - -### Scope - -Nature of proposed change -------------------------- - -readdir-ahead support is enabled through a new client-side translator. - -Implications on manageability ------------------------------ - -None beyond the ability to enable and disable the translator. - -Implications on presentation layer ----------------------------------- - -N/A - -Implications on persistence layer ---------------------------------- - -N/A - -Implications on 'GlusterFS' backend ------------------------------------ - -N/A - -Modification to GlusterFS metadata ----------------------------------- - -N/A - -Implications on 'glusterd' --------------------------- - -N/A - -How To Test ------------ - -Performance testing. Verify that sequential reads of large directories -complete faster (i.e., ls, xfs\_io -c readdir). - -User Experience ---------------- - -Improved performance on sequential read workloads. The translator should -otherwise be invisible and not detract performance or disrupt behavior -in any way. - -Dependencies ------------- - -N/A - -Documentation -------------- - -Set the associated config option to enable or disable directory -read-ahead on a volume: - - gluster volume set  readdir-ahead [enable|disable] - -readdir-ahead is disabled by default. - -Status ------- - -Development complete for the initial version. Minor changes and bug -fixes likely. - -Future versions might expand to provide generic caching and more -flexible behavior. - -Comments and Discussion ------------------------ \ No newline at end of file -- cgit