glusterfs.git/libglusterfs/src/mem-types.h, branch v3.7.9

libglusterfs: Use GF_CALLOC/GF_FREE instead of CALLOC/FREE

2015-07-09T05:00:26+00:00

- Also removed numbers for the types as the string form of type is printed in
  statedump now, so the numbers are not needed anymore.

 >Change-Id: I6e8c15a1dc8cb6187842f96f1d46ec0f26a602b4
 >BUG: 1237381
 >Signed-off-by: Pranith Kumar K 
 >Reviewed-on: http://review.gluster.org/11495
 >Tested-by: Gluster Build System 
 >Tested-by: NetBSD Build System 
 >Reviewed-by: Vijay Bellur 
 >(cherry picked from commit e7f2547f89dbcd90cdb3714f63620a36bdc2ef3a)

Change-Id: I60dbe14ba0c9855982e235001ee86747e3fa87a0
BUG: 1238476
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/11501
Tested-by: NetBSD Build System 
Tested-by: Gluster Build System

rebalance: Introducing local crawl and parallel migration

2015-05-07T09:37:02+00:00

The current patch address two part of the design proposed.
1. Rebalance multiple files in parallel
2. Crawl only bricks that belong to the current node

Brief design explanation for the above two points.

1. Rebalance multiple files in parallel:
   -------------------------------------
The existing rebalance engine is single threaded. Hence, introduced
multiple threads which will be running parallel to the crawler. The
current rebalance migration is converted to a "Producer-Consumer"
frame work.

Where Producer is : Crawler
      Consumer is : Migrating Threads

Crawler: Crawler is the main thread. The job of the crawler is now
limited to fix-layout of each directory and add the files which are
eligible for the migration to a global queue in a round robin manner
so that we will use all the disk resources efficiently. Hence, the
crawler will not be "blocked" by migration process.

Producer: Producer will monitor the global queue. If any file is
added to this queue, it will dqueue that entry and migrate the file.
Currently 20 migration threads are spawned at the beginning of the
rebalance process. Hence, multiple file migration happens in parallel.

2. Crawl only bricks that belong to the current node:
   --------------------------------------------------
As rebalance process is spawned per node, it migrates only the files
that belongs to it's own node for the sake of load balancing. But it
also reads entries from the whole cluster, which is not necessary as
readdir hits other nodes.

New Design:
        As part of the new design the rebalancer decides the subvols
that are local to the rebalancer node by checking the node-uuid of
root directory prior to the crawler starts. Hence, readdir won't hit
the whole cluster  as it has already the context of local subvols and
also node-uuid request for each file can be avoided. This makes the
rebalance process "more scalable".

Change-Id: I6f1b44086a09df8ca23935fd213509c70cc0c050
BUG: 1217381
Signed-off-by: Susant Palai 
Reviewed-on: http://review.gluster.org/10466
Tested-by: Gluster Build System 
Tested-by: NetBSD Build System
Reviewed-by: N Balachandran

glusterd: Replace transaction peers lists

2015-04-13T06:30:02+00:00

Transaction peer lists were used in GlusterD to peers belonging to a
transaction. This was needed to prevent newly added peers performing
partial transactions, which could be incorrect.

This was accomplished by creating a seperate transaction peers list at
the beginning of every transaction. A transaction peers list referenced
the peerinfo data structures of the peers which were present at the
beginning of the transaction. RCU protection of peerinfos referenced by
the transaction peers list is a hard problem and difficult to do
correctly.

To have proper RCU protection of peerinfos, the transaction peers lists
have been replaced by an alternative method to identify peers that
belong to a transaction. The alternative method is to the global peers
list along with generation numbers to identify peers that should belong
to a transaction.

This change introduces a global peer list generation number, and a
generation number for each peerinfo object. Whenever a peerinfo object
is created, the global generation number is bumped, and the peerinfos
generation number is set to the bumped global generation.

With the above changes, the algorithm to identify peers belonging to a
transaction with RCU protection is as follows,
- At the beginning of a transaction, the current global generation
  number is saved
- To identify if a peers belonging to the transaction,
  - Start a RCU read critical section
  - For each peer in the global peers list,
    - If the peers generation number is not greater than the saved
      generation number, continue with the action on the peer
  - End the RCU read critical section

The above algorithm guarantees that,
- The peer list is not modified when a transaction is iterating through
  it
- The transaction actions are only done on peers that were present when
  the transaction started

But, as a transaction could iterate over the peers list multiple times,
the algorithm cannot guarantee that same set of peers will be selected
every time. A peer could get deleted between two iterations of the list
within a transaction. This problem existed with transaction peers list
as well, but unlike before now it will not lead to invalid memory access
and potential crashes. This problem will be addressed seprately.

This change was developed on the git branch at [1]. This commit is a
combination of the following commits on the development branch.
  52ded5b Add timespec_cmp
  44aedd8 Add create timestamp to peerinfo
  7bcbea5 Fix some silly mistakes
  13e3241 Add start time to opinfo
  17a6727 Use timestamp comparisions to identify xaction peers instead
          of a xaction peer list
  3be05b6 Correct check for peerinfo age
  70d5b58 Use read-critical sections for peer list iteration
  ba4dbca Use peerinfo timestamp checks in op-sm instead of xaction peer
          list
  d63f811 Add more peer status checks when iterating peers list in
          glusterd-syncop
  1998a2a Timestamp based peer list traversal of mgmtv3 xactions
  f3c1a42 Remove transaction peer lists
  b8b08ee Remove unused labels
  32e5f5b Remove 'npeers' usage
  a075fb7 Remove 'npeers' from mgmt-v3 framework
  12c9df2 Use generation number instead of timestamps.
  9723021 Remove timespec_cmp
  80ae2c6 Remove timespec.h include
  a9479b0 Address review comments on 10147/4

[1]: https://github.com/kshlm/glusterfs/tree/urcu

Change-Id: I9be1033525c0a89276f5b5d83dc2eb061918b97f
BUG: 1205186
Signed-off-by: Kaushal M 
Reviewed-on: http://review.gluster.org/10147
Tested-by: Gluster Build System 
Reviewed-by: Atin Mukherjee 
Reviewed-by: Anand Nekkunti 
Reviewed-by: Krishnan Parthasarathi 
Tested-by: Krishnan Parthasarathi

glusterd: Maintain local xaction_peer list for op-sm

2015-03-26T07:10:45+00:00

http://review.gluster.org/9269 addresses maintaining local xaction_peers in
syncop and mgmt_v3 framework. This patch is to maintain local xaction_peers list
for op-sm framework as well.

Change-Id: Idd8484463fed196b3b18c2df7f550a3302c6e138
BUG: 1204727
Signed-off-by: Atin Mukherjee 
Reviewed-on: http://review.gluster.org/9972
Reviewed-by: Anand Nekkunti 
Tested-by: Gluster Build System 
Reviewed-by: Krishnan Parthasarathi 
Tested-by: Krishnan Parthasarathi

features/bit-rot: Implementation of bit-rot xlator

2015-03-24T17:55:32+00:00

This is the "Signer" -- responsible for signing files with their
checksums upon last file descriptor close (last release()).
The event notification facility provided by the changelog xlator
is made use of.

Moreover, checksums are as of now SHA256 hash of the object data
and is the only available hash at this point of time. Therefore,
there is no special "what hash to use" type check, although it's
does not take much to add various hashing algorithms to sign
objects with. Signatures are stored in extended attributes of the
objects along with the the type of hashing used to calculate the
signature. This makes thing future proof when other hash types
are added. The signature  infrastructure is provided by bitrot
stub: a little piece of code that sits over the POSIX xlator
providing interfaces to "get or set" objects signature and it's
staleness.

Since objects are signed upon receiving release() notification,
pre-existing data which are "never" modified would never be
signed. To counter this, an initial crawler thread is spawned
The crawler scans the entire brick for objects that are unsigned
or "missed" signing due to the server going offline (node reboots,
crashes, etc..) and triggers an explicit sign. This would also
sign objects when bit-rot is enabled for a volume and/or after
upgrade.

Change-Id: I1d9a98bee6cad1c39c35c53c8fb0fc4bad2bf67b
BUG: 1170075
Original-Author: Raghavendra Bhat 
Signed-off-by: Venky Shankar 
Reviewed-on: http://review.gluster.org/9711
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

glusterfsd: add "print-netgroups" and "print-exports" command

2015-03-19T05:16:34+00:00

NFS now has the ability to use a separate file for "netgroups" and
"exports". An administrator should have the ability to check the
validity of the files before applying the configuration.

The "glusterfsd" command now has the following additional arguments that
can be used to check the configuration:

   --print-netgroups: Validate the netgroups file and print it out
   --print-exports: Validate the exports file and print it out

BUG: 1143880
Change-Id: I24c40d50110d49d8290f9fd916742f7e4d0df85f
URL: http://www.gluster.org/community/documentation/index.php/Features/Exports_Netgroups_Authentication
Original-author: Shreyas Siravara 
CC: Richard Wareing 
CC: Jiffin Tony Thottan 
Signed-off-by: Niels de Vos 
Reviewed-on: http://review.gluster.org/9365
Reviewed-by: Vijay Bellur 
Tested-by: Vijay Bellur

libglusterfs/rot-buffs: rotational buffers

2015-03-19T04:22:33+00:00

This patch introduces rotational buffers aiming at the classic
multiple producer and multiple consumer problem. A fixed set
of buffer list is allocated during initialization, where each
list consist of a list of buffers. Each buffer is an iovec
pointing to a memory region of fixed allocation size. Multiple
producers write data to these buffers. A buffer list starts with
a single buffer (iovec) and allocates more when required (although
this can be preallocatd in multiples of k).

rot-buffs allow multiple producers to write data parallely with
a bit of extra cost of taking locks. Therefore, it's much suited
for large writes. Multiple producers are allowed to write in the
buffer parallely by "reserving" write space for selected number
of bytes and returning pointer to the start of the reserved area.
The write size is selected by the producer before it starts the
write (which is often known). Therefore, the write itself need not
be serialized -- just the space reservation needs to be done safely.

The other part is when a consumer kicks in to consume what has
been produced. At this point, a buffer list switch is performed.
The "current" buffer list pointer is safely pointed to the next
available buffer list. New writes are now directed to the just
switched buffer list (the old buffer list is now considered out
of rotation). Note that the old buffer still may have producers
in progress (pending writes), so the consumer has to wait till
the writers are drained. Currently this is the slow path for
producers (write completion) and needs to be improved.

Currently, there is special handling for cases where the number
of consumers match (or exceed) the number of producers, which
could result in writer starvation. In this scenario, when a
consumers requests a buffer list for consumption, a check is
performed for writer starvation and consumption is denied
until at least another buffer list is ready of the producer
for writes, i.e., one (or more) consumer(s) completed, thereby
putting the buffer list back in rotation.

[
   NOTE:
   I've not performance tested this producer-consumer model
   yet. It's being used in changelog for event notification.
   The list of buffers (iovecs) are directly passed to RPC
   layer.
]

Change-Id: I88d235522b05ab82509aba861374a2312bff57f2
BUG: 1170075
Signed-off-by: Venky Shankar 
Reviewed-on: http://review.gluster.org/9706
Tested-by: Vijay Bellur

features/quota : Introducing inode quota

2015-03-19T01:24:12+00:00

==========================================================================
                             Inode quota
==========================================================================
= Currently, the only way to retrieve the number of files/objects in a   =
= directory or volume is to do a crawl of the entire directory/volume.   =
= This is expensive and is not scalable.                                 =
=                                                                        =
= The proposed mechanism will provide an easier alternative to determine =
= the count of files/objects in a directory or volume.                   =
=                                                                        =
= The new mechanism proposes to store count of objects/files as part of  =
= an extended attribute of a directory. Each directory's extended        =
= attribute value will indicate the number of files/objects present      =
= in a tree with the directory being considered as the root of the tree. =
=                                                                        =
= The count value can be accessed by performing a getxattr().            =
= Cluster translators like afr, dht and stripe will perform aggregation  =
= of count values from various bricks when getxattr() happens on the key =
= associated with file/object count.                                     =

A new interface is introduced:
------------------------------
        limit-objects  : limit the number of inodes at directory level
        list-objects   : list the directories where the limit is set
        remove-objects : remove the limit from the directory

==========================================================================

CLI COMMAND:
gluster volume quota  limit-objects   []

*  is a hard-limit for number of objects limitation for path ""
  If hard-limit is exceeded, creation of file/directory is no longer
permitted.

*  is a soft-limit for number of objects creation for path ""
  If soft-limit is exceeded, a warning is issued for each creation.

CLI COMMAND:
gluster volume quota  remove-objects [path]

==========================================================================

CLI COMMAND:
gluster volume quota  list-objects [path] ...

Sample output:
------------------
  Path                   Hard-limit Soft-limit   Used  Available
Soft-limit exceeded?
Hard-limit exceeded?
  ------------------------------------------------------------------------
--------------------------------------
  /dir                      10       80%          10       0
Yes
        Yes

==========================================================================

[root@snapshot-28 dir]# ls
a  b  file11  file12  file13  file14  file15  file16  file17
[root@snapshot-28 dir]# touch a1
touch: cannot touch `a1': Disk quota exceeded
* Nine files are created in directory "dir" and directory is included in
* the
count too. Hence the limit "10" is reached and further file creation
fails

==========================================================================

Note: We have also done some re-factoring in cli for volume name
validation. New function cli_validate_volname is created

==========================================================================

Change-Id: I1823497de4f790a2a20ebb1770293472ea33ee2b
BUG: 1190108
Signed-off-by: Sachin Pandit 
Signed-off-by: vmallika 
Reviewed-on: http://review.gluster.org/9769
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

Adding Libgfdb to GlusterFS

2015-03-18T17:36:42+00:00

*************************************************************************
			Libgfdb						|
*************************************************************************
Libgfdb provides abstract mechanism to record extra/rich metadata
required for data maintenance, such as data tiering/classification.
It provides consumer with API for recording and querying, keeping
the consumer abstracted from the data store used beneath for storing data.
It works in a plug-and-play model, where data stores can be plugged-in.
Presently we have plugin for Sqlite3. In the future will provide recording
and querying performance optimizer. In the current implementation the schema
of metadata is fixed.

Schema:
~~~~~~
      GF_FILE_TB Table:
      ~~~~~~~~~~~~~~~~~
      This table has one entry per file inode. It holds the metadata required to
      make decisions in data maintenance.
      GF_ID (Primary key)	: File GFID (Universal Unique IDentifier in the namespace)
      W_SEC, W_MSEC 		: Write wind time in sec & micro-sec
      UW_SEC, UW_MSEC		: Write un-wind time in sec & micro-sec
      W_READ_SEC, W_READ_MSEC 	: Read wind time in sec & micro-sec
      UW_READ_SEC, UW_READ_MSEC : Read un-wind time in sec & micro-sec
      WRITE_FREQ_CNTR INTEGER	: Write Frequency Counter
      READ_FREQ_CNTR INTEGER	: Read Frequency Counter

      GF_FLINK_TABLE:
      ~~~~~~~~~~~~~~
      This table has all the hardlinks to a file inode.
      GF_ID		: File GFID               (Composite Primary Key)``|
      GF_PID		: Parent Directory GFID  (Composite Primary Key)   |-> Primary Key
      FNAME 		: File Base Name          (Composite Primary Key)__|
      FPATH 		: File Full Path (Its redundant for now, this will go)
      W_DEL_FLAG 	: This Flag is used for crash consistancy, when a link is unlinked.
                  	  i.e Set to 1 during unlink wind and during unwind this record
                          is deleted
      LINK_UPDATE 	: This Flag is used when a link is changed i.e rename.
                          Set to 1 when rename wind and set to 0 in rename unwind

Libgfdb API:
~~~~~~~~~~~
Refer libglusterfs/src/gfdb/gfdb_data_store.h

Change-Id: I2e9fbab3878ce630a7f41221ef61017dc43db11f
BUG: 1194753
Signed-off-by: Joseph Fernandes 
Signed-off-by: Dan Lambright 
Signed-off-by: Joseph Fernandes 
Reviewed-on: http://review.gluster.org/9683
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

core : using gluster-like memory allocation for parse-utility feature

2015-03-09T17:42:46+00:00

Change-Id: I58dc7e0dc8d4ac4e10795e0536fcd0e1722116ed
BUG: 1143880
Signed-off-by: Jiffin Tony Thottan 
Reviewed-on: http://review.gluster.org/9830
Reviewed-by: Niels de Vos 
Tested-by: Gluster Build System