Tiering =======

* Feature page:
http://www.gluster.org/community/documentation/index.php/Features/data-classification

* Design: goo.gl/bkU5qv

Theory of operation -------------------

The tiering feature enables different storage types to be used by the same
logical volume. In Gluster 3.7, the two types are classified as "cold" and
"hot", and are represented as two groups of bricks. The hot group acts as
a cache for the cold group. The bricks within the two groups themselves are
arranged according to standard Gluster volume conventions, e.g. replicated,
distributed replicated, or dispersed.

A normal gluster volume can become a tiered volume by "attaching" bricks
to it. The attached bricks become the "hot" group. The bricks within the
original gluster volume are the "cold" bricks.

For example, the original volume may be dispersed on HDD, and the "hot"
tier could be distributed-replicated SSDs.

Once this new "tiered" volume is built, I/Os to it are subjected to cacheing
heuristics:

* All I/Os are forwarded to the hot tier.

* If a lookup fails to the hot tier, the I/O will be forwarded to the cold
tier. This is a "cache miss".

* Files on the hot tier that are not touched within some time are demoted
(moved) to the cold tier (see performance parameters, below).

* Files on the cold tier that are touched one or more times are promoted
(moved) to the hot tier. (see performance parameters, below).

This resembles implementations by Ceph and the Linux data management (DM)
component.

Performance enhancements being considered include:

* Biasing migration of large files over small.

* Only demoting when the hot tier is close to full.

* Write-back cache for database updates.

Code organization -----------------

The design endevors to be upward compatible with future migration policies,
such as scheduled file migration, data classification, etc. For example,
the caching logic is self-contained and separate from the file migration. A
different set of migration policies could use the same underlying migration
engine. The I/O tracking and meta data store compontents are intended to be
reusable for things besides caching semantics.

Meta data:

A database stores meta-data on the files. Entries within it are added or
removed by the changetimerecorder translator. The database is queried by
the migration daemon. The results of the queries drive which files are to
be migrated.

The database resides withi the libgfdb subdirectory. There is one database
for each brick. The database is currently sqlite. However, the libgfdb
library API is not tied to sqlite, and a different database could be used.

For more information on libgfdb see the doc file: libgfdb.txt.

I/O tracking:

The changetimerecorder server-side translator generates metadata about I/Os
as they happen. Metadata is then entered into the database after the I/O
completes. Internal I/Os are not included.

Migration daemon:

When a tiered volume is created, a migration daemon starts. There is one daemon
for every tiered volume per node. The daemon sleeps and then periodically
queries the database for files to promote or demote. The query callbacks
assembles files in a list, which is then enumerated. The frequencies by
which promotes and demotes happen is subject to user configuration.

Selected files are migrated between the tiers using existing DHT migration
logic. The tier translator will leverage DHT rebalance performance
enhancements.

tier translator:

The tier translator is the root node in tiered volumes. The first subvolume
is the cold tier, and the second the hot tier. DHT logic for fowarding I/Os
is largely unchanged. Exceptions are handled according to the dht_methods_t
structure, which forks control according to DHT or tier type.

The major exception is DHT's layout is not utilized for choosing hashed
subvolumes. Rather, the hot tier is always the hashed subvolume.

Changes to DHT were made to allow "stacking", i.e. DHT over DHT:

* readdir operations remember the index of the "leaf node" in the volume graph
(client id), rather than a unique index for each DHT instance.

* Each DHT instance uses a unique extended attribute for tracking migration.

* In certain cases, it is legal for tiered volumes to have unpopulated inodes
(wheras this would be an error in DHT's case).

Currently tiered volume expansion (adding and removing bricks) is unsupported.

glusterd:

The tiered volume tree is a composition of two other volumes. The glusterd
daemon builds it. Existing logic for adding and removing bricks is heavily
leveraged to attach and detach tiers, and perform statistics collection.