dht: improve transform/detransform of d_off (and be ext4 safe)

The scheme to encode brick d_off and brick id into global d_off has two approaches. Since both brick d_off and global d_off are both 64-bit wide, we need to be careful about how the brick id is encoded. Filesystems like XFS always give a d_off which fits within 32bits. So we have another 32bits (actually 31, in this scheme, as seen ahead) to encode the brick id - which is typically plenty. Filesystems like the recent EXT4 utilize the upto 63 low bits in d_off, as the d_off is calculated based on a hash function value. This leaves us no "unused" bits to encode the brick id. However both these filesystmes (EXT4 more importantly) are "tolerant" in terms of the accuracy of the value presented back in seekdir(). i.e, a seekdir(val) actually seeks to the entry which has the "closest" true offset. This "two-prong" scheme exploits this behavior - which seems to be the best middle ground amongst various approaches and has all the advantages of the old approach: - Works against XFS and EXT4, the two most common filesystems out there. (which wasn't an "advantage" of the old approach as it is borken against EXT4) - Probably works against most of the others as well. The ones which would NOT work are those which return HUGE d_offs _and_ NOT tolerant to seekdir() to "closest" true offset. - Nothing to "remember in memory" or evict "old entries". - Works fine across NFS server reboots and also NFS head failover. - Tolerant to seekdir() to arbitrary locations. Algorithm: Each d_off can be encoded in either of the two schemes. There is no requirement to encode all d_offs of a directory or a reply-set in the same scheme. The topmost bit of the 64 bits is used to specify the "type" of encoding of this particular d_off. If the topmost bit (bit-63) is 1, it indicates that the encoding scheme holds a HUGE d_off. If the topmost bit is is 0, it indicates that the "small" d_off encoding scheme is used. The goal of the "small" d_off encoding is to stay as dense as possible towards the lower bits even in the global d_off. The goal of the HUGE d_off encoding is to stay as accurate (close) as possible to the "true" d_off after a round of encoding and decoding. If DHT has N subvolumes, we need ROOF(Log2(N)) "bits" to encode the brick ID (call it "n"). SMALL d_off =========== Encoding -------- If the top n + 1 bits are free in a brick offset, then we leave the top bit as 0 and set the remaining bits based on the old formula: hi_mask = 0xffffffffffffffff hi_mask = ~(hi_mask >> (n + 1)) if ((hi_mask & d_off_brick) != 0) do_large_d_off_encoding () d_off_global = (d_off_brick * N) + brick_id Decoding -------- If the top bit in the global offset is 0, it indicates that this is the encoding formula used. So decoding such a global offset will be like the old formula: if ((d_off_global & 0x8000000000000000) != 0) do_large_d_off_decoding() d_off_brick = (d_off_global % N) brick_id = d_off_global / N HUGE d_off ========== Encoding -------- If the top n + 1 bits are NOT free in a given brick offset, then we set the top bit as 1 in the global offset. The low n bits are replaced by brick_id. low_mask = 0xffffffffffffffff << n // where n is ROOF(Log2(N)) d_off_global = (0x8000000000000000 | d_off_brick & low_mask) + brick_id if (d_off_global == 0xffffffffffffffff) discard_entry(); Decoding -------- If the top bit in the global offset is set 1, it indicates that the encoding formula used is above. So decoding would look like: hi_mask = (0xffffffffffffffff << n) low_mask = ~(hi_mask) d_off_brick = (global_d_off & hi_mask & 0x7fffffffffffffff) brick_id = global_d_off & low_mask If "losing" the low n bits in this decoding of d_off_brick looks "scary", we need to realize that till recently EXT4 used to only return what can now be expressed as (d_off_global >> 32). The extra 31 bits of hash added by EXT recently, only decreases the probability of a collision, and not eliminate it completely, anyways. In a way, the "lost" n bits are made up by decreasing the probability of collision by sharding the files into N bricks / EXT directories -- call it "hash hedging", if you will :-) Thanks-to: Zach Brown <zab@redhat.com> Change-Id: Ieba9a7071829d51860b7c131982f12e0136b9855 BUG: 838784 Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-on: http://review.gluster.org/4711 Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
author: Anand Avati <avati@redhat.com> 2013-03-17 03:32:44 -0700
committer: Anand Avati <avati@redhat.com> 2013-04-01 13:13:01 -0700
commit: e0616e9314c8323dc59fca7cad6972f08d72b936 (patch)
tree: 760b424e0ee21caff6193722622fb2750add2aa1 /xlators
parent: 25053c9bdaf16f150815fb99f725bd037a49e97e (diff)
1 files changed, 82 insertions, 5 deletions
diff --git a/xlators/cluster/dht/src/dht-helper.c b/xlators/cluster/dht/src/dht-helper.c
index 4f50e214f2b..e91352c4ccb 100644
--- a/xlators/cluster/dht/src/dht-helper.c
+++ b/xlators/cluster/dht/src/dht-helper.c
@@ -40,6 +40,43 @@ dht_frame_return (call_frame_t *frame)
 }
 
 
+static uint64_t
+dht_bits_for (uint64_t num)
+{
+	uint64_t bits = 0, ctrl = 1;
+
+	while (ctrl < num) {
+		ctrl *= 2;
+		bits ++;
+	}
+
+	return bits;
+}
+
+/*
+ * A slightly "updated" version of the algorithm described in the commit log
+ * is used here.
+ *
+ * The only enhancement is that:
+ *
+ * - The number of bits used by the backend filesystem for HUGE d_off which
+ *   is described as 63, and
+ * - The number of bits used by the d_off presented by the transformation
+ *   upwards which is described as 64, are both made "configurable."
+ */
+
+
+#define BACKEND_D_OFF_BITS 63
+#define PRESENT_D_OFF_BITS 63
+
+#define ONE 1ULL
+#define MASK (~0ULL)
+#define PRESENT_MASK (MASK >> (64 - PRESENT_D_OFF_BITS))
+#define BACKEND_MASK (MASK >> (64 - BACKEND_D_OFF_BITS))
+
+#define TOP_BIT (ONE << (PRESENT_D_OFF_BITS - 1))
+#define SHIFT_BITS (max (0, (BACKEND_D_OFF_BITS - PRESENT_D_OFF_BITS + 1)))
+
 int
 dht_itransform (xlator_t *this, xlator_t *subvol, uint64_t x, uint64_t *y_p)
 {
@@ -47,6 +84,9 @@ dht_itransform (xlator_t *this, xlator_t *subvol, uint64_t x, uint64_t *y_p)
         int         cnt = 0;
         int         max = 0;
         uint64_t    y = 0;
+        uint64_t    hi_mask = 0;
+        uint64_t    off_mask = 0;
+        int         max_bits = 0;
 
         if (x == ((uint64_t) -1)) {
                 y = (uint64_t) -1;
@@ -60,7 +100,23 @@ dht_itransform (xlator_t *this, xlator_t *subvol, uint64_t x, uint64_t *y_p)
         max = conf->subvolume_cnt;
         cnt = dht_subvol_cnt (this, subvol);
 
-        y = ((x * max) + cnt);
+	if (max == 1) {
+		y = x;
+		goto out;
+	}
+
+        max_bits = dht_bits_for (max);
+
+        hi_mask = ~(PRESENT_MASK >> (max_bits + 1));
+
+        if (x & hi_mask) {
+                /* HUGE d_off */
+                off_mask = MASK << max_bits;
+                y = TOP_BIT | ((x >> SHIFT_BITS) & off_mask) | cnt;
+        } else {
+                /* small d_off */
+                y = ((x * max) + cnt);
+        }
 
 out:
         if (y_p)
@@ -135,16 +191,38 @@ dht_deitransform (xlator_t *this, uint64_t y, xlator_t **subvol_p,
         int         max = 0;
         uint64_t    x = 0;
         xlator_t   *subvol = 0;
+        int         max_bits = 0;
+        uint64_t    off_mask = 0;
+        uint64_t    host_mask = 0;
 
         if (!this->private)
-                goto out;
+                return -1;
 
         conf = this->private;
         max = conf->subvolume_cnt;
 
-        cnt = y % max;
-        x   = y / max;
+	if (max == 1) {
+		x = y;
+		cnt = 0;
+		goto out;
+	}
+
+        if (y & TOP_BIT) {
+                /* HUGE d_off */
+                max_bits = dht_bits_for (max);
+                off_mask = (MASK << max_bits);
+                host_mask = ~(off_mask);
+
+                x = ((y & ~TOP_BIT) & off_mask) << SHIFT_BITS;
+
+                cnt = y & host_mask;
+	} else {
+                /* small d_off */
+                cnt = y % max;
+                x = y / max;
+        }
 
+out:
         subvol = conf->subvolumes[cnt];
 
         if (subvol_p)
@@ -153,7 +231,6 @@ dht_deitransform (xlator_t *this, uint64_t y, xlator_t **subvol_p,
         if (x_p)
                 *x_p = x;
 
-out:
         return 0;
 }
author	Anand Avati <avati@redhat.com>	2013-03-17 03:32:44 -0700
committer	Anand Avati <avati@redhat.com>	2013-04-01 13:13:01 -0700
commit	e0616e9314c8323dc59fca7cad6972f08d72b936 (patch)
tree	760b424e0ee21caff6193722622fb2750add2aa1 /xlators
parent	25053c9bdaf16f150815fb99f725bd037a49e97e (diff)