summaryrefslogtreecommitdiffstats
path: root/done/negative-lookup-cache.md
blob: ce549f18fec0713fb5db5cd6f4f5a9db26c15f44 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
Feature
-------
Implement negative lookup cache and hence improve create performance

Summary
-------
Before creating any file lookups(1 in Fuse, 5-6 in SMB etc.) are sent to verify if the file
 already exists. Serving these lookup from the cache when possible, increases the create
performance by multiple folds in SMB access and by some percentage in Fuse/NFS access.
A lookup that fails with ENOENT is called as negative lookup.

Owners
------
Poornima G <pgurusid@redhat.com>
Pranith Kumar Karampuri <pkarampu@redhat.com>
Rajesh Joseph <rjoseph@redhat.com>

Current status
--------------
In design phase.

Related Feature Requests and Bugs
---------------------------------

Detailed Description
--------------------

### Design:
To identify if the lookup is on a non existant file, we need to compare the filename in lookup 
against filenames in the parent directory. Hence, we have to have some kind of dentry cache and 
a way to ensure consistency of the cache.

#### Which cache consistency model?

1. Cache entries and use directory leases for consistency
   Directory RW lease is taken by any client that is caching the existing entries and creating new 
   entries in the directory. The lease is broken when any other client tries to create an entry in 
   the directory. With this approach we need to first implement directory leases and then the 
   negative lookup cache xlator. Directory leases implementation in the current dht setup becomes 
   very complex as the directories are replicated on all the bricks and leases should be acquired 
   on all the bricks. Since lease will be a in memory state on the bricks it becomes harder to heal 
   the states and handle disconnects, network partitions etc. Hence the approach 2.

2. Cache entries and use upcall for consistency
   Directory entries are cached, whenever a any other client is creating entries, the cache is 
   invalidated and built a fresh. With this there is a small window when the file was actually 
   created and the upcall has not yet reached the client. In this window, the cache can return 
   lookup as ENOENT though the file existed at that point on the bricks. But this window can happen 
   even otherwise because we do not serialize fops from multiple clients. If there are applications 
   which poll on a file being created, it is not a suitable use case for caching. Hence going with 
   this approach.

#### What to cache?
Positive entries (entries that are already present on disk) and/or negative entries (entries that 
were looked up and found to be ENOENT). The names of the entries are cached, but it can be plugged 
to cache the hash of the name ([1] suggested by Pranith). The cache is bound by the upper limit 
configured by the user.

The positive entries(PE):
-  Populated as a part of readdirp, and as a part of mkdir followed by creates inside that directory. 
   Lookups and other fops do not populate the positive entry (as it can grow long and is of no value 
   add)
-  Freed on recieving upcall(with dentry change flag) or on expiring timeout of the cache.

The negative entries(NE):
-  Populated only when lookup/stat returns ENOENT. Fuse mostly sends only one lookup before create, 
   hence negative entry cache is almost useless. But for SMB access, multiple lookups/stats are sent 
   before creating the file. Hence the negative entry cache. It can exist even when the positive 
   entry cache is invalid.
-  Freed on recieving upcall(with dentry change flag) or on expiring timeout of the cache.

PE list and NE list can exist independent of each other and can be optionally disabled.

#### What fops gets served from the cache?
1. Lookup/stat if its a negative lookup
2. Getxattr on get_real_filename (this is a case insensitive lookup)

Pseudocode for lookup/stat and getxattr fops:
lookup/stat () {
   if inode is found in inode table for this name then goto out
   if (parent inode state == NE_VALID)
        if found in NE list
             STACK_UNWIND (ENOENT)
        else
             goto out
   if (parent inode state == PE_VALID)
        if not found in PE list
             STACK_UNWIND (ENOENT)
        else
             goto out
   else if (parent inode state == PE_PARTIAL || PE_INVALID)
        goto out
out:
   STACK_WIND
}
getxattr (get_real_filename) {
   if (inode state == PE_VALID)
        if (found in case insensitive PE search)
             STACK_UNWIND (found filename)
        else
             if (inode state == PE_FULL)
                  STACK_UNWIND (ENOENT)
             else if (inode state == PE_PARTIAL)
                  STACK_WIND
}

#### Data structures to store the cache?
First thought would be to have a linked list of filenames. String compare of all the PE and NE list 
is CPU consuming and may cause performance hit in some cases. Lookups can be catagorised to: Fresh 
+ve lookup, Subsequent lookups(+ve), Negative lookups.

Subsequent lookups: As the inode will be present in inode table we need not search PE/NE list, 
                    hence do not suffer performance hit.

Fresh +Ve lookups: Worst case it will have to compare it with every entry in the NE and PE list. 
                   This will cause some performance hit.

Fresh -Ve lookups: Worst case it will have to compare it with every entry in the NE list to find 
                   that its not there, and PE is INVALID/PARTIAL hence has to goto bricks anyways. 
                   (PE_FULL is not a bad case, as it can return from the cache instead of network fop)

Fresh +ve lookup will have some performance impact, hence we need something better than sequential 
string compare of names:
- Store the hash instead of names as suggested in [1]. If this is the case get_real_filename cannot 
  be served, unless we store two hashes (case as is- for lookups, and lower case hash- for 
  get_real_filename)
- Store it in htree, so that the search time reduces but adding and removing from a tree is not 
  O(1) operation.
- Set an upper limit on the number of NE and PE in the linked list.

For the initial implementation going with the linked list for negative entries.
List of inode refs for positive entries.

#### Other things to note?
- The timeout can be for the whole of the inode cache, in that case any further update of the cache 
will not not change the cache time. Or the timeout can be per entry if the PE/NE list, but this can 
lead to lot of bookkeeping and timers. The timer expiry cleanup can be as a part of different reaper
thread or we can register a time handler for each inode in the timerwheel (as it is scalse well).

[1] http://lists.gluster.org/pipermail/gluster-users/2016-August/028107.html

Benefit to GlusterFS
--------------------

Improves file/directory create performance.

Scope
-----

#### Nature of proposed change

- Adds a new xlator and glusterd canges to enable that xlator

#### Implications on manageability

N/A

#### Implications on presentation layer

N/A

#### Implications on persistence layer

N/A

#### Implications on 'GlusterFS' backend

N/A

#### Modification to GlusterFS metadata

N/A

#### Implications on 'glusterd'

GlusterD changes are integral to this feature, and described above.

How To Test
-----------

For the most part, testing is of the "do no harm" sort; the most thorough test
of this feature is to run our current regression suite.
Some specific test cases include create on all kind of volumes and from multiple clienats:
- distribute
- replicate
- shard
- disperse
- tier
Also, create while:
- rebalance in progress
- tiering migration in progress
- self heal in progress

And all the test cases being run while the memory consumption of the process
is monitored.

User Experience
---------------

Faster file or directory creates

Dependencies
------------

N/A

Documentation
-------------

TBD (very little)

Status
------

Development in progress

Comments and Discussion
-----------------------

N/A