summaryrefslogtreecommitdiffstats
path: root/done/GlusterFS 3.5/Zerofill.md
blob: 43b279d5546335b7f728ab539e83c106338217aa (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
Feature
-------

zerofill API for GlusterFS

Summary
-------

zerofill() API would allow creation of pre-allocated and zeroed-out
files on GlusterFS volumes by offloading the zeroing part to server
and/or storage (storage offloads use SCSI WRITESAME).

Owners
------

Bharata B Rao  
M. Mohankumar

Current status
--------------

Patch on gerrit: <http://review.gluster.org/5327>

Detailed Description
--------------------

Add support for a new ZEROFILL fop. Zerofill writes zeroes to a file in
the specified range. This fop will be useful when a whole file needs to
be initialized with zero (could be useful for zero filled VM disk image
provisioning or during scrubbing of VM disk images).

Client/application can issue this FOP for zeroing out. Gluster server
will zero out required range of bytes ie server offloaded zeroing. In
the absence of this fop, client/application has to repetitively issue
write (zero) fop to the server, which is very inefficient method because
of the overheads involved in RPC calls and acknowledgements.

WRITESAME is a SCSI T10 command that takes a block of data as input and
writes the same data to other blocks and this write is handled
completely within the storage and hence is known as offload . Linux ,now
has support for SCSI WRITESAME command which is exposed to the user in
the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to
implement this fop. Thus zeroing out operations can be completely
offloaded to the storage device , making it highly efficient.

The fop takes two arguments offset and size. It zeroes out 'size' number
of bytes in an opened file starting from 'offset' position.

Benefit to GlusterFS
--------------------

Benefits GlusterFS in virtualization by providing the ability to quickly
create pre-allocated and zeroed-out VM disk image by using
server/storage off-loads.

### Scope

Nature of proposed change
-------------------------

An FOP supported in libgfapi and FUSE.

Implications on manageability
-----------------------------

None.

Implications on presentation layer
----------------------------------

N/A

Implications on persistence layer
---------------------------------

N/A

Implications on 'GlusterFS' backend
-----------------------------------

N/A

Modification to GlusterFS metadata
----------------------------------

N/A

Implications on 'glusterd'
--------------------------

N/A

How To Test
-----------

Test server offload by measuring the time taken for creating a fully
allocated and zeroed file on Posix backend.

Test storage offload by measuring the time taken for creating a fully
allocated and zeroed file on BD backend.

User Experience
---------------

Fast provisioning of VM images when GlusterFS is used as a file system
backend for KVM virtualization.

Dependencies
------------

zerofill() support in BD backend depends on the new BD translator -
<http://review.gluster.org/#/c/4809/>

Documentation
-------------

This feature add support for a new ZEROFILL fop. Zerofill writes zeroes
to a file in the specified range. This fop will be useful when a whole
file needs to be initialized with zero (could be useful for zero filled
VM disk image provisioning or during scrubbing of VM disk images).

Client/application can issue this FOP for zeroing out. Gluster server
will zero out required range of bytes ie server offloaded zeroing. In
the absence of this fop, client/application has to repetitively issue
write (zero) fop to the server, which is very inefficient method because
of the overheads involved in RPC calls and acknowledgements.

WRITESAME is a SCSI T10 command that takes a block of data as input and
writes the same data to other blocks and this write is handled
completely within the storage and hence is known as offload . Linux ,now
has support for SCSI WRITESAME command which is exposed to the user in
the form of BLKZEROOUT ioctl. BD Xlator can exploit BLKZEROOUT ioctl to
implement this fop. Thus zeroing out operations can be completely
offloaded to the storage device , making it highly efficient.

The fop takes two arguments offset and size. It zeroes out 'size' number
of bytes in an opened file starting from 'offset' position.

This feature adds zerofill support to the following areas:

-  libglusterfs
-  io-stats
-  performance/md-cache,open-behind
-  quota
-  cluster/afr,dht,stripe
-  rpc/xdr
-  protocol/client,server
-  io-threads
-  marker
-  storage/posix
-  libgfapi

Client applications can exploit this fop by using glfs\_zerofill
introduced in libgfapi.FUSE support to this fop has not been added as
there is no system call for this fop.

Here is a performance comparison of server offloaded zeofill vs zeroing
out using repeated writes.

		[root@llmvm02 remote]# time ./offloaded aakash-test log 20

		real    3m34.155s
		user    0m0.018s
		sys 0m0.040s


		 [root@llmvm02 remote]# time ./manually aakash-test log 20

		real    4m23.043s
		user    0m2.197s
		sys 0m14.457s
		 [root@llmvm02 remote]# time ./offloaded aakash-test log 25;

		real    4m28.363s
		user    0m0.021s
		sys 0m0.025s
		[root@llmvm02 remote]# time ./manually aakash-test log 25

		real    5m34.278s
		user    0m2.957s
		sys 0m18.808s

The argument log is a file which we want to set for logging purpose and
the third argument is size in GB .

As we can see there is a performance improvement of around 20% with this
fop.

Status
------

Patch : <http://review.gluster.org/5327> Status : Merged