Skip to content

Commit 0dac83b

Browse files
committed
Add a "big theory statement" comment to code
Signed-off-by: Don Brady <[email protected]>
1 parent 7a3ba15 commit 0dac83b

File tree

1 file changed

+175
-0
lines changed

1 file changed

+175
-0
lines changed

module/zfs/vdev_raidz.c

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,181 @@
141141
VDEV_RAIDZ_64MUL_2((x), mask); \
142142
}
143143

144+
145+
/*
146+
* Big Theory Statement for how a RAIDZ VDEV is expanded
147+
*
148+
* An existing RAIDZ VDEV can be expanded by attaching a new disk. Expansion
149+
* works with all three RAIDZ parity choices, including RAIDZ1, 2, or 3. VDEVs
150+
* that have been previously expanded can be expanded again.
151+
*
152+
* The RAIDZ VDEV must be healthy (must be able to write to all the drives in
153+
* the VDEV) when an expansion starts. And the expansion will pause if any
154+
* disk in the VDEV fails, and resume once the VDEV is healthy again. All other
155+
* operations on the pool can continue while an expansion is in progress (e.g.
156+
* read/write, snapshot, zpool add, etc). Following a reboot or export/import,
157+
* the expansion resumes where it left off.
158+
*
159+
* == Reflowing the Data ==
160+
*
161+
* The expansion involves reflowing (copying) the data from the current set
162+
* of disks to spread it across the new set which now has one more disk. This
163+
* reflow operation is similar to reflowing text when the column width of a
164+
* text editor window is expanded. The text doesn’t change but the location of
165+
* the text changes to accommodate the new width. An example reflow result for
166+
* a 4-wide RAIDZ1 to a 5-wide is shown below.
167+
*
168+
* Reflow End State
169+
* Each letter indicates a parity group (logical stripe)
170+
*
171+
* Before expansion After Expansion
172+
* D1 D2 D3 D4 D1 D2 D3 D4 D5
173+
* +------+------+------+------+ +------+------+------+------+------+
174+
* | | | | | | | | | | |
175+
* | A | A | A | A | | A | A | A | A | B |
176+
* | 1| 2| 3| 4| | 1| 2| 3| 4| 5|
177+
* +------+------+------+------+ +------+------+------+------+------+
178+
* | | | | | | | | | | |
179+
* | B | B | C | C | | B | C | C | C | C |
180+
* | 5| 6| 7| 8| | 6| 7| 8| 9| 10|
181+
* +------+------+------+------+ +------+------+------+------+------+
182+
* | | | | | | | | | | |
183+
* | C | C | D | D | | D | D | E | E | E |
184+
* | 9| 10| 11| 12| | 11| 12| 13| 14| 15|
185+
* +------+------+------+------+ +------+------+------+------+------+
186+
* | | | | | | | | | | |
187+
* | E | E | E | E | --> | E | F | F | G | G |
188+
* | 13| 14| 15| 16| | 16| 17| 18|p 19| 20|
189+
* +------+------+------+------+ +------+------+------+------+------+
190+
* | | | | | | | | | | |
191+
* | F | F | G | G | | G | G | H | H | H |
192+
* | 17| 18| 19| 20| | 21| 22| 23| 24| 25|
193+
* +------+------+------+------+ +------+------+------+------+------+
194+
* | | | | | | | | | | |
195+
* | G | G | H | H | | H | I | I | J | J |
196+
* | 21| 22| 23| 24| | 26| 27| 28| 29| 30|
197+
* +------+------+------+------+ +------+------+------+------+------+
198+
* | | | | | | | | | | |
199+
* | H | H | I | I | | J | J | | | K |
200+
* | 25| 26| 27| 28| | 31| 32| 33| 34| 35|
201+
* +------+------+------+------+ +------+------+------+------+------+
202+
*
203+
* This reflow approach has several advantages. There is no need to read or
204+
* modify the block pointers or recompute any block checksums. The reflow
205+
* doesn’t need to know where the parity sectors reside. We can read and write
206+
* data sequentially and the copy can occur in a background thread in open
207+
* context. The design also allows for fast discovery of what data to copy.
208+
*
209+
* The VDEV metaslabs are processed, one at a time, to copy the block data to
210+
* have it flow across all the disks. The metasab is disabled for allocations
211+
* during the copy. As an optimization, we only copy the allocated data which
212+
* can be determined by looking at the metaslab range tree. During the copy we
213+
* must maintain the redundancy guarantees of the RAIDZ VDEV (e.g. parity count
214+
* of disks can fail). This means we cannot overwrite data during the reflow
215+
* that would be needed if a disk is lost.
216+
*
217+
* After the reflow completes, all newly-written blocks will have the new
218+
* layout, i.e. they will have the parity to data ratio implied by the new
219+
* number of disks in the RAIDZ group.
220+
*
221+
* Even though the reflow copies all the allocated space, it only rearranges
222+
* the existing data + parity. This has a few implications about blocks that
223+
* were written before the reflow completes:
224+
*
225+
* - Old blocks will still use the same amount of space (i.e. they will have
226+
* the parity to data ratio implied by the old number of disks in the RAIDZ
227+
* group).
228+
* - Reading old blocks will be slightly slower than before the reflow, for
229+
* two reasons. First, we will have to read from all disks in the RAIDZ
230+
* VDEV, rather than being able to skip the children that contain only
231+
* parity of this block (because the parity and data of a single block are
232+
* now spread out across all the disks). Second, in most cases there will
233+
* be an extra bcopy, needed to rearrange the data back to its original
234+
* layout in memory.
235+
*
236+
* == Scratch Area ==
237+
*
238+
* As we copy the block data, we can only progress to the point that writes
239+
* will not overlap with blocks whose progress has not yet been recorded on
240+
* disk. Since partially-copied rows are always read from the old location,
241+
* we need to stop one row before the sector-wise overlap, to prevent any
242+
* row-wise overlap. Initially this would limit us to copying one sector at
243+
* a time. The amount we can safely copy is known as the chunk size.
244+
*
245+
* Ideally we want to copy at least 2 * (new_width)^2 so that we have a
246+
* separation of 2*(new_width+1) and a chunk size of new_width+2. With the max
247+
* RAIDZ width of 255 and 4K sectors this would be 2MB per disk. In practice
248+
* the widths will likely be single digits so we can get a substantial chuck
249+
* size using only a few MB of scratch per disk.
250+
*
251+
* To speed up the initial copy, we use a scratch area that is persisted to
252+
* disk which holds a large amount of reflowed state. We can always read the
253+
* partially written stripes when a disk fails or the copy is interrupted
254+
* (crash) during the initial copying phase and also get past a small chunk
255+
* size restriction. At a minimum, the scratch space must be large enough to
256+
* get us to the point that one row does not overlap itself when moved
257+
* (i.e new_width^2). But going larger is even better. We use the 3.5 MiB
258+
* reserved "boot" space that resides after the ZFS disk labels as our scratch
259+
* space to handle overwriting the initial part of the VDEV.
260+
*
261+
* 0 256K 512K 4M
262+
* +------+------+-----------------------+-----------------------------
263+
* | VDEV | VDEV | Boot Block (3.5M) | Allocatable space ...
264+
* | L0 | L1 | Reserved | (Metaslabs)
265+
* +------+------+-----------------------+-------------------------------
266+
* Scratch Area
267+
*
268+
* == Reflow Progress Updates ==
269+
* After the initial scratch-based reflow, the expansion process works
270+
* similarly to device removal. We create a new open context thread whichi
271+
* reflows the data, and periodically kicks off sync tasks to update logical
272+
* state. In this case, state is the committed progress (offset of next data
273+
* to copy). We need to persist the completed offset on disk, so that if we
274+
* crash we know which format each VDEV offset is in.
275+
*
276+
* == Time Dependent Geometry ==
277+
*
278+
* In RAIDZ, blocks are read from disk in a column by column fashion. For a
279+
* multi-row block, the second sector is in the first column not in the 2nd
280+
* column. This allows us to issue full reads for each column directly into
281+
* the request buffer. The block data is thus laid out sequentially in a
282+
* column-by-column fashion.
283+
*
284+
* After a block is reflowed, the sectors that were all in the original column
285+
* data can now reside in different columns. When reading from an expanded
286+
* VDEV, we need to know the logical stripe width for each block so we can
287+
* reconstitute the block’s data after the reads are completed. Likewise,
288+
* when we perform the combinatorial reconstruction we need to know the
289+
* original width so we can retry combinations from the past layouts.
290+
*
291+
* Time dependent geometry is what we call having blocks with different layouts
292+
* (stripe widths) in the same VDEV. This time-dependent geometry uses the
293+
* block’s birth time (+ the time expansion ended) to establish the correct
294+
* width for a given block. After an expansion completes, we record the time
295+
* for blocks written with a particular width (geometry).
296+
*
297+
* == On Disk Format Changes ==
298+
*
299+
* New pool feature flag, 'raidz_expansion' whose reference count is the number
300+
* of RAIDZ VDEVs that have been expanded.
301+
*
302+
* The uberblock has a new ub_raidz_reflow_info field that holds the scratch
303+
* space state (i.e. active or not) and the next offset that needs to be
304+
* reflowed (progress state).
305+
*
306+
* The blocks on expanded RAIDZ VDEV can have different logical stripe widths.
307+
*
308+
* The top-level RAIDZ VDEV has two new entries in the nvlist:
309+
* 'raidz_expand_txgs' array: logical stripe widths by txg are recorded here
310+
* 'raidz_expanding' boolean: present during reflow and removed after completion
311+
*
312+
* And finally the VDEVs top zap adds the following entries:
313+
* VDEV_TOP_ZAP_RAIDZ_EXPAND_STATE
314+
* VDEV_TOP_ZAP_RAIDZ_EXPAND_START_TIME
315+
* VDEV_TOP_ZAP_RAIDZ_EXPAND_END_TIME
316+
* VDEV_TOP_ZAP_RAIDZ_EXPAND_BYTES_COPIED
317+
*/
318+
144319
/*
145320
* For testing only: pause the raidz expansion after reflowing this amount.
146321
* (accessed by ZTS and ztest)

0 commit comments

Comments
 (0)