|
141 | 141 | VDEV_RAIDZ_64MUL_2((x), mask); \
|
142 | 142 | }
|
143 | 143 |
|
| 144 | + |
| 145 | +/* |
| 146 | + * Big Theory Statement for how a RAIDZ VDEV is expanded |
| 147 | + * |
| 148 | + * An existing RAIDZ VDEV can be expanded by attaching a new disk. Expansion |
| 149 | + * works with all three RAIDZ parity choices, including RAIDZ1, 2, or 3. VDEVs |
| 150 | + * that have been previously expanded can be expanded again. |
| 151 | + * |
| 152 | + * The RAIDZ VDEV must be healthy (must be able to write to all the drives in |
| 153 | + * the VDEV) when an expansion starts. And the expansion will pause if any |
| 154 | + * disk in the VDEV fails, and resume once the VDEV is healthy again. All other |
| 155 | + * operations on the pool can continue while an expansion is in progress (e.g. |
| 156 | + * read/write, snapshot, zpool add, etc). Following a reboot or export/import, |
| 157 | + * the expansion resumes where it left off. |
| 158 | + * |
| 159 | + * == Reflowing the Data == |
| 160 | + * |
| 161 | + * The expansion involves reflowing (copying) the data from the current set |
| 162 | + * of disks to spread it across the new set which now has one more disk. This |
| 163 | + * reflow operation is similar to reflowing text when the column width of a |
| 164 | + * text editor window is expanded. The text doesn’t change but the location of |
| 165 | + * the text changes to accommodate the new width. An example reflow result for |
| 166 | + * a 4-wide RAIDZ1 to a 5-wide is shown below. |
| 167 | + * |
| 168 | + * Reflow End State |
| 169 | + * Each letter indicates a parity group (logical stripe) |
| 170 | + * |
| 171 | + * Before expansion After Expansion |
| 172 | + * D1 D2 D3 D4 D1 D2 D3 D4 D5 |
| 173 | + * +------+------+------+------+ +------+------+------+------+------+ |
| 174 | + * | | | | | | | | | | | |
| 175 | + * | A | A | A | A | | A | A | A | A | B | |
| 176 | + * | 1| 2| 3| 4| | 1| 2| 3| 4| 5| |
| 177 | + * +------+------+------+------+ +------+------+------+------+------+ |
| 178 | + * | | | | | | | | | | | |
| 179 | + * | B | B | C | C | | B | C | C | C | C | |
| 180 | + * | 5| 6| 7| 8| | 6| 7| 8| 9| 10| |
| 181 | + * +------+------+------+------+ +------+------+------+------+------+ |
| 182 | + * | | | | | | | | | | | |
| 183 | + * | C | C | D | D | | D | D | E | E | E | |
| 184 | + * | 9| 10| 11| 12| | 11| 12| 13| 14| 15| |
| 185 | + * +------+------+------+------+ +------+------+------+------+------+ |
| 186 | + * | | | | | | | | | | | |
| 187 | + * | E | E | E | E | --> | E | F | F | G | G | |
| 188 | + * | 13| 14| 15| 16| | 16| 17| 18|p 19| 20| |
| 189 | + * +------+------+------+------+ +------+------+------+------+------+ |
| 190 | + * | | | | | | | | | | | |
| 191 | + * | F | F | G | G | | G | G | H | H | H | |
| 192 | + * | 17| 18| 19| 20| | 21| 22| 23| 24| 25| |
| 193 | + * +------+------+------+------+ +------+------+------+------+------+ |
| 194 | + * | | | | | | | | | | | |
| 195 | + * | G | G | H | H | | H | I | I | J | J | |
| 196 | + * | 21| 22| 23| 24| | 26| 27| 28| 29| 30| |
| 197 | + * +------+------+------+------+ +------+------+------+------+------+ |
| 198 | + * | | | | | | | | | | | |
| 199 | + * | H | H | I | I | | J | J | | | K | |
| 200 | + * | 25| 26| 27| 28| | 31| 32| 33| 34| 35| |
| 201 | + * +------+------+------+------+ +------+------+------+------+------+ |
| 202 | + * |
| 203 | + * This reflow approach has several advantages. There is no need to read or |
| 204 | + * modify the block pointers or recompute any block checksums. The reflow |
| 205 | + * doesn’t need to know where the parity sectors reside. We can read and write |
| 206 | + * data sequentially and the copy can occur in a background thread in open |
| 207 | + * context. The design also allows for fast discovery of what data to copy. |
| 208 | + * |
| 209 | + * The VDEV metaslabs are processed, one at a time, to copy the block data to |
| 210 | + * have it flow across all the disks. The metasab is disabled for allocations |
| 211 | + * during the copy. As an optimization, we only copy the allocated data which |
| 212 | + * can be determined by looking at the metaslab range tree. During the copy we |
| 213 | + * must maintain the redundancy guarantees of the RAIDZ VDEV (e.g. parity count |
| 214 | + * of disks can fail). This means we cannot overwrite data during the reflow |
| 215 | + * that would be needed if a disk is lost. |
| 216 | + * |
| 217 | + * After the reflow completes, all newly-written blocks will have the new |
| 218 | + * layout, i.e. they will have the parity to data ratio implied by the new |
| 219 | + * number of disks in the RAIDZ group. |
| 220 | + * |
| 221 | + * Even though the reflow copies all the allocated space, it only rearranges |
| 222 | + * the existing data + parity. This has a few implications about blocks that |
| 223 | + * were written before the reflow completes: |
| 224 | + * |
| 225 | + * - Old blocks will still use the same amount of space (i.e. they will have |
| 226 | + * the parity to data ratio implied by the old number of disks in the RAIDZ |
| 227 | + * group). |
| 228 | + * - Reading old blocks will be slightly slower than before the reflow, for |
| 229 | + * two reasons. First, we will have to read from all disks in the RAIDZ |
| 230 | + * VDEV, rather than being able to skip the children that contain only |
| 231 | + * parity of this block (because the parity and data of a single block are |
| 232 | + * now spread out across all the disks). Second, in most cases there will |
| 233 | + * be an extra bcopy, needed to rearrange the data back to its original |
| 234 | + * layout in memory. |
| 235 | + * |
| 236 | + * == Scratch Area == |
| 237 | + * |
| 238 | + * As we copy the block data, we can only progress to the point that writes |
| 239 | + * will not overlap with blocks whose progress has not yet been recorded on |
| 240 | + * disk. Since partially-copied rows are always read from the old location, |
| 241 | + * we need to stop one row before the sector-wise overlap, to prevent any |
| 242 | + * row-wise overlap. Initially this would limit us to copying one sector at |
| 243 | + * a time. The amount we can safely copy is known as the chunk size. |
| 244 | + * |
| 245 | + * Ideally we want to copy at least 2 * (new_width)^2 so that we have a |
| 246 | + * separation of 2*(new_width+1) and a chunk size of new_width+2. With the max |
| 247 | + * RAIDZ width of 255 and 4K sectors this would be 2MB per disk. In practice |
| 248 | + * the widths will likely be single digits so we can get a substantial chuck |
| 249 | + * size using only a few MB of scratch per disk. |
| 250 | + * |
| 251 | + * To speed up the initial copy, we use a scratch area that is persisted to |
| 252 | + * disk which holds a large amount of reflowed state. We can always read the |
| 253 | + * partially written stripes when a disk fails or the copy is interrupted |
| 254 | + * (crash) during the initial copying phase and also get past a small chunk |
| 255 | + * size restriction. At a minimum, the scratch space must be large enough to |
| 256 | + * get us to the point that one row does not overlap itself when moved |
| 257 | + * (i.e new_width^2). But going larger is even better. We use the 3.5 MiB |
| 258 | + * reserved "boot" space that resides after the ZFS disk labels as our scratch |
| 259 | + * space to handle overwriting the initial part of the VDEV. |
| 260 | + * |
| 261 | + * 0 256K 512K 4M |
| 262 | + * +------+------+-----------------------+----------------------------- |
| 263 | + * | VDEV | VDEV | Boot Block (3.5M) | Allocatable space ... |
| 264 | + * | L0 | L1 | Reserved | (Metaslabs) |
| 265 | + * +------+------+-----------------------+------------------------------- |
| 266 | + * Scratch Area |
| 267 | + * |
| 268 | + * == Reflow Progress Updates == |
| 269 | + * After the initial scratch-based reflow, the expansion process works |
| 270 | + * similarly to device removal. We create a new open context thread whichi |
| 271 | + * reflows the data, and periodically kicks off sync tasks to update logical |
| 272 | + * state. In this case, state is the committed progress (offset of next data |
| 273 | + * to copy). We need to persist the completed offset on disk, so that if we |
| 274 | + * crash we know which format each VDEV offset is in. |
| 275 | + * |
| 276 | + * == Time Dependent Geometry == |
| 277 | + * |
| 278 | + * In RAIDZ, blocks are read from disk in a column by column fashion. For a |
| 279 | + * multi-row block, the second sector is in the first column not in the 2nd |
| 280 | + * column. This allows us to issue full reads for each column directly into |
| 281 | + * the request buffer. The block data is thus laid out sequentially in a |
| 282 | + * column-by-column fashion. |
| 283 | + * |
| 284 | + * After a block is reflowed, the sectors that were all in the original column |
| 285 | + * data can now reside in different columns. When reading from an expanded |
| 286 | + * VDEV, we need to know the logical stripe width for each block so we can |
| 287 | + * reconstitute the block’s data after the reads are completed. Likewise, |
| 288 | + * when we perform the combinatorial reconstruction we need to know the |
| 289 | + * original width so we can retry combinations from the past layouts. |
| 290 | + * |
| 291 | + * Time dependent geometry is what we call having blocks with different layouts |
| 292 | + * (stripe widths) in the same VDEV. This time-dependent geometry uses the |
| 293 | + * block’s birth time (+ the time expansion ended) to establish the correct |
| 294 | + * width for a given block. After an expansion completes, we record the time |
| 295 | + * for blocks written with a particular width (geometry). |
| 296 | + * |
| 297 | + * == On Disk Format Changes == |
| 298 | + * |
| 299 | + * New pool feature flag, 'raidz_expansion' whose reference count is the number |
| 300 | + * of RAIDZ VDEVs that have been expanded. |
| 301 | + * |
| 302 | + * The uberblock has a new ub_raidz_reflow_info field that holds the scratch |
| 303 | + * space state (i.e. active or not) and the next offset that needs to be |
| 304 | + * reflowed (progress state). |
| 305 | + * |
| 306 | + * The blocks on expanded RAIDZ VDEV can have different logical stripe widths. |
| 307 | + * |
| 308 | + * The top-level RAIDZ VDEV has two new entries in the nvlist: |
| 309 | + * 'raidz_expand_txgs' array: logical stripe widths by txg are recorded here |
| 310 | + * 'raidz_expanding' boolean: present during reflow and removed after completion |
| 311 | + * |
| 312 | + * And finally the VDEVs top zap adds the following entries: |
| 313 | + * VDEV_TOP_ZAP_RAIDZ_EXPAND_STATE |
| 314 | + * VDEV_TOP_ZAP_RAIDZ_EXPAND_START_TIME |
| 315 | + * VDEV_TOP_ZAP_RAIDZ_EXPAND_END_TIME |
| 316 | + * VDEV_TOP_ZAP_RAIDZ_EXPAND_BYTES_COPIED |
| 317 | + */ |
| 318 | + |
144 | 319 | /*
|
145 | 320 | * For testing only: pause the raidz expansion after reflowing this amount.
|
146 | 321 | * (accessed by ZTS and ztest)
|
|
0 commit comments