1Introduction 2============ 3 4This document describes a collection of device-mapper targets that 5between them implement thin-provisioning and snapshots. 6 7The main highlight of this implementation, compared to the previous 8implementation of snapshots, is that it allows many virtual devices to 9be stored on the same data volume. This simplifies administration and 10allows the sharing of data between volumes, thus reducing disk usage. 11 12Another significant feature is support for an arbitrary depth of 13recursive snapshots (snapshots of snapshots of snapshots ...). The 14previous implementation of snapshots did this by chaining together 15lookup tables, and so performance was O(depth). This new 16implementation uses a single data structure to avoid this degradation 17with depth. Fragmentation may still be an issue, however, in some 18scenarios. 19 20Metadata is stored on a separate device from data, giving the 21administrator some freedom, for example to: 22 23- Improve metadata resilience by storing metadata on a mirrored volume 24 but data on a non-mirrored one. 25 26- Improve performance by storing the metadata on SSD. 27 28Status 29====== 30 31These targets are considered safe for production use. But different use 32cases will have different performance characteristics, for example due 33to fragmentation of the data volume. 34 35If you find this software is not performing as expected please mail 36dm-devel@redhat.com with details and we'll try our best to improve 37things for you. 38 39Userspace tools for checking and repairing the metadata have been fully 40developed and are available as 'thin_check' and 'thin_repair'. The name 41of the package that provides these utilities varies by distribution (on 42a Red Hat distribution it is named 'device-mapper-persistent-data'). 43 44Cookbook 45======== 46 47This section describes some quick recipes for using thin provisioning. 48They use the dmsetup program to control the device-mapper driver 49directly. End users will be advised to use a higher-level volume 50manager such as LVM2 once support has been added. 51 52Pool device 53----------- 54 55The pool device ties together the metadata volume and the data volume. 56It maps I/O linearly to the data volume and updates the metadata via 57two mechanisms: 58 59- Function calls from the thin targets 60 61- Device-mapper 'messages' from userspace which control the creation of new 62 virtual devices amongst other things. 63 64Setting up a fresh pool device 65------------------------------ 66 67Setting up a pool device requires a valid metadata device, and a 68data device. If you do not have an existing metadata device you can 69make one by zeroing the first 4k to indicate empty metadata. 70 71 dd if=/dev/zero of=$metadata_dev bs=4096 count=1 72 73The amount of metadata you need will vary according to how many blocks 74are shared between thin devices (i.e. through snapshots). If you have 75less sharing than average you'll need a larger-than-average metadata device. 76 77As a guide, we suggest you calculate the number of bytes to use in the 78metadata device as 48 * $data_dev_size / $data_block_size but round it up 79to 2MB if the answer is smaller. If you're creating large numbers of 80snapshots which are recording large amounts of change, you may find you 81need to increase this. 82 83The largest size supported is 16GB: If the device is larger, 84a warning will be issued and the excess space will not be used. 85 86Reloading a pool table 87---------------------- 88 89You may reload a pool's table, indeed this is how the pool is resized 90if it runs out of space. (N.B. While specifying a different metadata 91device when reloading is not forbidden at the moment, things will go 92wrong if it does not route I/O to exactly the same on-disk location as 93previously.) 94 95Using an existing pool device 96----------------------------- 97 98 dmsetup create pool \ 99 --table "0 20971520 thin-pool $metadata_dev $data_dev \ 100 $data_block_size $low_water_mark" 101 102$data_block_size gives the smallest unit of disk space that can be 103allocated at a time expressed in units of 512-byte sectors. 104$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a 105multiple of 128 (64KB). $data_block_size cannot be changed after the 106thin-pool is created. People primarily interested in thin provisioning 107may want to use a value such as 1024 (512KB). People doing lots of 108snapshotting may want a smaller value such as 128 (64KB). If you are 109not zeroing newly-allocated data, a larger $data_block_size in the 110region of 256000 (128MB) is suggested. 111 112$low_water_mark is expressed in blocks of size $data_block_size. If 113free space on the data device drops below this level then a dm event 114will be triggered which a userspace daemon should catch allowing it to 115extend the pool device. Only one such event will be sent. 116 117No special event is triggered if a just resumed device's free space is below 118the low water mark. However, resuming a device always triggers an 119event; a userspace daemon should verify that free space exceeds the low 120water mark when handling this event. 121 122A low water mark for the metadata device is maintained in the kernel and 123will trigger a dm event if free space on the metadata device drops below 124it. 125 126Updating on-disk metadata 127------------------------- 128 129On-disk metadata is committed every time a FLUSH or FUA bio is written. 130If no such requests are made then commits will occur every second. This 131means the thin-provisioning target behaves like a physical disk that has 132a volatile write cache. If power is lost you may lose some recent 133writes. The metadata should always be consistent in spite of any crash. 134 135If data space is exhausted the pool will either error or queue IO 136according to the configuration (see: error_if_no_space). If metadata 137space is exhausted or a metadata operation fails: the pool will error IO 138until the pool is taken offline and repair is performed to 1) fix any 139potential inconsistencies and 2) clear the flag that imposes repair. 140Once the pool's metadata device is repaired it may be resized, which 141will allow the pool to return to normal operation. Note that if a pool 142is flagged as needing repair, the pool's data and metadata devices 143cannot be resized until repair is performed. It should also be noted 144that when the pool's metadata space is exhausted the current metadata 145transaction is aborted. Given that the pool will cache IO whose 146completion may have already been acknowledged to upper IO layers 147(e.g. filesystem) it is strongly suggested that consistency checks 148(e.g. fsck) be performed on those layers when repair of the pool is 149required. 150 151Thin provisioning 152----------------- 153 154i) Creating a new thinly-provisioned volume. 155 156 To create a new thinly- provisioned volume you must send a message to an 157 active pool device, /dev/mapper/pool in this example. 158 159 dmsetup message /dev/mapper/pool 0 "create_thin 0" 160 161 Here '0' is an identifier for the volume, a 24-bit number. It's up 162 to the caller to allocate and manage these identifiers. If the 163 identifier is already in use, the message will fail with -EEXIST. 164 165ii) Using a thinly-provisioned volume. 166 167 Thinly-provisioned volumes are activated using the 'thin' target: 168 169 dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0" 170 171 The last parameter is the identifier for the thinp device. 172 173Internal snapshots 174------------------ 175 176i) Creating an internal snapshot. 177 178 Snapshots are created with another message to the pool. 179 180 N.B. If the origin device that you wish to snapshot is active, you 181 must suspend it before creating the snapshot to avoid corruption. 182 This is NOT enforced at the moment, so please be careful! 183 184 dmsetup suspend /dev/mapper/thin 185 dmsetup message /dev/mapper/pool 0 "create_snap 1 0" 186 dmsetup resume /dev/mapper/thin 187 188 Here '1' is the identifier for the volume, a 24-bit number. '0' is the 189 identifier for the origin device. 190 191ii) Using an internal snapshot. 192 193 Once created, the user doesn't have to worry about any connection 194 between the origin and the snapshot. Indeed the snapshot is no 195 different from any other thinly-provisioned device and can be 196 snapshotted itself via the same method. It's perfectly legal to 197 have only one of them active, and there's no ordering requirement on 198 activating or removing them both. (This differs from conventional 199 device-mapper snapshots.) 200 201 Activate it exactly the same way as any other thinly-provisioned volume: 202 203 dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1" 204 205External snapshots 206------------------ 207 208You can use an external _read only_ device as an origin for a 209thinly-provisioned volume. Any read to an unprovisioned area of the 210thin device will be passed through to the origin. Writes trigger 211the allocation of new blocks as usual. 212 213One use case for this is VM hosts that want to run guests on 214thinly-provisioned volumes but have the base image on another device 215(possibly shared between many VMs). 216 217You must not write to the origin device if you use this technique! 218Of course, you may write to the thin device and take internal snapshots 219of the thin volume. 220 221i) Creating a snapshot of an external device 222 223 This is the same as creating a thin device. 224 You don't mention the origin at this stage. 225 226 dmsetup message /dev/mapper/pool 0 "create_thin 0" 227 228ii) Using a snapshot of an external device. 229 230 Append an extra parameter to the thin target specifying the origin: 231 232 dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image" 233 234 N.B. All descendants (internal snapshots) of this snapshot require the 235 same extra origin parameter. 236 237Deactivation 238------------ 239 240All devices using a pool must be deactivated before the pool itself 241can be. 242 243 dmsetup remove thin 244 dmsetup remove snap 245 dmsetup remove pool 246 247Reference 248========= 249 250'thin-pool' target 251------------------ 252 253i) Constructor 254 255 thin-pool <metadata dev> <data dev> <data block size (sectors)> \ 256 <low water mark (blocks)> [<number of feature args> [<arg>]*] 257 258 Optional feature arguments: 259 260 skip_block_zeroing: Skip the zeroing of newly-provisioned blocks. 261 262 ignore_discard: Disable discard support. 263 264 no_discard_passdown: Don't pass discards down to the underlying 265 data device, but just remove the mapping. 266 267 read_only: Don't allow any changes to be made to the pool 268 metadata. This mode is only available after the 269 thin-pool has been created and first used in full 270 read/write mode. It cannot be specified on initial 271 thin-pool creation. 272 273 error_if_no_space: Error IOs, instead of queueing, if no space. 274 275 Data block size must be between 64KB (128 sectors) and 1GB 276 (2097152 sectors) inclusive. 277 278 279ii) Status 280 281 <transaction id> <used metadata blocks>/<total metadata blocks> 282 <used data blocks>/<total data blocks> <held metadata root> 283 ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space 284 needs_check|- metadata_low_watermark 285 286 transaction id: 287 A 64-bit number used by userspace to help synchronise with metadata 288 from volume managers. 289 290 used data blocks / total data blocks 291 If the number of free blocks drops below the pool's low water mark a 292 dm event will be sent to userspace. This event is edge-triggered and 293 it will occur only once after each resume so volume manager writers 294 should register for the event and then check the target's status. 295 296 held metadata root: 297 The location, in blocks, of the metadata root that has been 298 'held' for userspace read access. '-' indicates there is no 299 held root. 300 301 discard_passdown|no_discard_passdown 302 Whether or not discards are actually being passed down to the 303 underlying device. When this is enabled when loading the table, 304 it can get disabled if the underlying device doesn't support it. 305 306 ro|rw|out_of_data_space 307 If the pool encounters certain types of device failures it will 308 drop into a read-only metadata mode in which no changes to 309 the pool metadata (like allocating new blocks) are permitted. 310 311 In serious cases where even a read-only mode is deemed unsafe 312 no further I/O will be permitted and the status will just 313 contain the string 'Fail'. The userspace recovery tools 314 should then be used. 315 316 error_if_no_space|queue_if_no_space 317 If the pool runs out of data or metadata space, the pool will 318 either queue or error the IO destined to the data device. The 319 default is to queue the IO until more space is added or the 320 'no_space_timeout' expires. The 'no_space_timeout' dm-thin-pool 321 module parameter can be used to change this timeout -- it 322 defaults to 60 seconds but may be disabled using a value of 0. 323 324 needs_check 325 A metadata operation has failed, resulting in the needs_check 326 flag being set in the metadata's superblock. The metadata 327 device must be deactivated and checked/repaired before the 328 thin-pool can be made fully operational again. '-' indicates 329 needs_check is not set. 330 331 metadata_low_watermark: 332 Value of metadata low watermark in blocks. The kernel sets this 333 value internally but userspace needs to know this value to 334 determine if an event was caused by crossing this threshold. 335 336iii) Messages 337 338 create_thin <dev id> 339 340 Create a new thinly-provisioned device. 341 <dev id> is an arbitrary unique 24-bit identifier chosen by 342 the caller. 343 344 create_snap <dev id> <origin id> 345 346 Create a new snapshot of another thinly-provisioned device. 347 <dev id> is an arbitrary unique 24-bit identifier chosen by 348 the caller. 349 <origin id> is the identifier of the thinly-provisioned device 350 of which the new device will be a snapshot. 351 352 delete <dev id> 353 354 Deletes a thin device. Irreversible. 355 356 set_transaction_id <current id> <new id> 357 358 Userland volume managers, such as LVM, need a way to 359 synchronise their external metadata with the internal metadata of the 360 pool target. The thin-pool target offers to store an 361 arbitrary 64-bit transaction id and return it on the target's 362 status line. To avoid races you must provide what you think 363 the current transaction id is when you change it with this 364 compare-and-swap message. 365 366 reserve_metadata_snap 367 368 Reserve a copy of the data mapping btree for use by userland. 369 This allows userland to inspect the mappings as they were when 370 this message was executed. Use the pool's status command to 371 get the root block associated with the metadata snapshot. 372 373 release_metadata_snap 374 375 Release a previously reserved copy of the data mapping btree. 376 377'thin' target 378------------- 379 380i) Constructor 381 382 thin <pool dev> <dev id> [<external origin dev>] 383 384 pool dev: 385 the thin-pool device, e.g. /dev/mapper/my_pool or 253:0 386 387 dev id: 388 the internal device identifier of the device to be 389 activated. 390 391 external origin dev: 392 an optional block device outside the pool to be treated as a 393 read-only snapshot origin: reads to unprovisioned areas of the 394 thin target will be mapped to this device. 395 396The pool doesn't store any size against the thin devices. If you 397load a thin target that is smaller than you've been using previously, 398then you'll have no access to blocks mapped beyond the end. If you 399load a target that is bigger than before, then extra blocks will be 400provisioned as and when needed. 401 402ii) Status 403 404 <nr mapped sectors> <highest mapped sector> 405 406 If the pool has encountered device errors and failed, the status 407 will just contain the string 'Fail'. The userspace recovery 408 tools should then be used. 409 410 In the case where <nr mapped sectors> is 0, there is no highest 411 mapped sector and the value of <highest mapped sector> is unspecified. 412