1Wound/Wait Deadlock-Proof Mutex Design 2====================================== 3 4Please read mutex-design.txt first, as it applies to wait/wound mutexes too. 5 6Motivation for WW-Mutexes 7------------------------- 8 9GPU's do operations that commonly involve many buffers. Those buffers 10can be shared across contexts/processes, exist in different memory 11domains (for example VRAM vs system memory), and so on. And with 12PRIME / dmabuf, they can even be shared across devices. So there are 13a handful of situations where the driver needs to wait for buffers to 14become ready. If you think about this in terms of waiting on a buffer 15mutex for it to become available, this presents a problem because 16there is no way to guarantee that buffers appear in a execbuf/batch in 17the same order in all contexts. That is directly under control of 18userspace, and a result of the sequence of GL calls that an application 19makes. Which results in the potential for deadlock. The problem gets 20more complex when you consider that the kernel may need to migrate the 21buffer(s) into VRAM before the GPU operates on the buffer(s), which 22may in turn require evicting some other buffers (and you don't want to 23evict other buffers which are already queued up to the GPU), but for a 24simplified understanding of the problem you can ignore this. 25 26The algorithm that the TTM graphics subsystem came up with for dealing with 27this problem is quite simple. For each group of buffers (execbuf) that need 28to be locked, the caller would be assigned a unique reservation id/ticket, 29from a global counter. In case of deadlock while locking all the buffers 30associated with a execbuf, the one with the lowest reservation ticket (i.e. 31the oldest task) wins, and the one with the higher reservation id (i.e. the 32younger task) unlocks all of the buffers that it has already locked, and then 33tries again. 34 35In the RDBMS literature, a reservation ticket is associated with a transaction. 36and the deadlock handling approach is called Wait-Die. The name is based on 37the actions of a locking thread when it encounters an already locked mutex. 38If the transaction holding the lock is younger, the locking transaction waits. 39If the transaction holding the lock is older, the locking transaction backs off 40and dies. Hence Wait-Die. 41There is also another algorithm called Wound-Wait: 42If the transaction holding the lock is younger, the locking transaction 43wounds the transaction holding the lock, requesting it to die. 44If the transaction holding the lock is older, it waits for the other 45transaction. Hence Wound-Wait. 46The two algorithms are both fair in that a transaction will eventually succeed. 47However, the Wound-Wait algorithm is typically stated to generate fewer backoffs 48compared to Wait-Die, but is, on the other hand, associated with more work than 49Wait-Die when recovering from a backoff. Wound-Wait is also a preemptive 50algorithm in that transactions are wounded by other transactions, and that 51requires a reliable way to pick up up the wounded condition and preempt the 52running transaction. Note that this is not the same as process preemption. A 53Wound-Wait transaction is considered preempted when it dies (returning 54-EDEADLK) following a wound. 55 56Concepts 57-------- 58 59Compared to normal mutexes two additional concepts/objects show up in the lock 60interface for w/w mutexes: 61 62Acquire context: To ensure eventual forward progress it is important the a task 63trying to acquire locks doesn't grab a new reservation id, but keeps the one it 64acquired when starting the lock acquisition. This ticket is stored in the 65acquire context. Furthermore the acquire context keeps track of debugging state 66to catch w/w mutex interface abuse. An acquire context is representing a 67transaction. 68 69W/w class: In contrast to normal mutexes the lock class needs to be explicit for 70w/w mutexes, since it is required to initialize the acquire context. The lock 71class also specifies what algorithm to use, Wound-Wait or Wait-Die. 72 73Furthermore there are three different class of w/w lock acquire functions: 74 75* Normal lock acquisition with a context, using ww_mutex_lock. 76 77* Slowpath lock acquisition on the contending lock, used by the task that just 78 killed its transaction after having dropped all already acquired locks. 79 These functions have the _slow postfix. 80 81 From a simple semantics point-of-view the _slow functions are not strictly 82 required, since simply calling the normal ww_mutex_lock functions on the 83 contending lock (after having dropped all other already acquired locks) will 84 work correctly. After all if no other ww mutex has been acquired yet there's 85 no deadlock potential and hence the ww_mutex_lock call will block and not 86 prematurely return -EDEADLK. The advantage of the _slow functions is in 87 interface safety: 88 - ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow 89 has a void return type. Note that since ww mutex code needs loops/retries 90 anyway the __must_check doesn't result in spurious warnings, even though the 91 very first lock operation can never fail. 92 - When full debugging is enabled ww_mutex_lock_slow checks that all acquired 93 ww mutex have been released (preventing deadlocks) and makes sure that we 94 block on the contending lock (preventing spinning through the -EDEADLK 95 slowpath until the contended lock can be acquired). 96 97* Functions to only acquire a single w/w mutex, which results in the exact same 98 semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL 99 context. 100 101 Again this is not strictly required. But often you only want to acquire a 102 single lock in which case it's pointless to set up an acquire context (and so 103 better to avoid grabbing a deadlock avoidance ticket). 104 105Of course, all the usual variants for handling wake-ups due to signals are also 106provided. 107 108Usage 109----- 110 111The algorithm (Wait-Die vs Wound-Wait) is chosen by using either 112DEFINE_WW_CLASS() (Wound-Wait) or DEFINE_WD_CLASS() (Wait-Die) 113As a rough rule of thumb, use Wound-Wait iff you 114expect the number of simultaneous competing transactions to be typically small, 115and you want to reduce the number of rollbacks. 116 117Three different ways to acquire locks within the same w/w class. Common 118definitions for methods #1 and #2: 119 120static DEFINE_WW_CLASS(ww_class); 121 122struct obj { 123 struct ww_mutex lock; 124 /* obj data */ 125}; 126 127struct obj_entry { 128 struct list_head head; 129 struct obj *obj; 130}; 131 132Method 1, using a list in execbuf->buffers that's not allowed to be reordered. 133This is useful if a list of required objects is already tracked somewhere. 134Furthermore the lock helper can use propagate the -EALREADY return code back to 135the caller as a signal that an object is twice on the list. This is useful if 136the list is constructed from userspace input and the ABI requires userspace to 137not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl). 138 139int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) 140{ 141 struct obj *res_obj = NULL; 142 struct obj_entry *contended_entry = NULL; 143 struct obj_entry *entry; 144 145 ww_acquire_init(ctx, &ww_class); 146 147retry: 148 list_for_each_entry (entry, list, head) { 149 if (entry->obj == res_obj) { 150 res_obj = NULL; 151 continue; 152 } 153 ret = ww_mutex_lock(&entry->obj->lock, ctx); 154 if (ret < 0) { 155 contended_entry = entry; 156 goto err; 157 } 158 } 159 160 ww_acquire_done(ctx); 161 return 0; 162 163err: 164 list_for_each_entry_continue_reverse (entry, list, head) 165 ww_mutex_unlock(&entry->obj->lock); 166 167 if (res_obj) 168 ww_mutex_unlock(&res_obj->lock); 169 170 if (ret == -EDEADLK) { 171 /* we lost out in a seqno race, lock and retry.. */ 172 ww_mutex_lock_slow(&contended_entry->obj->lock, ctx); 173 res_obj = contended_entry->obj; 174 goto retry; 175 } 176 ww_acquire_fini(ctx); 177 178 return ret; 179} 180 181Method 2, using a list in execbuf->buffers that can be reordered. Same semantics 182of duplicate entry detection using -EALREADY as method 1 above. But the 183list-reordering allows for a bit more idiomatic code. 184 185int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) 186{ 187 struct obj_entry *entry, *entry2; 188 189 ww_acquire_init(ctx, &ww_class); 190 191 list_for_each_entry (entry, list, head) { 192 ret = ww_mutex_lock(&entry->obj->lock, ctx); 193 if (ret < 0) { 194 entry2 = entry; 195 196 list_for_each_entry_continue_reverse (entry2, list, head) 197 ww_mutex_unlock(&entry2->obj->lock); 198 199 if (ret != -EDEADLK) { 200 ww_acquire_fini(ctx); 201 return ret; 202 } 203 204 /* we lost out in a seqno race, lock and retry.. */ 205 ww_mutex_lock_slow(&entry->obj->lock, ctx); 206 207 /* 208 * Move buf to head of the list, this will point 209 * buf->next to the first unlocked entry, 210 * restarting the for loop. 211 */ 212 list_del(&entry->head); 213 list_add(&entry->head, list); 214 } 215 } 216 217 ww_acquire_done(ctx); 218 return 0; 219} 220 221Unlocking works the same way for both methods #1 and #2: 222 223void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) 224{ 225 struct obj_entry *entry; 226 227 list_for_each_entry (entry, list, head) 228 ww_mutex_unlock(&entry->obj->lock); 229 230 ww_acquire_fini(ctx); 231} 232 233Method 3 is useful if the list of objects is constructed ad-hoc and not upfront, 234e.g. when adjusting edges in a graph where each node has its own ww_mutex lock, 235and edges can only be changed when holding the locks of all involved nodes. w/w 236mutexes are a natural fit for such a case for two reasons: 237- They can handle lock-acquisition in any order which allows us to start walking 238 a graph from a starting point and then iteratively discovering new edges and 239 locking down the nodes those edges connect to. 240- Due to the -EALREADY return code signalling that a given objects is already 241 held there's no need for additional book-keeping to break cycles in the graph 242 or keep track off which looks are already held (when using more than one node 243 as a starting point). 244 245Note that this approach differs in two important ways from the above methods: 246- Since the list of objects is dynamically constructed (and might very well be 247 different when retrying due to hitting the -EDEADLK die condition) there's 248 no need to keep any object on a persistent list when it's not locked. We can 249 therefore move the list_head into the object itself. 250- On the other hand the dynamic object list construction also means that the -EALREADY return 251 code can't be propagated. 252 253Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a 254list of starting nodes (passed in from userspace) using one of the above 255methods. And then lock any additional objects affected by the operations using 256method #3 below. The backoff/retry procedure will be a bit more involved, since 257when the dynamic locking step hits -EDEADLK we also need to unlock all the 258objects acquired with the fixed list. But the w/w mutex debug checks will catch 259any interface misuse for these cases. 260 261Also, method 3 can't fail the lock acquisition step since it doesn't return 262-EALREADY. Of course this would be different when using the _interruptible 263variants, but that's outside of the scope of these examples here. 264 265struct obj { 266 struct ww_mutex ww_mutex; 267 struct list_head locked_list; 268}; 269 270static DEFINE_WW_CLASS(ww_class); 271 272void __unlock_objs(struct list_head *list) 273{ 274 struct obj *entry, *temp; 275 276 list_for_each_entry_safe (entry, temp, list, locked_list) { 277 /* need to do that before unlocking, since only the current lock holder is 278 allowed to use object */ 279 list_del(&entry->locked_list); 280 ww_mutex_unlock(entry->ww_mutex) 281 } 282} 283 284void lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) 285{ 286 struct obj *obj; 287 288 ww_acquire_init(ctx, &ww_class); 289 290retry: 291 /* re-init loop start state */ 292 loop { 293 /* magic code which walks over a graph and decides which objects 294 * to lock */ 295 296 ret = ww_mutex_lock(obj->ww_mutex, ctx); 297 if (ret == -EALREADY) { 298 /* we have that one already, get to the next object */ 299 continue; 300 } 301 if (ret == -EDEADLK) { 302 __unlock_objs(list); 303 304 ww_mutex_lock_slow(obj, ctx); 305 list_add(&entry->locked_list, list); 306 goto retry; 307 } 308 309 /* locked a new object, add it to the list */ 310 list_add_tail(&entry->locked_list, list); 311 } 312 313 ww_acquire_done(ctx); 314 return 0; 315} 316 317void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx) 318{ 319 __unlock_objs(list); 320 ww_acquire_fini(ctx); 321} 322 323Method 4: Only lock one single objects. In that case deadlock detection and 324prevention is obviously overkill, since with grabbing just one lock you can't 325produce a deadlock within just one class. To simplify this case the w/w mutex 326api can be used with a NULL context. 327 328Implementation Details 329---------------------- 330 331Design: 332 ww_mutex currently encapsulates a struct mutex, this means no extra overhead for 333 normal mutex locks, which are far more common. As such there is only a small 334 increase in code size if wait/wound mutexes are not used. 335 336 We maintain the following invariants for the wait list: 337 (1) Waiters with an acquire context are sorted by stamp order; waiters 338 without an acquire context are interspersed in FIFO order. 339 (2) For Wait-Die, among waiters with contexts, only the first one can have 340 other locks acquired already (ctx->acquired > 0). Note that this waiter 341 may come after other waiters without contexts in the list. 342 343 The Wound-Wait preemption is implemented with a lazy-preemption scheme: 344 The wounded status of the transaction is checked only when there is 345 contention for a new lock and hence a true chance of deadlock. In that 346 situation, if the transaction is wounded, it backs off, clears the 347 wounded status and retries. A great benefit of implementing preemption in 348 this way is that the wounded transaction can identify a contending lock to 349 wait for before restarting the transaction. Just blindly restarting the 350 transaction would likely make the transaction end up in a situation where 351 it would have to back off again. 352 353 In general, not much contention is expected. The locks are typically used to 354 serialize access to resources for devices, and optimization focus should 355 therefore be directed towards the uncontended cases. 356 357Lockdep: 358 Special care has been taken to warn for as many cases of api abuse 359 as possible. Some common api abuses will be caught with 360 CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended. 361 362 Some of the errors which will be warned about: 363 - Forgetting to call ww_acquire_fini or ww_acquire_init. 364 - Attempting to lock more mutexes after ww_acquire_done. 365 - Attempting to lock the wrong mutex after -EDEADLK and 366 unlocking all mutexes. 367 - Attempting to lock the right mutex after -EDEADLK, 368 before unlocking all mutexes. 369 370 - Calling ww_mutex_lock_slow before -EDEADLK was returned. 371 372 - Unlocking mutexes with the wrong unlock function. 373 - Calling one of the ww_acquire_* twice on the same context. 374 - Using a different ww_class for the mutex than for the ww_acquire_ctx. 375 - Normal lockdep errors that can result in deadlocks. 376 377 Some of the lockdep errors that can result in deadlocks: 378 - Calling ww_acquire_init to initialize a second ww_acquire_ctx before 379 having called ww_acquire_fini on the first. 380 - 'normal' deadlocks that can occur. 381 382FIXME: Update this section once we have the TASK_DEADLOCK task state flag magic 383implemented. 384