1======================================================= 2Semantics and Behavior of Atomic and Bitmask Operations 3======================================================= 4 5:Author: David S. Miller 6 7This document is intended to serve as a guide to Linux port 8maintainers on how to implement atomic counter, bitops, and spinlock 9interfaces properly. 10 11Atomic Type And Operations 12========================== 13 14The atomic_t type should be defined as a signed integer and 15the atomic_long_t type as a signed long integer. Also, they should 16be made opaque such that any kind of cast to a normal C integer type 17will fail. Something like the following should suffice:: 18 19 typedef struct { int counter; } atomic_t; 20 typedef struct { long counter; } atomic_long_t; 21 22Historically, counter has been declared volatile. This is now discouraged. 23See :ref:`Documentation/process/volatile-considered-harmful.rst 24<volatile_considered_harmful>` for the complete rationale. 25 26local_t is very similar to atomic_t. If the counter is per CPU and only 27updated by one CPU, local_t is probably more appropriate. Please see 28:ref:`Documentation/core-api/local_ops.rst <local_ops>` for the semantics of 29local_t. 30 31The first operations to implement for atomic_t's are the initializers and 32plain writes. :: 33 34 #define ATOMIC_INIT(i) { (i) } 35 #define atomic_set(v, i) ((v)->counter = (i)) 36 37The first macro is used in definitions, such as:: 38 39 static atomic_t my_counter = ATOMIC_INIT(1); 40 41The initializer is atomic in that the return values of the atomic operations 42are guaranteed to be correct reflecting the initialized value if the 43initializer is used before runtime. If the initializer is used at runtime, a 44proper implicit or explicit read memory barrier is needed before reading the 45value with atomic_read from another thread. 46 47As with all of the ``atomic_`` interfaces, replace the leading ``atomic_`` 48with ``atomic_long_`` to operate on atomic_long_t. 49 50The second interface can be used at runtime, as in:: 51 52 struct foo { atomic_t counter; }; 53 ... 54 55 struct foo *k; 56 57 k = kmalloc(sizeof(*k), GFP_KERNEL); 58 if (!k) 59 return -ENOMEM; 60 atomic_set(&k->counter, 0); 61 62The setting is atomic in that the return values of the atomic operations by 63all threads are guaranteed to be correct reflecting either the value that has 64been set with this operation or set with another operation. A proper implicit 65or explicit memory barrier is needed before the value set with the operation 66is guaranteed to be readable with atomic_read from another thread. 67 68Next, we have:: 69 70 #define atomic_read(v) ((v)->counter) 71 72which simply reads the counter value currently visible to the calling thread. 73The read is atomic in that the return value is guaranteed to be one of the 74values initialized or modified with the interface operations if a proper 75implicit or explicit memory barrier is used after possible runtime 76initialization by any other thread and the value is modified only with the 77interface operations. atomic_read does not guarantee that the runtime 78initialization by any other thread is visible yet, so the user of the 79interface must take care of that with a proper implicit or explicit memory 80barrier. 81 82.. warning:: 83 84 ``atomic_read()`` and ``atomic_set()`` DO NOT IMPLY BARRIERS! 85 86 Some architectures may choose to use the volatile keyword, barriers, or 87 inline assembly to guarantee some degree of immediacy for atomic_read() 88 and atomic_set(). This is not uniformly guaranteed, and may change in 89 the future, so all users of atomic_t should treat atomic_read() and 90 atomic_set() as simple C statements that may be reordered or optimized 91 away entirely by the compiler or processor, and explicitly invoke the 92 appropriate compiler and/or memory barrier for each use case. Failure 93 to do so will result in code that may suddenly break when used with 94 different architectures or compiler optimizations, or even changes in 95 unrelated code which changes how the compiler optimizes the section 96 accessing atomic_t variables. 97 98Properly aligned pointers, longs, ints, and chars (and unsigned 99equivalents) may be atomically loaded from and stored to in the same 100sense as described for atomic_read() and atomic_set(). The READ_ONCE() 101and WRITE_ONCE() macros should be used to prevent the compiler from using 102optimizations that might otherwise optimize accesses out of existence on 103the one hand, or that might create unsolicited accesses on the other. 104 105For example consider the following code:: 106 107 while (a > 0) 108 do_something(); 109 110If the compiler can prove that do_something() does not store to the 111variable a, then the compiler is within its rights transforming this to 112the following:: 113 114 if (a > 0) 115 for (;;) 116 do_something(); 117 118If you don't want the compiler to do this (and you probably don't), then 119you should use something like the following:: 120 121 while (READ_ONCE(a) > 0) 122 do_something(); 123 124Alternatively, you could place a barrier() call in the loop. 125 126For another example, consider the following code:: 127 128 tmp_a = a; 129 do_something_with(tmp_a); 130 do_something_else_with(tmp_a); 131 132If the compiler can prove that do_something_with() does not store to the 133variable a, then the compiler is within its rights to manufacture an 134additional load as follows:: 135 136 tmp_a = a; 137 do_something_with(tmp_a); 138 tmp_a = a; 139 do_something_else_with(tmp_a); 140 141This could fatally confuse your code if it expected the same value 142to be passed to do_something_with() and do_something_else_with(). 143 144The compiler would be likely to manufacture this additional load if 145do_something_with() was an inline function that made very heavy use 146of registers: reloading from variable a could save a flush to the 147stack and later reload. To prevent the compiler from attacking your 148code in this manner, write the following:: 149 150 tmp_a = READ_ONCE(a); 151 do_something_with(tmp_a); 152 do_something_else_with(tmp_a); 153 154For a final example, consider the following code, assuming that the 155variable a is set at boot time before the second CPU is brought online 156and never changed later, so that memory barriers are not needed:: 157 158 if (a) 159 b = 9; 160 else 161 b = 42; 162 163The compiler is within its rights to manufacture an additional store 164by transforming the above code into the following:: 165 166 b = 42; 167 if (a) 168 b = 9; 169 170This could come as a fatal surprise to other code running concurrently 171that expected b to never have the value 42 if a was zero. To prevent 172the compiler from doing this, write something like:: 173 174 if (a) 175 WRITE_ONCE(b, 9); 176 else 177 WRITE_ONCE(b, 42); 178 179Don't even -think- about doing this without proper use of memory barriers, 180locks, or atomic operations if variable a can change at runtime! 181 182.. warning:: 183 184 ``READ_ONCE()`` OR ``WRITE_ONCE()`` DO NOT IMPLY A BARRIER! 185 186Now, we move onto the atomic operation interfaces typically implemented with 187the help of assembly code. :: 188 189 void atomic_add(int i, atomic_t *v); 190 void atomic_sub(int i, atomic_t *v); 191 void atomic_inc(atomic_t *v); 192 void atomic_dec(atomic_t *v); 193 194These four routines add and subtract integral values to/from the given 195atomic_t value. The first two routines pass explicit integers by 196which to make the adjustment, whereas the latter two use an implicit 197adjustment value of "1". 198 199One very important aspect of these two routines is that they DO NOT 200require any explicit memory barriers. They need only perform the 201atomic_t counter update in an SMP safe manner. 202 203Next, we have:: 204 205 int atomic_inc_return(atomic_t *v); 206 int atomic_dec_return(atomic_t *v); 207 208These routines add 1 and subtract 1, respectively, from the given 209atomic_t and return the new counter value after the operation is 210performed. 211 212Unlike the above routines, it is required that these primitives 213include explicit memory barriers that are performed before and after 214the operation. It must be done such that all memory operations before 215and after the atomic operation calls are strongly ordered with respect 216to the atomic operation itself. 217 218For example, it should behave as if a smp_mb() call existed both 219before and after the atomic operation. 220 221If the atomic instructions used in an implementation provide explicit 222memory barrier semantics which satisfy the above requirements, that is 223fine as well. 224 225Let's move on:: 226 227 int atomic_add_return(int i, atomic_t *v); 228 int atomic_sub_return(int i, atomic_t *v); 229 230These behave just like atomic_{inc,dec}_return() except that an 231explicit counter adjustment is given instead of the implicit "1". 232This means that like atomic_{inc,dec}_return(), the memory barrier 233semantics are required. 234 235Next:: 236 237 int atomic_inc_and_test(atomic_t *v); 238 int atomic_dec_and_test(atomic_t *v); 239 240These two routines increment and decrement by 1, respectively, the 241given atomic counter. They return a boolean indicating whether the 242resulting counter value was zero or not. 243 244Again, these primitives provide explicit memory barrier semantics around 245the atomic operation:: 246 247 int atomic_sub_and_test(int i, atomic_t *v); 248 249This is identical to atomic_dec_and_test() except that an explicit 250decrement is given instead of the implicit "1". This primitive must 251provide explicit memory barrier semantics around the operation:: 252 253 int atomic_add_negative(int i, atomic_t *v); 254 255The given increment is added to the given atomic counter value. A boolean 256is return which indicates whether the resulting counter value is negative. 257This primitive must provide explicit memory barrier semantics around 258the operation. 259 260Then:: 261 262 int atomic_xchg(atomic_t *v, int new); 263 264This performs an atomic exchange operation on the atomic variable v, setting 265the given new value. It returns the old value that the atomic variable v had 266just before the operation. 267 268atomic_xchg must provide explicit memory barriers around the operation. :: 269 270 int atomic_cmpxchg(atomic_t *v, int old, int new); 271 272This performs an atomic compare exchange operation on the atomic value v, 273with the given old and new values. Like all atomic_xxx operations, 274atomic_cmpxchg will only satisfy its atomicity semantics as long as all 275other accesses of \*v are performed through atomic_xxx operations. 276 277atomic_cmpxchg must provide explicit memory barriers around the operation, 278although if the comparison fails then no memory ordering guarantees are 279required. 280 281The semantics for atomic_cmpxchg are the same as those defined for 'cas' 282below. 283 284Finally:: 285 286 int atomic_add_unless(atomic_t *v, int a, int u); 287 288If the atomic value v is not equal to u, this function adds a to v, and 289returns non zero. If v is equal to u then it returns zero. This is done as 290an atomic operation. 291 292atomic_add_unless must provide explicit memory barriers around the 293operation unless it fails (returns 0). 294 295atomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0) 296 297 298If a caller requires memory barrier semantics around an atomic_t 299operation which does not return a value, a set of interfaces are 300defined which accomplish this:: 301 302 void smp_mb__before_atomic(void); 303 void smp_mb__after_atomic(void); 304 305Preceding a non-value-returning read-modify-write atomic operation with 306smp_mb__before_atomic() and following it with smp_mb__after_atomic() 307provides the same full ordering that is provided by value-returning 308read-modify-write atomic operations. 309 310For example, smp_mb__before_atomic() can be used like so:: 311 312 obj->dead = 1; 313 smp_mb__before_atomic(); 314 atomic_dec(&obj->ref_count); 315 316It makes sure that all memory operations preceding the atomic_dec() 317call are strongly ordered with respect to the atomic counter 318operation. In the above example, it guarantees that the assignment of 319"1" to obj->dead will be globally visible to other cpus before the 320atomic counter decrement. 321 322Without the explicit smp_mb__before_atomic() call, the 323implementation could legally allow the atomic counter update visible 324to other cpus before the "obj->dead = 1;" assignment. 325 326A missing memory barrier in the cases where they are required by the 327atomic_t implementation above can have disastrous results. Here is 328an example, which follows a pattern occurring frequently in the Linux 329kernel. It is the use of atomic counters to implement reference 330counting, and it works such that once the counter falls to zero it can 331be guaranteed that no other entity can be accessing the object:: 332 333 static void obj_list_add(struct obj *obj, struct list_head *head) 334 { 335 obj->active = 1; 336 list_add(&obj->list, head); 337 } 338 339 static void obj_list_del(struct obj *obj) 340 { 341 list_del(&obj->list); 342 obj->active = 0; 343 } 344 345 static void obj_destroy(struct obj *obj) 346 { 347 BUG_ON(obj->active); 348 kfree(obj); 349 } 350 351 struct obj *obj_list_peek(struct list_head *head) 352 { 353 if (!list_empty(head)) { 354 struct obj *obj; 355 356 obj = list_entry(head->next, struct obj, list); 357 atomic_inc(&obj->refcnt); 358 return obj; 359 } 360 return NULL; 361 } 362 363 void obj_poke(void) 364 { 365 struct obj *obj; 366 367 spin_lock(&global_list_lock); 368 obj = obj_list_peek(&global_list); 369 spin_unlock(&global_list_lock); 370 371 if (obj) { 372 obj->ops->poke(obj); 373 if (atomic_dec_and_test(&obj->refcnt)) 374 obj_destroy(obj); 375 } 376 } 377 378 void obj_timeout(struct obj *obj) 379 { 380 spin_lock(&global_list_lock); 381 obj_list_del(obj); 382 spin_unlock(&global_list_lock); 383 384 if (atomic_dec_and_test(&obj->refcnt)) 385 obj_destroy(obj); 386 } 387 388.. note:: 389 390 This is a simplification of the ARP queue management in the generic 391 neighbour discover code of the networking. Olaf Kirch found a bug wrt. 392 memory barriers in kfree_skb() that exposed the atomic_t memory barrier 393 requirements quite clearly. 394 395Given the above scheme, it must be the case that the obj->active 396update done by the obj list deletion be visible to other processors 397before the atomic counter decrement is performed. 398 399Otherwise, the counter could fall to zero, yet obj->active would still 400be set, thus triggering the assertion in obj_destroy(). The error 401sequence looks like this:: 402 403 cpu 0 cpu 1 404 obj_poke() obj_timeout() 405 obj = obj_list_peek(); 406 ... gains ref to obj, refcnt=2 407 obj_list_del(obj); 408 obj->active = 0 ... 409 ... visibility delayed ... 410 atomic_dec_and_test() 411 ... refcnt drops to 1 ... 412 atomic_dec_and_test() 413 ... refcount drops to 0 ... 414 obj_destroy() 415 BUG() triggers since obj->active 416 still seen as one 417 obj->active update visibility occurs 418 419With the memory barrier semantics required of the atomic_t operations 420which return values, the above sequence of memory visibility can never 421happen. Specifically, in the above case the atomic_dec_and_test() 422counter decrement would not become globally visible until the 423obj->active update does. 424 425As a historical note, 32-bit Sparc used to only allow usage of 42624-bits of its atomic_t type. This was because it used 8 bits 427as a spinlock for SMP safety. Sparc32 lacked a "compare and swap" 428type instruction. However, 32-bit Sparc has since been moved over 429to a "hash table of spinlocks" scheme, that allows the full 32-bit 430counter to be realized. Essentially, an array of spinlocks are 431indexed into based upon the address of the atomic_t being operated 432on, and that lock protects the atomic operation. Parisc uses the 433same scheme. 434 435Another note is that the atomic_t operations returning values are 436extremely slow on an old 386. 437 438 439Atomic Bitmask 440============== 441 442We will now cover the atomic bitmask operations. You will find that 443their SMP and memory barrier semantics are similar in shape and scope 444to the atomic_t ops above. 445 446Native atomic bit operations are defined to operate on objects aligned 447to the size of an "unsigned long" C data type, and are least of that 448size. The endianness of the bits within each "unsigned long" are the 449native endianness of the cpu. :: 450 451 void set_bit(unsigned long nr, volatile unsigned long *addr); 452 void clear_bit(unsigned long nr, volatile unsigned long *addr); 453 void change_bit(unsigned long nr, volatile unsigned long *addr); 454 455These routines set, clear, and change, respectively, the bit number 456indicated by "nr" on the bit mask pointed to by "ADDR". 457 458They must execute atomically, yet there are no implicit memory barrier 459semantics required of these interfaces. :: 460 461 int test_and_set_bit(unsigned long nr, volatile unsigned long *addr); 462 int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); 463 int test_and_change_bit(unsigned long nr, volatile unsigned long *addr); 464 465Like the above, except that these routines return a boolean which 466indicates whether the changed bit was set _BEFORE_ the atomic bit 467operation. 468 469 470.. warning:: 471 It is incredibly important that the value be a boolean, ie. "0" or "1". 472 Do not try to be fancy and save a few instructions by declaring the 473 above to return "long" and just returning something like "old_val & 474 mask" because that will not work. 475 476For one thing, this return value gets truncated to int in many code 477paths using these interfaces, so on 64-bit if the bit is set in the 478upper 32-bits then testers will never see that. 479 480One great example of where this problem crops up are the thread_info 481flag operations. Routines such as test_and_set_ti_thread_flag() chop 482the return value into an int. There are other places where things 483like this occur as well. 484 485These routines, like the atomic_t counter operations returning values, 486must provide explicit memory barrier semantics around their execution. 487All memory operations before the atomic bit operation call must be 488made visible globally before the atomic bit operation is made visible. 489Likewise, the atomic bit operation must be visible globally before any 490subsequent memory operation is made visible. For example:: 491 492 obj->dead = 1; 493 if (test_and_set_bit(0, &obj->flags)) 494 /* ... */; 495 obj->killed = 1; 496 497The implementation of test_and_set_bit() must guarantee that 498"obj->dead = 1;" is visible to cpus before the atomic memory operation 499done by test_and_set_bit() becomes visible. Likewise, the atomic 500memory operation done by test_and_set_bit() must become visible before 501"obj->killed = 1;" is visible. 502 503Finally there is the basic operation:: 504 505 int test_bit(unsigned long nr, __const__ volatile unsigned long *addr); 506 507Which returns a boolean indicating if bit "nr" is set in the bitmask 508pointed to by "addr". 509 510If explicit memory barriers are required around {set,clear}_bit() (which do 511not return a value, and thus does not need to provide memory barrier 512semantics), two interfaces are provided:: 513 514 void smp_mb__before_atomic(void); 515 void smp_mb__after_atomic(void); 516 517They are used as follows, and are akin to their atomic_t operation 518brothers:: 519 520 /* All memory operations before this call will 521 * be globally visible before the clear_bit(). 522 */ 523 smp_mb__before_atomic(); 524 clear_bit( ... ); 525 526 /* The clear_bit() will be visible before all 527 * subsequent memory operations. 528 */ 529 smp_mb__after_atomic(); 530 531There are two special bitops with lock barrier semantics (acquire/release, 532same as spinlocks). These operate in the same way as their non-_lock/unlock 533postfixed variants, except that they are to provide acquire/release semantics, 534respectively. This means they can be used for bit_spin_trylock and 535bit_spin_unlock type operations without specifying any more barriers. :: 536 537 int test_and_set_bit_lock(unsigned long nr, unsigned long *addr); 538 void clear_bit_unlock(unsigned long nr, unsigned long *addr); 539 void __clear_bit_unlock(unsigned long nr, unsigned long *addr); 540 541The __clear_bit_unlock version is non-atomic, however it still implements 542unlock barrier semantics. This can be useful if the lock itself is protecting 543the other bits in the word. 544 545Finally, there are non-atomic versions of the bitmask operations 546provided. They are used in contexts where some other higher-level SMP 547locking scheme is being used to protect the bitmask, and thus less 548expensive non-atomic operations may be used in the implementation. 549They have names similar to the above bitmask operation interfaces, 550except that two underscores are prefixed to the interface name. :: 551 552 void __set_bit(unsigned long nr, volatile unsigned long *addr); 553 void __clear_bit(unsigned long nr, volatile unsigned long *addr); 554 void __change_bit(unsigned long nr, volatile unsigned long *addr); 555 int __test_and_set_bit(unsigned long nr, volatile unsigned long *addr); 556 int __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); 557 int __test_and_change_bit(unsigned long nr, volatile unsigned long *addr); 558 559These non-atomic variants also do not require any special memory 560barrier semantics. 561 562The routines xchg() and cmpxchg() must provide the same exact 563memory-barrier semantics as the atomic and bit operations returning 564values. 565 566.. note:: 567 568 If someone wants to use xchg(), cmpxchg() and their variants, 569 linux/atomic.h should be included rather than asm/cmpxchg.h, unless the 570 code is in arch/* and can take care of itself. 571 572Spinlocks and rwlocks have memory barrier expectations as well. 573The rule to follow is simple: 574 5751) When acquiring a lock, the implementation must make it globally 576 visible before any subsequent memory operation. 577 5782) When releasing a lock, the implementation must make it such that 579 all previous memory operations are globally visible before the 580 lock release. 581 582Which finally brings us to _atomic_dec_and_lock(). There is an 583architecture-neutral version implemented in lib/dec_and_lock.c, 584but most platforms will wish to optimize this in assembler. :: 585 586 int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock); 587 588Atomically decrement the given counter, and if will drop to zero 589atomically acquire the given spinlock and perform the decrement 590of the counter to zero. If it does not drop to zero, do nothing 591with the spinlock. 592 593It is actually pretty simple to get the memory barrier correct. 594Simply satisfy the spinlock grab requirements, which is make 595sure the spinlock operation is globally visible before any 596subsequent memory operation. 597 598We can demonstrate this operation more clearly if we define 599an abstract atomic operation:: 600 601 long cas(long *mem, long old, long new); 602 603"cas" stands for "compare and swap". It atomically: 604 6051) Compares "old" with the value currently at "mem". 6062) If they are equal, "new" is written to "mem". 6073) Regardless, the current value at "mem" is returned. 608 609As an example usage, here is what an atomic counter update 610might look like:: 611 612 void example_atomic_inc(long *counter) 613 { 614 long old, new, ret; 615 616 while (1) { 617 old = *counter; 618 new = old + 1; 619 620 ret = cas(counter, old, new); 621 if (ret == old) 622 break; 623 } 624 } 625 626Let's use cas() in order to build a pseudo-C atomic_dec_and_lock():: 627 628 int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) 629 { 630 long old, new, ret; 631 int went_to_zero; 632 633 went_to_zero = 0; 634 while (1) { 635 old = atomic_read(atomic); 636 new = old - 1; 637 if (new == 0) { 638 went_to_zero = 1; 639 spin_lock(lock); 640 } 641 ret = cas(atomic, old, new); 642 if (ret == old) 643 break; 644 if (went_to_zero) { 645 spin_unlock(lock); 646 went_to_zero = 0; 647 } 648 } 649 650 return went_to_zero; 651 } 652 653Now, as far as memory barriers go, as long as spin_lock() 654strictly orders all subsequent memory operations (including 655the cas()) with respect to itself, things will be fine. 656 657Said another way, _atomic_dec_and_lock() must guarantee that 658a counter dropping to zero is never made visible before the 659spinlock being acquired. 660 661.. note:: 662 663 Note that this also means that for the case where the counter is not 664 dropping to zero, there are no memory ordering requirements. 665