memory-barriers.txt - OpenGrok cross reference for /Linux-v5.15/Documentation/memory-barriers.txt

Lines Matching +full:cpu +full:- +full:read
19 documentation at tools/memory-model/.  Nevertheless, even this memory
37 Note also that it is possible that a barrier may be a no-op for an
48      - Device operations.
49      - Guarantees.
53      - Varieties of memory barrier.
54      - What may not be assumed about memory barriers?
55      - Data dependency barriers (historical).
56      - Control dependencies.
57      - SMP barrier pairing.
58      - Examples of memory barrier sequences.
59      - Read memory barriers vs load speculation.
60      - Multicopy atomicity.
64      - Compiler barrier.
65      - CPU memory barriers.
69      - Lock acquisition functions.
70      - Interrupt disabling functions.
71      - Sleep and wake-up functions.
72      - Miscellaneous functions.
74  (*) Inter-CPU acquiring barrier effects.
76      - Acquires vs memory accesses.
80      - Interprocessor interaction.
81      - Atomic operations.
82      - Accessing devices.
83      - Interrupts.
89  (*) The effects of the cpu cache.
91      - Cache coherency.
92      - Cache coherency vs DMA.
93      - Cache coherency vs MMIO.
97      - And then there's the Alpha.
98      - Virtual Machine Guests.
102      - Circular buffers.
116 		+-------+   :   +--------+   :   +-------+
119 		| CPU 1 |<----->| Memory |<----->| CPU 2 |
122 		+-------+   :   +--------+   :   +-------+
127 		    |       :   +--------+   :       |
130 		    +---------->| Device |<----------+
133 		            :   +--------+   :
136 Each CPU executes a program that generates memory access operations.  In the
137 abstract CPU, memory operation ordering is very relaxed, and a CPU may actually
144 CPU are perceived by the rest of the system as the operations cross the
145 interface between the CPU and rest of the system (the dotted lines).
150 	CPU 1		CPU 2
159 	STORE A=3,	STORE B=4,	y=LOAD A->3,	x=LOAD B->4
160 	STORE A=3,	STORE B=4,	x=LOAD B->4,	y=LOAD A->3
161 	STORE A=3,	y=LOAD A->3,	STORE B=4,	x=LOAD B->4
162 	STORE A=3,	y=LOAD A->3,	x=LOAD B->2,	STORE B=4
163 	STORE A=3,	x=LOAD B->2,	STORE B=4,	y=LOAD A->3
164 	STORE A=3,	x=LOAD B->2,	y=LOAD A->3,	STORE B=4
165 	STORE B=4,	STORE A=3,	y=LOAD A->3,	x=LOAD B->4
177 Furthermore, the stores committed by a CPU to the memory system may not be
178 perceived by the loads made by another CPU in the same order as the stores were
184 	CPU 1		CPU 2
191 the address retrieved from P by CPU 2.  At the end of the sequence, any of the
198 Note that CPU 2 will never try and load C into D because the CPU will load P
203 -----------------
209 port register (D).  To read internal register 5, the following code might then
221 the address _after_ attempting to read the register.
225 ----------
227 There are some minimal guarantees that may be expected of a CPU:
229  (*) On any given CPU, dependent memory accesses will be issued in order, with
234      the CPU will issue the following memory operations:
239      emits a memory-barrier instruction, so that a DEC Alpha CPU will
247  (*) Overlapping loads and stores within a particular CPU will appear to be
248      ordered within that CPU.  This means that for:
252      the CPU will only issue the following sequence of memory operations:
260      the CPU will only issue:
310 And there are anti-guarantees:
313      generate code to modify these using non-atomic read-modify-write
320      non-atomic read-modify-write sequences can cause an update to one
327      "char", two-byte alignment for "short", four-byte alignment for
328      "int", and either four-byte or eight-byte alignment for "long",
329      on 32-bit and 64-bit systems, respectively.  Note that these
331      using older pre-C11 compilers (for example, gcc 4.6).  The portion
337 		of adjacent bit-fields all having nonzero width
343 		NOTE 2: A bit-field and an adjacent non-bit-field member
345 		to two bit-fields, if one is declared inside a nested
347 		are separated by a zero-length bit-field declaration,
348 		or if they are separated by a non-bit-field member
350 		bit-fields in the same structure if all members declared
351 		between them are also bit-fields, no matter what the
352 		sizes of those intervening bit-fields happen to be.
360 in random order, but this can be a problem for CPU-CPU interaction and for I/O.
362 CPU to restrict the order.
376 ---------------------------
390      A CPU can be viewed as committing a sequence of store operations to the
394      [!] Note that write barriers should normally be paired with read or data
400      A data dependency barrier is a weaker form of read barrier.  In the case
412      committing sequences of stores to the memory system that the CPU being
413      considered can then perceive.  A data dependency barrier issued by the CPU
415      load touches one of a sequence of stores from another CPU, then by the
427      a full read barrier or better is required.  See the "Control dependencies"
434  (3) Read (or load) memory barriers.
436      A read barrier is a data dependency barrier plus a guarantee that all the
441      A read barrier is a partial ordering on loads only; it is not required to
444      Read memory barriers imply data dependency barriers, and so can substitute
447      [!] Note that read barriers should normally be paired with write barriers;
460      General memory barriers imply both read and write memory barriers, and so
468      This acts as a one-way permeable barrier.  It guarantees that all memory
483      This also acts as a one-way permeable barrier.  It guarantees that all
494      -not- guaranteed to act as a full memory barrier.  However, after an
505 RELEASE variants in addition to fully-ordered and relaxed (no barrier
511 between two CPUs or between a CPU and a device.  If it can be guaranteed that
522 ----------------------------------------------
528      instruction; the barrier can be considered to draw a line in that CPU's
531  (*) There is no guarantee that issuing a memory barrier on one CPU will have
532      any direct effect on another CPU or any other hardware in the system.  The
533      indirect effect will be the order in which the second CPU sees the effects
534      of the first CPU's accesses occur, but see the next point:
536  (*) There is no guarantee that a CPU will see the correct order of effects
537      from a second CPU's accesses, even _if_ the second CPU uses a memory
538      barrier, unless the first CPU _also_ uses a matching memory barrier (see
541  (*) There is no guarantee that some intervening piece of off-the-CPU
542      hardware[*] will not reorder the memory accesses.  CPU cache coherency
546 	[*] For information on bus mastering DMA and coherency please read:
548 	    Documentation/driver-api/pci/pci.rst
549 	    Documentation/core-api/dma-api-howto.rst
550 	    Documentation/core-api/dma-api.rst
554 -------------------------------------
558 to this section are those working on DEC Alpha architecture-specific code
561 data-dependency barriers.
567 	CPU 1		      CPU 2
582 But!  CPU 2's perception of P may be updated _before_ its perception of B, thus
594 	CPU 1		      CPU 2
610 even-numbered cache lines and the other bank processes odd-numbered cache
611 lines.  The pointer P might be stored in an odd-numbered cache line, and the
612 variable B might be stored in an even-numbered cache line.  Then, if the
613 even-numbered bank of the reading CPU's cache is extremely busy while the
614 odd-numbered bank is idle, one can see the new value of the pointer P (&B),
618 A data-dependency barrier is not required to order dependent writes
622 But please carefully read the "CONTROL DEPENDENCIES" section and the
626 	CPU 1		      CPU 2
635 Therefore, no data-dependency barrier is required to order the read into
637 even without a data-dependency barrier:
642 of dependency ordering is to -prevent- writes to the data structure, along
649 the CPU containing it.  See the section on "Multicopy atomicity" for
663 --------------------
669 A load-load control dependency requires a full read memory barrier, not
680 dependency, but rather a control dependency that the CPU may short-circuit
687 		<read barrier>
691 However, stores are not speculated.  This means that ordering -is- provided
692 for load-store control dependencies, as in the following example:
707 variable 'a' is always non-zero, it would be well within its rights
712 	b = 1;  /* BUG: Compiler and CPU can both reorder!!! */
737 		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
740 		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
745 'b', which means that the CPU is within its rights to reorder them:
760 In contrast, without explicit memory barriers, two-legged-if control
796 Given this transformation, the CPU is not required to respect the ordering
817 You must also be careful not to rely too much on boolean short-circuit
832 out-guess your code.  More generally, although READ_ONCE() does force
836 In addition, control dependencies apply only to the then-clause and
837 else-clause of the if-statement in question.  In particular, it does
838 not necessarily apply to code following the if-statement:
846 	WRITE_ONCE(c, 1);  /* BUG: No ordering against the read from 'a'. */
852 conditional-move instructions, as in this fanciful pseudo-assembly
862 A weakly ordered CPU would have no dependency of any sort between the load
865 In short, control dependencies apply only to the stores in the then-clause
866 and else-clause of the if-statement in question (including functions
867 invoked by those two clauses), not to code following that if-statement.
871 to the CPU containing it.  See the section on "Multicopy atomicity"
878       However, they do -not- guarantee any other sort of ordering:
887       to carry out the stores.  Please note that it is -not- sufficient
893   (*) Control dependencies require at least one run-time conditional
905   (*) Control dependencies apply only to the then-clause and else-clause
906       of the if-statement containing the control dependency, including
908       do -not- apply to code following the if-statement containing the
913   (*) Control dependencies do -not- provide multicopy atomicity.  If you
921 -------------------
923 When dealing with CPU-CPU interactions, certain types of memory barrier should
931 a release barrier, a read barrier, or a general barrier.  Similarly a
932 read barrier, control dependency, or a data dependency barrier pairs
936 	CPU 1		      CPU 2
941 			      <read barrier>
946 	CPU 1		      CPU 2
956 	CPU 1		      CPU 2
967 Basically, the read barrier always has to be there, even though it can be of
971 match the loads after the read barrier or the data dependency barrier, and vice
974 	CPU 1                               CPU 2
976 	WRITE_ONCE(a, 1);    }----   --->{  v = READ_ONCE(c);
978 	<write barrier>            \        <read barrier>
980 	WRITE_ONCE(d, 4);    }----   --->{  y = READ_ONCE(b);
984 ------------------------------------
989 	CPU 1
1003 	+-------+       :      :
1004 	|       |       +------+
1005 	|       |------>| C=3  |     }     /\
1006 	|       |  :    +------+     }-----  \  -----> Events perceptible to
1008 	|       |  :    +------+     }
1009 	| CPU 1 |  :    | B=2  |     }
1010 	|       |       +------+     }
1011 	|       |   wwwwwwwwwwwwwwww }   <--- At this point the write barrier
1012 	|       |       +------+     }        requires all stores prior to the
1014 	|       |  :    +------+     }        further stores may take place
1015 	|       |------>| D=4  |     }
1016 	|       |       +------+
1017 	+-------+       :      :
1020 	                   | memory system by CPU 1
1024 Secondly, data dependency barriers act as partial orderings on data-dependent
1027 	CPU 1			CPU 2
1037 Without intervention, CPU 2 may perceive the events on CPU 1 in some
1038 effectively random order, despite the write barrier issued by CPU 1:
1040 	+-------+       :      :                :       :
1041 	|       |       +------+                +-------+  | Sequence of update
1042 	|       |------>| B=2  |-----       --->| Y->8  |  | of perception on
1043 	|       |  :    +------+     \          +-------+  | CPU 2
1044 	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |  V
1045 	|       |       +------+       |        +-------+
1047 	|       |       +------+       |        :       :
1048 	|       |  :    | C=&B |---    |        :       :       +-------+
1049 	|       |  :    +------+   \   |        +-------+       |       |
1050 	|       |------>| D=4  |    ----------->| C->&B |------>|       |
1051 	|       |       +------+       |        +-------+       |       |
1052 	+-------+       :      :       |        :       :       |       |
1054 	                               |        :       :       | CPU 2 |
1055 	                               |        +-------+       |       |
1056 	    Apparently incorrect --->  |        | B->7  |------>|       |
1057 	    perception of B (!)        |        +-------+       |       |
1059 	                               |        +-------+       |       |
1060 	    The load of X holds --->    \       | X->9  |------>|       |
1061 	    up the maintenance           \      +-------+       |       |
1062 	    of coherence of B             ----->| B->2  |       +-------+
1063 	                                        +-------+
1067 In the above example, CPU 2 perceives that B is 7, despite the load of *C
1071 and the load of *C (ie: B) on CPU 2:
1073 	CPU 1			CPU 2
1086 	+-------+       :      :                :       :
1087 	|       |       +------+                +-------+
1088 	|       |------>| B=2  |-----       --->| Y->8  |
1089 	|       |  :    +------+     \          +-------+
1090 	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |
1091 	|       |       +------+       |        +-------+
1093 	|       |       +------+       |        :       :
1094 	|       |  :    | C=&B |---    |        :       :       +-------+
1095 	|       |  :    +------+   \   |        +-------+       |       |
1096 	|       |------>| D=4  |    ----------->| C->&B |------>|       |
1097 	|       |       +------+       |        +-------+       |       |
1098 	+-------+       :      :       |        :       :       |       |
1100 	                               |        :       :       | CPU 2 |
1101 	                               |        +-------+       |       |
1102 	                               |        | X->9  |------>|       |
1103 	                               |        +-------+       |       |
1104 	  Makes sure all effects --->   \   ddddddddddddddddd   |       |
1105 	  prior to the store of C        \      +-------+       |       |
1106 	  are perceptible to              ----->| B->2  |------>|       |
1107 	  subsequent loads                      +-------+       |       |
1108 	                                        :       :       +-------+
1111 And thirdly, a read barrier acts as a partial order on loads.  Consider the
1114 	CPU 1			CPU 2
1123 Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in
1124 some effectively random order, despite the write barrier issued by CPU 1:
1126 	+-------+       :      :                :       :
1127 	|       |       +------+                +-------+
1128 	|       |------>| A=1  |------      --->| A->0  |
1129 	|       |       +------+      \         +-------+
1130 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1131 	|       |       +------+        |       +-------+
1132 	|       |------>| B=2  |---     |       :       :
1133 	|       |       +------+   \    |       :       :       +-------+
1134 	+-------+       :      :    \   |       +-------+       |       |
1135 	                             ---------->| B->2  |------>|       |
1136 	                                |       +-------+       | CPU 2 |
1137 	                                |       | A->0  |------>|       |
1138 	                                |       +-------+       |       |
1139 	                                |       :       :       +-------+
1141 	                                  \     +-------+
1142 	                                   ---->| A->1  |
1143 	                                        +-------+
1147 If, however, a read barrier were to be placed between the load of B and the
1148 load of A on CPU 2:
1150 	CPU 1			CPU 2
1157 				<read barrier>
1160 then the partial ordering imposed by CPU 1 will be perceived correctly by CPU
1163 	+-------+       :      :                :       :
1164 	|       |       +------+                +-------+
1165 	|       |------>| A=1  |------      --->| A->0  |
1166 	|       |       +------+      \         +-------+
1167 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1168 	|       |       +------+        |       +-------+
1169 	|       |------>| B=2  |---     |       :       :
1170 	|       |       +------+   \    |       :       :       +-------+
1171 	+-------+       :      :    \   |       +-------+       |       |
1172 	                             ---------->| B->2  |------>|       |
1173 	                                |       +-------+       | CPU 2 |
1176 	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1177 	  barrier causes all effects      \     +-------+       |       |
1178 	  prior to the storage of B        ---->| A->1  |------>|       |
1179 	  to be perceptible to CPU 2            +-------+       |       |
1180 	                                        :       :       +-------+
1184 contained a load of A either side of the read barrier:
1186 	CPU 1			CPU 2
1194 				<read barrier>
1200 	+-------+       :      :                :       :
1201 	|       |       +------+                +-------+
1202 	|       |------>| A=1  |------      --->| A->0  |
1203 	|       |       +------+      \         +-------+
1204 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1205 	|       |       +------+        |       +-------+
1206 	|       |------>| B=2  |---     |       :       :
1207 	|       |       +------+   \    |       :       :       +-------+
1208 	+-------+       :      :    \   |       +-------+       |       |
1209 	                             ---------->| B->2  |------>|       |
1210 	                                |       +-------+       | CPU 2 |
1213 	                                |       +-------+       |       |
1214 	                                |       | A->0  |------>| 1st   |
1215 	                                |       +-------+       |       |
1216 	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1217 	  barrier causes all effects      \     +-------+       |       |
1218 	  prior to the storage of B        ---->| A->1  |------>| 2nd   |
1219 	  to be perceptible to CPU 2            +-------+       |       |
1220 	                                        :       :       +-------+
1223 But it may be that the update to A from CPU 1 becomes perceptible to CPU 2
1224 before the read barrier completes anyway:
1226 	+-------+       :      :                :       :
1227 	|       |       +------+                +-------+
1228 	|       |------>| A=1  |------      --->| A->0  |
1229 	|       |       +------+      \         +-------+
1230 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1231 	|       |       +------+        |       +-------+
1232 	|       |------>| B=2  |---     |       :       :
1233 	|       |       +------+   \    |       :       :       +-------+
1234 	+-------+       :      :    \   |       +-------+       |       |
1235 	                             ---------->| B->2  |------>|       |
1236 	                                |       +-------+       | CPU 2 |
1239 	                                  \     +-------+       |       |
1240 	                                   ---->| A->1  |------>| 1st   |
1241 	                                        +-------+       |       |
1243 	                                        +-------+       |       |
1244 	                                        | A->1  |------>| 2nd   |
1245 	                                        +-------+       |       |
1246 	                                        :       :       +-------+
1254 READ MEMORY BARRIERS VS LOAD SPECULATION
1255 ----------------------------------------
1259 other loads, and so do the load in advance - even though they haven't actually
1261 actual load instruction to potentially complete immediately because the CPU
1264 It may turn out that the CPU didn't actually need the value - perhaps because a
1265 branch circumvented the load - in which case it can discard the value or just
1270 	CPU 1			CPU 2
1279 	                                        :       :       +-------+
1280 	                                        +-------+       |       |
1281 	                                    --->| B->2  |------>|       |
1282 	                                        +-------+       | CPU 2 |
1284 	                                        +-------+       |       |
1285 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1286 	division speculates on the              +-------+   ~   |       |
1290 	Once the divisions are complete -->     :       :   ~-->|       |
1291 	the CPU can then perform the            :       :       |       |
1292 	LOAD with immediate effect              :       :       +-------+
1295 Placing a read barrier or a data dependency barrier just before the second
1298 	CPU 1			CPU 2
1303 				<read barrier>
1310 	                                        :       :       +-------+
1311 	                                        +-------+       |       |
1312 	                                    --->| B->2  |------>|       |
1313 	                                        +-------+       | CPU 2 |
1315 	                                        +-------+       |       |
1316 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1317 	division speculates on the              +-------+   ~   |       |
1324 	                                        :       :   ~-->|       |
1326 	                                        :       :       +-------+
1329 but if there was an update or an invalidation from another CPU pending, then
1332 	                                        :       :       +-------+
1333 	                                        +-------+       |       |
1334 	                                    --->| B->2  |------>|       |
1335 	                                        +-------+       | CPU 2 |
1337 	                                        +-------+       |       |
1338 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1339 	division speculates on the              +-------+   ~   |       |
1345 	                                        +-------+       |       |
1346 	The speculation is discarded --->   --->| A->1  |------>|       |
1347 	and an updated value is                 +-------+       |       |
1348 	retrieved                               :       :       +-------+
1352 --------------------
1361 time to all -other- CPUs.  The remainder of this document discusses this
1366 	CPU 1			CPU 2			CPU 3
1370 				<general barrier>	<read barrier>
1373 Suppose that CPU 2's load from X returns 1, which it then stores to Y,
1374 and CPU 3's load from Y returns 1.  This indicates that CPU 1's store
1375 to X precedes CPU 2's load from X and that CPU 2's store to Y precedes
1376 CPU 3's load from Y.  In addition, the memory barriers guarantee that
1377 CPU 2 executes its load before its store, and CPU 3 loads from Y before
1378 it loads from X.  The question is then "Can CPU 3's load from X return 0?"
1380 Because CPU 3's load from X in some sense comes after CPU 2's load, it
1381 is natural to expect that CPU 3's load from X must therefore return 1.
1383 on CPU B follows a load from the same variable executing on CPU A (and
1384 CPU A did not originally store the value which it read), then on
1385 multicopy-atomic systems, CPU B's load must return either the same value
1386 that CPU A's load did or some later value.  However, the Linux kernel
1390 for any lack of multicopy atomicity.  In the example, if CPU 2's load
1391 from X returns 1 and CPU 3's load from Y returns 1, then CPU 3's load
1394 However, dependencies, read barriers, and write barriers are not always
1395 able to compensate for non-multicopy atomicity.  For example, suppose
1396 that CPU 2's general barrier is removed from the above example, leaving
1399 	CPU 1			CPU 2			CPU 3
1403 				<data dependency>	<read barrier>
1406 This substitution allows non-multicopy atomicity to run rampant: in
1407 this example, it is perfectly legal for CPU 2's load from X to return 1,
1408 CPU 3's load from Y to return 1, and its load from X to return 0.
1410 The key point is that although CPU 2's data dependency orders its load
1411 and store, it does not guarantee to order CPU 1's store.  Thus, if this
1412 example runs on a non-multicopy-atomic system where CPUs 1 and 2 share a
1413 store buffer or a level of cache, CPU 2 might have early access to CPU 1's
1417 General barriers can compensate not only for non-multicopy atomicity,
1418 but can also generate additional ordering that can ensure that -all-
1419 CPUs will perceive the same order of -all- operations.  In contrast, a
1420 chain of release-acquire pairs do not provide this additional ordering,
1461 Furthermore, because of the release-acquire relationship between cpu0()
1467 However, the ordering provided by a release-acquire chain is local
1478 writes in order, CPUs not involved in the release-acquire chain might
1480 the weak memory-barrier instructions used to implement smp_load_acquire()
1483 store to u as happening -after- cpu1()'s load from v, even though
1489 -not- ensure that any particular value will be read.  Therefore, the
1510   (*) CPU memory barriers.
1514 ----------------
1521 This is a general barrier -- there are no read-read or write-write
1531      interrupt-handler code and the code that was interrupted.
1537 optimizations that, while perfectly safe in single-threaded code, can
1542      to the same variable, and in some cases, the CPU is within its
1550      Prevent both the compiler and the CPU from doing this as follows:
1566      for single-threaded code, is almost certainly not what the developer
1587      single-threaded code, but can be fatal in concurrent code:
1594      a was modified by some other CPU between the "while" statement and
1605      single-threaded code, so you need to tell the compiler about cases
1619      This transformation is a win for single-threaded code because it
1621      will carry out its proof assuming that the current CPU is the only
1638      the code into near-nonexistence.  (It will still load from the
1643      Again, the compiler assumes that the current CPU is the only one
1654      surprise if some other CPU might have stored to variable 'a' in the
1666      between process-level code and an interrupt handler:
1682      win for single-threaded code:
1727      though the CPU of course need not do so.
1743      In single-threaded code, this is not only safe, but also saves
1745      could cause some other CPU to see a spurious value of 42 -- even
1746      if variable 'a' was never zero -- when loading variable 'b'.
1755      damaging, but they can result in cache-line bouncing and thus in
1760      with a single memory-reference instruction, prevents "load tearing"
1763      16-bit store instructions with 7-bit immediate fields, the compiler
1764      might be tempted to use two 16-bit store-immediate instructions to
1765      implement the following 32-bit store:
1772      This optimization can therefore be a win in single-threaded code.
1796      implement these three assignment statements as a pair of 32-bit
1797      loads followed by a pair of 32-bit stores.  This would result in
1812 Please note that these compiler barriers have no direct effect on the CPU,
1816 CPU MEMORY BARRIERS
1817 -------------------
1819 The Linux kernel has eight basic CPU memory barriers:
1825 	READ		rmb()			smp_rmb()
1843 systems because it is assumed that a CPU will appear to be self-consistent,
1854 windows.  These barriers are required even on non-SMP systems as they affect
1856 compiler and the CPU from reordering them.
1885 	obj->dead = 1;
1887 	atomic_dec(&obj->ref_count);
1899      of writes or reads of shared memory accessible to both the CPU and a
1904      to the device or the CPU, and a doorbell to notify it when new
1907 	if (desc->status != DEVICE_OWN) {
1908 		/* do not read data until we own descriptor */
1911 		/* read/modify data */
1912 		read_data = desc->data;
1913 		desc->data = write_data;
1919 		desc->status = DEVICE_OWN;
1926      before we read the data from the descriptor, and the dma_wmb() allows
1935      relaxed I/O accessors and the Documentation/core-api/dma-api.rst file for
1944      For example, after a non-temporal write to pmem region, we use pmem_wmb()
1950      For load from persistent memory, existing read memory barriers are sufficient
1951      to ensure read ordering.
1966 --------------------------
2013 one-way barriers is that the effects of instructions outside of a critical
2033 another CPU not holding that lock.  In short, a ACQUIRE followed by an
2034 RELEASE may -not- be assumed to be a full memory barrier.
2037 not imply a full memory barrier.  Therefore, the CPU's execution of the
2056 	One key point is that we are only talking about the CPU doing
2059 	-could- occur.
2061 	But suppose the CPU reordered the operations.  In this case,
2062 	the unlock precedes the lock in the assembly code.  The CPU
2065 	try to sleep, but more on that later).	The CPU will eventually
2074 	a sleep-unlock race, but the locking primitive needs to resolve
2079 anything at all - especially with respect to I/O accesses - unless combined
2082 See also the section on "Inter-CPU acquiring barrier effects".
2112 -----------------------------
2120 SLEEP AND WAKE-UP FUNCTIONS
2121 ---------------------------
2142 	CPU 1
2146 	    STORE current->state
2185 	CPU 1 (Sleeper)			CPU 2 (Waker)
2189 	    STORE current->state	  ...
2191 	LOAD event_indicated		  if ((LOAD task->state) & TASK_NORMAL)
2192 					    STORE task->state
2194 where "task" is the thread being woken up and it equals CPU 1's "current".
2201 	CPU 1				CPU 2
2237 order multiple stores before the wake-up with respect to loads of those stored
2273 -----------------------
2281 INTER-CPU ACQUIRING BARRIER EFFECTS
2290 ---------------------------
2295 	CPU 1				CPU 2
2304 Then there is no guarantee as to what order CPU 3 will see the accesses to *A
2323 be a problem as a single-threaded linear piece of code will still appear to
2337 --------------------------
2339 When there's a system with more than one processor, more than one CPU in the
2364  (1) read the next pointer from this waiter's record to know as to where the
2367  (2) read the pointer to the waiter's task structure;
2377 	LOAD waiter->list.next;
2378 	LOAD waiter->task;
2379 	STORE waiter->task;
2389 if the task pointer is cleared _before_ the next pointer in the list is read,
2390 another CPU might start processing the waiter and might clobber the waiter's
2391 stack before the up*() function has a chance to read the next pointer.
2395 	CPU 1				CPU 2
2401 	LOAD waiter->task;
2402 	STORE waiter->task;
2410 	LOAD waiter->list.next;
2411 	--- OOPS ---
2418 	LOAD waiter->list.next;
2419 	LOAD waiter->task;
2421 	STORE waiter->task;
2431 On a UP system - where this wouldn't be a problem - the smp_mb() is just a
2433 right order without actually intervening in the CPU.  Since there's only one
2434 CPU, that CPU's dependency ordering logic will take care of everything else.
2438 -----------------
2449 -----------------
2451 Many devices can be memory mapped, and so appear to the CPU as if they're just
2455 However, having a clever CPU or a clever compiler creates a potential problem
2457 device in the requisite order if the CPU or the compiler thinks it is more
2458 efficient to reorder, combine or merge accesses - something that would cause
2462 routines - such as inb() or writel() - which know how to make such accesses
2468 See Documentation/driver-api/device-io.rst for more information.
2472 ----------
2478 This may be alleviated - at least in part - by disabling local interrupts (a
2480 the interrupt-disabled section in the driver.  While the driver's interrupt
2481 routine is executing, the driver's core may not run on the same CPU, and its
2487 under interrupt-disablement and then the driver's interrupt handler is invoked:
2506 accesses performed in an interrupt - and vice versa - unless implicit or
2516 likely, then interrupt-disabling locks should be used to guarantee ordering.
2524 specific. Therefore, drivers which are inherently non-portable may rely on
2540 	   by the same CPU thread to a particular device will arrive in program
2543 	2. A writeX() issued by a CPU thread holding a spinlock is ordered
2544 	   before a writeX() to the same peripheral from another CPU thread
2550 	3. A writeX() by a CPU thread to the peripheral will first wait for the
2552 	   propagated to, the same thread. This ensures that writes by the CPU
2554 	   visible to a DMA engine when the CPU writes to its MMIO control
2557 	4. A readX() by a CPU thread from the peripheral will complete before
2559 	   ensures that reads by the CPU from an incoming DMA buffer allocated
2564 	5. A readX() by a CPU thread from the peripheral will complete before
2566 	   This ensures that two MMIO register writes by the CPU to a peripheral
2567 	   will arrive at least 1us apart if the first write is immediately read
2576 	The ordering properties of __iomem pointers obtained with non-default
2586 	bullets 2-5 above) but they are still guaranteed to be ordered with
2587 	respect to other accesses from the same CPU thread to the same
2594 	register-based, memory-mapped FIFOs residing on peripherals that are not
2600 	The inX() and outX() accessors are intended to access legacy port-mapped
2605 	Since many CPU architectures ultimately access these peripherals via an
2611 	Device drivers may expect outX() to emit a non-posted write transaction
2629 little-endian and will therefore perform byte-swapping operations on big-endian
2637 It has to be assumed that the conceptual CPU is weakly-ordered but that it will
2641 of arch-specific code.
2643 This means that it must be considered that the CPU will execute its instruction
2644 stream in any order it feels like - or even in parallel - provided that if an
2650  [*] Some instructions have more than one effect - such as changing the
2651      condition codes, changing registers or changing memory - and different
2654 A CPU may also discard any instruction sequence that winds up having no
2665 THE EFFECTS OF THE CPU CACHE
2672 As far as the way a CPU interacts with another part of the system through the
2673 caches goes, the memory system has to include the CPU's caches, and memory
2674 barriers for the most part act at the interface between the CPU and its cache
2677 	    <--- CPU --->         :       <----------- Memory ----------->
2679 	+--------+    +--------+  :   +--------+    +-----------+
2680 	|        |    |        |  :   |        |    |           |    +--------+
2681 	|  CPU   |    | Memory |  :   | CPU    |    |           |    |        |
2682 	|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2683 	|        |    | Queue  |  :   |        |    |           |--->| Memory |
2685 	+--------+    +--------+  :   +--------+    |           |    |        |
2686 	                          :                 | Cache     |    +--------+
2688 	                          :                 | Mechanism |    +--------+
2689 	+--------+    +--------+  :   +--------+    |           |    |	      |
2691 	|  CPU   |    | Memory |  :   | CPU    |    |           |--->| Device |
2692 	|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2694 	|        |    |        |  :   |        |    |           |    +--------+
2695 	+--------+    +--------+  :   +--------+    +-----------+
2700 CPU that issued it since it may have been satisfied within the CPU's own cache,
2703 cacheline over to the accessing CPU and propagate the effects upon conflict.
2705 The CPU core may execute instructions in any order it deems fit, provided the
2713 accesses cross from the CPU side of things to the memory side of things, and
2717 [!] Memory barriers are _not_ needed within a given CPU, as CPUs always see
2722 the use of any special device communication instructions the CPU may have.
2726 ----------------------
2732 the kernel must flush the overlapping bits of cache on each CPU (and maybe
2736 cache lines being written back to RAM from a CPU's cache after the device has
2737 installed its own data, or cache lines present in the CPU's cache may simply
2739 is discarded from the CPU's cache and reloaded.  To deal with this, the
2741 cache on each CPU.
2743 See Documentation/core-api/cachetlb.rst for more information on cache management.
2747 -----------------------
2750 a window in the CPU's memory space that has different properties assigned than
2765 A programmer might take it for granted that the CPU will perform memory
2766 operations in exactly the order specified, so that if the CPU is, for example,
2775 they would then expect that the CPU will complete the memory operation for each
2796      of the CPU buses and caches;
2803  (*) the CPU's data cache may affect the ordering, and while cache-coherency
2804      mechanisms may alleviate this - once the store has actually hit the cache
2805      - there's no guarantee that the coherency management will be propagated in
2808 So what another CPU, say, might actually observe from the above piece of code
2816 However, it is guaranteed that a CPU will be self-consistent: it will see its
2835 The code above may cause the CPU to generate the full sequence of memory
2843 are -not- optional in the above example, as there are architectures
2844 where a given CPU might reorder successive loads to the same location.
2851 the CPU even sees them.
2874 and the LOAD operation never appear outside of the CPU.
2878 --------------------------
2880 The DEC Alpha CPU is one of the most relaxed CPUs there is.  Not only that,
2881 some versions of the Alpha CPU have a split data cache, permitting them to have
2882 two semantically-related cache lines updated at separate times.  This is where
2893 ----------------------
2898 barriers for this use-case would be possible but is often suboptimal.
2900 To handle this case optimally, low-level virt_mb() etc macros are available.
2902 identical code for SMP and non-SMP systems.  For example, virtual machine guests
2916 ----------------
2921 	Documentation/core-api/circular-buffers.rst
2935 	Chapter 5.6: Read/Write Ordering
2938 	Chapter 7.1: Memory-Access Ordering
2941 ARM Architecture Reference Manual (ARMv8, for ARMv8-A architecture profile)
2944 IA-32 Intel Architecture Software Developer's Manual, Volume 3:
2959 	Chapter 15: Sparc-V9 Memory Models
2975 Solaris Internals, Core Kernel Architecture, p63-68: