1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
2        "http://www.w3.org/TR/html4/loose.dtd">
3        <html>
4        <head><title>A Tour Through TREE_RCU's Grace-Period Memory Ordering</title>
5        <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
6
7           <p>August 8, 2017</p>
8           <p>This article was contributed by Paul E.&nbsp;McKenney</p>
9
10<h3>Introduction</h3>
11
12<p>This document gives a rough visual overview of how Tree RCU's
13grace-period memory ordering guarantee is provided.
14
15<ol>
16<li>	<a href="#What Is Tree RCU's Grace Period Memory Ordering Guarantee?">
17	What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a>
18<li>	<a href="#Tree RCU Grace Period Memory Ordering Building Blocks">
19	Tree RCU Grace Period Memory Ordering Building Blocks</a>
20<li>	<a href="#Tree RCU Grace Period Memory Ordering Components">
21	Tree RCU Grace Period Memory Ordering Components</a>
22<li>	<a href="#Putting It All Together">Putting It All Together</a>
23</ol>
24
25<h3><a name="What Is Tree RCU's Grace Period Memory Ordering Guarantee?">
26What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a></h3>
27
28<p>RCU grace periods provide extremely strong memory-ordering guarantees
29for non-idle non-offline code.
30Any code that happens after the end of a given RCU grace period is guaranteed
31to see the effects of all accesses prior to the beginning of that grace
32period that are within RCU read-side critical sections.
33Similarly, any code that happens before the beginning of a given RCU grace
34period is guaranteed to see the effects of all accesses following the end
35of that grace period that are within RCU read-side critical sections.
36
37<p>This guarantee is particularly pervasive for <tt>synchronize_sched()</tt>,
38for which RCU-sched read-side critical sections include any region
39of code for which preemption is disabled.
40Given that each individual machine instruction can be thought of as
41an extremely small region of preemption-disabled code, one can think of
42<tt>synchronize_sched()</tt> as <tt>smp_mb()</tt> on steroids.
43
44<p>RCU updaters use this guarantee by splitting their updates into
45two phases, one of which is executed before the grace period and
46the other of which is executed after the grace period.
47In the most common use case, phase one removes an element from
48a linked RCU-protected data structure, and phase two frees that element.
49For this to work, any readers that have witnessed state prior to the
50phase-one update (in the common case, removal) must not witness state
51following the phase-two update (in the common case, freeing).
52
53<p>The RCU implementation provides this guarantee using a network
54of lock-based critical sections, memory barriers, and per-CPU
55processing, as is described in the following sections.
56
57<h3><a name="Tree RCU Grace Period Memory Ordering Building Blocks">
58Tree RCU Grace Period Memory Ordering Building Blocks</a></h3>
59
60<p>The workhorse for RCU's grace-period memory ordering is the
61critical section for the <tt>rcu_node</tt> structure's
62<tt>-&gt;lock</tt>.
63These critical sections use helper functions for lock acquisition, including
64<tt>raw_spin_lock_rcu_node()</tt>,
65<tt>raw_spin_lock_irq_rcu_node()</tt>, and
66<tt>raw_spin_lock_irqsave_rcu_node()</tt>.
67Their lock-release counterparts are
68<tt>raw_spin_unlock_rcu_node()</tt>,
69<tt>raw_spin_unlock_irq_rcu_node()</tt>, and
70<tt>raw_spin_unlock_irqrestore_rcu_node()</tt>,
71respectively.
72For completeness, a
73<tt>raw_spin_trylock_rcu_node()</tt>
74is also provided.
75The key point is that the lock-acquisition functions, including
76<tt>raw_spin_trylock_rcu_node()</tt>, all invoke
77<tt>smp_mb__after_unlock_lock()</tt> immediately after successful
78acquisition of the lock.
79
80<p>Therefore, for any given <tt>rcu_node</tt> struction, any access
81happening before one of the above lock-release functions will be seen
82by all CPUs as happening before any access happening after a later
83one of the above lock-acquisition functions.
84Furthermore, any access happening before one of the
85above lock-release function on any given CPU will be seen by all
86CPUs as happening before any access happening after a later one
87of the above lock-acquisition functions executing on that same CPU,
88even if the lock-release and lock-acquisition functions are operating
89on different <tt>rcu_node</tt> structures.
90Tree RCU uses these two ordering guarantees to form an ordering
91network among all CPUs that were in any way involved in the grace
92period, including any CPUs that came online or went offline during
93the grace period in question.
94
95<p>The following litmus test exhibits the ordering effects of these
96lock-acquisition and lock-release functions:
97
98<pre>
99 1 int x, y, z;
100 2
101 3 void task0(void)
102 4 {
103 5   raw_spin_lock_rcu_node(rnp);
104 6   WRITE_ONCE(x, 1);
105 7   r1 = READ_ONCE(y);
106 8   raw_spin_unlock_rcu_node(rnp);
107 9 }
10810
10911 void task1(void)
11012 {
11113   raw_spin_lock_rcu_node(rnp);
11214   WRITE_ONCE(y, 1);
11315   r2 = READ_ONCE(z);
11416   raw_spin_unlock_rcu_node(rnp);
11517 }
11618
11719 void task2(void)
11820 {
11921   WRITE_ONCE(z, 1);
12022   smp_mb();
12123   r3 = READ_ONCE(x);
12224 }
12325
12426 WARN_ON(r1 == 0 &amp;&amp; r2 == 0 &amp;&amp; r3 == 0);
125</pre>
126
127<p>The <tt>WARN_ON()</tt> is evaluated at &ldquo;the end of time&rdquo;,
128after all changes have propagated throughout the system.
129Without the <tt>smp_mb__after_unlock_lock()</tt> provided by the
130acquisition functions, this <tt>WARN_ON()</tt> could trigger, for example
131on PowerPC.
132The <tt>smp_mb__after_unlock_lock()</tt> invocations prevent this
133<tt>WARN_ON()</tt> from triggering.
134
135<p>This approach must be extended to include idle CPUs, which need
136RCU's grace-period memory ordering guarantee to extend to any
137RCU read-side critical sections preceding and following the current
138idle sojourn.
139This case is handled by calls to the strongly ordered
140<tt>atomic_add_return()</tt> read-modify-write atomic operation that
141is invoked within <tt>rcu_dynticks_eqs_enter()</tt> at idle-entry
142time and within <tt>rcu_dynticks_eqs_exit()</tt> at idle-exit time.
143The grace-period kthread invokes <tt>rcu_dynticks_snap()</tt> and
144<tt>rcu_dynticks_in_eqs_since()</tt> (both of which invoke
145an <tt>atomic_add_return()</tt> of zero) to detect idle CPUs.
146
147<table>
148<tr><th>&nbsp;</th></tr>
149<tr><th align="left">Quick Quiz:</th></tr>
150<tr><td>
151	But what about CPUs that remain offline for the entire
152	grace period?
153</td></tr>
154<tr><th align="left">Answer:</th></tr>
155<tr><td bgcolor="#ffffff"><font color="ffffff">
156	Such CPUs will be offline at the beginning of the grace period,
157	so the grace period won't expect quiescent states from them.
158	Races between grace-period start and CPU-hotplug operations
159	are mediated by the CPU's leaf <tt>rcu_node</tt> structure's
160	<tt>-&gt;lock</tt> as described above.
161</font></td></tr>
162<tr><td>&nbsp;</td></tr>
163</table>
164
165<p>The approach must be extended to handle one final case, that
166of waking a task blocked in <tt>synchronize_rcu()</tt>.
167This task might be affinitied to a CPU that is not yet aware that
168the grace period has ended, and thus might not yet be subject to
169the grace period's memory ordering.
170Therefore, there is an <tt>smp_mb()</tt> after the return from
171<tt>wait_for_completion()</tt> in the <tt>synchronize_rcu()</tt>
172code path.
173
174<table>
175<tr><th>&nbsp;</th></tr>
176<tr><th align="left">Quick Quiz:</th></tr>
177<tr><td>
178	What?  Where???
179	I don't see any <tt>smp_mb()</tt> after the return from
180	<tt>wait_for_completion()</tt>!!!
181</td></tr>
182<tr><th align="left">Answer:</th></tr>
183<tr><td bgcolor="#ffffff"><font color="ffffff">
184	That would be because I spotted the need for that
185	<tt>smp_mb()</tt> during the creation of this documentation,
186	and it is therefore unlikely to hit mainline before v4.14.
187	Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and
188	Jonathan Cameron for asking questions that sensitized me
189	to the rather elaborate sequence of events that demonstrate
190	the need for this memory barrier.
191</font></td></tr>
192<tr><td>&nbsp;</td></tr>
193</table>
194
195<p>Tree RCU's grace--period memory-ordering guarantees rely most
196heavily on the <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>
197field, so much so that it is necessary to abbreviate this pattern
198in the diagrams in the next section.
199For example, consider the <tt>rcu_prepare_for_idle()</tt> function
200shown below, which is one of several functions that enforce ordering
201of newly arrived RCU callbacks against future grace periods:
202
203<pre>
204 1 static void rcu_prepare_for_idle(void)
205 2 {
206 3   bool needwake;
207 4   struct rcu_data *rdp;
208 5   struct rcu_dynticks *rdtp = this_cpu_ptr(&amp;rcu_dynticks);
209 6   struct rcu_node *rnp;
210 7   struct rcu_state *rsp;
211 8   int tne;
212 9
21310   if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) ||
21411       rcu_is_nocb_cpu(smp_processor_id()))
21512     return;
21613   tne = READ_ONCE(tick_nohz_active);
21714   if (tne != rdtp-&gt;tick_nohz_enabled_snap) {
21815     if (rcu_cpu_has_callbacks(NULL))
21916       invoke_rcu_core();
22017     rdtp-&gt;tick_nohz_enabled_snap = tne;
22118     return;
22219   }
22320   if (!tne)
22421     return;
22522   if (rdtp-&gt;all_lazy &amp;&amp;
22623       rdtp-&gt;nonlazy_posted != rdtp-&gt;nonlazy_posted_snap) {
22724     rdtp-&gt;all_lazy = false;
22825     rdtp-&gt;nonlazy_posted_snap = rdtp-&gt;nonlazy_posted;
22926     invoke_rcu_core();
23027     return;
23128   }
23229   if (rdtp-&gt;last_accelerate == jiffies)
23330     return;
23431   rdtp-&gt;last_accelerate = jiffies;
23532   for_each_rcu_flavor(rsp) {
23633     rdp = this_cpu_ptr(rsp-&gt;rda);
23734     if (rcu_segcblist_pend_cbs(&amp;rdp-&gt;cblist))
23835       continue;
23936     rnp = rdp-&gt;mynode;
24037     raw_spin_lock_rcu_node(rnp);
24138     needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
24239     raw_spin_unlock_rcu_node(rnp);
24340     if (needwake)
24441       rcu_gp_kthread_wake(rsp);
24542   }
24643 }
247</pre>
248
249<p>But the only part of <tt>rcu_prepare_for_idle()</tt> that really
250matters for this discussion are lines&nbsp;37&ndash;39.
251We will therefore abbreviate this function as follows:
252
253</p><p><img src="rcu_node-lock.svg" alt="rcu_node-lock.svg">
254
255<p>The box represents the <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>
256critical section, with the double line on top representing the additional
257<tt>smp_mb__after_unlock_lock()</tt>.
258
259<h3><a name="Tree RCU Grace Period Memory Ordering Components">
260Tree RCU Grace Period Memory Ordering Components</a></h3>
261
262<p>Tree RCU's grace-period memory-ordering guarantee is provided by
263a number of RCU components:
264
265<ol>
266<li>	<a href="#Callback Registry">Callback Registry</a>
267<li>	<a href="#Grace-Period Initialization">Grace-Period Initialization</a>
268<li>	<a href="#Self-Reported Quiescent States">
269	Self-Reported Quiescent States</a>
270<li>	<a href="#Dynamic Tick Interface">Dynamic Tick Interface</a>
271<li>	<a href="#CPU-Hotplug Interface">CPU-Hotplug Interface</a>
272<li>	<a href="Forcing Quiescent States">Forcing Quiescent States</a>
273<li>	<a href="Grace-Period Cleanup">Grace-Period Cleanup</a>
274<li>	<a href="Callback Invocation">Callback Invocation</a>
275</ol>
276
277<p>Each of the following section looks at the corresponding component
278in detail.
279
280<h4><a name="Callback Registry">Callback Registry</a></h4>
281
282<p>If RCU's grace-period guarantee is to mean anything at all, any
283access that happens before a given invocation of <tt>call_rcu()</tt>
284must also happen before the corresponding grace period.
285The implementation of this portion of RCU's grace period guarantee
286is shown in the following figure:
287
288</p><p><img src="TreeRCU-callback-registry.svg" alt="TreeRCU-callback-registry.svg">
289
290<p>Because <tt>call_rcu()</tt> normally acts only on CPU-local state,
291it provides no ordering guarantees, either for itself or for
292phase one of the update (which again will usually be removal of
293an element from an RCU-protected data structure).
294It simply enqueues the <tt>rcu_head</tt> structure on a per-CPU list,
295which cannot become associated with a grace period until a later
296call to <tt>rcu_accelerate_cbs()</tt>, as shown in the diagram above.
297
298<p>One set of code paths shown on the left invokes
299<tt>rcu_accelerate_cbs()</tt> via
300<tt>note_gp_changes()</tt>, either directly from <tt>call_rcu()</tt> (if
301the current CPU is inundated with queued <tt>rcu_head</tt> structures)
302or more likely from an <tt>RCU_SOFTIRQ</tt> handler.
303Another code path in the middle is taken only in kernels built with
304<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>, which invokes
305<tt>rcu_accelerate_cbs()</tt> via <tt>rcu_prepare_for_idle()</tt>.
306The final code path on the right is taken only in kernels built with
307<tt>CONFIG_HOTPLUG_CPU=y</tt>, which invokes
308<tt>rcu_accelerate_cbs()</tt> via
309<tt>rcu_advance_cbs()</tt>, <tt>rcu_migrate_callbacks</tt>,
310<tt>rcutree_migrate_callbacks()</tt>, and <tt>takedown_cpu()</tt>,
311which in turn is invoked on a surviving CPU after the outgoing
312CPU has been completely offlined.
313
314<p>There are a few other code paths within grace-period processing
315that opportunistically invoke <tt>rcu_accelerate_cbs()</tt>.
316However, either way, all of the CPU's recently queued <tt>rcu_head</tt>
317structures are associated with a future grace-period number under
318the protection of the CPU's lead <tt>rcu_node</tt> structure's
319<tt>-&gt;lock</tt>.
320In all cases, there is full ordering against any prior critical section
321for that same <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>, and
322also full ordering against any of the current task's or CPU's prior critical
323sections for any <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>.
324
325<p>The next section will show how this ordering ensures that any
326accesses prior to the <tt>call_rcu()</tt> (particularly including phase
327one of the update)
328happen before the start of the corresponding grace period.
329
330<table>
331<tr><th>&nbsp;</th></tr>
332<tr><th align="left">Quick Quiz:</th></tr>
333<tr><td>
334	But what about <tt>synchronize_rcu()</tt>?
335</td></tr>
336<tr><th align="left">Answer:</th></tr>
337<tr><td bgcolor="#ffffff"><font color="ffffff">
338	The <tt>synchronize_rcu()</tt> passes <tt>call_rcu()</tt>
339	to <tt>wait_rcu_gp()</tt>, which invokes it.
340	So either way, it eventually comes down to <tt>call_rcu()</tt>.
341</font></td></tr>
342<tr><td>&nbsp;</td></tr>
343</table>
344
345<h4><a name="Grace-Period Initialization">Grace-Period Initialization</a></h4>
346
347<p>Grace-period initialization is carried out by
348the grace-period kernel thread, which makes several passes over the
349<tt>rcu_node</tt> tree within the <tt>rcu_gp_init()</tt> function.
350This means that showing the full flow of ordering through the
351grace-period computation will require duplicating this tree.
352If you find this confusing, please note that the state of the
353<tt>rcu_node</tt> changes over time, just like Heraclitus's river.
354However, to keep the <tt>rcu_node</tt> river tractable, the
355grace-period kernel thread's traversals are presented in multiple
356parts, starting in this section with the various phases of
357grace-period initialization.
358
359<p>The first ordering-related grace-period initialization action is to
360advance the <tt>rcu_state</tt> structure's <tt>-&gt;gp_seq</tt>
361grace-period-number counter, as shown below:
362
363</p><p><img src="TreeRCU-gp-init-1.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
364
365<p>The actual increment is carried out using <tt>smp_store_release()</tt>,
366which helps reject false-positive RCU CPU stall detection.
367Note that only the root <tt>rcu_node</tt> structure is touched.
368
369<p>The first pass through the <tt>rcu_node</tt> tree updates bitmasks
370based on CPUs having come online or gone offline since the start of
371the previous grace period.
372In the common case where the number of online CPUs for this <tt>rcu_node</tt>
373structure has not transitioned to or from zero,
374this pass will scan only the leaf <tt>rcu_node</tt> structures.
375However, if the number of online CPUs for a given leaf <tt>rcu_node</tt>
376structure has transitioned from zero,
377<tt>rcu_init_new_rnp()</tt> will be invoked for the first incoming CPU.
378Similarly, if the number of online CPUs for a given leaf <tt>rcu_node</tt>
379structure has transitioned to zero,
380<tt>rcu_cleanup_dead_rnp()</tt> will be invoked for the last outgoing CPU.
381The diagram below shows the path of ordering if the leftmost
382<tt>rcu_node</tt> structure onlines its first CPU and if the next
383<tt>rcu_node</tt> structure has no online CPUs
384(or, alternatively if the leftmost <tt>rcu_node</tt> structure offlines
385its last CPU and if the next <tt>rcu_node</tt> structure has no online CPUs).
386
387</p><p><img src="TreeRCU-gp-init-2.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
388
389<p>The final <tt>rcu_gp_init()</tt> pass through the <tt>rcu_node</tt>
390tree traverses breadth-first, setting each <tt>rcu_node</tt> structure's
391<tt>-&gt;gp_seq</tt> field to the newly advanced value from the
392<tt>rcu_state</tt> structure, as shown in the following diagram.
393
394</p><p><img src="TreeRCU-gp-init-3.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
395
396<p>This change will also cause each CPU's next call to
397<tt>__note_gp_changes()</tt>
398to notice that a new grace period has started, as described in the next
399section.
400But because the grace-period kthread started the grace period at the
401root (with the advancing of the <tt>rcu_state</tt> structure's
402<tt>-&gt;gp_seq</tt> field) before setting each leaf <tt>rcu_node</tt>
403structure's <tt>-&gt;gp_seq</tt> field, each CPU's observation of
404the start of the grace period will happen after the actual start
405of the grace period.
406
407<table>
408<tr><th>&nbsp;</th></tr>
409<tr><th align="left">Quick Quiz:</th></tr>
410<tr><td>
411	But what about the CPU that started the grace period?
412	Why wouldn't it see the start of the grace period right when
413	it started that grace period?
414</td></tr>
415<tr><th align="left">Answer:</th></tr>
416<tr><td bgcolor="#ffffff"><font color="ffffff">
417	In some deep philosophical and overly anthromorphized
418	sense, yes, the CPU starting the grace period is immediately
419	aware of having done so.
420	However, if we instead assume that RCU is not self-aware,
421	then even the CPU starting the grace period does not really
422	become aware of the start of this grace period until its
423	first call to <tt>__note_gp_changes()</tt>.
424	On the other hand, this CPU potentially gets early notification
425	because it invokes <tt>__note_gp_changes()</tt> during its
426	last <tt>rcu_gp_init()</tt> pass through its leaf
427	<tt>rcu_node</tt> structure.
428</font></td></tr>
429<tr><td>&nbsp;</td></tr>
430</table>
431
432<h4><a name="Self-Reported Quiescent States">
433Self-Reported Quiescent States</a></h4>
434
435<p>When all entities that might block the grace period have reported
436quiescent states (or as described in a later section, had quiescent
437states reported on their behalf), the grace period can end.
438Online non-idle CPUs report their own quiescent states, as shown
439in the following diagram:
440
441</p><p><img src="TreeRCU-qs.svg" alt="TreeRCU-qs.svg" width="75%">
442
443<p>This is for the last CPU to report a quiescent state, which signals
444the end of the grace period.
445Earlier quiescent states would push up the <tt>rcu_node</tt> tree
446only until they encountered an <tt>rcu_node</tt> structure that
447is waiting for additional quiescent states.
448However, ordering is nevertheless preserved because some later quiescent
449state will acquire that <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>.
450
451<p>Any number of events can lead up to a CPU invoking
452<tt>note_gp_changes</tt> (or alternatively, directly invoking
453<tt>__note_gp_changes()</tt>), at which point that CPU will notice
454the start of a new grace period while holding its leaf
455<tt>rcu_node</tt> lock.
456Therefore, all execution shown in this diagram happens after the
457start of the grace period.
458In addition, this CPU will consider any RCU read-side critical
459section that started before the invocation of <tt>__note_gp_changes()</tt>
460to have started before the grace period, and thus a critical
461section that the grace period must wait on.
462
463<table>
464<tr><th>&nbsp;</th></tr>
465<tr><th align="left">Quick Quiz:</th></tr>
466<tr><td>
467	But a RCU read-side critical section might have started
468	after the beginning of the grace period
469	(the advancing of <tt>-&gt;gp_seq</tt> from earlier), so why should
470	the grace period wait on such a critical section?
471</td></tr>
472<tr><th align="left">Answer:</th></tr>
473<tr><td bgcolor="#ffffff"><font color="ffffff">
474	It is indeed not necessary for the grace period to wait on such
475	a critical section.
476	However, it is permissible to wait on it.
477	And it is furthermore important to wait on it, as this
478	lazy approach is far more scalable than a &ldquo;big bang&rdquo;
479	all-at-once grace-period start could possibly be.
480</font></td></tr>
481<tr><td>&nbsp;</td></tr>
482</table>
483
484<p>If the CPU does a context switch, a quiescent state will be
485noted by <tt>rcu_node_context_switch()</tt> on the left.
486On the other hand, if the CPU takes a scheduler-clock interrupt
487while executing in usermode, a quiescent state will be noted by
488<tt>rcu_check_callbacks()</tt> on the right.
489Either way, the passage through a quiescent state will be noted
490in a per-CPU variable.
491
492<p>The next time an <tt>RCU_SOFTIRQ</tt> handler executes on
493this CPU (for example, after the next scheduler-clock
494interrupt), <tt>__rcu_process_callbacks()</tt> will invoke
495<tt>rcu_check_quiescent_state()</tt>, which will notice the
496recorded quiescent state, and invoke
497<tt>rcu_report_qs_rdp()</tt>.
498If <tt>rcu_report_qs_rdp()</tt> verifies that the quiescent state
499really does apply to the current grace period, it invokes
500<tt>rcu_report_rnp()</tt> which traverses up the <tt>rcu_node</tt>
501tree as shown at the bottom of the diagram, clearing bits from
502each <tt>rcu_node</tt> structure's <tt>-&gt;qsmask</tt> field,
503and propagating up the tree when the result is zero.
504
505<p>Note that traversal passes upwards out of a given <tt>rcu_node</tt>
506structure only if the current CPU is reporting the last quiescent
507state for the subtree headed by that <tt>rcu_node</tt> structure.
508A key point is that if a CPU's traversal stops at a given <tt>rcu_node</tt>
509structure, then there will be a later traversal by another CPU
510(or perhaps the same one) that proceeds upwards
511from that point, and the <tt>rcu_node</tt> <tt>-&gt;lock</tt>
512guarantees that the first CPU's quiescent state happens before the
513remainder of the second CPU's traversal.
514Applying this line of thought repeatedly shows that all CPUs'
515quiescent states happen before the last CPU traverses through
516the root <tt>rcu_node</tt> structure, the &ldquo;last CPU&rdquo;
517being the one that clears the last bit in the root <tt>rcu_node</tt>
518structure's <tt>-&gt;qsmask</tt> field.
519
520<h4><a name="Dynamic Tick Interface">Dynamic Tick Interface</a></h4>
521
522<p>Due to energy-efficiency considerations, RCU is forbidden from
523disturbing idle CPUs.
524CPUs are therefore required to notify RCU when entering or leaving idle
525state, which they do via fully ordered value-returning atomic operations
526on a per-CPU variable.
527The ordering effects are as shown below:
528
529</p><p><img src="TreeRCU-dyntick.svg" alt="TreeRCU-dyntick.svg" width="50%">
530
531<p>The RCU grace-period kernel thread samples the per-CPU idleness
532variable while holding the corresponding CPU's leaf <tt>rcu_node</tt>
533structure's <tt>-&gt;lock</tt>.
534This means that any RCU read-side critical sections that precede the
535idle period (the oval near the top of the diagram above) will happen
536before the end of the current grace period.
537Similarly, the beginning of the current grace period will happen before
538any RCU read-side critical sections that follow the
539idle period (the oval near the bottom of the diagram above).
540
541<p>Plumbing this into the full grace-period execution is described
542<a href="#Forcing Quiescent States">below</a>.
543
544<h4><a name="CPU-Hotplug Interface">CPU-Hotplug Interface</a></h4>
545
546<p>RCU is also forbidden from disturbing offline CPUs, which might well
547be powered off and removed from the system completely.
548CPUs are therefore required to notify RCU of their comings and goings
549as part of the corresponding CPU hotplug operations.
550The ordering effects are shown below:
551
552</p><p><img src="TreeRCU-hotplug.svg" alt="TreeRCU-hotplug.svg" width="50%">
553
554<p>Because CPU hotplug operations are much less frequent than idle transitions,
555they are heavier weight, and thus acquire the CPU's leaf <tt>rcu_node</tt>
556structure's <tt>-&gt;lock</tt> and update this structure's
557<tt>-&gt;qsmaskinitnext</tt>.
558The RCU grace-period kernel thread samples this mask to detect CPUs
559having gone offline since the beginning of this grace period.
560
561<p>Plumbing this into the full grace-period execution is described
562<a href="#Forcing Quiescent States">below</a>.
563
564<h4><a name="Forcing Quiescent States">Forcing Quiescent States</a></h4>
565
566<p>As noted above, idle and offline CPUs cannot report their own
567quiescent states, and therefore the grace-period kernel thread
568must do the reporting on their behalf.
569This process is called &ldquo;forcing quiescent states&rdquo;, it is
570repeated every few jiffies, and its ordering effects are shown below:
571
572</p><p><img src="TreeRCU-gp-fqs.svg" alt="TreeRCU-gp-fqs.svg" width="100%">
573
574<p>Each pass of quiescent state forcing is guaranteed to traverse the
575leaf <tt>rcu_node</tt> structures, and if there are no new quiescent
576states due to recently idled and/or offlined CPUs, then only the
577leaves are traversed.
578However, if there is a newly offlined CPU as illustrated on the left
579or a newly idled CPU as illustrated on the right, the corresponding
580quiescent state will be driven up towards the root.
581As with self-reported quiescent states, the upwards driving stops
582once it reaches an <tt>rcu_node</tt> structure that has quiescent
583states outstanding from other CPUs.
584
585<table>
586<tr><th>&nbsp;</th></tr>
587<tr><th align="left">Quick Quiz:</th></tr>
588<tr><td>
589	The leftmost drive to root stopped before it reached
590	the root <tt>rcu_node</tt> structure, which means that
591	there are still CPUs subordinate to that structure on
592	which the current grace period is waiting.
593	Given that, how is it possible that the rightmost drive
594	to root ended the grace period?
595</td></tr>
596<tr><th align="left">Answer:</th></tr>
597<tr><td bgcolor="#ffffff"><font color="ffffff">
598	Good analysis!
599	It is in fact impossible in the absence of bugs in RCU.
600	But this diagram is complex enough as it is, so simplicity
601	overrode accuracy.
602	You can think of it as poetic license, or you can think of
603	it as misdirection that is resolved in the
604	<a href="#Putting It All Together">stitched-together diagram</a>.
605</font></td></tr>
606<tr><td>&nbsp;</td></tr>
607</table>
608
609<h4><a name="Grace-Period Cleanup">Grace-Period Cleanup</a></h4>
610
611<p>Grace-period cleanup first scans the <tt>rcu_node</tt> tree
612breadth-first advancing all the <tt>-&gt;gp_seq</tt> fields, then it
613advances the <tt>rcu_state</tt> structure's <tt>-&gt;gp_seq</tt> field.
614The ordering effects are shown below:
615
616</p><p><img src="TreeRCU-gp-cleanup.svg" alt="TreeRCU-gp-cleanup.svg" width="75%">
617
618<p>As indicated by the oval at the bottom of the diagram, once
619grace-period cleanup is complete, the next grace period can begin.
620
621<table>
622<tr><th>&nbsp;</th></tr>
623<tr><th align="left">Quick Quiz:</th></tr>
624<tr><td>
625	But when precisely does the grace period end?
626</td></tr>
627<tr><th align="left">Answer:</th></tr>
628<tr><td bgcolor="#ffffff"><font color="ffffff">
629	There is no useful single point at which the grace period
630	can be said to end.
631	The earliest reasonable candidate is as soon as the last
632	CPU has reported its quiescent state, but it may be some
633	milliseconds before RCU becomes aware of this.
634	The latest reasonable candidate is once the <tt>rcu_state</tt>
635	structure's <tt>-&gt;gp_seq</tt> field has been updated,
636	but it is quite possible that some CPUs have already completed
637	phase two of their updates by that time.
638	In short, if you are going to work with RCU, you need to
639	learn to embrace uncertainty.
640</font></td></tr>
641<tr><td>&nbsp;</td></tr>
642</table>
643
644
645<h4><a name="Callback Invocation">Callback Invocation</a></h4>
646
647<p>Once a given CPU's leaf <tt>rcu_node</tt> structure's
648<tt>-&gt;gp_seq</tt> field has been updated, that CPU can begin
649invoking its RCU callbacks that were waiting for this grace period
650to end.
651These callbacks are identified by <tt>rcu_advance_cbs()</tt>,
652which is usually invoked by <tt>__note_gp_changes()</tt>.
653As shown in the diagram below, this invocation can be triggered by
654the scheduling-clock interrupt (<tt>rcu_check_callbacks()</tt> on
655the left) or by idle entry (<tt>rcu_cleanup_after_idle()</tt> on
656the right, but only for kernels build with
657<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>).
658Either way, <tt>RCU_SOFTIRQ</tt> is raised, which results in
659<tt>rcu_do_batch()</tt> invoking the callbacks, which in turn
660allows those callbacks to carry out (either directly or indirectly
661via wakeup) the needed phase-two processing for each update.
662
663</p><p><img src="TreeRCU-callback-invocation.svg" alt="TreeRCU-callback-invocation.svg" width="60%">
664
665<p>Please note that callback invocation can also be prompted by any
666number of corner-case code paths, for example, when a CPU notes that
667it has excessive numbers of callbacks queued.
668In all cases, the CPU acquires its leaf <tt>rcu_node</tt> structure's
669<tt>-&gt;lock</tt> before invoking callbacks, which preserves the
670required ordering against the newly completed grace period.
671
672<p>However, if the callback function communicates to other CPUs,
673for example, doing a wakeup, then it is that function's responsibility
674to maintain ordering.
675For example, if the callback function wakes up a task that runs on
676some other CPU, proper ordering must in place in both the callback
677function and the task being awakened.
678To see why this is important, consider the top half of the
679<a href="#Grace-Period Cleanup">grace-period cleanup</a> diagram.
680The callback might be running on a CPU corresponding to the leftmost
681leaf <tt>rcu_node</tt> structure, and awaken a task that is to run on
682a CPU corresponding to the rightmost leaf <tt>rcu_node</tt> structure,
683and the grace-period kernel thread might not yet have reached the
684rightmost leaf.
685In this case, the grace period's memory ordering might not yet have
686reached that CPU, so again the callback function and the awakened
687task must supply proper ordering.
688
689<h3><a name="Putting It All Together">Putting It All Together</a></h3>
690
691<p>A stitched-together diagram is
692<a href="Tree-RCU-Diagram.html">here</a>.
693
694<h3><a name="Legal Statement">
695Legal Statement</a></h3>
696
697<p>This work represents the view of the author and does not necessarily
698represent the view of IBM.
699
700</p><p>Linux is a registered trademark of Linus Torvalds.
701
702</p><p>Other company, product, and service names may be trademarks or
703service marks of others.
704
705</body></html>
706