1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
2        "http://www.w3.org/TR/html4/loose.dtd">
3        <html>
4        <head><title>A Tour Through TREE_RCU's Grace-Period Memory Ordering</title>
5        <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
6
7           <p>August 8, 2017</p>
8           <p>This article was contributed by Paul E.&nbsp;McKenney</p>
9
10<h3>Introduction</h3>
11
12<p>This document gives a rough visual overview of how Tree RCU's
13grace-period memory ordering guarantee is provided.
14
15<ol>
16<li>	<a href="#What Is Tree RCU's Grace Period Memory Ordering Guarantee?">
17	What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a>
18<li>	<a href="#Tree RCU Grace Period Memory Ordering Building Blocks">
19	Tree RCU Grace Period Memory Ordering Building Blocks</a>
20<li>	<a href="#Tree RCU Grace Period Memory Ordering Components">
21	Tree RCU Grace Period Memory Ordering Components</a>
22<li>	<a href="#Putting It All Together">Putting It All Together</a>
23</ol>
24
25<h3><a name="What Is Tree RCU's Grace Period Memory Ordering Guarantee?">
26What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a></h3>
27
28<p>RCU grace periods provide extremely strong memory-ordering guarantees
29for non-idle non-offline code.
30Any code that happens after the end of a given RCU grace period is guaranteed
31to see the effects of all accesses prior to the beginning of that grace
32period that are within RCU read-side critical sections.
33Similarly, any code that happens before the beginning of a given RCU grace
34period is guaranteed to see the effects of all accesses following the end
35of that grace period that are within RCU read-side critical sections.
36
37<p>Note well that RCU-sched read-side critical sections include any region
38of code for which preemption is disabled.
39Given that each individual machine instruction can be thought of as
40an extremely small region of preemption-disabled code, one can think of
41<tt>synchronize_rcu()</tt> as <tt>smp_mb()</tt> on steroids.
42
43<p>RCU updaters use this guarantee by splitting their updates into
44two phases, one of which is executed before the grace period and
45the other of which is executed after the grace period.
46In the most common use case, phase one removes an element from
47a linked RCU-protected data structure, and phase two frees that element.
48For this to work, any readers that have witnessed state prior to the
49phase-one update (in the common case, removal) must not witness state
50following the phase-two update (in the common case, freeing).
51
52<p>The RCU implementation provides this guarantee using a network
53of lock-based critical sections, memory barriers, and per-CPU
54processing, as is described in the following sections.
55
56<h3><a name="Tree RCU Grace Period Memory Ordering Building Blocks">
57Tree RCU Grace Period Memory Ordering Building Blocks</a></h3>
58
59<p>The workhorse for RCU's grace-period memory ordering is the
60critical section for the <tt>rcu_node</tt> structure's
61<tt>-&gt;lock</tt>.
62These critical sections use helper functions for lock acquisition, including
63<tt>raw_spin_lock_rcu_node()</tt>,
64<tt>raw_spin_lock_irq_rcu_node()</tt>, and
65<tt>raw_spin_lock_irqsave_rcu_node()</tt>.
66Their lock-release counterparts are
67<tt>raw_spin_unlock_rcu_node()</tt>,
68<tt>raw_spin_unlock_irq_rcu_node()</tt>, and
69<tt>raw_spin_unlock_irqrestore_rcu_node()</tt>,
70respectively.
71For completeness, a
72<tt>raw_spin_trylock_rcu_node()</tt>
73is also provided.
74The key point is that the lock-acquisition functions, including
75<tt>raw_spin_trylock_rcu_node()</tt>, all invoke
76<tt>smp_mb__after_unlock_lock()</tt> immediately after successful
77acquisition of the lock.
78
79<p>Therefore, for any given <tt>rcu_node</tt> structure, any access
80happening before one of the above lock-release functions will be seen
81by all CPUs as happening before any access happening after a later
82one of the above lock-acquisition functions.
83Furthermore, any access happening before one of the
84above lock-release function on any given CPU will be seen by all
85CPUs as happening before any access happening after a later one
86of the above lock-acquisition functions executing on that same CPU,
87even if the lock-release and lock-acquisition functions are operating
88on different <tt>rcu_node</tt> structures.
89Tree RCU uses these two ordering guarantees to form an ordering
90network among all CPUs that were in any way involved in the grace
91period, including any CPUs that came online or went offline during
92the grace period in question.
93
94<p>The following litmus test exhibits the ordering effects of these
95lock-acquisition and lock-release functions:
96
97<pre>
98 1 int x, y, z;
99 2
100 3 void task0(void)
101 4 {
102 5   raw_spin_lock_rcu_node(rnp);
103 6   WRITE_ONCE(x, 1);
104 7   r1 = READ_ONCE(y);
105 8   raw_spin_unlock_rcu_node(rnp);
106 9 }
10710
10811 void task1(void)
10912 {
11013   raw_spin_lock_rcu_node(rnp);
11114   WRITE_ONCE(y, 1);
11215   r2 = READ_ONCE(z);
11316   raw_spin_unlock_rcu_node(rnp);
11417 }
11518
11619 void task2(void)
11720 {
11821   WRITE_ONCE(z, 1);
11922   smp_mb();
12023   r3 = READ_ONCE(x);
12124 }
12225
12326 WARN_ON(r1 == 0 &amp;&amp; r2 == 0 &amp;&amp; r3 == 0);
124</pre>
125
126<p>The <tt>WARN_ON()</tt> is evaluated at &ldquo;the end of time&rdquo;,
127after all changes have propagated throughout the system.
128Without the <tt>smp_mb__after_unlock_lock()</tt> provided by the
129acquisition functions, this <tt>WARN_ON()</tt> could trigger, for example
130on PowerPC.
131The <tt>smp_mb__after_unlock_lock()</tt> invocations prevent this
132<tt>WARN_ON()</tt> from triggering.
133
134<p>This approach must be extended to include idle CPUs, which need
135RCU's grace-period memory ordering guarantee to extend to any
136RCU read-side critical sections preceding and following the current
137idle sojourn.
138This case is handled by calls to the strongly ordered
139<tt>atomic_add_return()</tt> read-modify-write atomic operation that
140is invoked within <tt>rcu_dynticks_eqs_enter()</tt> at idle-entry
141time and within <tt>rcu_dynticks_eqs_exit()</tt> at idle-exit time.
142The grace-period kthread invokes <tt>rcu_dynticks_snap()</tt> and
143<tt>rcu_dynticks_in_eqs_since()</tt> (both of which invoke
144an <tt>atomic_add_return()</tt> of zero) to detect idle CPUs.
145
146<table>
147<tr><th>&nbsp;</th></tr>
148<tr><th align="left">Quick Quiz:</th></tr>
149<tr><td>
150	But what about CPUs that remain offline for the entire
151	grace period?
152</td></tr>
153<tr><th align="left">Answer:</th></tr>
154<tr><td bgcolor="#ffffff"><font color="ffffff">
155	Such CPUs will be offline at the beginning of the grace period,
156	so the grace period won't expect quiescent states from them.
157	Races between grace-period start and CPU-hotplug operations
158	are mediated by the CPU's leaf <tt>rcu_node</tt> structure's
159	<tt>-&gt;lock</tt> as described above.
160</font></td></tr>
161<tr><td>&nbsp;</td></tr>
162</table>
163
164<p>The approach must be extended to handle one final case, that
165of waking a task blocked in <tt>synchronize_rcu()</tt>.
166This task might be affinitied to a CPU that is not yet aware that
167the grace period has ended, and thus might not yet be subject to
168the grace period's memory ordering.
169Therefore, there is an <tt>smp_mb()</tt> after the return from
170<tt>wait_for_completion()</tt> in the <tt>synchronize_rcu()</tt>
171code path.
172
173<table>
174<tr><th>&nbsp;</th></tr>
175<tr><th align="left">Quick Quiz:</th></tr>
176<tr><td>
177	What?  Where???
178	I don't see any <tt>smp_mb()</tt> after the return from
179	<tt>wait_for_completion()</tt>!!!
180</td></tr>
181<tr><th align="left">Answer:</th></tr>
182<tr><td bgcolor="#ffffff"><font color="ffffff">
183	That would be because I spotted the need for that
184	<tt>smp_mb()</tt> during the creation of this documentation,
185	and it is therefore unlikely to hit mainline before v4.14.
186	Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and
187	Jonathan Cameron for asking questions that sensitized me
188	to the rather elaborate sequence of events that demonstrate
189	the need for this memory barrier.
190</font></td></tr>
191<tr><td>&nbsp;</td></tr>
192</table>
193
194<p>Tree RCU's grace--period memory-ordering guarantees rely most
195heavily on the <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>
196field, so much so that it is necessary to abbreviate this pattern
197in the diagrams in the next section.
198For example, consider the <tt>rcu_prepare_for_idle()</tt> function
199shown below, which is one of several functions that enforce ordering
200of newly arrived RCU callbacks against future grace periods:
201
202<pre>
203 1 static void rcu_prepare_for_idle(void)
204 2 {
205 3   bool needwake;
206 4   struct rcu_data *rdp;
207 5   struct rcu_dynticks *rdtp = this_cpu_ptr(&amp;rcu_dynticks);
208 6   struct rcu_node *rnp;
209 7   struct rcu_state *rsp;
210 8   int tne;
211 9
21210   if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) ||
21311       rcu_is_nocb_cpu(smp_processor_id()))
21412     return;
21513   tne = READ_ONCE(tick_nohz_active);
21614   if (tne != rdtp-&gt;tick_nohz_enabled_snap) {
21715     if (rcu_cpu_has_callbacks(NULL))
21816       invoke_rcu_core();
21917     rdtp-&gt;tick_nohz_enabled_snap = tne;
22018     return;
22119   }
22220   if (!tne)
22321     return;
22422   if (rdtp-&gt;all_lazy &amp;&amp;
22523       rdtp-&gt;nonlazy_posted != rdtp-&gt;nonlazy_posted_snap) {
22624     rdtp-&gt;all_lazy = false;
22725     rdtp-&gt;nonlazy_posted_snap = rdtp-&gt;nonlazy_posted;
22826     invoke_rcu_core();
22927     return;
23028   }
23129   if (rdtp-&gt;last_accelerate == jiffies)
23230     return;
23331   rdtp-&gt;last_accelerate = jiffies;
23432   for_each_rcu_flavor(rsp) {
23533     rdp = this_cpu_ptr(rsp-&gt;rda);
23634     if (rcu_segcblist_pend_cbs(&amp;rdp-&gt;cblist))
23735       continue;
23836     rnp = rdp-&gt;mynode;
23937     raw_spin_lock_rcu_node(rnp);
24038     needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
24139     raw_spin_unlock_rcu_node(rnp);
24240     if (needwake)
24341       rcu_gp_kthread_wake(rsp);
24442   }
24543 }
246</pre>
247
248<p>But the only part of <tt>rcu_prepare_for_idle()</tt> that really
249matters for this discussion are lines&nbsp;37&ndash;39.
250We will therefore abbreviate this function as follows:
251
252</p><p><img src="rcu_node-lock.svg" alt="rcu_node-lock.svg">
253
254<p>The box represents the <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>
255critical section, with the double line on top representing the additional
256<tt>smp_mb__after_unlock_lock()</tt>.
257
258<h3><a name="Tree RCU Grace Period Memory Ordering Components">
259Tree RCU Grace Period Memory Ordering Components</a></h3>
260
261<p>Tree RCU's grace-period memory-ordering guarantee is provided by
262a number of RCU components:
263
264<ol>
265<li>	<a href="#Callback Registry">Callback Registry</a>
266<li>	<a href="#Grace-Period Initialization">Grace-Period Initialization</a>
267<li>	<a href="#Self-Reported Quiescent States">
268	Self-Reported Quiescent States</a>
269<li>	<a href="#Dynamic Tick Interface">Dynamic Tick Interface</a>
270<li>	<a href="#CPU-Hotplug Interface">CPU-Hotplug Interface</a>
271<li>	<a href="Forcing Quiescent States">Forcing Quiescent States</a>
272<li>	<a href="Grace-Period Cleanup">Grace-Period Cleanup</a>
273<li>	<a href="Callback Invocation">Callback Invocation</a>
274</ol>
275
276<p>Each of the following section looks at the corresponding component
277in detail.
278
279<h4><a name="Callback Registry">Callback Registry</a></h4>
280
281<p>If RCU's grace-period guarantee is to mean anything at all, any
282access that happens before a given invocation of <tt>call_rcu()</tt>
283must also happen before the corresponding grace period.
284The implementation of this portion of RCU's grace period guarantee
285is shown in the following figure:
286
287</p><p><img src="TreeRCU-callback-registry.svg" alt="TreeRCU-callback-registry.svg">
288
289<p>Because <tt>call_rcu()</tt> normally acts only on CPU-local state,
290it provides no ordering guarantees, either for itself or for
291phase one of the update (which again will usually be removal of
292an element from an RCU-protected data structure).
293It simply enqueues the <tt>rcu_head</tt> structure on a per-CPU list,
294which cannot become associated with a grace period until a later
295call to <tt>rcu_accelerate_cbs()</tt>, as shown in the diagram above.
296
297<p>One set of code paths shown on the left invokes
298<tt>rcu_accelerate_cbs()</tt> via
299<tt>note_gp_changes()</tt>, either directly from <tt>call_rcu()</tt> (if
300the current CPU is inundated with queued <tt>rcu_head</tt> structures)
301or more likely from an <tt>RCU_SOFTIRQ</tt> handler.
302Another code path in the middle is taken only in kernels built with
303<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>, which invokes
304<tt>rcu_accelerate_cbs()</tt> via <tt>rcu_prepare_for_idle()</tt>.
305The final code path on the right is taken only in kernels built with
306<tt>CONFIG_HOTPLUG_CPU=y</tt>, which invokes
307<tt>rcu_accelerate_cbs()</tt> via
308<tt>rcu_advance_cbs()</tt>, <tt>rcu_migrate_callbacks</tt>,
309<tt>rcutree_migrate_callbacks()</tt>, and <tt>takedown_cpu()</tt>,
310which in turn is invoked on a surviving CPU after the outgoing
311CPU has been completely offlined.
312
313<p>There are a few other code paths within grace-period processing
314that opportunistically invoke <tt>rcu_accelerate_cbs()</tt>.
315However, either way, all of the CPU's recently queued <tt>rcu_head</tt>
316structures are associated with a future grace-period number under
317the protection of the CPU's lead <tt>rcu_node</tt> structure's
318<tt>-&gt;lock</tt>.
319In all cases, there is full ordering against any prior critical section
320for that same <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>, and
321also full ordering against any of the current task's or CPU's prior critical
322sections for any <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>.
323
324<p>The next section will show how this ordering ensures that any
325accesses prior to the <tt>call_rcu()</tt> (particularly including phase
326one of the update)
327happen before the start of the corresponding grace period.
328
329<table>
330<tr><th>&nbsp;</th></tr>
331<tr><th align="left">Quick Quiz:</th></tr>
332<tr><td>
333	But what about <tt>synchronize_rcu()</tt>?
334</td></tr>
335<tr><th align="left">Answer:</th></tr>
336<tr><td bgcolor="#ffffff"><font color="ffffff">
337	The <tt>synchronize_rcu()</tt> passes <tt>call_rcu()</tt>
338	to <tt>wait_rcu_gp()</tt>, which invokes it.
339	So either way, it eventually comes down to <tt>call_rcu()</tt>.
340</font></td></tr>
341<tr><td>&nbsp;</td></tr>
342</table>
343
344<h4><a name="Grace-Period Initialization">Grace-Period Initialization</a></h4>
345
346<p>Grace-period initialization is carried out by
347the grace-period kernel thread, which makes several passes over the
348<tt>rcu_node</tt> tree within the <tt>rcu_gp_init()</tt> function.
349This means that showing the full flow of ordering through the
350grace-period computation will require duplicating this tree.
351If you find this confusing, please note that the state of the
352<tt>rcu_node</tt> changes over time, just like Heraclitus's river.
353However, to keep the <tt>rcu_node</tt> river tractable, the
354grace-period kernel thread's traversals are presented in multiple
355parts, starting in this section with the various phases of
356grace-period initialization.
357
358<p>The first ordering-related grace-period initialization action is to
359advance the <tt>rcu_state</tt> structure's <tt>-&gt;gp_seq</tt>
360grace-period-number counter, as shown below:
361
362</p><p><img src="TreeRCU-gp-init-1.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
363
364<p>The actual increment is carried out using <tt>smp_store_release()</tt>,
365which helps reject false-positive RCU CPU stall detection.
366Note that only the root <tt>rcu_node</tt> structure is touched.
367
368<p>The first pass through the <tt>rcu_node</tt> tree updates bitmasks
369based on CPUs having come online or gone offline since the start of
370the previous grace period.
371In the common case where the number of online CPUs for this <tt>rcu_node</tt>
372structure has not transitioned to or from zero,
373this pass will scan only the leaf <tt>rcu_node</tt> structures.
374However, if the number of online CPUs for a given leaf <tt>rcu_node</tt>
375structure has transitioned from zero,
376<tt>rcu_init_new_rnp()</tt> will be invoked for the first incoming CPU.
377Similarly, if the number of online CPUs for a given leaf <tt>rcu_node</tt>
378structure has transitioned to zero,
379<tt>rcu_cleanup_dead_rnp()</tt> will be invoked for the last outgoing CPU.
380The diagram below shows the path of ordering if the leftmost
381<tt>rcu_node</tt> structure onlines its first CPU and if the next
382<tt>rcu_node</tt> structure has no online CPUs
383(or, alternatively if the leftmost <tt>rcu_node</tt> structure offlines
384its last CPU and if the next <tt>rcu_node</tt> structure has no online CPUs).
385
386</p><p><img src="TreeRCU-gp-init-2.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
387
388<p>The final <tt>rcu_gp_init()</tt> pass through the <tt>rcu_node</tt>
389tree traverses breadth-first, setting each <tt>rcu_node</tt> structure's
390<tt>-&gt;gp_seq</tt> field to the newly advanced value from the
391<tt>rcu_state</tt> structure, as shown in the following diagram.
392
393</p><p><img src="TreeRCU-gp-init-3.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
394
395<p>This change will also cause each CPU's next call to
396<tt>__note_gp_changes()</tt>
397to notice that a new grace period has started, as described in the next
398section.
399But because the grace-period kthread started the grace period at the
400root (with the advancing of the <tt>rcu_state</tt> structure's
401<tt>-&gt;gp_seq</tt> field) before setting each leaf <tt>rcu_node</tt>
402structure's <tt>-&gt;gp_seq</tt> field, each CPU's observation of
403the start of the grace period will happen after the actual start
404of the grace period.
405
406<table>
407<tr><th>&nbsp;</th></tr>
408<tr><th align="left">Quick Quiz:</th></tr>
409<tr><td>
410	But what about the CPU that started the grace period?
411	Why wouldn't it see the start of the grace period right when
412	it started that grace period?
413</td></tr>
414<tr><th align="left">Answer:</th></tr>
415<tr><td bgcolor="#ffffff"><font color="ffffff">
416	In some deep philosophical and overly anthromorphized
417	sense, yes, the CPU starting the grace period is immediately
418	aware of having done so.
419	However, if we instead assume that RCU is not self-aware,
420	then even the CPU starting the grace period does not really
421	become aware of the start of this grace period until its
422	first call to <tt>__note_gp_changes()</tt>.
423	On the other hand, this CPU potentially gets early notification
424	because it invokes <tt>__note_gp_changes()</tt> during its
425	last <tt>rcu_gp_init()</tt> pass through its leaf
426	<tt>rcu_node</tt> structure.
427</font></td></tr>
428<tr><td>&nbsp;</td></tr>
429</table>
430
431<h4><a name="Self-Reported Quiescent States">
432Self-Reported Quiescent States</a></h4>
433
434<p>When all entities that might block the grace period have reported
435quiescent states (or as described in a later section, had quiescent
436states reported on their behalf), the grace period can end.
437Online non-idle CPUs report their own quiescent states, as shown
438in the following diagram:
439
440</p><p><img src="TreeRCU-qs.svg" alt="TreeRCU-qs.svg" width="75%">
441
442<p>This is for the last CPU to report a quiescent state, which signals
443the end of the grace period.
444Earlier quiescent states would push up the <tt>rcu_node</tt> tree
445only until they encountered an <tt>rcu_node</tt> structure that
446is waiting for additional quiescent states.
447However, ordering is nevertheless preserved because some later quiescent
448state will acquire that <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>.
449
450<p>Any number of events can lead up to a CPU invoking
451<tt>note_gp_changes</tt> (or alternatively, directly invoking
452<tt>__note_gp_changes()</tt>), at which point that CPU will notice
453the start of a new grace period while holding its leaf
454<tt>rcu_node</tt> lock.
455Therefore, all execution shown in this diagram happens after the
456start of the grace period.
457In addition, this CPU will consider any RCU read-side critical
458section that started before the invocation of <tt>__note_gp_changes()</tt>
459to have started before the grace period, and thus a critical
460section that the grace period must wait on.
461
462<table>
463<tr><th>&nbsp;</th></tr>
464<tr><th align="left">Quick Quiz:</th></tr>
465<tr><td>
466	But a RCU read-side critical section might have started
467	after the beginning of the grace period
468	(the advancing of <tt>-&gt;gp_seq</tt> from earlier), so why should
469	the grace period wait on such a critical section?
470</td></tr>
471<tr><th align="left">Answer:</th></tr>
472<tr><td bgcolor="#ffffff"><font color="ffffff">
473	It is indeed not necessary for the grace period to wait on such
474	a critical section.
475	However, it is permissible to wait on it.
476	And it is furthermore important to wait on it, as this
477	lazy approach is far more scalable than a &ldquo;big bang&rdquo;
478	all-at-once grace-period start could possibly be.
479</font></td></tr>
480<tr><td>&nbsp;</td></tr>
481</table>
482
483<p>If the CPU does a context switch, a quiescent state will be
484noted by <tt>rcu_node_context_switch()</tt> on the left.
485On the other hand, if the CPU takes a scheduler-clock interrupt
486while executing in usermode, a quiescent state will be noted by
487<tt>rcu_sched_clock_irq()</tt> on the right.
488Either way, the passage through a quiescent state will be noted
489in a per-CPU variable.
490
491<p>The next time an <tt>RCU_SOFTIRQ</tt> handler executes on
492this CPU (for example, after the next scheduler-clock
493interrupt), <tt>rcu_core()</tt> will invoke
494<tt>rcu_check_quiescent_state()</tt>, which will notice the
495recorded quiescent state, and invoke
496<tt>rcu_report_qs_rdp()</tt>.
497If <tt>rcu_report_qs_rdp()</tt> verifies that the quiescent state
498really does apply to the current grace period, it invokes
499<tt>rcu_report_rnp()</tt> which traverses up the <tt>rcu_node</tt>
500tree as shown at the bottom of the diagram, clearing bits from
501each <tt>rcu_node</tt> structure's <tt>-&gt;qsmask</tt> field,
502and propagating up the tree when the result is zero.
503
504<p>Note that traversal passes upwards out of a given <tt>rcu_node</tt>
505structure only if the current CPU is reporting the last quiescent
506state for the subtree headed by that <tt>rcu_node</tt> structure.
507A key point is that if a CPU's traversal stops at a given <tt>rcu_node</tt>
508structure, then there will be a later traversal by another CPU
509(or perhaps the same one) that proceeds upwards
510from that point, and the <tt>rcu_node</tt> <tt>-&gt;lock</tt>
511guarantees that the first CPU's quiescent state happens before the
512remainder of the second CPU's traversal.
513Applying this line of thought repeatedly shows that all CPUs'
514quiescent states happen before the last CPU traverses through
515the root <tt>rcu_node</tt> structure, the &ldquo;last CPU&rdquo;
516being the one that clears the last bit in the root <tt>rcu_node</tt>
517structure's <tt>-&gt;qsmask</tt> field.
518
519<h4><a name="Dynamic Tick Interface">Dynamic Tick Interface</a></h4>
520
521<p>Due to energy-efficiency considerations, RCU is forbidden from
522disturbing idle CPUs.
523CPUs are therefore required to notify RCU when entering or leaving idle
524state, which they do via fully ordered value-returning atomic operations
525on a per-CPU variable.
526The ordering effects are as shown below:
527
528</p><p><img src="TreeRCU-dyntick.svg" alt="TreeRCU-dyntick.svg" width="50%">
529
530<p>The RCU grace-period kernel thread samples the per-CPU idleness
531variable while holding the corresponding CPU's leaf <tt>rcu_node</tt>
532structure's <tt>-&gt;lock</tt>.
533This means that any RCU read-side critical sections that precede the
534idle period (the oval near the top of the diagram above) will happen
535before the end of the current grace period.
536Similarly, the beginning of the current grace period will happen before
537any RCU read-side critical sections that follow the
538idle period (the oval near the bottom of the diagram above).
539
540<p>Plumbing this into the full grace-period execution is described
541<a href="#Forcing Quiescent States">below</a>.
542
543<h4><a name="CPU-Hotplug Interface">CPU-Hotplug Interface</a></h4>
544
545<p>RCU is also forbidden from disturbing offline CPUs, which might well
546be powered off and removed from the system completely.
547CPUs are therefore required to notify RCU of their comings and goings
548as part of the corresponding CPU hotplug operations.
549The ordering effects are shown below:
550
551</p><p><img src="TreeRCU-hotplug.svg" alt="TreeRCU-hotplug.svg" width="50%">
552
553<p>Because CPU hotplug operations are much less frequent than idle transitions,
554they are heavier weight, and thus acquire the CPU's leaf <tt>rcu_node</tt>
555structure's <tt>-&gt;lock</tt> and update this structure's
556<tt>-&gt;qsmaskinitnext</tt>.
557The RCU grace-period kernel thread samples this mask to detect CPUs
558having gone offline since the beginning of this grace period.
559
560<p>Plumbing this into the full grace-period execution is described
561<a href="#Forcing Quiescent States">below</a>.
562
563<h4><a name="Forcing Quiescent States">Forcing Quiescent States</a></h4>
564
565<p>As noted above, idle and offline CPUs cannot report their own
566quiescent states, and therefore the grace-period kernel thread
567must do the reporting on their behalf.
568This process is called &ldquo;forcing quiescent states&rdquo;, it is
569repeated every few jiffies, and its ordering effects are shown below:
570
571</p><p><img src="TreeRCU-gp-fqs.svg" alt="TreeRCU-gp-fqs.svg" width="100%">
572
573<p>Each pass of quiescent state forcing is guaranteed to traverse the
574leaf <tt>rcu_node</tt> structures, and if there are no new quiescent
575states due to recently idled and/or offlined CPUs, then only the
576leaves are traversed.
577However, if there is a newly offlined CPU as illustrated on the left
578or a newly idled CPU as illustrated on the right, the corresponding
579quiescent state will be driven up towards the root.
580As with self-reported quiescent states, the upwards driving stops
581once it reaches an <tt>rcu_node</tt> structure that has quiescent
582states outstanding from other CPUs.
583
584<table>
585<tr><th>&nbsp;</th></tr>
586<tr><th align="left">Quick Quiz:</th></tr>
587<tr><td>
588	The leftmost drive to root stopped before it reached
589	the root <tt>rcu_node</tt> structure, which means that
590	there are still CPUs subordinate to that structure on
591	which the current grace period is waiting.
592	Given that, how is it possible that the rightmost drive
593	to root ended the grace period?
594</td></tr>
595<tr><th align="left">Answer:</th></tr>
596<tr><td bgcolor="#ffffff"><font color="ffffff">
597	Good analysis!
598	It is in fact impossible in the absence of bugs in RCU.
599	But this diagram is complex enough as it is, so simplicity
600	overrode accuracy.
601	You can think of it as poetic license, or you can think of
602	it as misdirection that is resolved in the
603	<a href="#Putting It All Together">stitched-together diagram</a>.
604</font></td></tr>
605<tr><td>&nbsp;</td></tr>
606</table>
607
608<h4><a name="Grace-Period Cleanup">Grace-Period Cleanup</a></h4>
609
610<p>Grace-period cleanup first scans the <tt>rcu_node</tt> tree
611breadth-first advancing all the <tt>-&gt;gp_seq</tt> fields, then it
612advances the <tt>rcu_state</tt> structure's <tt>-&gt;gp_seq</tt> field.
613The ordering effects are shown below:
614
615</p><p><img src="TreeRCU-gp-cleanup.svg" alt="TreeRCU-gp-cleanup.svg" width="75%">
616
617<p>As indicated by the oval at the bottom of the diagram, once
618grace-period cleanup is complete, the next grace period can begin.
619
620<table>
621<tr><th>&nbsp;</th></tr>
622<tr><th align="left">Quick Quiz:</th></tr>
623<tr><td>
624	But when precisely does the grace period end?
625</td></tr>
626<tr><th align="left">Answer:</th></tr>
627<tr><td bgcolor="#ffffff"><font color="ffffff">
628	There is no useful single point at which the grace period
629	can be said to end.
630	The earliest reasonable candidate is as soon as the last
631	CPU has reported its quiescent state, but it may be some
632	milliseconds before RCU becomes aware of this.
633	The latest reasonable candidate is once the <tt>rcu_state</tt>
634	structure's <tt>-&gt;gp_seq</tt> field has been updated,
635	but it is quite possible that some CPUs have already completed
636	phase two of their updates by that time.
637	In short, if you are going to work with RCU, you need to
638	learn to embrace uncertainty.
639</font></td></tr>
640<tr><td>&nbsp;</td></tr>
641</table>
642
643
644<h4><a name="Callback Invocation">Callback Invocation</a></h4>
645
646<p>Once a given CPU's leaf <tt>rcu_node</tt> structure's
647<tt>-&gt;gp_seq</tt> field has been updated, that CPU can begin
648invoking its RCU callbacks that were waiting for this grace period
649to end.
650These callbacks are identified by <tt>rcu_advance_cbs()</tt>,
651which is usually invoked by <tt>__note_gp_changes()</tt>.
652As shown in the diagram below, this invocation can be triggered by
653the scheduling-clock interrupt (<tt>rcu_sched_clock_irq()</tt> on
654the left) or by idle entry (<tt>rcu_cleanup_after_idle()</tt> on
655the right, but only for kernels build with
656<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>).
657Either way, <tt>RCU_SOFTIRQ</tt> is raised, which results in
658<tt>rcu_do_batch()</tt> invoking the callbacks, which in turn
659allows those callbacks to carry out (either directly or indirectly
660via wakeup) the needed phase-two processing for each update.
661
662</p><p><img src="TreeRCU-callback-invocation.svg" alt="TreeRCU-callback-invocation.svg" width="60%">
663
664<p>Please note that callback invocation can also be prompted by any
665number of corner-case code paths, for example, when a CPU notes that
666it has excessive numbers of callbacks queued.
667In all cases, the CPU acquires its leaf <tt>rcu_node</tt> structure's
668<tt>-&gt;lock</tt> before invoking callbacks, which preserves the
669required ordering against the newly completed grace period.
670
671<p>However, if the callback function communicates to other CPUs,
672for example, doing a wakeup, then it is that function's responsibility
673to maintain ordering.
674For example, if the callback function wakes up a task that runs on
675some other CPU, proper ordering must in place in both the callback
676function and the task being awakened.
677To see why this is important, consider the top half of the
678<a href="#Grace-Period Cleanup">grace-period cleanup</a> diagram.
679The callback might be running on a CPU corresponding to the leftmost
680leaf <tt>rcu_node</tt> structure, and awaken a task that is to run on
681a CPU corresponding to the rightmost leaf <tt>rcu_node</tt> structure,
682and the grace-period kernel thread might not yet have reached the
683rightmost leaf.
684In this case, the grace period's memory ordering might not yet have
685reached that CPU, so again the callback function and the awakened
686task must supply proper ordering.
687
688<h3><a name="Putting It All Together">Putting It All Together</a></h3>
689
690<p>A stitched-together diagram is
691<a href="Tree-RCU-Diagram.html">here</a>.
692
693<h3><a name="Legal Statement">
694Legal Statement</a></h3>
695
696<p>This work represents the view of the author and does not necessarily
697represent the view of IBM.
698
699</p><p>Linux is a registered trademark of Linus Torvalds.
700
701</p><p>Other company, product, and service names may be trademarks or
702service marks of others.
703
704</body></html>
705