1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 2 "http://www.w3.org/TR/html4/loose.dtd"> 3 <html> 4 <head><title>A Tour Through TREE_RCU's Grace-Period Memory Ordering</title> 5 <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> 6 7 <p>August 8, 2017</p> 8 <p>This article was contributed by Paul E. McKenney</p> 9 10<h3>Introduction</h3> 11 12<p>This document gives a rough visual overview of how Tree RCU's 13grace-period memory ordering guarantee is provided. 14 15<ol> 16<li> <a href="#What Is Tree RCU's Grace Period Memory Ordering Guarantee?"> 17 What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a> 18<li> <a href="#Tree RCU Grace Period Memory Ordering Building Blocks"> 19 Tree RCU Grace Period Memory Ordering Building Blocks</a> 20<li> <a href="#Tree RCU Grace Period Memory Ordering Components"> 21 Tree RCU Grace Period Memory Ordering Components</a> 22<li> <a href="#Putting It All Together">Putting It All Together</a> 23</ol> 24 25<h3><a name="What Is Tree RCU's Grace Period Memory Ordering Guarantee?"> 26What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a></h3> 27 28<p>RCU grace periods provide extremely strong memory-ordering guarantees 29for non-idle non-offline code. 30Any code that happens after the end of a given RCU grace period is guaranteed 31to see the effects of all accesses prior to the beginning of that grace 32period that are within RCU read-side critical sections. 33Similarly, any code that happens before the beginning of a given RCU grace 34period is guaranteed to see the effects of all accesses following the end 35of that grace period that are within RCU read-side critical sections. 36 37<p>This guarantee is particularly pervasive for <tt>synchronize_sched()</tt>, 38for which RCU-sched read-side critical sections include any region 39of code for which preemption is disabled. 40Given that each individual machine instruction can be thought of as 41an extremely small region of preemption-disabled code, one can think of 42<tt>synchronize_sched()</tt> as <tt>smp_mb()</tt> on steroids. 43 44<p>RCU updaters use this guarantee by splitting their updates into 45two phases, one of which is executed before the grace period and 46the other of which is executed after the grace period. 47In the most common use case, phase one removes an element from 48a linked RCU-protected data structure, and phase two frees that element. 49For this to work, any readers that have witnessed state prior to the 50phase-one update (in the common case, removal) must not witness state 51following the phase-two update (in the common case, freeing). 52 53<p>The RCU implementation provides this guarantee using a network 54of lock-based critical sections, memory barriers, and per-CPU 55processing, as is described in the following sections. 56 57<h3><a name="Tree RCU Grace Period Memory Ordering Building Blocks"> 58Tree RCU Grace Period Memory Ordering Building Blocks</a></h3> 59 60<p>The workhorse for RCU's grace-period memory ordering is the 61critical section for the <tt>rcu_node</tt> structure's 62<tt>->lock</tt>. 63These critical sections use helper functions for lock acquisition, including 64<tt>raw_spin_lock_rcu_node()</tt>, 65<tt>raw_spin_lock_irq_rcu_node()</tt>, and 66<tt>raw_spin_lock_irqsave_rcu_node()</tt>. 67Their lock-release counterparts are 68<tt>raw_spin_unlock_rcu_node()</tt>, 69<tt>raw_spin_unlock_irq_rcu_node()</tt>, and 70<tt>raw_spin_unlock_irqrestore_rcu_node()</tt>, 71respectively. 72For completeness, a 73<tt>raw_spin_trylock_rcu_node()</tt> 74is also provided. 75The key point is that the lock-acquisition functions, including 76<tt>raw_spin_trylock_rcu_node()</tt>, all invoke 77<tt>smp_mb__after_unlock_lock()</tt> immediately after successful 78acquisition of the lock. 79 80<p>Therefore, for any given <tt>rcu_node</tt> struction, any access 81happening before one of the above lock-release functions will be seen 82by all CPUs as happening before any access happening after a later 83one of the above lock-acquisition functions. 84Furthermore, any access happening before one of the 85above lock-release function on any given CPU will be seen by all 86CPUs as happening before any access happening after a later one 87of the above lock-acquisition functions executing on that same CPU, 88even if the lock-release and lock-acquisition functions are operating 89on different <tt>rcu_node</tt> structures. 90Tree RCU uses these two ordering guarantees to form an ordering 91network among all CPUs that were in any way involved in the grace 92period, including any CPUs that came online or went offline during 93the grace period in question. 94 95<p>The following litmus test exhibits the ordering effects of these 96lock-acquisition and lock-release functions: 97 98<pre> 99 1 int x, y, z; 100 2 101 3 void task0(void) 102 4 { 103 5 raw_spin_lock_rcu_node(rnp); 104 6 WRITE_ONCE(x, 1); 105 7 r1 = READ_ONCE(y); 106 8 raw_spin_unlock_rcu_node(rnp); 107 9 } 10810 10911 void task1(void) 11012 { 11113 raw_spin_lock_rcu_node(rnp); 11214 WRITE_ONCE(y, 1); 11315 r2 = READ_ONCE(z); 11416 raw_spin_unlock_rcu_node(rnp); 11517 } 11618 11719 void task2(void) 11820 { 11921 WRITE_ONCE(z, 1); 12022 smp_mb(); 12123 r3 = READ_ONCE(x); 12224 } 12325 12426 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0); 125</pre> 126 127<p>The <tt>WARN_ON()</tt> is evaluated at “the end of time”, 128after all changes have propagated throughout the system. 129Without the <tt>smp_mb__after_unlock_lock()</tt> provided by the 130acquisition functions, this <tt>WARN_ON()</tt> could trigger, for example 131on PowerPC. 132The <tt>smp_mb__after_unlock_lock()</tt> invocations prevent this 133<tt>WARN_ON()</tt> from triggering. 134 135<p>This approach must be extended to include idle CPUs, which need 136RCU's grace-period memory ordering guarantee to extend to any 137RCU read-side critical sections preceding and following the current 138idle sojourn. 139This case is handled by calls to the strongly ordered 140<tt>atomic_add_return()</tt> read-modify-write atomic operation that 141is invoked within <tt>rcu_dynticks_eqs_enter()</tt> at idle-entry 142time and within <tt>rcu_dynticks_eqs_exit()</tt> at idle-exit time. 143The grace-period kthread invokes <tt>rcu_dynticks_snap()</tt> and 144<tt>rcu_dynticks_in_eqs_since()</tt> (both of which invoke 145an <tt>atomic_add_return()</tt> of zero) to detect idle CPUs. 146 147<table> 148<tr><th> </th></tr> 149<tr><th align="left">Quick Quiz:</th></tr> 150<tr><td> 151 But what about CPUs that remain offline for the entire 152 grace period? 153</td></tr> 154<tr><th align="left">Answer:</th></tr> 155<tr><td bgcolor="#ffffff"><font color="ffffff"> 156 Such CPUs will be offline at the beginning of the grace period, 157 so the grace period won't expect quiescent states from them. 158 Races between grace-period start and CPU-hotplug operations 159 are mediated by the CPU's leaf <tt>rcu_node</tt> structure's 160 <tt>->lock</tt> as described above. 161</font></td></tr> 162<tr><td> </td></tr> 163</table> 164 165<p>The approach must be extended to handle one final case, that 166of waking a task blocked in <tt>synchronize_rcu()</tt>. 167This task might be affinitied to a CPU that is not yet aware that 168the grace period has ended, and thus might not yet be subject to 169the grace period's memory ordering. 170Therefore, there is an <tt>smp_mb()</tt> after the return from 171<tt>wait_for_completion()</tt> in the <tt>synchronize_rcu()</tt> 172code path. 173 174<table> 175<tr><th> </th></tr> 176<tr><th align="left">Quick Quiz:</th></tr> 177<tr><td> 178 What? Where??? 179 I don't see any <tt>smp_mb()</tt> after the return from 180 <tt>wait_for_completion()</tt>!!! 181</td></tr> 182<tr><th align="left">Answer:</th></tr> 183<tr><td bgcolor="#ffffff"><font color="ffffff"> 184 That would be because I spotted the need for that 185 <tt>smp_mb()</tt> during the creation of this documentation, 186 and it is therefore unlikely to hit mainline before v4.14. 187 Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and 188 Jonathan Cameron for asking questions that sensitized me 189 to the rather elaborate sequence of events that demonstrate 190 the need for this memory barrier. 191</font></td></tr> 192<tr><td> </td></tr> 193</table> 194 195<p>Tree RCU's grace--period memory-ordering guarantees rely most 196heavily on the <tt>rcu_node</tt> structure's <tt>->lock</tt> 197field, so much so that it is necessary to abbreviate this pattern 198in the diagrams in the next section. 199For example, consider the <tt>rcu_prepare_for_idle()</tt> function 200shown below, which is one of several functions that enforce ordering 201of newly arrived RCU callbacks against future grace periods: 202 203<pre> 204 1 static void rcu_prepare_for_idle(void) 205 2 { 206 3 bool needwake; 207 4 struct rcu_data *rdp; 208 5 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks); 209 6 struct rcu_node *rnp; 210 7 struct rcu_state *rsp; 211 8 int tne; 212 9 21310 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) || 21411 rcu_is_nocb_cpu(smp_processor_id())) 21512 return; 21613 tne = READ_ONCE(tick_nohz_active); 21714 if (tne != rdtp->tick_nohz_enabled_snap) { 21815 if (rcu_cpu_has_callbacks(NULL)) 21916 invoke_rcu_core(); 22017 rdtp->tick_nohz_enabled_snap = tne; 22118 return; 22219 } 22320 if (!tne) 22421 return; 22522 if (rdtp->all_lazy && 22623 rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) { 22724 rdtp->all_lazy = false; 22825 rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted; 22926 invoke_rcu_core(); 23027 return; 23128 } 23229 if (rdtp->last_accelerate == jiffies) 23330 return; 23431 rdtp->last_accelerate = jiffies; 23532 for_each_rcu_flavor(rsp) { 23633 rdp = this_cpu_ptr(rsp->rda); 23734 if (rcu_segcblist_pend_cbs(&rdp->cblist)) 23835 continue; 23936 rnp = rdp->mynode; 24037 raw_spin_lock_rcu_node(rnp); 24138 needwake = rcu_accelerate_cbs(rsp, rnp, rdp); 24239 raw_spin_unlock_rcu_node(rnp); 24340 if (needwake) 24441 rcu_gp_kthread_wake(rsp); 24542 } 24643 } 247</pre> 248 249<p>But the only part of <tt>rcu_prepare_for_idle()</tt> that really 250matters for this discussion are lines 37–39. 251We will therefore abbreviate this function as follows: 252 253</p><p><img src="rcu_node-lock.svg" alt="rcu_node-lock.svg"> 254 255<p>The box represents the <tt>rcu_node</tt> structure's <tt>->lock</tt> 256critical section, with the double line on top representing the additional 257<tt>smp_mb__after_unlock_lock()</tt>. 258 259<h3><a name="Tree RCU Grace Period Memory Ordering Components"> 260Tree RCU Grace Period Memory Ordering Components</a></h3> 261 262<p>Tree RCU's grace-period memory-ordering guarantee is provided by 263a number of RCU components: 264 265<ol> 266<li> <a href="#Callback Registry">Callback Registry</a> 267<li> <a href="#Grace-Period Initialization">Grace-Period Initialization</a> 268<li> <a href="#Self-Reported Quiescent States"> 269 Self-Reported Quiescent States</a> 270<li> <a href="#Dynamic Tick Interface">Dynamic Tick Interface</a> 271<li> <a href="#CPU-Hotplug Interface">CPU-Hotplug Interface</a> 272<li> <a href="Forcing Quiescent States">Forcing Quiescent States</a> 273<li> <a href="Grace-Period Cleanup">Grace-Period Cleanup</a> 274<li> <a href="Callback Invocation">Callback Invocation</a> 275</ol> 276 277<p>Each of the following section looks at the corresponding component 278in detail. 279 280<h4><a name="Callback Registry">Callback Registry</a></h4> 281 282<p>If RCU's grace-period guarantee is to mean anything at all, any 283access that happens before a given invocation of <tt>call_rcu()</tt> 284must also happen before the corresponding grace period. 285The implementation of this portion of RCU's grace period guarantee 286is shown in the following figure: 287 288</p><p><img src="TreeRCU-callback-registry.svg" alt="TreeRCU-callback-registry.svg"> 289 290<p>Because <tt>call_rcu()</tt> normally acts only on CPU-local state, 291it provides no ordering guarantees, either for itself or for 292phase one of the update (which again will usually be removal of 293an element from an RCU-protected data structure). 294It simply enqueues the <tt>rcu_head</tt> structure on a per-CPU list, 295which cannot become associated with a grace period until a later 296call to <tt>rcu_accelerate_cbs()</tt>, as shown in the diagram above. 297 298<p>One set of code paths shown on the left invokes 299<tt>rcu_accelerate_cbs()</tt> via 300<tt>note_gp_changes()</tt>, either directly from <tt>call_rcu()</tt> (if 301the current CPU is inundated with queued <tt>rcu_head</tt> structures) 302or more likely from an <tt>RCU_SOFTIRQ</tt> handler. 303Another code path in the middle is taken only in kernels built with 304<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>, which invokes 305<tt>rcu_accelerate_cbs()</tt> via <tt>rcu_prepare_for_idle()</tt>. 306The final code path on the right is taken only in kernels built with 307<tt>CONFIG_HOTPLUG_CPU=y</tt>, which invokes 308<tt>rcu_accelerate_cbs()</tt> via 309<tt>rcu_advance_cbs()</tt>, <tt>rcu_migrate_callbacks</tt>, 310<tt>rcutree_migrate_callbacks()</tt>, and <tt>takedown_cpu()</tt>, 311which in turn is invoked on a surviving CPU after the outgoing 312CPU has been completely offlined. 313 314<p>There are a few other code paths within grace-period processing 315that opportunistically invoke <tt>rcu_accelerate_cbs()</tt>. 316However, either way, all of the CPU's recently queued <tt>rcu_head</tt> 317structures are associated with a future grace-period number under 318the protection of the CPU's lead <tt>rcu_node</tt> structure's 319<tt>->lock</tt>. 320In all cases, there is full ordering against any prior critical section 321for that same <tt>rcu_node</tt> structure's <tt>->lock</tt>, and 322also full ordering against any of the current task's or CPU's prior critical 323sections for any <tt>rcu_node</tt> structure's <tt>->lock</tt>. 324 325<p>The next section will show how this ordering ensures that any 326accesses prior to the <tt>call_rcu()</tt> (particularly including phase 327one of the update) 328happen before the start of the corresponding grace period. 329 330<table> 331<tr><th> </th></tr> 332<tr><th align="left">Quick Quiz:</th></tr> 333<tr><td> 334 But what about <tt>synchronize_rcu()</tt>? 335</td></tr> 336<tr><th align="left">Answer:</th></tr> 337<tr><td bgcolor="#ffffff"><font color="ffffff"> 338 The <tt>synchronize_rcu()</tt> passes <tt>call_rcu()</tt> 339 to <tt>wait_rcu_gp()</tt>, which invokes it. 340 So either way, it eventually comes down to <tt>call_rcu()</tt>. 341</font></td></tr> 342<tr><td> </td></tr> 343</table> 344 345<h4><a name="Grace-Period Initialization">Grace-Period Initialization</a></h4> 346 347<p>Grace-period initialization is carried out by 348the grace-period kernel thread, which makes several passes over the 349<tt>rcu_node</tt> tree within the <tt>rcu_gp_init()</tt> function. 350This means that showing the full flow of ordering through the 351grace-period computation will require duplicating this tree. 352If you find this confusing, please note that the state of the 353<tt>rcu_node</tt> changes over time, just like Heraclitus's river. 354However, to keep the <tt>rcu_node</tt> river tractable, the 355grace-period kernel thread's traversals are presented in multiple 356parts, starting in this section with the various phases of 357grace-period initialization. 358 359<p>The first ordering-related grace-period initialization action is to 360advance the <tt>rcu_state</tt> structure's <tt>->gp_seq</tt> 361grace-period-number counter, as shown below: 362 363</p><p><img src="TreeRCU-gp-init-1.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> 364 365<p>The actual increment is carried out using <tt>smp_store_release()</tt>, 366which helps reject false-positive RCU CPU stall detection. 367Note that only the root <tt>rcu_node</tt> structure is touched. 368 369<p>The first pass through the <tt>rcu_node</tt> tree updates bitmasks 370based on CPUs having come online or gone offline since the start of 371the previous grace period. 372In the common case where the number of online CPUs for this <tt>rcu_node</tt> 373structure has not transitioned to or from zero, 374this pass will scan only the leaf <tt>rcu_node</tt> structures. 375However, if the number of online CPUs for a given leaf <tt>rcu_node</tt> 376structure has transitioned from zero, 377<tt>rcu_init_new_rnp()</tt> will be invoked for the first incoming CPU. 378Similarly, if the number of online CPUs for a given leaf <tt>rcu_node</tt> 379structure has transitioned to zero, 380<tt>rcu_cleanup_dead_rnp()</tt> will be invoked for the last outgoing CPU. 381The diagram below shows the path of ordering if the leftmost 382<tt>rcu_node</tt> structure onlines its first CPU and if the next 383<tt>rcu_node</tt> structure has no online CPUs 384(or, alternatively if the leftmost <tt>rcu_node</tt> structure offlines 385its last CPU and if the next <tt>rcu_node</tt> structure has no online CPUs). 386 387</p><p><img src="TreeRCU-gp-init-2.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> 388 389<p>The final <tt>rcu_gp_init()</tt> pass through the <tt>rcu_node</tt> 390tree traverses breadth-first, setting each <tt>rcu_node</tt> structure's 391<tt>->gp_seq</tt> field to the newly advanced value from the 392<tt>rcu_state</tt> structure, as shown in the following diagram. 393 394</p><p><img src="TreeRCU-gp-init-3.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> 395 396<p>This change will also cause each CPU's next call to 397<tt>__note_gp_changes()</tt> 398to notice that a new grace period has started, as described in the next 399section. 400But because the grace-period kthread started the grace period at the 401root (with the advancing of the <tt>rcu_state</tt> structure's 402<tt>->gp_seq</tt> field) before setting each leaf <tt>rcu_node</tt> 403structure's <tt>->gp_seq</tt> field, each CPU's observation of 404the start of the grace period will happen after the actual start 405of the grace period. 406 407<table> 408<tr><th> </th></tr> 409<tr><th align="left">Quick Quiz:</th></tr> 410<tr><td> 411 But what about the CPU that started the grace period? 412 Why wouldn't it see the start of the grace period right when 413 it started that grace period? 414</td></tr> 415<tr><th align="left">Answer:</th></tr> 416<tr><td bgcolor="#ffffff"><font color="ffffff"> 417 In some deep philosophical and overly anthromorphized 418 sense, yes, the CPU starting the grace period is immediately 419 aware of having done so. 420 However, if we instead assume that RCU is not self-aware, 421 then even the CPU starting the grace period does not really 422 become aware of the start of this grace period until its 423 first call to <tt>__note_gp_changes()</tt>. 424 On the other hand, this CPU potentially gets early notification 425 because it invokes <tt>__note_gp_changes()</tt> during its 426 last <tt>rcu_gp_init()</tt> pass through its leaf 427 <tt>rcu_node</tt> structure. 428</font></td></tr> 429<tr><td> </td></tr> 430</table> 431 432<h4><a name="Self-Reported Quiescent States"> 433Self-Reported Quiescent States</a></h4> 434 435<p>When all entities that might block the grace period have reported 436quiescent states (or as described in a later section, had quiescent 437states reported on their behalf), the grace period can end. 438Online non-idle CPUs report their own quiescent states, as shown 439in the following diagram: 440 441</p><p><img src="TreeRCU-qs.svg" alt="TreeRCU-qs.svg" width="75%"> 442 443<p>This is for the last CPU to report a quiescent state, which signals 444the end of the grace period. 445Earlier quiescent states would push up the <tt>rcu_node</tt> tree 446only until they encountered an <tt>rcu_node</tt> structure that 447is waiting for additional quiescent states. 448However, ordering is nevertheless preserved because some later quiescent 449state will acquire that <tt>rcu_node</tt> structure's <tt>->lock</tt>. 450 451<p>Any number of events can lead up to a CPU invoking 452<tt>note_gp_changes</tt> (or alternatively, directly invoking 453<tt>__note_gp_changes()</tt>), at which point that CPU will notice 454the start of a new grace period while holding its leaf 455<tt>rcu_node</tt> lock. 456Therefore, all execution shown in this diagram happens after the 457start of the grace period. 458In addition, this CPU will consider any RCU read-side critical 459section that started before the invocation of <tt>__note_gp_changes()</tt> 460to have started before the grace period, and thus a critical 461section that the grace period must wait on. 462 463<table> 464<tr><th> </th></tr> 465<tr><th align="left">Quick Quiz:</th></tr> 466<tr><td> 467 But a RCU read-side critical section might have started 468 after the beginning of the grace period 469 (the advancing of <tt>->gp_seq</tt> from earlier), so why should 470 the grace period wait on such a critical section? 471</td></tr> 472<tr><th align="left">Answer:</th></tr> 473<tr><td bgcolor="#ffffff"><font color="ffffff"> 474 It is indeed not necessary for the grace period to wait on such 475 a critical section. 476 However, it is permissible to wait on it. 477 And it is furthermore important to wait on it, as this 478 lazy approach is far more scalable than a “big bang” 479 all-at-once grace-period start could possibly be. 480</font></td></tr> 481<tr><td> </td></tr> 482</table> 483 484<p>If the CPU does a context switch, a quiescent state will be 485noted by <tt>rcu_node_context_switch()</tt> on the left. 486On the other hand, if the CPU takes a scheduler-clock interrupt 487while executing in usermode, a quiescent state will be noted by 488<tt>rcu_check_callbacks()</tt> on the right. 489Either way, the passage through a quiescent state will be noted 490in a per-CPU variable. 491 492<p>The next time an <tt>RCU_SOFTIRQ</tt> handler executes on 493this CPU (for example, after the next scheduler-clock 494interrupt), <tt>__rcu_process_callbacks()</tt> will invoke 495<tt>rcu_check_quiescent_state()</tt>, which will notice the 496recorded quiescent state, and invoke 497<tt>rcu_report_qs_rdp()</tt>. 498If <tt>rcu_report_qs_rdp()</tt> verifies that the quiescent state 499really does apply to the current grace period, it invokes 500<tt>rcu_report_rnp()</tt> which traverses up the <tt>rcu_node</tt> 501tree as shown at the bottom of the diagram, clearing bits from 502each <tt>rcu_node</tt> structure's <tt>->qsmask</tt> field, 503and propagating up the tree when the result is zero. 504 505<p>Note that traversal passes upwards out of a given <tt>rcu_node</tt> 506structure only if the current CPU is reporting the last quiescent 507state for the subtree headed by that <tt>rcu_node</tt> structure. 508A key point is that if a CPU's traversal stops at a given <tt>rcu_node</tt> 509structure, then there will be a later traversal by another CPU 510(or perhaps the same one) that proceeds upwards 511from that point, and the <tt>rcu_node</tt> <tt>->lock</tt> 512guarantees that the first CPU's quiescent state happens before the 513remainder of the second CPU's traversal. 514Applying this line of thought repeatedly shows that all CPUs' 515quiescent states happen before the last CPU traverses through 516the root <tt>rcu_node</tt> structure, the “last CPU” 517being the one that clears the last bit in the root <tt>rcu_node</tt> 518structure's <tt>->qsmask</tt> field. 519 520<h4><a name="Dynamic Tick Interface">Dynamic Tick Interface</a></h4> 521 522<p>Due to energy-efficiency considerations, RCU is forbidden from 523disturbing idle CPUs. 524CPUs are therefore required to notify RCU when entering or leaving idle 525state, which they do via fully ordered value-returning atomic operations 526on a per-CPU variable. 527The ordering effects are as shown below: 528 529</p><p><img src="TreeRCU-dyntick.svg" alt="TreeRCU-dyntick.svg" width="50%"> 530 531<p>The RCU grace-period kernel thread samples the per-CPU idleness 532variable while holding the corresponding CPU's leaf <tt>rcu_node</tt> 533structure's <tt>->lock</tt>. 534This means that any RCU read-side critical sections that precede the 535idle period (the oval near the top of the diagram above) will happen 536before the end of the current grace period. 537Similarly, the beginning of the current grace period will happen before 538any RCU read-side critical sections that follow the 539idle period (the oval near the bottom of the diagram above). 540 541<p>Plumbing this into the full grace-period execution is described 542<a href="#Forcing Quiescent States">below</a>. 543 544<h4><a name="CPU-Hotplug Interface">CPU-Hotplug Interface</a></h4> 545 546<p>RCU is also forbidden from disturbing offline CPUs, which might well 547be powered off and removed from the system completely. 548CPUs are therefore required to notify RCU of their comings and goings 549as part of the corresponding CPU hotplug operations. 550The ordering effects are shown below: 551 552</p><p><img src="TreeRCU-hotplug.svg" alt="TreeRCU-hotplug.svg" width="50%"> 553 554<p>Because CPU hotplug operations are much less frequent than idle transitions, 555they are heavier weight, and thus acquire the CPU's leaf <tt>rcu_node</tt> 556structure's <tt>->lock</tt> and update this structure's 557<tt>->qsmaskinitnext</tt>. 558The RCU grace-period kernel thread samples this mask to detect CPUs 559having gone offline since the beginning of this grace period. 560 561<p>Plumbing this into the full grace-period execution is described 562<a href="#Forcing Quiescent States">below</a>. 563 564<h4><a name="Forcing Quiescent States">Forcing Quiescent States</a></h4> 565 566<p>As noted above, idle and offline CPUs cannot report their own 567quiescent states, and therefore the grace-period kernel thread 568must do the reporting on their behalf. 569This process is called “forcing quiescent states”, it is 570repeated every few jiffies, and its ordering effects are shown below: 571 572</p><p><img src="TreeRCU-gp-fqs.svg" alt="TreeRCU-gp-fqs.svg" width="100%"> 573 574<p>Each pass of quiescent state forcing is guaranteed to traverse the 575leaf <tt>rcu_node</tt> structures, and if there are no new quiescent 576states due to recently idled and/or offlined CPUs, then only the 577leaves are traversed. 578However, if there is a newly offlined CPU as illustrated on the left 579or a newly idled CPU as illustrated on the right, the corresponding 580quiescent state will be driven up towards the root. 581As with self-reported quiescent states, the upwards driving stops 582once it reaches an <tt>rcu_node</tt> structure that has quiescent 583states outstanding from other CPUs. 584 585<table> 586<tr><th> </th></tr> 587<tr><th align="left">Quick Quiz:</th></tr> 588<tr><td> 589 The leftmost drive to root stopped before it reached 590 the root <tt>rcu_node</tt> structure, which means that 591 there are still CPUs subordinate to that structure on 592 which the current grace period is waiting. 593 Given that, how is it possible that the rightmost drive 594 to root ended the grace period? 595</td></tr> 596<tr><th align="left">Answer:</th></tr> 597<tr><td bgcolor="#ffffff"><font color="ffffff"> 598 Good analysis! 599 It is in fact impossible in the absence of bugs in RCU. 600 But this diagram is complex enough as it is, so simplicity 601 overrode accuracy. 602 You can think of it as poetic license, or you can think of 603 it as misdirection that is resolved in the 604 <a href="#Putting It All Together">stitched-together diagram</a>. 605</font></td></tr> 606<tr><td> </td></tr> 607</table> 608 609<h4><a name="Grace-Period Cleanup">Grace-Period Cleanup</a></h4> 610 611<p>Grace-period cleanup first scans the <tt>rcu_node</tt> tree 612breadth-first advancing all the <tt>->gp_seq</tt> fields, then it 613advances the <tt>rcu_state</tt> structure's <tt>->gp_seq</tt> field. 614The ordering effects are shown below: 615 616</p><p><img src="TreeRCU-gp-cleanup.svg" alt="TreeRCU-gp-cleanup.svg" width="75%"> 617 618<p>As indicated by the oval at the bottom of the diagram, once 619grace-period cleanup is complete, the next grace period can begin. 620 621<table> 622<tr><th> </th></tr> 623<tr><th align="left">Quick Quiz:</th></tr> 624<tr><td> 625 But when precisely does the grace period end? 626</td></tr> 627<tr><th align="left">Answer:</th></tr> 628<tr><td bgcolor="#ffffff"><font color="ffffff"> 629 There is no useful single point at which the grace period 630 can be said to end. 631 The earliest reasonable candidate is as soon as the last 632 CPU has reported its quiescent state, but it may be some 633 milliseconds before RCU becomes aware of this. 634 The latest reasonable candidate is once the <tt>rcu_state</tt> 635 structure's <tt>->gp_seq</tt> field has been updated, 636 but it is quite possible that some CPUs have already completed 637 phase two of their updates by that time. 638 In short, if you are going to work with RCU, you need to 639 learn to embrace uncertainty. 640</font></td></tr> 641<tr><td> </td></tr> 642</table> 643 644 645<h4><a name="Callback Invocation">Callback Invocation</a></h4> 646 647<p>Once a given CPU's leaf <tt>rcu_node</tt> structure's 648<tt>->gp_seq</tt> field has been updated, that CPU can begin 649invoking its RCU callbacks that were waiting for this grace period 650to end. 651These callbacks are identified by <tt>rcu_advance_cbs()</tt>, 652which is usually invoked by <tt>__note_gp_changes()</tt>. 653As shown in the diagram below, this invocation can be triggered by 654the scheduling-clock interrupt (<tt>rcu_check_callbacks()</tt> on 655the left) or by idle entry (<tt>rcu_cleanup_after_idle()</tt> on 656the right, but only for kernels build with 657<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>). 658Either way, <tt>RCU_SOFTIRQ</tt> is raised, which results in 659<tt>rcu_do_batch()</tt> invoking the callbacks, which in turn 660allows those callbacks to carry out (either directly or indirectly 661via wakeup) the needed phase-two processing for each update. 662 663</p><p><img src="TreeRCU-callback-invocation.svg" alt="TreeRCU-callback-invocation.svg" width="60%"> 664 665<p>Please note that callback invocation can also be prompted by any 666number of corner-case code paths, for example, when a CPU notes that 667it has excessive numbers of callbacks queued. 668In all cases, the CPU acquires its leaf <tt>rcu_node</tt> structure's 669<tt>->lock</tt> before invoking callbacks, which preserves the 670required ordering against the newly completed grace period. 671 672<p>However, if the callback function communicates to other CPUs, 673for example, doing a wakeup, then it is that function's responsibility 674to maintain ordering. 675For example, if the callback function wakes up a task that runs on 676some other CPU, proper ordering must in place in both the callback 677function and the task being awakened. 678To see why this is important, consider the top half of the 679<a href="#Grace-Period Cleanup">grace-period cleanup</a> diagram. 680The callback might be running on a CPU corresponding to the leftmost 681leaf <tt>rcu_node</tt> structure, and awaken a task that is to run on 682a CPU corresponding to the rightmost leaf <tt>rcu_node</tt> structure, 683and the grace-period kernel thread might not yet have reached the 684rightmost leaf. 685In this case, the grace period's memory ordering might not yet have 686reached that CPU, so again the callback function and the awakened 687task must supply proper ordering. 688 689<h3><a name="Putting It All Together">Putting It All Together</a></h3> 690 691<p>A stitched-together diagram is 692<a href="Tree-RCU-Diagram.html">here</a>. 693 694<h3><a name="Legal Statement"> 695Legal Statement</a></h3> 696 697<p>This work represents the view of the author and does not necessarily 698represent the view of IBM. 699 700</p><p>Linux is a registered trademark of Linus Torvalds. 701 702</p><p>Other company, product, and service names may be trademarks or 703service marks of others. 704 705</body></html> 706