1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 2 "http://www.w3.org/TR/html4/loose.dtd"> 3 <html> 4 <head><title>A Tour Through TREE_RCU's Grace-Period Memory Ordering</title> 5 <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> 6 7 <p>August 8, 2017</p> 8 <p>This article was contributed by Paul E. McKenney</p> 9 10<h3>Introduction</h3> 11 12<p>This document gives a rough visual overview of how Tree RCU's 13grace-period memory ordering guarantee is provided. 14 15<ol> 16<li> <a href="#What Is Tree RCU's Grace Period Memory Ordering Guarantee?"> 17 What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a> 18<li> <a href="#Tree RCU Grace Period Memory Ordering Building Blocks"> 19 Tree RCU Grace Period Memory Ordering Building Blocks</a> 20<li> <a href="#Tree RCU Grace Period Memory Ordering Components"> 21 Tree RCU Grace Period Memory Ordering Components</a> 22<li> <a href="#Putting It All Together">Putting It All Together</a> 23</ol> 24 25<h3><a name="What Is Tree RCU's Grace Period Memory Ordering Guarantee?"> 26What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a></h3> 27 28<p>RCU grace periods provide extremely strong memory-ordering guarantees 29for non-idle non-offline code. 30Any code that happens after the end of a given RCU grace period is guaranteed 31to see the effects of all accesses prior to the beginning of that grace 32period that are within RCU read-side critical sections. 33Similarly, any code that happens before the beginning of a given RCU grace 34period is guaranteed to see the effects of all accesses following the end 35of that grace period that are within RCU read-side critical sections. 36 37<p>Note well that RCU-sched read-side critical sections include any region 38of code for which preemption is disabled. 39Given that each individual machine instruction can be thought of as 40an extremely small region of preemption-disabled code, one can think of 41<tt>synchronize_rcu()</tt> as <tt>smp_mb()</tt> on steroids. 42 43<p>RCU updaters use this guarantee by splitting their updates into 44two phases, one of which is executed before the grace period and 45the other of which is executed after the grace period. 46In the most common use case, phase one removes an element from 47a linked RCU-protected data structure, and phase two frees that element. 48For this to work, any readers that have witnessed state prior to the 49phase-one update (in the common case, removal) must not witness state 50following the phase-two update (in the common case, freeing). 51 52<p>The RCU implementation provides this guarantee using a network 53of lock-based critical sections, memory barriers, and per-CPU 54processing, as is described in the following sections. 55 56<h3><a name="Tree RCU Grace Period Memory Ordering Building Blocks"> 57Tree RCU Grace Period Memory Ordering Building Blocks</a></h3> 58 59<p>The workhorse for RCU's grace-period memory ordering is the 60critical section for the <tt>rcu_node</tt> structure's 61<tt>->lock</tt>. 62These critical sections use helper functions for lock acquisition, including 63<tt>raw_spin_lock_rcu_node()</tt>, 64<tt>raw_spin_lock_irq_rcu_node()</tt>, and 65<tt>raw_spin_lock_irqsave_rcu_node()</tt>. 66Their lock-release counterparts are 67<tt>raw_spin_unlock_rcu_node()</tt>, 68<tt>raw_spin_unlock_irq_rcu_node()</tt>, and 69<tt>raw_spin_unlock_irqrestore_rcu_node()</tt>, 70respectively. 71For completeness, a 72<tt>raw_spin_trylock_rcu_node()</tt> 73is also provided. 74The key point is that the lock-acquisition functions, including 75<tt>raw_spin_trylock_rcu_node()</tt>, all invoke 76<tt>smp_mb__after_unlock_lock()</tt> immediately after successful 77acquisition of the lock. 78 79<p>Therefore, for any given <tt>rcu_node</tt> structure, any access 80happening before one of the above lock-release functions will be seen 81by all CPUs as happening before any access happening after a later 82one of the above lock-acquisition functions. 83Furthermore, any access happening before one of the 84above lock-release function on any given CPU will be seen by all 85CPUs as happening before any access happening after a later one 86of the above lock-acquisition functions executing on that same CPU, 87even if the lock-release and lock-acquisition functions are operating 88on different <tt>rcu_node</tt> structures. 89Tree RCU uses these two ordering guarantees to form an ordering 90network among all CPUs that were in any way involved in the grace 91period, including any CPUs that came online or went offline during 92the grace period in question. 93 94<p>The following litmus test exhibits the ordering effects of these 95lock-acquisition and lock-release functions: 96 97<pre> 98 1 int x, y, z; 99 2 100 3 void task0(void) 101 4 { 102 5 raw_spin_lock_rcu_node(rnp); 103 6 WRITE_ONCE(x, 1); 104 7 r1 = READ_ONCE(y); 105 8 raw_spin_unlock_rcu_node(rnp); 106 9 } 10710 10811 void task1(void) 10912 { 11013 raw_spin_lock_rcu_node(rnp); 11114 WRITE_ONCE(y, 1); 11215 r2 = READ_ONCE(z); 11316 raw_spin_unlock_rcu_node(rnp); 11417 } 11518 11619 void task2(void) 11720 { 11821 WRITE_ONCE(z, 1); 11922 smp_mb(); 12023 r3 = READ_ONCE(x); 12124 } 12225 12326 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0); 124</pre> 125 126<p>The <tt>WARN_ON()</tt> is evaluated at “the end of time”, 127after all changes have propagated throughout the system. 128Without the <tt>smp_mb__after_unlock_lock()</tt> provided by the 129acquisition functions, this <tt>WARN_ON()</tt> could trigger, for example 130on PowerPC. 131The <tt>smp_mb__after_unlock_lock()</tt> invocations prevent this 132<tt>WARN_ON()</tt> from triggering. 133 134<p>This approach must be extended to include idle CPUs, which need 135RCU's grace-period memory ordering guarantee to extend to any 136RCU read-side critical sections preceding and following the current 137idle sojourn. 138This case is handled by calls to the strongly ordered 139<tt>atomic_add_return()</tt> read-modify-write atomic operation that 140is invoked within <tt>rcu_dynticks_eqs_enter()</tt> at idle-entry 141time and within <tt>rcu_dynticks_eqs_exit()</tt> at idle-exit time. 142The grace-period kthread invokes <tt>rcu_dynticks_snap()</tt> and 143<tt>rcu_dynticks_in_eqs_since()</tt> (both of which invoke 144an <tt>atomic_add_return()</tt> of zero) to detect idle CPUs. 145 146<table> 147<tr><th> </th></tr> 148<tr><th align="left">Quick Quiz:</th></tr> 149<tr><td> 150 But what about CPUs that remain offline for the entire 151 grace period? 152</td></tr> 153<tr><th align="left">Answer:</th></tr> 154<tr><td bgcolor="#ffffff"><font color="ffffff"> 155 Such CPUs will be offline at the beginning of the grace period, 156 so the grace period won't expect quiescent states from them. 157 Races between grace-period start and CPU-hotplug operations 158 are mediated by the CPU's leaf <tt>rcu_node</tt> structure's 159 <tt>->lock</tt> as described above. 160</font></td></tr> 161<tr><td> </td></tr> 162</table> 163 164<p>The approach must be extended to handle one final case, that 165of waking a task blocked in <tt>synchronize_rcu()</tt>. 166This task might be affinitied to a CPU that is not yet aware that 167the grace period has ended, and thus might not yet be subject to 168the grace period's memory ordering. 169Therefore, there is an <tt>smp_mb()</tt> after the return from 170<tt>wait_for_completion()</tt> in the <tt>synchronize_rcu()</tt> 171code path. 172 173<table> 174<tr><th> </th></tr> 175<tr><th align="left">Quick Quiz:</th></tr> 176<tr><td> 177 What? Where??? 178 I don't see any <tt>smp_mb()</tt> after the return from 179 <tt>wait_for_completion()</tt>!!! 180</td></tr> 181<tr><th align="left">Answer:</th></tr> 182<tr><td bgcolor="#ffffff"><font color="ffffff"> 183 That would be because I spotted the need for that 184 <tt>smp_mb()</tt> during the creation of this documentation, 185 and it is therefore unlikely to hit mainline before v4.14. 186 Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and 187 Jonathan Cameron for asking questions that sensitized me 188 to the rather elaborate sequence of events that demonstrate 189 the need for this memory barrier. 190</font></td></tr> 191<tr><td> </td></tr> 192</table> 193 194<p>Tree RCU's grace--period memory-ordering guarantees rely most 195heavily on the <tt>rcu_node</tt> structure's <tt>->lock</tt> 196field, so much so that it is necessary to abbreviate this pattern 197in the diagrams in the next section. 198For example, consider the <tt>rcu_prepare_for_idle()</tt> function 199shown below, which is one of several functions that enforce ordering 200of newly arrived RCU callbacks against future grace periods: 201 202<pre> 203 1 static void rcu_prepare_for_idle(void) 204 2 { 205 3 bool needwake; 206 4 struct rcu_data *rdp; 207 5 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks); 208 6 struct rcu_node *rnp; 209 7 struct rcu_state *rsp; 210 8 int tne; 211 9 21210 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) || 21311 rcu_is_nocb_cpu(smp_processor_id())) 21412 return; 21513 tne = READ_ONCE(tick_nohz_active); 21614 if (tne != rdtp->tick_nohz_enabled_snap) { 21715 if (rcu_cpu_has_callbacks(NULL)) 21816 invoke_rcu_core(); 21917 rdtp->tick_nohz_enabled_snap = tne; 22018 return; 22119 } 22220 if (!tne) 22321 return; 22422 if (rdtp->all_lazy && 22523 rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) { 22624 rdtp->all_lazy = false; 22725 rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted; 22826 invoke_rcu_core(); 22927 return; 23028 } 23129 if (rdtp->last_accelerate == jiffies) 23230 return; 23331 rdtp->last_accelerate = jiffies; 23432 for_each_rcu_flavor(rsp) { 23533 rdp = this_cpu_ptr(rsp->rda); 23634 if (rcu_segcblist_pend_cbs(&rdp->cblist)) 23735 continue; 23836 rnp = rdp->mynode; 23937 raw_spin_lock_rcu_node(rnp); 24038 needwake = rcu_accelerate_cbs(rsp, rnp, rdp); 24139 raw_spin_unlock_rcu_node(rnp); 24240 if (needwake) 24341 rcu_gp_kthread_wake(rsp); 24442 } 24543 } 246</pre> 247 248<p>But the only part of <tt>rcu_prepare_for_idle()</tt> that really 249matters for this discussion are lines 37–39. 250We will therefore abbreviate this function as follows: 251 252</p><p><img src="rcu_node-lock.svg" alt="rcu_node-lock.svg"> 253 254<p>The box represents the <tt>rcu_node</tt> structure's <tt>->lock</tt> 255critical section, with the double line on top representing the additional 256<tt>smp_mb__after_unlock_lock()</tt>. 257 258<h3><a name="Tree RCU Grace Period Memory Ordering Components"> 259Tree RCU Grace Period Memory Ordering Components</a></h3> 260 261<p>Tree RCU's grace-period memory-ordering guarantee is provided by 262a number of RCU components: 263 264<ol> 265<li> <a href="#Callback Registry">Callback Registry</a> 266<li> <a href="#Grace-Period Initialization">Grace-Period Initialization</a> 267<li> <a href="#Self-Reported Quiescent States"> 268 Self-Reported Quiescent States</a> 269<li> <a href="#Dynamic Tick Interface">Dynamic Tick Interface</a> 270<li> <a href="#CPU-Hotplug Interface">CPU-Hotplug Interface</a> 271<li> <a href="Forcing Quiescent States">Forcing Quiescent States</a> 272<li> <a href="Grace-Period Cleanup">Grace-Period Cleanup</a> 273<li> <a href="Callback Invocation">Callback Invocation</a> 274</ol> 275 276<p>Each of the following section looks at the corresponding component 277in detail. 278 279<h4><a name="Callback Registry">Callback Registry</a></h4> 280 281<p>If RCU's grace-period guarantee is to mean anything at all, any 282access that happens before a given invocation of <tt>call_rcu()</tt> 283must also happen before the corresponding grace period. 284The implementation of this portion of RCU's grace period guarantee 285is shown in the following figure: 286 287</p><p><img src="TreeRCU-callback-registry.svg" alt="TreeRCU-callback-registry.svg"> 288 289<p>Because <tt>call_rcu()</tt> normally acts only on CPU-local state, 290it provides no ordering guarantees, either for itself or for 291phase one of the update (which again will usually be removal of 292an element from an RCU-protected data structure). 293It simply enqueues the <tt>rcu_head</tt> structure on a per-CPU list, 294which cannot become associated with a grace period until a later 295call to <tt>rcu_accelerate_cbs()</tt>, as shown in the diagram above. 296 297<p>One set of code paths shown on the left invokes 298<tt>rcu_accelerate_cbs()</tt> via 299<tt>note_gp_changes()</tt>, either directly from <tt>call_rcu()</tt> (if 300the current CPU is inundated with queued <tt>rcu_head</tt> structures) 301or more likely from an <tt>RCU_SOFTIRQ</tt> handler. 302Another code path in the middle is taken only in kernels built with 303<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>, which invokes 304<tt>rcu_accelerate_cbs()</tt> via <tt>rcu_prepare_for_idle()</tt>. 305The final code path on the right is taken only in kernels built with 306<tt>CONFIG_HOTPLUG_CPU=y</tt>, which invokes 307<tt>rcu_accelerate_cbs()</tt> via 308<tt>rcu_advance_cbs()</tt>, <tt>rcu_migrate_callbacks</tt>, 309<tt>rcutree_migrate_callbacks()</tt>, and <tt>takedown_cpu()</tt>, 310which in turn is invoked on a surviving CPU after the outgoing 311CPU has been completely offlined. 312 313<p>There are a few other code paths within grace-period processing 314that opportunistically invoke <tt>rcu_accelerate_cbs()</tt>. 315However, either way, all of the CPU's recently queued <tt>rcu_head</tt> 316structures are associated with a future grace-period number under 317the protection of the CPU's lead <tt>rcu_node</tt> structure's 318<tt>->lock</tt>. 319In all cases, there is full ordering against any prior critical section 320for that same <tt>rcu_node</tt> structure's <tt>->lock</tt>, and 321also full ordering against any of the current task's or CPU's prior critical 322sections for any <tt>rcu_node</tt> structure's <tt>->lock</tt>. 323 324<p>The next section will show how this ordering ensures that any 325accesses prior to the <tt>call_rcu()</tt> (particularly including phase 326one of the update) 327happen before the start of the corresponding grace period. 328 329<table> 330<tr><th> </th></tr> 331<tr><th align="left">Quick Quiz:</th></tr> 332<tr><td> 333 But what about <tt>synchronize_rcu()</tt>? 334</td></tr> 335<tr><th align="left">Answer:</th></tr> 336<tr><td bgcolor="#ffffff"><font color="ffffff"> 337 The <tt>synchronize_rcu()</tt> passes <tt>call_rcu()</tt> 338 to <tt>wait_rcu_gp()</tt>, which invokes it. 339 So either way, it eventually comes down to <tt>call_rcu()</tt>. 340</font></td></tr> 341<tr><td> </td></tr> 342</table> 343 344<h4><a name="Grace-Period Initialization">Grace-Period Initialization</a></h4> 345 346<p>Grace-period initialization is carried out by 347the grace-period kernel thread, which makes several passes over the 348<tt>rcu_node</tt> tree within the <tt>rcu_gp_init()</tt> function. 349This means that showing the full flow of ordering through the 350grace-period computation will require duplicating this tree. 351If you find this confusing, please note that the state of the 352<tt>rcu_node</tt> changes over time, just like Heraclitus's river. 353However, to keep the <tt>rcu_node</tt> river tractable, the 354grace-period kernel thread's traversals are presented in multiple 355parts, starting in this section with the various phases of 356grace-period initialization. 357 358<p>The first ordering-related grace-period initialization action is to 359advance the <tt>rcu_state</tt> structure's <tt>->gp_seq</tt> 360grace-period-number counter, as shown below: 361 362</p><p><img src="TreeRCU-gp-init-1.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> 363 364<p>The actual increment is carried out using <tt>smp_store_release()</tt>, 365which helps reject false-positive RCU CPU stall detection. 366Note that only the root <tt>rcu_node</tt> structure is touched. 367 368<p>The first pass through the <tt>rcu_node</tt> tree updates bitmasks 369based on CPUs having come online or gone offline since the start of 370the previous grace period. 371In the common case where the number of online CPUs for this <tt>rcu_node</tt> 372structure has not transitioned to or from zero, 373this pass will scan only the leaf <tt>rcu_node</tt> structures. 374However, if the number of online CPUs for a given leaf <tt>rcu_node</tt> 375structure has transitioned from zero, 376<tt>rcu_init_new_rnp()</tt> will be invoked for the first incoming CPU. 377Similarly, if the number of online CPUs for a given leaf <tt>rcu_node</tt> 378structure has transitioned to zero, 379<tt>rcu_cleanup_dead_rnp()</tt> will be invoked for the last outgoing CPU. 380The diagram below shows the path of ordering if the leftmost 381<tt>rcu_node</tt> structure onlines its first CPU and if the next 382<tt>rcu_node</tt> structure has no online CPUs 383(or, alternatively if the leftmost <tt>rcu_node</tt> structure offlines 384its last CPU and if the next <tt>rcu_node</tt> structure has no online CPUs). 385 386</p><p><img src="TreeRCU-gp-init-2.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> 387 388<p>The final <tt>rcu_gp_init()</tt> pass through the <tt>rcu_node</tt> 389tree traverses breadth-first, setting each <tt>rcu_node</tt> structure's 390<tt>->gp_seq</tt> field to the newly advanced value from the 391<tt>rcu_state</tt> structure, as shown in the following diagram. 392 393</p><p><img src="TreeRCU-gp-init-3.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> 394 395<p>This change will also cause each CPU's next call to 396<tt>__note_gp_changes()</tt> 397to notice that a new grace period has started, as described in the next 398section. 399But because the grace-period kthread started the grace period at the 400root (with the advancing of the <tt>rcu_state</tt> structure's 401<tt>->gp_seq</tt> field) before setting each leaf <tt>rcu_node</tt> 402structure's <tt>->gp_seq</tt> field, each CPU's observation of 403the start of the grace period will happen after the actual start 404of the grace period. 405 406<table> 407<tr><th> </th></tr> 408<tr><th align="left">Quick Quiz:</th></tr> 409<tr><td> 410 But what about the CPU that started the grace period? 411 Why wouldn't it see the start of the grace period right when 412 it started that grace period? 413</td></tr> 414<tr><th align="left">Answer:</th></tr> 415<tr><td bgcolor="#ffffff"><font color="ffffff"> 416 In some deep philosophical and overly anthromorphized 417 sense, yes, the CPU starting the grace period is immediately 418 aware of having done so. 419 However, if we instead assume that RCU is not self-aware, 420 then even the CPU starting the grace period does not really 421 become aware of the start of this grace period until its 422 first call to <tt>__note_gp_changes()</tt>. 423 On the other hand, this CPU potentially gets early notification 424 because it invokes <tt>__note_gp_changes()</tt> during its 425 last <tt>rcu_gp_init()</tt> pass through its leaf 426 <tt>rcu_node</tt> structure. 427</font></td></tr> 428<tr><td> </td></tr> 429</table> 430 431<h4><a name="Self-Reported Quiescent States"> 432Self-Reported Quiescent States</a></h4> 433 434<p>When all entities that might block the grace period have reported 435quiescent states (or as described in a later section, had quiescent 436states reported on their behalf), the grace period can end. 437Online non-idle CPUs report their own quiescent states, as shown 438in the following diagram: 439 440</p><p><img src="TreeRCU-qs.svg" alt="TreeRCU-qs.svg" width="75%"> 441 442<p>This is for the last CPU to report a quiescent state, which signals 443the end of the grace period. 444Earlier quiescent states would push up the <tt>rcu_node</tt> tree 445only until they encountered an <tt>rcu_node</tt> structure that 446is waiting for additional quiescent states. 447However, ordering is nevertheless preserved because some later quiescent 448state will acquire that <tt>rcu_node</tt> structure's <tt>->lock</tt>. 449 450<p>Any number of events can lead up to a CPU invoking 451<tt>note_gp_changes</tt> (or alternatively, directly invoking 452<tt>__note_gp_changes()</tt>), at which point that CPU will notice 453the start of a new grace period while holding its leaf 454<tt>rcu_node</tt> lock. 455Therefore, all execution shown in this diagram happens after the 456start of the grace period. 457In addition, this CPU will consider any RCU read-side critical 458section that started before the invocation of <tt>__note_gp_changes()</tt> 459to have started before the grace period, and thus a critical 460section that the grace period must wait on. 461 462<table> 463<tr><th> </th></tr> 464<tr><th align="left">Quick Quiz:</th></tr> 465<tr><td> 466 But a RCU read-side critical section might have started 467 after the beginning of the grace period 468 (the advancing of <tt>->gp_seq</tt> from earlier), so why should 469 the grace period wait on such a critical section? 470</td></tr> 471<tr><th align="left">Answer:</th></tr> 472<tr><td bgcolor="#ffffff"><font color="ffffff"> 473 It is indeed not necessary for the grace period to wait on such 474 a critical section. 475 However, it is permissible to wait on it. 476 And it is furthermore important to wait on it, as this 477 lazy approach is far more scalable than a “big bang” 478 all-at-once grace-period start could possibly be. 479</font></td></tr> 480<tr><td> </td></tr> 481</table> 482 483<p>If the CPU does a context switch, a quiescent state will be 484noted by <tt>rcu_node_context_switch()</tt> on the left. 485On the other hand, if the CPU takes a scheduler-clock interrupt 486while executing in usermode, a quiescent state will be noted by 487<tt>rcu_sched_clock_irq()</tt> on the right. 488Either way, the passage through a quiescent state will be noted 489in a per-CPU variable. 490 491<p>The next time an <tt>RCU_SOFTIRQ</tt> handler executes on 492this CPU (for example, after the next scheduler-clock 493interrupt), <tt>rcu_core()</tt> will invoke 494<tt>rcu_check_quiescent_state()</tt>, which will notice the 495recorded quiescent state, and invoke 496<tt>rcu_report_qs_rdp()</tt>. 497If <tt>rcu_report_qs_rdp()</tt> verifies that the quiescent state 498really does apply to the current grace period, it invokes 499<tt>rcu_report_rnp()</tt> which traverses up the <tt>rcu_node</tt> 500tree as shown at the bottom of the diagram, clearing bits from 501each <tt>rcu_node</tt> structure's <tt>->qsmask</tt> field, 502and propagating up the tree when the result is zero. 503 504<p>Note that traversal passes upwards out of a given <tt>rcu_node</tt> 505structure only if the current CPU is reporting the last quiescent 506state for the subtree headed by that <tt>rcu_node</tt> structure. 507A key point is that if a CPU's traversal stops at a given <tt>rcu_node</tt> 508structure, then there will be a later traversal by another CPU 509(or perhaps the same one) that proceeds upwards 510from that point, and the <tt>rcu_node</tt> <tt>->lock</tt> 511guarantees that the first CPU's quiescent state happens before the 512remainder of the second CPU's traversal. 513Applying this line of thought repeatedly shows that all CPUs' 514quiescent states happen before the last CPU traverses through 515the root <tt>rcu_node</tt> structure, the “last CPU” 516being the one that clears the last bit in the root <tt>rcu_node</tt> 517structure's <tt>->qsmask</tt> field. 518 519<h4><a name="Dynamic Tick Interface">Dynamic Tick Interface</a></h4> 520 521<p>Due to energy-efficiency considerations, RCU is forbidden from 522disturbing idle CPUs. 523CPUs are therefore required to notify RCU when entering or leaving idle 524state, which they do via fully ordered value-returning atomic operations 525on a per-CPU variable. 526The ordering effects are as shown below: 527 528</p><p><img src="TreeRCU-dyntick.svg" alt="TreeRCU-dyntick.svg" width="50%"> 529 530<p>The RCU grace-period kernel thread samples the per-CPU idleness 531variable while holding the corresponding CPU's leaf <tt>rcu_node</tt> 532structure's <tt>->lock</tt>. 533This means that any RCU read-side critical sections that precede the 534idle period (the oval near the top of the diagram above) will happen 535before the end of the current grace period. 536Similarly, the beginning of the current grace period will happen before 537any RCU read-side critical sections that follow the 538idle period (the oval near the bottom of the diagram above). 539 540<p>Plumbing this into the full grace-period execution is described 541<a href="#Forcing Quiescent States">below</a>. 542 543<h4><a name="CPU-Hotplug Interface">CPU-Hotplug Interface</a></h4> 544 545<p>RCU is also forbidden from disturbing offline CPUs, which might well 546be powered off and removed from the system completely. 547CPUs are therefore required to notify RCU of their comings and goings 548as part of the corresponding CPU hotplug operations. 549The ordering effects are shown below: 550 551</p><p><img src="TreeRCU-hotplug.svg" alt="TreeRCU-hotplug.svg" width="50%"> 552 553<p>Because CPU hotplug operations are much less frequent than idle transitions, 554they are heavier weight, and thus acquire the CPU's leaf <tt>rcu_node</tt> 555structure's <tt>->lock</tt> and update this structure's 556<tt>->qsmaskinitnext</tt>. 557The RCU grace-period kernel thread samples this mask to detect CPUs 558having gone offline since the beginning of this grace period. 559 560<p>Plumbing this into the full grace-period execution is described 561<a href="#Forcing Quiescent States">below</a>. 562 563<h4><a name="Forcing Quiescent States">Forcing Quiescent States</a></h4> 564 565<p>As noted above, idle and offline CPUs cannot report their own 566quiescent states, and therefore the grace-period kernel thread 567must do the reporting on their behalf. 568This process is called “forcing quiescent states”, it is 569repeated every few jiffies, and its ordering effects are shown below: 570 571</p><p><img src="TreeRCU-gp-fqs.svg" alt="TreeRCU-gp-fqs.svg" width="100%"> 572 573<p>Each pass of quiescent state forcing is guaranteed to traverse the 574leaf <tt>rcu_node</tt> structures, and if there are no new quiescent 575states due to recently idled and/or offlined CPUs, then only the 576leaves are traversed. 577However, if there is a newly offlined CPU as illustrated on the left 578or a newly idled CPU as illustrated on the right, the corresponding 579quiescent state will be driven up towards the root. 580As with self-reported quiescent states, the upwards driving stops 581once it reaches an <tt>rcu_node</tt> structure that has quiescent 582states outstanding from other CPUs. 583 584<table> 585<tr><th> </th></tr> 586<tr><th align="left">Quick Quiz:</th></tr> 587<tr><td> 588 The leftmost drive to root stopped before it reached 589 the root <tt>rcu_node</tt> structure, which means that 590 there are still CPUs subordinate to that structure on 591 which the current grace period is waiting. 592 Given that, how is it possible that the rightmost drive 593 to root ended the grace period? 594</td></tr> 595<tr><th align="left">Answer:</th></tr> 596<tr><td bgcolor="#ffffff"><font color="ffffff"> 597 Good analysis! 598 It is in fact impossible in the absence of bugs in RCU. 599 But this diagram is complex enough as it is, so simplicity 600 overrode accuracy. 601 You can think of it as poetic license, or you can think of 602 it as misdirection that is resolved in the 603 <a href="#Putting It All Together">stitched-together diagram</a>. 604</font></td></tr> 605<tr><td> </td></tr> 606</table> 607 608<h4><a name="Grace-Period Cleanup">Grace-Period Cleanup</a></h4> 609 610<p>Grace-period cleanup first scans the <tt>rcu_node</tt> tree 611breadth-first advancing all the <tt>->gp_seq</tt> fields, then it 612advances the <tt>rcu_state</tt> structure's <tt>->gp_seq</tt> field. 613The ordering effects are shown below: 614 615</p><p><img src="TreeRCU-gp-cleanup.svg" alt="TreeRCU-gp-cleanup.svg" width="75%"> 616 617<p>As indicated by the oval at the bottom of the diagram, once 618grace-period cleanup is complete, the next grace period can begin. 619 620<table> 621<tr><th> </th></tr> 622<tr><th align="left">Quick Quiz:</th></tr> 623<tr><td> 624 But when precisely does the grace period end? 625</td></tr> 626<tr><th align="left">Answer:</th></tr> 627<tr><td bgcolor="#ffffff"><font color="ffffff"> 628 There is no useful single point at which the grace period 629 can be said to end. 630 The earliest reasonable candidate is as soon as the last 631 CPU has reported its quiescent state, but it may be some 632 milliseconds before RCU becomes aware of this. 633 The latest reasonable candidate is once the <tt>rcu_state</tt> 634 structure's <tt>->gp_seq</tt> field has been updated, 635 but it is quite possible that some CPUs have already completed 636 phase two of their updates by that time. 637 In short, if you are going to work with RCU, you need to 638 learn to embrace uncertainty. 639</font></td></tr> 640<tr><td> </td></tr> 641</table> 642 643 644<h4><a name="Callback Invocation">Callback Invocation</a></h4> 645 646<p>Once a given CPU's leaf <tt>rcu_node</tt> structure's 647<tt>->gp_seq</tt> field has been updated, that CPU can begin 648invoking its RCU callbacks that were waiting for this grace period 649to end. 650These callbacks are identified by <tt>rcu_advance_cbs()</tt>, 651which is usually invoked by <tt>__note_gp_changes()</tt>. 652As shown in the diagram below, this invocation can be triggered by 653the scheduling-clock interrupt (<tt>rcu_sched_clock_irq()</tt> on 654the left) or by idle entry (<tt>rcu_cleanup_after_idle()</tt> on 655the right, but only for kernels build with 656<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>). 657Either way, <tt>RCU_SOFTIRQ</tt> is raised, which results in 658<tt>rcu_do_batch()</tt> invoking the callbacks, which in turn 659allows those callbacks to carry out (either directly or indirectly 660via wakeup) the needed phase-two processing for each update. 661 662</p><p><img src="TreeRCU-callback-invocation.svg" alt="TreeRCU-callback-invocation.svg" width="60%"> 663 664<p>Please note that callback invocation can also be prompted by any 665number of corner-case code paths, for example, when a CPU notes that 666it has excessive numbers of callbacks queued. 667In all cases, the CPU acquires its leaf <tt>rcu_node</tt> structure's 668<tt>->lock</tt> before invoking callbacks, which preserves the 669required ordering against the newly completed grace period. 670 671<p>However, if the callback function communicates to other CPUs, 672for example, doing a wakeup, then it is that function's responsibility 673to maintain ordering. 674For example, if the callback function wakes up a task that runs on 675some other CPU, proper ordering must in place in both the callback 676function and the task being awakened. 677To see why this is important, consider the top half of the 678<a href="#Grace-Period Cleanup">grace-period cleanup</a> diagram. 679The callback might be running on a CPU corresponding to the leftmost 680leaf <tt>rcu_node</tt> structure, and awaken a task that is to run on 681a CPU corresponding to the rightmost leaf <tt>rcu_node</tt> structure, 682and the grace-period kernel thread might not yet have reached the 683rightmost leaf. 684In this case, the grace period's memory ordering might not yet have 685reached that CPU, so again the callback function and the awakened 686task must supply proper ordering. 687 688<h3><a name="Putting It All Together">Putting It All Together</a></h3> 689 690<p>A stitched-together diagram is 691<a href="Tree-RCU-Diagram.html">here</a>. 692 693<h3><a name="Legal Statement"> 694Legal Statement</a></h3> 695 696<p>This work represents the view of the author and does not necessarily 697represent the view of IBM. 698 699</p><p>Linux is a registered trademark of Linus Torvalds. 700 701</p><p>Other company, product, and service names may be trademarks or 702service marks of others. 703 704</body></html> 705