Lines Matching +full:lines +full:- +full:initial +full:- +full:states
15 Data-Structure Relationships
25 .. kernel-figure:: BigTreeClassicRCU.svg
34 which results in a three-level ``rcu_node`` tree.
38 The purpose of this combining tree is to allow per-CPU events
39 such as quiescent states, dyntick-idle transitions,
42 Quiescent states are recorded by the per-CPU ``rcu_data`` structures,
43 and other events are recorded by the leaf-level ``rcu_node``
54 As can be seen from the diagram, on a 64-bit system
55 a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout
58 +-----------------------------------------------------------------------+
60 +-----------------------------------------------------------------------+
62 +-----------------------------------------------------------------------+
64 +-----------------------------------------------------------------------+
65 | Because there are more types of events that affect the leaf-level |
68 | these structures' ``->structures`` becomes excessive. Experimentation |
73 | thousands of CPUs may demonstrate that the fanout for the non-leaf |
77 | on the non-leaf ``rcu_node`` structures, you may use the |
79 | non-leaf fanout as needed. |
85 +-----------------------------------------------------------------------+
88 32-bit system), then RCU will automatically add more levels to the tree.
89 For example, if you are crazy enough to build a 64-bit system with
92 .. kernel-figure:: HugeTreeClassicRCU.svg
94 RCU currently permits up to a four-level tree, which on a 64-bit system
96 32-bit systems. On the other hand, you can set both
98 2, which would result in a 16-CPU test using a 4-level tree. This can be
99 useful for testing large-system capabilities on small test machines.
101 This multi-level combining tree allows us to get most of the performance
102 and scalability benefits of partitioning, even though RCU grace-period
106 up the tree. This means that at the leaf-level ``rcu_node`` structure,
109 Only one access out of sixty-four will progress up the tree. Because the
112 there are in the system, at most 64 quiescent-state reports per grace
128 .. kernel-figure:: BigTreePreemptRCUBHdyntickCB.svg
147 absolutely necessary for RCU to have good read-side performance. If this
156 serves as short-term repository for callbacks orphaned by CPU-hotplug
158 grace-period state, and maintains state used to force quiescent
159 states when grace periods extend too long,
161 quiescent-state information from the leaves to the root, and also
162 propagates grace-period information from the root to the leaves. It
163 provides local copies of the grace-period state in order to allow
167 lists of tasks that have blocked while in their current RCU read-side
169 ``CONFIG_RCU_BOOST``, it manages the per-\ ``rcu_node``
170 priority-boosting kernel threads (kthreads) and state. Finally, it
171 records CPU-hotplug state in order to determine which CPUs should be
173 #. ``rcu_data``: This per-CPU structure is the focus of quiescent-state
176 more-efficient propagation of quiescent states up the ``rcu_node``
178 copy of the grace-period information to allow for-free synchronized
180 structure records past dyntick-idle state for the corresponding CPU
184 structure is normally embedded within the RCU-protected data
198 periods, contains the lock used to synchronize with CPU-hotplug events,
199 and maintains state used to force quiescent states when grace periods
217 +-----------------------------------------------------------------------+
219 +-----------------------------------------------------------------------+
222 +-----------------------------------------------------------------------+
224 +-----------------------------------------------------------------------+
230 +-----------------------------------------------------------------------+
232 The ``rcu_node`` tree is embedded into the ``->node[]`` array as shown
235 .. kernel-figure:: TreeMapping.svg
237 One interesting consequence of this mapping is that a breadth-first
243 Each entry of the ``->level`` array references the first ``rcu_node``
247 .. kernel-figure:: TreeMappingLevel.svg
254 For whatever it is worth, if you draw the tree to be tree-shaped rather
255 than array-shaped, it is easy to draw a planar representation:
257 .. kernel-figure:: TreeLevel.svg
259 Finally, the ``->rda`` field references a per-CPU pointer to the
265 Grace-Period Tracking
274 RCU grace periods are numbered, and the ``->gp_seq`` field contains the
275 current grace-period sequence number. The bottom two bits are the state
278 ``->gp_seq`` are zero, then RCU is idle. Any other value in the bottom
280 the root ``rcu_node`` structure's ``->lock`` field.
282 There are ``->gp_seq`` fields in the ``rcu_node`` and ``rcu_data``
300 The ``->gp_max`` field tracks the duration of the longest grace period
301 in jiffies. It is protected by the root ``rcu_node``'s ``->lock``.
303 The ``->name`` and ``->abbr`` fields distinguish between preemptible RCU
304 (“rcu_preempt” and “p”) and non-preemptible RCU (“rcu_sched” and “s”).
311 quiescent-state information from the leaves to the root and also that
312 propagates grace-period information from the root down to the leaves.
313 They provides local copies of the grace-period state in order to allow
317 of tasks that have blocked while in their current RCU read-side critical
319 manage the per-\ ``rcu_node`` priority-boosting kernel threads
320 (kthreads) and state. Finally, they record CPU-hotplug state in order to
340 The ``->parent`` pointer references the ``rcu_node`` one level up in the
342 makes heavy use of this field to push quiescent states up the tree. The
343 ``->level`` field gives the level in the tree, with the root being at
344 level zero, its children at level one, and so on. The ``->grpnum`` field
346 number can range between 0 and 31 on 32-bit systems and between 0 and 63
347 on 64-bit systems. The ``->level`` and ``->grpnum`` fields are used only
348 during initialization and for tracing. The ``->grpmask`` field is the
349 bitmask counterpart of ``->grpnum``, and therefore always has exactly
352 later. Finally, the ``->grplo`` and ``->grphi`` fields contain the
374 .. _grace-period-tracking-1:
376 Grace-Period Tracking
386 The ``rcu_node`` structures' ``->gp_seq`` fields are the counterparts of
389 two bits of a given ``rcu_node`` structure's ``->gp_seq`` field is zero,
395 The ``->gp_seq_needed`` fields record the furthest-in-the-future grace
397 request is considered fulfilled when the value of the ``->gp_seq`` field
398 equals or exceeds that of the ``->gp_seq_needed`` field.
400 +-----------------------------------------------------------------------+
402 +-----------------------------------------------------------------------+
404 | very long time. Won't wrapping of the ``->gp_seq`` field cause |
406 +-----------------------------------------------------------------------+
408 +-----------------------------------------------------------------------+
409 | No, because if the ``->gp_seq_needed`` field lags behind the |
410 | ``->gp_seq`` field, the ``->gp_seq_needed`` field will be updated at |
411 | the end of the grace period. Modulo-arithmetic comparisons therefore |
413 +-----------------------------------------------------------------------+
415 Quiescent-State Tracking
418 These fields manage the propagation of quiescent states up the combining
430 The ``->qsmask`` field tracks which of this ``rcu_node`` structure's
431 children still need to report quiescent states for the current normal
435 Similarly, the ``->expmask`` field tracks which of this ``rcu_node``
436 structure's children still need to report quiescent states for the
440 grace-period latency, for example, consuming a few tens of microseconds
441 worth of CPU time to reduce grace-period duration from milliseconds to
442 tens of microseconds. The ``->qsmaskinit`` field tracks which of this
444 This mask is used to initialize ``->qsmask``, and ``->expmaskinit`` is
445 used to initialize ``->expmask`` and the beginning of the normal and
448 +-----------------------------------------------------------------------+
450 +-----------------------------------------------------------------------+
453 +-----------------------------------------------------------------------+
455 +-----------------------------------------------------------------------+
456 | Lockless grace-period computation! Such a tantalizing possibility! |
459 | #. CPU 0 has been in dyntick-idle mode for quite some time. When it |
467 | read-side critical section, and notices that the RCU core needs |
474 | That grace period might now end before the RCU read-side critical |
478 | of the bits with updating of the grace-period sequence number in |
479 | ``->gp_seq``. |
480 +-----------------------------------------------------------------------+
482 Blocked-Task Management
486 read-side critical sections, and these tasks must be tracked explicitly.
488 separate article on RCU read-side processing. For now, it is enough to
498 The ``->blkd_tasks`` field is a list header for the list of blocked and
499 preempted tasks. As tasks undergo context switches within RCU read-side
501 the ``task_struct``'s ``->rcu_node_entry`` field) onto the head of the
502 ``->blkd_tasks`` list for the leaf ``rcu_node`` structure corresponding
504 later exit their RCU read-side critical sections, they remove themselves
509 grace period. That pointer is stored in ``->gp_tasks`` for normal grace
510 periods and in ``->exp_tasks`` for expedited grace periods. These last
514 removes itself from the ``->blkd_tasks`` list, then that task must
518 For example, suppose that tasks T1, T2, and T3 are all hard-affinitied
519 to the largest-numbered CPU in the system. Then if task T1 blocked in an
520 RCU read-side critical section, then an expedited grace period started,
521 then task T2 blocked in an RCU read-side critical section, then a normal
522 grace period started, and finally task 3 blocked in an RCU read-side
524 structure's blocked-task list would be as shown below:
526 .. kernel-figure:: blkd_task.svg
533 read-side critical section.
535 The ``->wait_blkd_tasks`` field indicates whether or not the current
541 The ``rcu_node`` array is sized via a series of C-preprocessor
614 limited to four, as specified by lines 21-24 and the structure of the
615 subsequent “if” statement. For 32-bit systems, this allows
617 years at least. For 64-bit systems, 16*64*64*64=4,194,304 CPUs is
619 four-level tree also allows kernels built with ``CONFIG_RCU_FANOUT=8``
625 combining-tree code.
628 each non-leaf level of the ``rcu_node`` tree. If the
635 number of bits in the ``->qsmask`` field on a 64-bit system, results in
636 excessive contention for the leaf ``rcu_node`` structures' ``->lock``
641 Lines 11-19 perform this computation.
643 Lines 21-24 compute the maximum number of CPUs supported by a
644 single-level (which contains a single ``rcu_node`` structure),
645 two-level, three-level, and four-level ``rcu_node`` tree, respectively,
648 ``RCU_FANOUT_2``, ``RCU_FANOUT_3``, and ``RCU_FANOUT_4`` C-preprocessor
651 These variables are used to control the C-preprocessor ``#if`` statement
652 spanning lines 26-66 that computes the number of ``rcu_node`` structures
655 C-preprocessor variable by lines 27, 35, 44, and 54. The number of
658 ``NUM_RCU_LVL_0`` by lines 28, 36, 45, and 55. The rest of the levels
662 lines 37, 46-47, and 56-58. Lines 31-33, 40-42, 50-52, and 62-63 create
663 initializers for lockdep lock-class names. Finally, lines 64-66 produce
695 grace period is current, hence the ``->gp_seq`` field.
701 The ``->head`` pointer references the first callback or is ``NULL`` if
703 Each element of the ``->tails[]`` array references the ``->next``
705 or the list's ``->head`` pointer if that segment and all previous
710 ``->head`` pointer, the ``->tails[]`` array, and the callbacks is shown
713 .. kernel-figure:: nxtlist.svg
715 In this figure, the ``->head`` pointer references the first RCU callback
716 in the list. The ``->tails[RCU_DONE_TAIL]`` array element references the
717 ``->head`` pointer itself, indicating that none of the callbacks is
718 ready to invoke. The ``->tails[RCU_WAIT_TAIL]`` array element references
719 callback CB 2's ``->next`` pointer, which indicates that CB 1 and CB 2
722 ``->tails[RCU_NEXT_READY_TAIL]`` array element references the same RCU
723 callback that ``->tails[RCU_WAIT_TAIL]`` does, which indicates that
725 ``->tails[RCU_NEXT_TAIL]`` array element references CB 4's ``->next``
728 ``->tails[RCU_NEXT_TAIL]`` array element always references the last RCU
729 callback's ``->next`` pointer unless the callback list is empty, in
730 which case it references the ``->head`` pointer.
733 ``->tails[RCU_NEXT_TAIL]`` array element: It can be ``NULL`` when this
742 The ``->gp_seq[]`` array records grace-period numbers corresponding to
749 The ``->len`` counter contains the number of callbacks in ``->head``,
750 and the ``->len_lazy`` contains the number of those callbacks that are
756 It is the ``->len`` field that determines whether or
758 structure, *not* the ``->head`` pointer. The reason for this is that all
759 the ready-to-invoke callbacks (that is, those in the ``RCU_DONE_TAIL``
760 segment) are extracted all at once at callback-invocation time
761 (``rcu_do_batch``), due to which ``->head`` may be set to NULL if there
762 are no not-done callbacks remaining in the ``rcu_segcblist``. If
764 high-priority process just woke up on this CPU, then the remaining
766 ``->head`` once again points to the start of the segment. In short, the
769 ``->head`` pointer for ``NULL``.
771 In contrast, the ``->len`` and ``->len_lazy`` counts are adjusted only
773 ``->len`` count is zero only if the ``rcu_segcblist`` structure really
774 is devoid of callbacks. Of course, off-CPU sampling of the ``->len``
782 The ``rcu_data`` maintains the per-CPU state for the RCU subsystem. The
785 of quiescent-state detection and RCU callback queuing. It also tracks
787 allow more-efficient propagation of quiescent states up the ``rcu_node``
789 copy of the grace-period information to allow for-free synchronized
791 structure records past dyntick-idle state for the corresponding CPU and
809 The ``->cpu`` field contains the number of the corresponding CPU and the
810 ``->mynode`` field references the corresponding ``rcu_node`` structure.
811 The ``->mynode`` is used to propagate quiescent states up the combining
815 The ``->grpmask`` field indicates the bit in the ``->mynode->qsmask``
817 propagating quiescent states. The ``->beenonline`` flag is set whenever
822 Quiescent-State and Grace-Period Tracking
835 The ``->gp_seq`` field is the counterpart of the field of the same name
837 ``->gp_seq_needed`` field is the counterpart of the field of the same
841 dyntick-idle mode (but these counters will catch up upon exit from
842 dyntick-idle mode). If the lower two bits of a given ``rcu_data``
843 structure's ``->gp_seq`` are zero, then this ``rcu_data`` structure
846 +-----------------------------------------------------------------------+
848 +-----------------------------------------------------------------------+
852 +-----------------------------------------------------------------------+
854 +-----------------------------------------------------------------------+
858 | need to carefully manage the numbers on a per-node basis. Recall from |
862 +-----------------------------------------------------------------------+
864 The ``->cpu_no_qs`` flag indicates that the CPU has not yet passed
865 through a quiescent state, while the ``->core_needs_qs`` flag indicates
867 The ``->gpwrap`` field indicates that the corresponding CPU has remained
875 In the absence of CPU-hotplug events, RCU callbacks are invoked by the
876 same CPU that registered them. This is strictly a cache-locality
895 The ``->cblist`` structure is the segmented callback list described
899 of its ``rcu_data`` structure's ``->gp_seq`` field differs from that of
901 structure's ``->gp_seq`` field is updated at the beginnings and ends of
904 The ``->qlen_last_fqs_check`` and ``->n_force_qs_snap`` coordinate the
905 forcing of quiescent states from ``call_rcu()`` and friends when
908 The ``->n_cbs_invoked``, ``->n_cbs_orphaned``, and ``->n_cbs_adopted``
911 CPUs go offline. The ``->n_nocbs_invoked`` is used when the CPU's
914 Finally, the ``->blimit`` counter is the maximum number of RCU callbacks
917 Dyntick-Idle Handling
927 The ``->dynticks_snap`` field is used to take a snapshot of the
928 corresponding CPU's dyntick-idle state when forcing quiescent states,
930 ``->dynticks_fqs`` field is used to count the number of times this CPU
931 is determined to be in dyntick-idle state, and is used for tracing and
944 These fields in the rcu_data structure maintain the per-CPU dyntick-idle
948 The ``->dynticks_nesting`` field counts the nesting depth of process
951 ``->dynticks_nmi_nesting`` field. Because NMIs cannot be masked, changes
953 provided by Andy Lutomirski. The initial transition from idle adds one,
955 represented by a ``->dynticks_nmi_nesting`` value of nine. This counter
957 CPU cannot be permitted to enter dyntick-idle mode, aside from
958 process-level transitions.
960 However, it turns out that when running in non-idle kernel context, the
963 ``->dynticks_nesting`` field is incremented up from zero, the
964 ``->dynticks_nmi_nesting`` field is set to a large positive number, and
965 whenever the ``->dynticks_nesting`` field is decremented down to zero,
966 the ``->dynticks_nmi_nesting`` field is set to zero. Assuming that
968 counter, this approach corrects the ``->dynticks_nmi_nesting`` field
972 The ``->dynticks`` field counts the corresponding CPU's transitions to
973 and from either dyntick-idle or user mode, so that this counter has an
974 even value when the CPU is in dyntick-idle mode or user mode and an odd
976 for user mode adaptive-ticks support (see Documentation/timers/no_hz.rst).
978 The ``->rcu_need_heavy_qs`` field is used to record the fact that the
981 heavy-weight dyntick-counter operations. This flag is checked by RCU's
982 context-switch and ``cond_resched()`` code, which provide a momentary
985 Finally, the ``->rcu_urgent_qs`` field is used to record the fact that
989 context-switch path (``rcu_note_context_switch``) and the cond_resched
992 +-----------------------------------------------------------------------+
994 +-----------------------------------------------------------------------+
995 | Why not simply combine the ``->dynticks_nesting`` and |
996 | ``->dynticks_nmi_nesting`` counters into a single counter that just |
997 | counts the number of reasons that the corresponding CPU is non-idle? |
998 +-----------------------------------------------------------------------+
1000 +-----------------------------------------------------------------------+
1002 | never return and of handlers that manage to return from a made-up |
1004 +-----------------------------------------------------------------------+
1006 Additional fields are present for some special-purpose builds, and are
1013 are normally embedded within RCU-protected data structures whose
1025 The ``->next`` field is used to link the ``rcu_head`` structures
1026 together in the lists within the ``rcu_data`` structures. The ``->func``
1029 ``rcu_head`` structure. However, ``kfree_rcu()`` uses the ``->func``
1031 enclosing RCU-protected data structure.
1036 +-----------------------------------------------------------------------+
1038 +-----------------------------------------------------------------------+
1039 | Given that the callback function ``->func`` is passed a pointer to |
1041 | beginning of the enclosing RCU-protected data structure? |
1042 +-----------------------------------------------------------------------+
1044 +-----------------------------------------------------------------------+
1046 | RCU-protected data structure. The callback function can therefore use |
1048 | pointer-manipulation facilities in other software environments) to |
1050 +-----------------------------------------------------------------------+
1052 RCU-Specific Fields in the ``task_struct`` Structure
1073 The ``->rcu_read_lock_nesting`` field records the nesting level for RCU
1074 read-side critical sections, and the ``->rcu_read_unlock_special`` field
1076 ``rcu_read_unlock()`` to do additional work. The ``->rcu_node_entry``
1078 preemptible-RCU read-side critical sections and the
1079 ``->rcu_blocked_node`` field references the ``rcu_node`` structure whose
1081 preemptible-RCU read-side critical section.
1083 The ``->rcu_tasks_nvcsw`` field tracks the number of voluntary context
1085 tasks-RCU grace period, ``->rcu_tasks_holdout`` is set if the current
1086 tasks-RCU grace period is waiting on this task,
1087 ``->rcu_tasks_holdout_list`` is a list element enqueuing this task on
1088 the holdout list, and ``->rcu_tasks_idle_cpu`` tracks which CPU this
1103 3 return &rsp->node[0];
1107 7 for ((rnp) = &(rsp)->node[0]; \
1108 8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
1111 11 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \
1112 12 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
1115 the specified ``rcu_state`` structure's ``->node[]`` array, which is the
1120 ``rcu_state`` structure's ``->node[]`` array, performing a breadth-first
1125 +-----------------------------------------------------------------------+
1127 +-----------------------------------------------------------------------+
1130 +-----------------------------------------------------------------------+
1132 +-----------------------------------------------------------------------+
1133 | In the single-node case, ``rcu_for_each_leaf_node()`` traverses the |
1135 +-----------------------------------------------------------------------+
1142 Finally, in ``CONFIG_NO_HZ_IDLE`` kernels, each CPU's dyntick-idle state
1143 is tracked by dynticks-related fields in the ``rcu_data`` structure. If
1152 helping me get this document into a more human-readable state.