diff options
Diffstat (limited to 'Documentation/RCU/Design')
9 files changed, 4971 insertions, 6102 deletions
diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.html b/Documentation/RCU/Design/Data-Structures/Data-Structures.html deleted file mode 100644 index c30c1957c7e6..000000000000 --- a/Documentation/RCU/Design/Data-Structures/Data-Structures.html +++ /dev/null @@ -1,1391 +0,0 @@ -<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" - "http://www.w3.org/TR/html4/loose.dtd"> - <html> - <head><title>A Tour Through TREE_RCU's Data Structures [LWN.net]</title> - <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> - - <p>December 18, 2016</p> - <p>This article was contributed by Paul E. McKenney</p> - -<h3>Introduction</h3> - -This document describes RCU's major data structures and their relationship -to each other. - -<ol> -<li> <a href="#Data-Structure Relationships"> - Data-Structure Relationships</a> -<li> <a href="#The rcu_state Structure"> - The <tt>rcu_state</tt> Structure</a> -<li> <a href="#The rcu_node Structure"> - The <tt>rcu_node</tt> Structure</a> -<li> <a href="#The rcu_segcblist Structure"> - The <tt>rcu_segcblist</tt> Structure</a> -<li> <a href="#The rcu_data Structure"> - The <tt>rcu_data</tt> Structure</a> -<li> <a href="#The rcu_head Structure"> - The <tt>rcu_head</tt> Structure</a> -<li> <a href="#RCU-Specific Fields in the task_struct Structure"> - RCU-Specific Fields in the <tt>task_struct</tt> Structure</a> -<li> <a href="#Accessor Functions"> - Accessor Functions</a> -</ol> - -<h3><a name="Data-Structure Relationships">Data-Structure Relationships</a></h3> - -<p>RCU is for all intents and purposes a large state machine, and its -data structures maintain the state in such a way as to allow RCU readers -to execute extremely quickly, while also processing the RCU grace periods -requested by updaters in an efficient and extremely scalable fashion. -The efficiency and scalability of RCU updaters is provided primarily -by a combining tree, as shown below: - -</p><p><img src="BigTreeClassicRCU.svg" alt="BigTreeClassicRCU.svg" width="30%"> - -</p><p>This diagram shows an enclosing <tt>rcu_state</tt> structure -containing a tree of <tt>rcu_node</tt> structures. -Each leaf node of the <tt>rcu_node</tt> tree has up to 16 -<tt>rcu_data</tt> structures associated with it, so that there -are <tt>NR_CPUS</tt> number of <tt>rcu_data</tt> structures, -one for each possible CPU. -This structure is adjusted at boot time, if needed, to handle the -common case where <tt>nr_cpu_ids</tt> is much less than -<tt>NR_CPUs</tt>. -For example, a number of Linux distributions set <tt>NR_CPUs=4096</tt>, -which results in a three-level <tt>rcu_node</tt> tree. -If the actual hardware has only 16 CPUs, RCU will adjust itself -at boot time, resulting in an <tt>rcu_node</tt> tree with only a single node. - -</p><p>The purpose of this combining tree is to allow per-CPU events -such as quiescent states, dyntick-idle transitions, -and CPU hotplug operations to be processed efficiently -and scalably. -Quiescent states are recorded by the per-CPU <tt>rcu_data</tt> structures, -and other events are recorded by the leaf-level <tt>rcu_node</tt> -structures. -All of these events are combined at each level of the tree until finally -grace periods are completed at the tree's root <tt>rcu_node</tt> -structure. -A grace period can be completed at the root once every CPU -(or, in the case of <tt>CONFIG_PREEMPT_RCU</tt>, task) -has passed through a quiescent state. -Once a grace period has completed, record of that fact is propagated -back down the tree. - -</p><p>As can be seen from the diagram, on a 64-bit system -a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout -of 64 at the root and a fanout of 16 at the leaves. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Why isn't the fanout at the leaves also 64? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Because there are more types of events that affect the leaf-level - <tt>rcu_node</tt> structures than further up the tree. - Therefore, if the leaf <tt>rcu_node</tt> structures have fanout of - 64, the contention on these structures' <tt>->structures</tt> - becomes excessive. - Experimentation on a wide variety of systems has shown that a fanout - of 16 works well for the leaves of the <tt>rcu_node</tt> tree. - </font> - - <p><font color="ffffff">Of course, further experience with - systems having hundreds or thousands of CPUs may demonstrate - that the fanout for the non-leaf <tt>rcu_node</tt> structures - must also be reduced. - Such reduction can be easily carried out when and if it proves - necessary. - In the meantime, if you are using such a system and running into - contention problems on the non-leaf <tt>rcu_node</tt> structures, - you may use the <tt>CONFIG_RCU_FANOUT</tt> kernel configuration - parameter to reduce the non-leaf fanout as needed. - </font> - - <p><font color="ffffff">Kernels built for systems with - strong NUMA characteristics might also need to adjust - <tt>CONFIG_RCU_FANOUT</tt> so that the domains of the - <tt>rcu_node</tt> structures align with hardware boundaries. - However, there has thus far been no need for this. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p>If your system has more than 1,024 CPUs (or more than 512 CPUs on -a 32-bit system), then RCU will automatically add more levels to the -tree. -For example, if you are crazy enough to build a 64-bit system with 65,536 -CPUs, RCU would configure the <tt>rcu_node</tt> tree as follows: - -</p><p><img src="HugeTreeClassicRCU.svg" alt="HugeTreeClassicRCU.svg" width="50%"> - -</p><p>RCU currently permits up to a four-level tree, which on a 64-bit system -accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for -32-bit systems. -On the other hand, you can set both <tt>CONFIG_RCU_FANOUT</tt> and -<tt>CONFIG_RCU_FANOUT_LEAF</tt> to be as small as 2, which would result -in a 16-CPU test using a 4-level tree. -This can be useful for testing large-system capabilities on small test -machines. - -</p><p>This multi-level combining tree allows us to get most of the -performance and scalability -benefits of partitioning, even though RCU grace-period detection is -inherently a global operation. -The trick here is that only the last CPU to report a quiescent state -into a given <tt>rcu_node</tt> structure need advance to the <tt>rcu_node</tt> -structure at the next level up the tree. -This means that at the leaf-level <tt>rcu_node</tt> structure, only -one access out of sixteen will progress up the tree. -For the internal <tt>rcu_node</tt> structures, the situation is even -more extreme: Only one access out of sixty-four will progress up -the tree. -Because the vast majority of the CPUs do not progress up the tree, -the lock contention remains roughly constant up the tree. -No matter how many CPUs there are in the system, at most 64 quiescent-state -reports per grace period will progress all the way to the root -<tt>rcu_node</tt> structure, thus ensuring that the lock contention -on that root <tt>rcu_node</tt> structure remains acceptably low. - -</p><p>In effect, the combining tree acts like a big shock absorber, -keeping lock contention under control at all tree levels regardless -of the level of loading on the system. - -</p><p>RCU updaters wait for normal grace periods by registering -RCU callbacks, either directly via <tt>call_rcu()</tt> -or indirectly via <tt>synchronize_rcu()</tt> and friends. -RCU callbacks are represented by <tt>rcu_head</tt> structures, -which are queued on <tt>rcu_data</tt> structures while they are -waiting for a grace period to elapse, as shown in the following figure: - -</p><p><img src="BigTreePreemptRCUBHdyntickCB.svg" alt="BigTreePreemptRCUBHdyntickCB.svg" width="40%"> - -</p><p>This figure shows how <tt>TREE_RCU</tt>'s and -<tt>PREEMPT_RCU</tt>'s major data structures are related. -Lesser data structures will be introduced with the algorithms that -make use of them. - -</p><p>Note that each of the data structures in the above figure has -its own synchronization: - -<p><ol> -<li> Each <tt>rcu_state</tt> structures has a lock and a mutex, - and some fields are protected by the corresponding root - <tt>rcu_node</tt> structure's lock. -<li> Each <tt>rcu_node</tt> structure has a spinlock. -<li> The fields in <tt>rcu_data</tt> are private to the corresponding - CPU, although a few can be read and written by other CPUs. -</ol> - -<p>It is important to note that different data structures can have -very different ideas about the state of RCU at any given time. -For but one example, awareness of the start or end of a given RCU -grace period propagates slowly through the data structures. -This slow propagation is absolutely necessary for RCU to have good -read-side performance. -If this balkanized implementation seems foreign to you, one useful -trick is to consider each instance of these data structures to be -a different person, each having the usual slightly different -view of reality. - -</p><p>The general role of each of these data structures is as -follows: - -</p><ol> -<li> <tt>rcu_state</tt>: - This structure forms the interconnection between the - <tt>rcu_node</tt> and <tt>rcu_data</tt> structures, - tracks grace periods, serves as short-term repository - for callbacks orphaned by CPU-hotplug events, - maintains <tt>rcu_barrier()</tt> state, - tracks expedited grace-period state, - and maintains state used to force quiescent states when - grace periods extend too long, -<li> <tt>rcu_node</tt>: This structure forms the combining - tree that propagates quiescent-state - information from the leaves to the root, and also propagates - grace-period information from the root to the leaves. - It provides local copies of the grace-period state in order - to allow this information to be accessed in a synchronized - manner without suffering the scalability limitations that - would otherwise be imposed by global locking. - In <tt>CONFIG_PREEMPT_RCU</tt> kernels, it manages the lists - of tasks that have blocked while in their current - RCU read-side critical section. - In <tt>CONFIG_PREEMPT_RCU</tt> with - <tt>CONFIG_RCU_BOOST</tt>, it manages the - per-<tt>rcu_node</tt> priority-boosting - kernel threads (kthreads) and state. - Finally, it records CPU-hotplug state in order to determine - which CPUs should be ignored during a given grace period. -<li> <tt>rcu_data</tt>: This per-CPU structure is the - focus of quiescent-state detection and RCU callback queuing. - It also tracks its relationship to the corresponding leaf - <tt>rcu_node</tt> structure to allow more-efficient - propagation of quiescent states up the <tt>rcu_node</tt> - combining tree. - Like the <tt>rcu_node</tt> structure, it provides a local - copy of the grace-period information to allow for-free - synchronized - access to this information from the corresponding CPU. - Finally, this structure records past dyntick-idle state - for the corresponding CPU and also tracks statistics. -<li> <tt>rcu_head</tt>: - This structure represents RCU callbacks, and is the - only structure allocated and managed by RCU users. - The <tt>rcu_head</tt> structure is normally embedded - within the RCU-protected data structure. -</ol> - -<p>If all you wanted from this article was a general notion of how -RCU's data structures are related, you are done. -Otherwise, each of the following sections give more details on -the <tt>rcu_state</tt>, <tt>rcu_node</tt> and <tt>rcu_data</tt> data -structures. - -<h3><a name="The rcu_state Structure"> -The <tt>rcu_state</tt> Structure</a></h3> - -<p>The <tt>rcu_state</tt> structure is the base structure that -represents the state of RCU in the system. -This structure forms the interconnection between the -<tt>rcu_node</tt> and <tt>rcu_data</tt> structures, -tracks grace periods, contains the lock used to -synchronize with CPU-hotplug events, -and maintains state used to force quiescent states when -grace periods extend too long, - -</p><p>A few of the <tt>rcu_state</tt> structure's fields are discussed, -singly and in groups, in the following sections. -The more specialized fields are covered in the discussion of their -use. - -<h5>Relationship to rcu_node and rcu_data Structures</h5> - -This portion of the <tt>rcu_state</tt> structure is declared -as follows: - -<pre> - 1 struct rcu_node node[NUM_RCU_NODES]; - 2 struct rcu_node *level[NUM_RCU_LVLS + 1]; - 3 struct rcu_data __percpu *rda; -</pre> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Wait a minute! - You said that the <tt>rcu_node</tt> structures formed a tree, - but they are declared as a flat array! - What gives? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - The tree is laid out in the array. - The first node In the array is the head, the next set of nodes in the - array are children of the head node, and so on until the last set of - nodes in the array are the leaves. - </font> - - <p><font color="ffffff">See the following diagrams to see how - this works. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p>The <tt>rcu_node</tt> tree is embedded into the -<tt>->node[]</tt> array as shown in the following figure: - -</p><p><img src="TreeMapping.svg" alt="TreeMapping.svg" width="40%"> - -</p><p>One interesting consequence of this mapping is that a -breadth-first traversal of the tree is implemented as a simple -linear scan of the array, which is in fact what the -<tt>rcu_for_each_node_breadth_first()</tt> macro does. -This macro is used at the beginning and ends of grace periods. - -</p><p>Each entry of the <tt>->level</tt> array references -the first <tt>rcu_node</tt> structure on the corresponding level -of the tree, for example, as shown below: - -</p><p><img src="TreeMappingLevel.svg" alt="TreeMappingLevel.svg" width="40%"> - -</p><p>The zero<sup>th</sup> element of the array references the root -<tt>rcu_node</tt> structure, the first element references the -first child of the root <tt>rcu_node</tt>, and finally the second -element references the first leaf <tt>rcu_node</tt> structure. - -</p><p>For whatever it is worth, if you draw the tree to be tree-shaped -rather than array-shaped, it is easy to draw a planar representation: - -</p><p><img src="TreeLevel.svg" alt="TreeLevel.svg" width="60%"> - -</p><p>Finally, the <tt>->rda</tt> field references a per-CPU -pointer to the corresponding CPU's <tt>rcu_data</tt> structure. - -</p><p>All of these fields are constant once initialization is complete, -and therefore need no protection. - -<h5>Grace-Period Tracking</h5> - -<p>This portion of the <tt>rcu_state</tt> structure is declared -as follows: - -<pre> - 1 unsigned long gp_seq; -</pre> - -<p>RCU grace periods are numbered, and -the <tt>->gp_seq</tt> field contains the current grace-period -sequence number. -The bottom two bits are the state of the current grace period, -which can be zero for not yet started or one for in progress. -In other words, if the bottom two bits of <tt>->gp_seq</tt> are -zero, then RCU is idle. -Any other value in the bottom two bits indicates that something is broken. -This field is protected by the root <tt>rcu_node</tt> structure's -<tt>->lock</tt> field. - -</p><p>There are <tt>->gp_seq</tt> fields -in the <tt>rcu_node</tt> and <tt>rcu_data</tt> structures -as well. -The fields in the <tt>rcu_state</tt> structure represent the -most current value, and those of the other structures are compared -in order to detect the beginnings and ends of grace periods in a distributed -fashion. -The values flow from <tt>rcu_state</tt> to <tt>rcu_node</tt> -(down the tree from the root to the leaves) to <tt>rcu_data</tt>. - -<h5>Miscellaneous</h5> - -<p>This portion of the <tt>rcu_state</tt> structure is declared -as follows: - -<pre> - 1 unsigned long gp_max; - 2 char abbr; - 3 char *name; -</pre> - -<p>The <tt>->gp_max</tt> field tracks the duration of the longest -grace period in jiffies. -It is protected by the root <tt>rcu_node</tt>'s <tt>->lock</tt>. - -<p>The <tt>->name</tt> and <tt>->abbr</tt> fields distinguish -between preemptible RCU (“rcu_preempt” and “p”) -and non-preemptible RCU (“rcu_sched” and “s”). -These fields are used for diagnostic and tracing purposes. - -<h3><a name="The rcu_node Structure"> -The <tt>rcu_node</tt> Structure</a></h3> - -<p>The <tt>rcu_node</tt> structures form the combining -tree that propagates quiescent-state -information from the leaves to the root and also that propagates -grace-period information from the root down to the leaves. -They provides local copies of the grace-period state in order -to allow this information to be accessed in a synchronized -manner without suffering the scalability limitations that -would otherwise be imposed by global locking. -In <tt>CONFIG_PREEMPT_RCU</tt> kernels, they manage the lists -of tasks that have blocked while in their current -RCU read-side critical section. -In <tt>CONFIG_PREEMPT_RCU</tt> with -<tt>CONFIG_RCU_BOOST</tt>, they manage the -per-<tt>rcu_node</tt> priority-boosting -kernel threads (kthreads) and state. -Finally, they record CPU-hotplug state in order to determine -which CPUs should be ignored during a given grace period. - -</p><p>The <tt>rcu_node</tt> structure's fields are discussed, -singly and in groups, in the following sections. - -<h5>Connection to Combining Tree</h5> - -<p>This portion of the <tt>rcu_node</tt> structure is declared -as follows: - -<pre> - 1 struct rcu_node *parent; - 2 u8 level; - 3 u8 grpnum; - 4 unsigned long grpmask; - 5 int grplo; - 6 int grphi; -</pre> - -<p>The <tt>->parent</tt> pointer references the <tt>rcu_node</tt> -one level up in the tree, and is <tt>NULL</tt> for the root -<tt>rcu_node</tt>. -The RCU implementation makes heavy use of this field to push quiescent -states up the tree. -The <tt>->level</tt> field gives the level in the tree, with -the root being at level zero, its children at level one, and so on. -The <tt>->grpnum</tt> field gives this node's position within -the children of its parent, so this number can range between 0 and 31 -on 32-bit systems and between 0 and 63 on 64-bit systems. -The <tt>->level</tt> and <tt>->grpnum</tt> fields are -used only during initialization and for tracing. -The <tt>->grpmask</tt> field is the bitmask counterpart of -<tt>->grpnum</tt>, and therefore always has exactly one bit set. -This mask is used to clear the bit corresponding to this <tt>rcu_node</tt> -structure in its parent's bitmasks, which are described later. -Finally, the <tt>->grplo</tt> and <tt>->grphi</tt> fields -contain the lowest and highest numbered CPU served by this -<tt>rcu_node</tt> structure, respectively. - -</p><p>All of these fields are constant, and thus do not require any -synchronization. - -<h5>Synchronization</h5> - -<p>This field of the <tt>rcu_node</tt> structure is declared -as follows: - -<pre> - 1 raw_spinlock_t lock; -</pre> - -<p>This field is used to protect the remaining fields in this structure, -unless otherwise stated. -That said, all of the fields in this structure can be accessed without -locking for tracing purposes. -Yes, this can result in confusing traces, but better some tracing confusion -than to be heisenbugged out of existence. - -<h5>Grace-Period Tracking</h5> - -<p>This portion of the <tt>rcu_node</tt> structure is declared -as follows: - -<pre> - 1 unsigned long gp_seq; - 2 unsigned long gp_seq_needed; -</pre> - -<p>The <tt>rcu_node</tt> structures' <tt>->gp_seq</tt> fields are -the counterparts of the field of the same name in the <tt>rcu_state</tt> -structure. -They each may lag up to one step behind their <tt>rcu_state</tt> -counterpart. -If the bottom two bits of a given <tt>rcu_node</tt> structure's -<tt>->gp_seq</tt> field is zero, then this <tt>rcu_node</tt> -structure believes that RCU is idle. -</p><p>The <tt>>gp_seq</tt> field of each <tt>rcu_node</tt> -structure is updated at the beginning and the end -of each grace period. - -<p>The <tt>->gp_seq_needed</tt> fields record the -furthest-in-the-future grace period request seen by the corresponding -<tt>rcu_node</tt> structure. The request is considered fulfilled when -the value of the <tt>->gp_seq</tt> field equals or exceeds that of -the <tt>->gp_seq_needed</tt> field. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Suppose that this <tt>rcu_node</tt> structure doesn't see - a request for a very long time. - Won't wrapping of the <tt>->gp_seq</tt> field cause - problems? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - No, because if the <tt>->gp_seq_needed</tt> field lags behind the - <tt>->gp_seq</tt> field, the <tt>->gp_seq_needed</tt> field - will be updated at the end of the grace period. - Modulo-arithmetic comparisons therefore will always get the - correct answer, even with wrapping. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h5>Quiescent-State Tracking</h5> - -<p>These fields manage the propagation of quiescent states up the -combining tree. - -</p><p>This portion of the <tt>rcu_node</tt> structure has fields -as follows: - -<pre> - 1 unsigned long qsmask; - 2 unsigned long expmask; - 3 unsigned long qsmaskinit; - 4 unsigned long expmaskinit; -</pre> - -<p>The <tt>->qsmask</tt> field tracks which of this -<tt>rcu_node</tt> structure's children still need to report -quiescent states for the current normal grace period. -Such children will have a value of 1 in their corresponding bit. -Note that the leaf <tt>rcu_node</tt> structures should be -thought of as having <tt>rcu_data</tt> structures as their -children. -Similarly, the <tt>->expmask</tt> field tracks which -of this <tt>rcu_node</tt> structure's children still need to report -quiescent states for the current expedited grace period. -An expedited grace period has -the same conceptual properties as a normal grace period, but the -expedited implementation accepts extreme CPU overhead to obtain -much lower grace-period latency, for example, consuming a few -tens of microseconds worth of CPU time to reduce grace-period -duration from milliseconds to tens of microseconds. -The <tt>->qsmaskinit</tt> field tracks which of this -<tt>rcu_node</tt> structure's children cover for at least -one online CPU. -This mask is used to initialize <tt>->qsmask</tt>, -and <tt>->expmaskinit</tt> is used to initialize -<tt>->expmask</tt> and the beginning of the -normal and expedited grace periods, respectively. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Why are these bitmasks protected by locking? - Come on, haven't you heard of atomic instructions??? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Lockless grace-period computation! Such a tantalizing possibility! - </font> - - <p><font color="ffffff">But consider the following sequence of events: - </font> - - <ol> - <li> <font color="ffffff">CPU 0 has been in dyntick-idle - mode for quite some time. - When it wakes up, it notices that the current RCU - grace period needs it to report in, so it sets a - flag where the scheduling clock interrupt will find it. - </font><p> - <li> <font color="ffffff">Meanwhile, CPU 1 is running - <tt>force_quiescent_state()</tt>, - and notices that CPU 0 has been in dyntick idle mode, - which qualifies as an extended quiescent state. - </font><p> - <li> <font color="ffffff">CPU 0's scheduling clock - interrupt fires in the - middle of an RCU read-side critical section, and notices - that the RCU core needs something, so commences RCU softirq - processing. - </font> - <p> - <li> <font color="ffffff">CPU 0's softirq handler - executes and is just about ready - to report its quiescent state up the <tt>rcu_node</tt> - tree. - </font><p> - <li> <font color="ffffff">But CPU 1 beats it to the punch, - completing the current - grace period and starting a new one. - </font><p> - <li> <font color="ffffff">CPU 0 now reports its quiescent - state for the wrong - grace period. - That grace period might now end before the RCU read-side - critical section. - If that happens, disaster will ensue. - </font> - </ol> - - <p><font color="ffffff">So the locking is absolutely required in - order to coordinate clearing of the bits with updating of the - grace-period sequence number in <tt>->gp_seq</tt>. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h5>Blocked-Task Management</h5> - -<p><tt>PREEMPT_RCU</tt> allows tasks to be preempted in the -midst of their RCU read-side critical sections, and these tasks -must be tracked explicitly. -The details of exactly why and how they are tracked will be covered -in a separate article on RCU read-side processing. -For now, it is enough to know that the <tt>rcu_node</tt> -structure tracks them. - -<pre> - 1 struct list_head blkd_tasks; - 2 struct list_head *gp_tasks; - 3 struct list_head *exp_tasks; - 4 bool wait_blkd_tasks; -</pre> - -<p>The <tt>->blkd_tasks</tt> field is a list header for -the list of blocked and preempted tasks. -As tasks undergo context switches within RCU read-side critical -sections, their <tt>task_struct</tt> structures are enqueued -(via the <tt>task_struct</tt>'s <tt>->rcu_node_entry</tt> -field) onto the head of the <tt>->blkd_tasks</tt> list for the -leaf <tt>rcu_node</tt> structure corresponding to the CPU -on which the outgoing context switch executed. -As these tasks later exit their RCU read-side critical sections, -they remove themselves from the list. -This list is therefore in reverse time order, so that if one of the tasks -is blocking the current grace period, all subsequent tasks must -also be blocking that same grace period. -Therefore, a single pointer into this list suffices to track -all tasks blocking a given grace period. -That pointer is stored in <tt>->gp_tasks</tt> for normal -grace periods and in <tt>->exp_tasks</tt> for expedited -grace periods. -These last two fields are <tt>NULL</tt> if either there is -no grace period in flight or if there are no blocked tasks -preventing that grace period from completing. -If either of these two pointers is referencing a task that -removes itself from the <tt>->blkd_tasks</tt> list, -then that task must advance the pointer to the next task on -the list, or set the pointer to <tt>NULL</tt> if there -are no subsequent tasks on the list. - -</p><p>For example, suppose that tasks T1, T2, and T3 are -all hard-affinitied to the largest-numbered CPU in the system. -Then if task T1 blocked in an RCU read-side -critical section, then an expedited grace period started, -then task T2 blocked in an RCU read-side critical section, -then a normal grace period started, and finally task 3 blocked -in an RCU read-side critical section, then the state of the -last leaf <tt>rcu_node</tt> structure's blocked-task list -would be as shown below: - -</p><p><img src="blkd_task.svg" alt="blkd_task.svg" width="60%"> - -</p><p>Task T1 is blocking both grace periods, task T2 is -blocking only the normal grace period, and task T3 is blocking -neither grace period. -Note that these tasks will not remove themselves from this list -immediately upon resuming execution. -They will instead remain on the list until they execute the outermost -<tt>rcu_read_unlock()</tt> that ends their RCU read-side critical -section. - -<p> -The <tt>->wait_blkd_tasks</tt> field indicates whether or not -the current grace period is waiting on a blocked task. - -<h5>Sizing the <tt>rcu_node</tt> Array</h5> - -<p>The <tt>rcu_node</tt> array is sized via a series of -C-preprocessor expressions as follows: - -<pre> - 1 #ifdef CONFIG_RCU_FANOUT - 2 #define RCU_FANOUT CONFIG_RCU_FANOUT - 3 #else - 4 # ifdef CONFIG_64BIT - 5 # define RCU_FANOUT 64 - 6 # else - 7 # define RCU_FANOUT 32 - 8 # endif - 9 #endif -10 -11 #ifdef CONFIG_RCU_FANOUT_LEAF -12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF -13 #else -14 # ifdef CONFIG_64BIT -15 # define RCU_FANOUT_LEAF 64 -16 # else -17 # define RCU_FANOUT_LEAF 32 -18 # endif -19 #endif -20 -21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF) -22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) -23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) -24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) -25 -26 #if NR_CPUS <= RCU_FANOUT_1 -27 # define RCU_NUM_LVLS 1 -28 # define NUM_RCU_LVL_0 1 -29 # define NUM_RCU_NODES NUM_RCU_LVL_0 -30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 } -31 # define RCU_NODE_NAME_INIT { "rcu_node_0" } -32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } -33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" } -34 #elif NR_CPUS <= RCU_FANOUT_2 -35 # define RCU_NUM_LVLS 2 -36 # define NUM_RCU_LVL_0 1 -37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) -38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) -39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 } -40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } -41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } -42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" } -43 #elif NR_CPUS <= RCU_FANOUT_3 -44 # define RCU_NUM_LVLS 3 -45 # define NUM_RCU_LVL_0 1 -46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) -47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) -48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) -49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } -50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } -51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } -52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" } -53 #elif NR_CPUS <= RCU_FANOUT_4 -54 # define RCU_NUM_LVLS 4 -55 # define NUM_RCU_LVL_0 1 -56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) -57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) -58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) -59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3) -60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 } -61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" } -62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" } -63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" } -64 #else -65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS" -66 #endif -</pre> - -<p>The maximum number of levels in the <tt>rcu_node</tt> structure -is currently limited to four, as specified by lines 21-24 -and the structure of the subsequent “if” statement. -For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which -should be sufficient for the next few years at least. -For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which -should see us through the next decade or so. -This four-level tree also allows kernels built with -<tt>CONFIG_RCU_FANOUT=8</tt> to support up to 4096 CPUs, -which might be useful in very large systems having eight CPUs per -socket (but please note that no one has yet shown any measurable -performance degradation due to misaligned socket and <tt>rcu_node</tt> -boundaries). -In addition, building kernels with a full four levels of <tt>rcu_node</tt> -tree permits better testing of RCU's combining-tree code. - -</p><p>The <tt>RCU_FANOUT</tt> symbol controls how many children -are permitted at each non-leaf level of the <tt>rcu_node</tt> tree. -If the <tt>CONFIG_RCU_FANOUT</tt> Kconfig option is not specified, -it is set based on the word size of the system, which is also -the Kconfig default. - -</p><p>The <tt>RCU_FANOUT_LEAF</tt> symbol controls how many CPUs are -handled by each leaf <tt>rcu_node</tt> structure. -Experience has shown that allowing a given leaf <tt>rcu_node</tt> -structure to handle 64 CPUs, as permitted by the number of bits in -the <tt>->qsmask</tt> field on a 64-bit system, results in -excessive contention for the leaf <tt>rcu_node</tt> structures' -<tt>->lock</tt> fields. -The number of CPUs per leaf <tt>rcu_node</tt> structure is therefore -limited to 16 given the default value of & |