diff options
Diffstat (limited to 'Documentation/RCU/Design/Requirements/Requirements.html')
-rw-r--r-- | Documentation/RCU/Design/Requirements/Requirements.html | 3330 |
1 files changed, 0 insertions, 3330 deletions
diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html deleted file mode 100644 index 5a9238a2883c..000000000000 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ /dev/null @@ -1,3330 +0,0 @@ -<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" - "http://www.w3.org/TR/html4/loose.dtd"> - <html> - <head><title>A Tour Through RCU's Requirements [LWN.net]</title> - <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8"> - -<h1>A Tour Through RCU's Requirements</h1> - -<p>Copyright IBM Corporation, 2015</p> -<p>Author: Paul E. McKenney</p> -<p><i>The initial version of this document appeared in the -<a href="https://lwn.net/">LWN</a> articles -<a href="https://lwn.net/Articles/652156/">here</a>, -<a href="https://lwn.net/Articles/652677/">here</a>, and -<a href="https://lwn.net/Articles/653326/">here</a>.</i></p> - -<h2>Introduction</h2> - -<p> -Read-copy update (RCU) is a synchronization mechanism that is often -used as a replacement for reader-writer locking. -RCU is unusual in that updaters do not block readers, -which means that RCU's read-side primitives can be exceedingly fast -and scalable. -In addition, updaters can make useful forward progress concurrently -with readers. -However, all this concurrency between RCU readers and updaters does raise -the question of exactly what RCU readers are doing, which in turn -raises the question of exactly what RCU's requirements are. - -<p> -This document therefore summarizes RCU's requirements, and can be thought -of as an informal, high-level specification for RCU. -It is important to understand that RCU's specification is primarily -empirical in nature; -in fact, I learned about many of these requirements the hard way. -This situation might cause some consternation, however, not only -has this learning process been a lot of fun, but it has also been -a great privilege to work with so many people willing to apply -technologies in interesting new ways. - -<p> -All that aside, here are the categories of currently known RCU requirements: -</p> - -<ol> -<li> <a href="#Fundamental Requirements"> - Fundamental Requirements</a> -<li> <a href="#Fundamental Non-Requirements">Fundamental Non-Requirements</a> -<li> <a href="#Parallelism Facts of Life"> - Parallelism Facts of Life</a> -<li> <a href="#Quality-of-Implementation Requirements"> - Quality-of-Implementation Requirements</a> -<li> <a href="#Linux Kernel Complications"> - Linux Kernel Complications</a> -<li> <a href="#Software-Engineering Requirements"> - Software-Engineering Requirements</a> -<li> <a href="#Other RCU Flavors"> - Other RCU Flavors</a> -<li> <a href="#Possible Future Changes"> - Possible Future Changes</a> -</ol> - -<p> -This is followed by a <a href="#Summary">summary</a>, -however, the answers to each quick quiz immediately follows the quiz. -Select the big white space with your mouse to see the answer. - -<h2><a name="Fundamental Requirements">Fundamental Requirements</a></h2> - -<p> -RCU's fundamental requirements are the closest thing RCU has to hard -mathematical requirements. -These are: - -<ol> -<li> <a href="#Grace-Period Guarantee"> - Grace-Period Guarantee</a> -<li> <a href="#Publish-Subscribe Guarantee"> - Publish-Subscribe Guarantee</a> -<li> <a href="#Memory-Barrier Guarantees"> - Memory-Barrier Guarantees</a> -<li> <a href="#RCU Primitives Guaranteed to Execute Unconditionally"> - RCU Primitives Guaranteed to Execute Unconditionally</a> -<li> <a href="#Guaranteed Read-to-Write Upgrade"> - Guaranteed Read-to-Write Upgrade</a> -</ol> - -<h3><a name="Grace-Period Guarantee">Grace-Period Guarantee</a></h3> - -<p> -RCU's grace-period guarantee is unusual in being premeditated: -Jack Slingwine and I had this guarantee firmly in mind when we started -work on RCU (then called “rclock”) in the early 1990s. -That said, the past two decades of experience with RCU have produced -a much more detailed understanding of this guarantee. - -<p> -RCU's grace-period guarantee allows updaters to wait for the completion -of all pre-existing RCU read-side critical sections. -An RCU read-side critical section -begins with the marker <tt>rcu_read_lock()</tt> and ends with -the marker <tt>rcu_read_unlock()</tt>. -These markers may be nested, and RCU treats a nested set as one -big RCU read-side critical section. -Production-quality implementations of <tt>rcu_read_lock()</tt> and -<tt>rcu_read_unlock()</tt> are extremely lightweight, and in -fact have exactly zero overhead in Linux kernels built for production -use with <tt>CONFIG_PREEMPT=n</tt>. - -<p> -This guarantee allows ordering to be enforced with extremely low -overhead to readers, for example: - -<blockquote> -<pre> - 1 int x, y; - 2 - 3 void thread0(void) - 4 { - 5 rcu_read_lock(); - 6 r1 = READ_ONCE(x); - 7 r2 = READ_ONCE(y); - 8 rcu_read_unlock(); - 9 } -10 -11 void thread1(void) -12 { -13 WRITE_ONCE(x, 1); -14 synchronize_rcu(); -15 WRITE_ONCE(y, 1); -16 } -</pre> -</blockquote> - -<p> -Because the <tt>synchronize_rcu()</tt> on line 14 waits for -all pre-existing readers, any instance of <tt>thread0()</tt> that -loads a value of zero from <tt>x</tt> must complete before -<tt>thread1()</tt> stores to <tt>y</tt>, so that instance must -also load a value of zero from <tt>y</tt>. -Similarly, any instance of <tt>thread0()</tt> that loads a value of -one from <tt>y</tt> must have started after the -<tt>synchronize_rcu()</tt> started, and must therefore also load -a value of one from <tt>x</tt>. -Therefore, the outcome: -<blockquote> -<pre> -(r1 == 0 && r2 == 1) -</pre> -</blockquote> -cannot happen. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Wait a minute! - You said that updaters can make useful forward progress concurrently - with readers, but pre-existing readers will block - <tt>synchronize_rcu()</tt>!!! - Just who are you trying to fool??? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - First, if updaters do not wish to be blocked by readers, they can use - <tt>call_rcu()</tt> or <tt>kfree_rcu()</tt>, which will - be discussed later. - Second, even when using <tt>synchronize_rcu()</tt>, the other - update-side code does run concurrently with readers, whether - pre-existing or not. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -This scenario resembles one of the first uses of RCU in -<a href="https://en.wikipedia.org/wiki/DYNIX">DYNIX/ptx</a>, -which managed a distributed lock manager's transition into -a state suitable for handling recovery from node failure, -more or less as follows: - -<blockquote> -<pre> - 1 #define STATE_NORMAL 0 - 2 #define STATE_WANT_RECOVERY 1 - 3 #define STATE_RECOVERING 2 - 4 #define STATE_WANT_NORMAL 3 - 5 - 6 int state = STATE_NORMAL; - 7 - 8 void do_something_dlm(void) - 9 { -10 int state_snap; -11 -12 rcu_read_lock(); -13 state_snap = READ_ONCE(state); -14 if (state_snap == STATE_NORMAL) -15 do_something(); -16 else -17 do_something_carefully(); -18 rcu_read_unlock(); -19 } -20 -21 void start_recovery(void) -22 { -23 WRITE_ONCE(state, STATE_WANT_RECOVERY); -24 synchronize_rcu(); -25 WRITE_ONCE(state, STATE_RECOVERING); -26 recovery(); -27 WRITE_ONCE(state, STATE_WANT_NORMAL); -28 synchronize_rcu(); -29 WRITE_ONCE(state, STATE_NORMAL); -30 } -</pre> -</blockquote> - -<p> -The RCU read-side critical section in <tt>do_something_dlm()</tt> -works with the <tt>synchronize_rcu()</tt> in <tt>start_recovery()</tt> -to guarantee that <tt>do_something()</tt> never runs concurrently -with <tt>recovery()</tt>, but with little or no synchronization -overhead in <tt>do_something_dlm()</tt>. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Why is the <tt>synchronize_rcu()</tt> on line 28 needed? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Without that extra grace period, memory reordering could result in - <tt>do_something_dlm()</tt> executing <tt>do_something()</tt> - concurrently with the last bits of <tt>recovery()</tt>. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -In order to avoid fatal problems such as deadlocks, -an RCU read-side critical section must not contain calls to -<tt>synchronize_rcu()</tt>. -Similarly, an RCU read-side critical section must not -contain anything that waits, directly or indirectly, on completion of -an invocation of <tt>synchronize_rcu()</tt>. - -<p> -Although RCU's grace-period guarantee is useful in and of itself, with -<a href="https://lwn.net/Articles/573497/">quite a few use cases</a>, -it would be good to be able to use RCU to coordinate read-side -access to linked data structures. -For this, the grace-period guarantee is not sufficient, as can -be seen in function <tt>add_gp_buggy()</tt> below. -We will look at the reader's code later, but in the meantime, just think of -the reader as locklessly picking up the <tt>gp</tt> pointer, -and, if the value loaded is non-<tt>NULL</tt>, locklessly accessing the -<tt>->a</tt> and <tt>->b</tt> fields. - -<blockquote> -<pre> - 1 bool add_gp_buggy(int a, int b) - 2 { - 3 p = kmalloc(sizeof(*p), GFP_KERNEL); - 4 if (!p) - 5 return -ENOMEM; - 6 spin_lock(&gp_lock); - 7 if (rcu_access_pointer(gp)) { - 8 spin_unlock(&gp_lock); - 9 return false; -10 } -11 p->a = a; -12 p->b = a; -13 gp = p; /* ORDERING BUG */ -14 spin_unlock(&gp_lock); -15 return true; -16 } -</pre> -</blockquote> - -<p> -The problem is that both the compiler and weakly ordered CPUs are within -their rights to reorder this code as follows: - -<blockquote> -<pre> - 1 bool add_gp_buggy_optimized(int a, int b) - 2 { - 3 p = kmalloc(sizeof(*p), GFP_KERNEL); - 4 if (!p) - 5 return -ENOMEM; - 6 spin_lock(&gp_lock); - 7 if (rcu_access_pointer(gp)) { - 8 spin_unlock(&gp_lock); - 9 return false; -10 } -<b>11 gp = p; /* ORDERING BUG */ -12 p->a = a; -13 p->b = a;</b> -14 spin_unlock(&gp_lock); -15 return true; -16 } -</pre> -</blockquote> - -<p> -If an RCU reader fetches <tt>gp</tt> just after -<tt>add_gp_buggy_optimized</tt> executes line 11, -it will see garbage in the <tt>->a</tt> and <tt>->b</tt> -fields. -And this is but one of many ways in which compiler and hardware optimizations -could cause trouble. -Therefore, we clearly need some way to prevent the compiler and the CPU from -reordering in this manner, which brings us to the publish-subscribe -guarantee discussed in the next section. - -<h3><a name="Publish-Subscribe Guarantee">Publish/Subscribe Guarantee</a></h3> - -<p> -RCU's publish-subscribe guarantee allows data to be inserted -into a linked data structure without disrupting RCU readers. -The updater uses <tt>rcu_assign_pointer()</tt> to insert the -new data, and readers use <tt>rcu_dereference()</tt> to -access data, whether new or old. -The following shows an example of insertion: - -<blockquote> -<pre> - 1 bool add_gp(int a, int b) - 2 { - 3 p = kmalloc(sizeof(*p), GFP_KERNEL); - 4 if (!p) - 5 return -ENOMEM; - 6 spin_lock(&gp_lock); - 7 if (rcu_access_pointer(gp)) { - 8 spin_unlock(&gp_lock); - 9 return false; -10 } -11 p->a = a; -12 p->b = a; -13 rcu_assign_pointer(gp, p); -14 spin_unlock(&gp_lock); -15 return true; -16 } -</pre> -</blockquote> - -<p> -The <tt>rcu_assign_pointer()</tt> on line 13 is conceptually -equivalent to a simple assignment statement, but also guarantees -that its assignment will -happen after the two assignments in lines 11 and 12, -similar to the C11 <tt>memory_order_release</tt> store operation. -It also prevents any number of “interesting” compiler -optimizations, for example, the use of <tt>gp</tt> as a scratch -location immediately preceding the assignment. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - But <tt>rcu_assign_pointer()</tt> does nothing to prevent the - two assignments to <tt>p->a</tt> and <tt>p->b</tt> - from being reordered. - Can't that also cause problems? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - No, it cannot. - The readers cannot see either of these two fields until - the assignment to <tt>gp</tt>, by which time both fields are - fully initialized. - So reordering the assignments - to <tt>p->a</tt> and <tt>p->b</tt> cannot possibly - cause any problems. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -It is tempting to assume that the reader need not do anything special -to control its accesses to the RCU-protected data, -as shown in <tt>do_something_gp_buggy()</tt> below: - -<blockquote> -<pre> - 1 bool do_something_gp_buggy(void) - 2 { - 3 rcu_read_lock(); - 4 p = gp; /* OPTIMIZATIONS GALORE!!! */ - 5 if (p) { - 6 do_something(p->a, p->b); - 7 rcu_read_unlock(); - 8 return true; - 9 } -10 rcu_read_unlock(); -11 return false; -12 } -</pre> -</blockquote> - -<p> -However, this temptation must be resisted because there are a -surprisingly large number of ways that the compiler -(to say nothing of -<a href="https://h71000.www7.hp.com/wizard/wiz_2637.html">DEC Alpha CPUs</a>) -can trip this code up. -For but one example, if the compiler were short of registers, it -might choose to refetch from <tt>gp</tt> rather than keeping -a separate copy in <tt>p</tt> as follows: - -<blockquote> -<pre> - 1 bool do_something_gp_buggy_optimized(void) - 2 { - 3 rcu_read_lock(); - 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */ -<b> 5 do_something(gp->a, gp->b);</b> - 6 rcu_read_unlock(); - 7 return true; - 8 } - 9 rcu_read_unlock(); -10 return false; -11 } -</pre> -</blockquote> - -<p> -If this function ran concurrently with a series of updates that -replaced the current structure with a new one, -the fetches of <tt>gp->a</tt> -and <tt>gp->b</tt> might well come from two different structures, -which could cause serious confusion. -To prevent this (and much else besides), <tt>do_something_gp()</tt> uses -<tt>rcu_dereference()</tt> to fetch from <tt>gp</tt>: - -<blockquote> -<pre> - 1 bool do_something_gp(void) - 2 { - 3 rcu_read_lock(); - 4 p = rcu_dereference(gp); - 5 if (p) { - 6 do_something(p->a, p->b); - 7 rcu_read_unlock(); - 8 return true; - 9 } -10 rcu_read_unlock(); -11 return false; -12 } -</pre> -</blockquote> - -<p> -The <tt>rcu_dereference()</tt> uses volatile casts and (for DEC Alpha) -memory barriers in the Linux kernel. -Should a -<a href="http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf">high-quality implementation of C11 <tt>memory_order_consume</tt> [PDF]</a> -ever appear, then <tt>rcu_dereference()</tt> could be implemented -as a <tt>memory_order_consume</tt> load. -Regardless of the exact implementation, a pointer fetched by -<tt>rcu_dereference()</tt> may not be used outside of the -outermost RCU read-side critical section containing that -<tt>rcu_dereference()</tt>, unless protection of -the corresponding data element has been passed from RCU to some -other synchronization mechanism, most commonly locking or -<a href="https://www.kernel.org/doc/Documentation/RCU/rcuref.txt">reference counting</a>. - -<p> -In short, updaters use <tt>rcu_assign_pointer()</tt> and readers -use <tt>rcu_dereference()</tt>, and these two RCU API elements -work together to ensure that readers have a consistent view of -newly added data elements. - -<p> -Of course, it is also necessary to remove elements from RCU-protected -data structures, for example, using the following process: - -<ol> -<li> Remove the data element from the enclosing structure. -<li> Wait for all pre-existing RCU read-side critical sections - to complete (because only pre-existing readers can possibly have - a reference to the newly removed data element). -<li> At this point, only the updater has a reference to the - newly removed data element, so it can safely reclaim - the data element, for example, by passing it to <tt>kfree()</tt>. -</ol> - -This process is implemented by <tt>remove_gp_synchronous()</tt>: - -<blockquote> -<pre> - 1 bool remove_gp_synchronous(void) - 2 { - 3 struct foo *p; - 4 - 5 spin_lock(&gp_lock); - 6 p = rcu_access_pointer(gp); - 7 if (!p) { - 8 spin_unlock(&gp_lock); - 9 return false; -10 } -11 rcu_assign_pointer(gp, NULL); -12 spin_unlock(&gp_lock); -13 synchronize_rcu(); -14 kfree(p); -15 return true; -16 } -</pre> -</blockquote> - -<p> -This function is straightforward, with line 13 waiting for a grace -period before line 14 frees the old data element. -This waiting ensures that readers will reach line 7 of -<tt>do_something_gp()</tt> before the data element referenced by -<tt>p</tt> is freed. -The <tt>rcu_access_pointer()</tt> on line 6 is similar to -<tt>rcu_dereference()</tt>, except that: - -<ol> -<li> The value returned by <tt>rcu_access_pointer()</tt> - cannot be dereferenced. - If you want to access the value pointed to as well as - the pointer itself, use <tt>rcu_dereference()</tt> - instead of <tt>rcu_access_pointer()</tt>. -<li> The call to <tt>rcu_access_pointer()</tt> need not be - protected. - In contrast, <tt>rcu_dereference()</tt> must either be - within an RCU read-side critical section or in a code - segment where the pointer cannot change, for example, in - code protected by the corresponding update-side lock. -</ol> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Without the <tt>rcu_dereference()</tt> or the - <tt>rcu_access_pointer()</tt>, what destructive optimizations - might the compiler make use of? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Let's start with what happens to <tt>do_something_gp()</tt> - if it fails to use <tt>rcu_dereference()</tt>. - It could reuse a value formerly fetched from this same pointer. - It could also fetch the pointer from <tt>gp</tt> in a byte-at-a-time - manner, resulting in <i>load tearing</i>, in turn resulting a bytewise - mash-up of two distinct pointer values. - It might even use value-speculation optimizations, where it makes - a wrong guess, but by the time it gets around to checking the - value, an update has changed the pointer to match the wrong guess. - Too bad about any dereferences that returned pre-initialization garbage - in the meantime! - </font> - - <p><font color="ffffff"> - For <tt>remove_gp_synchronous()</tt>, as long as all modifications - to <tt>gp</tt> are carried out while holding <tt>gp_lock</tt>, - the above optimizations are harmless. - However, <tt>sparse</tt> will complain if you - define <tt>gp</tt> with <tt>__rcu</tt> and then - access it without using - either <tt>rcu_access_pointer()</tt> or <tt>rcu_dereference()</tt>. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -In short, RCU's publish-subscribe guarantee is provided by the combination -of <tt>rcu_assign_pointer()</tt> and <tt>rcu_dereference()</tt>. -This guarantee allows data elements to be safely added to RCU-protected -linked data structures without disrupting RCU readers. -This guarantee can be used in combination with the grace-period -guarantee to also allow data elements to be removed from RCU-protected -linked data structures, again without disrupting RCU readers. - -<p> -This guarantee was only partially premeditated. -DYNIX/ptx used an explicit memory barrier for publication, but had nothing -resembling <tt>rcu_dereference()</tt> for subscription, nor did it -have anything resembling the <tt>smp_read_barrier_depends()</tt> -that was later subsumed into <tt>rcu_dereference()</tt> and later -still into <tt>READ_ONCE()</tt>. -The need for these operations made itself known quite suddenly at a -late-1990s meeting with the DEC Alpha architects, back in the days when -DEC was still a free-standing company. -It took the Alpha architects a good hour to convince me that any sort -of barrier would ever be needed, and it then took me a good <i>two</i> hours -to convince them that their documentation did not make this point clear. -More recent work with the C and C++ standards committees have provided -much education on tricks and traps from the compiler. -In short, compilers were much less tricky in the early 1990s, but in -2015, don't even think about omitting <tt>rcu_dereference()</tt>! - -<h3><a name="Memory-Barrier Guarantees">Memory-Barrier Guarantees</a></h3> - -<p> -The previous section's simple linked-data-structure scenario clearly -demonstrates the need for RCU's stringent memory-ordering guarantees on -systems with more than one CPU: - -<ol> -<li> Each CPU that has an RCU read-side critical section that - begins before <tt>synchronize_rcu()</tt> starts is - guaranteed to execute a full memory barrier between the time - that the RCU read-side critical section ends and the time that - <tt>synchronize_rcu()</tt> returns. - Without this guarantee, a pre-existing RCU read-side critical section - might hold a reference to the newly removed <tt>struct foo</tt> - after the <tt>kfree()</tt> on line 14 of - <tt>remove_gp_synchronous()</tt>. -<li> Each CPU that has an RCU read-side critical section that ends - after <tt>synchronize_rcu()</tt> returns is guaranteed - to execute a full memory barrier between the time that - <tt>synchronize_rcu()</tt> begins and the time that the RCU - read-side critical section begins. - Without this guarantee, a later RCU read-side critical section - running after the <tt>kfree()</tt> on line 14 of - <tt>remove_gp_synchronous()</tt> might - later run <tt>do_something_gp()</tt> and find the - newly deleted <tt>struct foo</tt>. -<li> If the task invoking <tt>synchronize_rcu()</tt> remains - on a given CPU, then that CPU is guaranteed to execute a full - memory barrier sometime during the execution of - <tt>synchronize_rcu()</tt>. - This guarantee ensures that the <tt>kfree()</tt> on - line 14 of <tt>remove_gp_synchronous()</tt> really does - execute after the removal on line 11. -<li> If the task invoking <tt>synchronize_rcu()</tt> migrates - among a group of CPUs during that invocation, then each of the - CPUs in that group is guaranteed to execute a full memory barrier - sometime during the execution of <tt>synchronize_rcu()</tt>. - This guarantee also ensures that the <tt>kfree()</tt> on - line 14 of <tt>remove_gp_synchronous()</tt> really does - execute after the removal on - line 11, but also in the case where the thread executing the - <tt>synchronize_rcu()</tt> migrates in the meantime. -</ol> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - Given that multiple CPUs can start RCU read-side critical sections - at any time without any ordering whatsoever, how can RCU possibly - tell whether or not a given RCU read-side critical section starts - before a given instance of <tt>synchronize_rcu()</tt>? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - If RCU cannot tell whether or not a given - RCU read-side critical section starts before a - given instance of <tt>synchronize_rcu()</tt>, - then it must assume that the RCU read-side critical section - started first. - In other words, a given instance of <tt>synchronize_rcu()</tt> - can avoid waiting on a given RCU read-side critical section only - if it can prove that <tt>synchronize_rcu()</tt> started first. - </font> - - <p><font color="ffffff"> - A related question is “When <tt>rcu_read_lock()</tt> - doesn't generate any code, why does it matter how it relates - to a grace period?” - The answer is that it is not the relationship of - <tt>rcu_read_lock()</tt> itself that is important, but rather - the relationship of the code within the enclosed RCU read-side - critical section to the code preceding and following the - grace period. - If we take this viewpoint, then a given RCU read-side critical - section begins before a given grace period when some access - preceding the grace period observes the effect of some access - within the critical section, in which case none of the accesses - within the critical section may observe the effects of any - access following the grace period. - </font> - - <p><font color="ffffff"> - As of late 2016, mathematical models of RCU take this - viewpoint, for example, see slides 62 and 63 - of the - <a href="http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.2016.10.04c.LCE.pdf">2016 LinuxCon EU</a> - presentation. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - The first and second guarantees require unbelievably strict ordering! - Are all these memory barriers <i> really</i> required? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - Yes, they really are required. - To see why the first guarantee is required, consider the following - sequence of events: - </font> - - <ol> - <li> <font color="ffffff"> - CPU 1: <tt>rcu_read_lock()</tt> - </font> - <li> <font color="ffffff"> - CPU 1: <tt>q = rcu_dereference(gp); - /* Very likely to return p. */</tt> - </font> - <li> <font color="ffffff"> - CPU 0: <tt>list_del_rcu(p);</tt> - </font> - <li> <font color="ffffff"> - CPU 0: <tt>synchronize_rcu()</tt> starts. - </font> - <li> <font color="ffffff"> - CPU 1: <tt>do_something_with(q->a); - /* No smp_mb(), so might happen after kfree(). */</tt> - </font> - <li> <font color="ffffff"> - CPU 1: <tt>rcu_read_unlock()</tt> - </font> - <li> <font color="ffffff"> - CPU 0: <tt>synchronize_rcu()</tt> returns. - </font> - <li> <font color="ffffff"> - CPU 0: <tt>kfree(p);</tt> - </font> - </ol> - - <p><font color="ffffff"> - Therefore, there absolutely must be a full memory barrier between the - end of the RCU read-side critical section and the end of the - grace period. - </font> - - <p><font color="ffffff"> - The sequence of events demonstrating the necessity of the second rule - is roughly similar: - </font> - - <ol> - <li> <font color="ffffff">CPU 0: <tt>list_del_rcu(p);</tt> - </font> - <li> <font color="ffffff">CPU 0: <tt>synchronize_rcu()</tt> starts. - </font> - <li> <font color="ffffff">CPU 1: <tt>rcu_read_lock()</tt> - </font> - <li> <font color="ffffff">CPU 1: <tt>q = rcu_dereference(gp); - /* Might return p if no memory barrier. */</tt> - </font> - <li> <font color="ffffff">CPU 0: <tt>synchronize_rcu()</tt> returns. - </font> - <li> <font color="ffffff">CPU 0: <tt>kfree(p);</tt> - </font> - <li> <font color="ffffff"> - CPU 1: <tt>do_something_with(q->a); /* Boom!!! */</tt> - </font> - <li> <font color="ffffff">CPU 1: <tt>rcu_read_unlock()</tt> - </font> - </ol> - - <p><font color="ffffff"> - And similarly, without a memory barrier between the beginning of the - grace period and the beginning of the RCU read-side critical section, - CPU 1 might end up accessing the freelist. - </font> - - <p><font color="ffffff"> - The “as if” rule of course applies, so that any - implementation that acts as if the appropriate memory barriers - were in place is a correct implementation. - That said, it is much easier to fool yourself into believing - that you have adhered to the as-if rule than it is to actually - adhere to it! -</font></td></tr> -<tr><td> </td></tr> -</table> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - You claim that <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt> - generate absolutely no code in some kernel builds. - This means that the compiler might arbitrarily rearrange consecutive - RCU read-side critical sections. - Given such rearrangement, if a given RCU read-side critical section - is done, how can you be sure that all prior RCU read-side critical - sections are done? - Won't the compiler rearrangements make that impossible to determine? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - In cases where <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt> - generate absolutely no code, RCU infers quiescent states only at - special locations, for example, within the scheduler. - Because calls to <tt>schedule()</tt> had better prevent calling-code - accesses to shared variables from being rearranged across the call to - <tt>schedule()</tt>, if RCU detects the end of a given RCU read-side - critical section, it will necessarily detect the end of all prior - RCU read-side critical sections, no matter how aggressively the - compiler scrambles the code. - </font> - - <p><font color="ffffff"> - Again, this all assumes that the compiler cannot scramble code across - calls to the scheduler, out of interrupt handlers, into the idle loop, - into user-mode code, and so on. - But if your kernel build allows that sort of scrambling, you have broken - far more than just RCU! -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -Note that these memory-barrier requirements do not replace the fundamental -RCU requirement that a grace period wait for all pre-existing readers. -On the contrary, the memory barriers called out in this section must operate in -such a way as to <i>enforce</i> this fundamental requirement. -Of course, different implementations enforce this requirement in different -ways, but enforce it they must. - -<h3><a name="RCU Primitives Guaranteed to Execute Unconditionally">RCU Primitives Guaranteed to Execute Unconditionally</a></h3> - -<p> -The common-case RCU primitives are unconditional. -They are invoked, they do their job, and they return, with no possibility -of error, and no need to retry. -This is a key RCU design philosophy. - -<p> -However, this philosophy is pragmatic rather than pigheaded. -If someone comes up with a good justification for a particular conditional -RCU primitive, it might well be implemented and added. -After all, this guarantee was reverse-engineered, not premeditated. -The unconditional nature of the RCU primitives was initially an -accident of implementation, and later experience with synchronization -primitives with conditional primitives caused me to elevate this -accident to a guarantee. -Therefore, the justification for adding a conditional primitive to -RCU would need to be based on detailed and compelling use cases. - -<h3><a name="Guaranteed Read-to-Write Upgrade">Guaranteed Read-to-Write Upgrade</a></h3> - -<p> -As far as RCU is concerned, it is always possible to carry out an -update within an RCU read-side critical section. -For example, that RCU read-side critical section might search for -a given data element, and then might acquire the update-side -spinlock in order to update that element, all while remaining -in that RCU read-side critical section. -Of course, it is necessary to exit the RCU read-side critical section -before invoking <tt>synchronize_rcu()</tt>, however, this -inconvenience can be avoided through use of the -<tt>call_rcu()</tt> and <tt>kfree_rcu()</tt> API members -described later in this document. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - But how does the upgrade-to-write operation exclude other readers? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - It doesn't, just like normal RCU updates, which also do not exclude - RCU readers. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -This guarantee allows lookup code to be shared between read-side -and update-side code, and was premeditated, appearing in the earliest -DYNIX/ptx RCU documentation. - -<h2><a name="Fundamental Non-Requirements">Fundamental Non-Requirements</a></h2> - -<p> -RCU provides extremely lightweight readers, and its read-side guarantees, -though quite useful, are correspondingly lightweight. -It is therefore all too easy to assume that RCU is guaranteeing more -than it really is. -Of course, the list of things that RCU does not guarantee is infinitely -long, however, the following sections list a few non-guarantees that -have caused confusion. -Except where otherwise noted, these non-guarantees were premeditated. - -<ol> -<li> <a href="#Readers Impose Minimal Ordering"> - Readers Impose Minimal Ordering</a> -<li> <a href="#Readers Do Not Exclude Updaters"> - Readers Do Not Exclude Updaters</a> -<li> <a href="#Updaters Only Wait For Old Readers"> - Updaters Only Wait For Old Readers</a> -<li> <a href="#Grace Periods Don't Partition Read-Side Critical Sections"> - Grace Periods Don't Partition Read-Side Critical Sections</a> -<li> <a href="#Read-Side Critical Sections Don't Partition Grace Periods"> - Read-Side Critical Sections Don't Partition Grace Periods</a> -</ol> - -<h3><a name="Readers Impose Minimal Ordering">Readers |