diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2023-02-21 10:45:51 -0800 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2023-02-21 10:45:51 -0800 |
| commit | 8cc01d43f882fa1f44d8aa6727a6ea783d8fbe3f (patch) | |
| tree | 053ed1940a0ddb7ff2972c05637edf820772cbb8 /kernel | |
| parent | 8ca8d89b43caf9a02a18414d6eeff966d2b14512 (diff) | |
| parent | bba8d3d17dc2678f9647962900aa421a18c25320 (diff) | |
| download | linux-8cc01d43f882fa1f44d8aa6727a6ea783d8fbe3f.tar.gz linux-8cc01d43f882fa1f44d8aa6727a6ea783d8fbe3f.tar.bz2 linux-8cc01d43f882fa1f44d8aa6727a6ea783d8fbe3f.zip | |
Merge tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Documentation updates
- Miscellaneous fixes, perhaps most notably:
- Throttling callback invocation based on the number of callbacks
that are now ready to invoke instead of on the total number of
callbacks
- Several patches that suppress false-positive boot-time
diagnostics, for example, due to lockdep not yet being
initialized
- Make expedited RCU CPU stall warnings dump stacks of any tasks
that are blocking the stalled grace period. (Normal RCU CPU
stall warnings have done this for many years)
- Lazy-callback fixes to avoid delays during boot, suspend, and
resume. (Note that lazy callbacks must be explicitly enabled, so
this should not (yet) affect production use cases)
- Make kfree_rcu() and friends take advantage of polled grace periods,
thus reducing memory footprint by almost two orders of magnitude,
admittedly on a microbenchmark
This also begins the transition from kfree_rcu(p) to
kfree_rcu_mightsleep(p). This transition was motivated by bugs where
kfree_rcu(p), which can block, was typed instead of the intended
kfree_rcu(p, rh)
- SRCU updates, perhaps most notably fixing a bug that causes SRCU to
fail when booted on a system with a non-zero boot CPU. This
surprising situation actually happens for kdump kernels on the
powerpc architecture
This also adds an srcu_down_read() and srcu_up_read(), which act like
srcu_read_lock() and srcu_read_unlock(), but allow an SRCU read-side
critical section to be handed off from one task to another
- Clean up the now-useless SRCU Kconfig option
There are a few more commits that are not yet acked or pulled into
maintainer trees, and these will be in a pull request for a later
merge window
- RCU-tasks updates, perhaps most notably these fixes:
- A strange interaction between PID-namespace unshare and the
RCU-tasks grace period that results in a low-probability but
very real hang
- A race between an RCU tasks rude grace period on a single-CPU
system and CPU-hotplug addition of the second CPU that can
result in a too-short grace period
- A race between shrinking RCU tasks down to a single callback
list and queuing a new callback to some other CPU, but where
that queuing is delayed for more than an RCU grace period. This
can result in that callback being stranded on the non-boot CPU
- Torture-test updates and fixes
- Torture-test scripting updates and fixes
- Provide additional RCU CPU stall-warning information in kernels built
with CONFIG_RCU_CPU_STALL_CPUTIME=y, and restore the full five-minute
timeout limit for expedited RCU CPU stall warnings
* tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (80 commits)
rcu/kvfree: Add kvfree_rcu_mightsleep() and kfree_rcu_mightsleep()
kernel/notifier: Remove CONFIG_SRCU
init: Remove "select SRCU"
fs/quota: Remove "select SRCU"
fs/notify: Remove "select SRCU"
fs/btrfs: Remove "select SRCU"
fs: Remove CONFIG_SRCU
drivers/pci/controller: Remove "select SRCU"
drivers/net: Remove "select SRCU"
drivers/md: Remove "select SRCU"
drivers/hwtracing/stm: Remove "select SRCU"
drivers/dax: Remove "select SRCU"
drivers/base: Remove CONFIG_SRCU
rcu: Disable laziness if lazy-tracking says so
rcu: Track laziness during boot and suspend
rcu: Remove redundant call to rcu_boost_kthread_setaffinity()
rcu: Allow up to five minutes expedited RCU CPU stall-warning timeouts
rcu: Align the output of RCU CPU stall warning messages
rcu: Add RCU stall diagnosis information
sched: Add helper nr_context_switches_cpu()
...
Diffstat (limited to 'kernel')
| -rw-r--r-- | kernel/locking/locktorture.c | 101 | ||||
| -rw-r--r-- | kernel/notifier.c | 3 | ||||
| -rw-r--r-- | kernel/pid_namespace.c | 17 | ||||
| -rw-r--r-- | kernel/rcu/Kconfig.debug | 15 | ||||
| -rw-r--r-- | kernel/rcu/rcu.h | 8 | ||||
| -rw-r--r-- | kernel/rcu/rcu_segcblist.c | 2 | ||||
| -rw-r--r-- | kernel/rcu/rcu_segcblist.h | 2 | ||||
| -rw-r--r-- | kernel/rcu/rcutorture.c | 12 | ||||
| -rw-r--r-- | kernel/rcu/refscale.c | 250 | ||||
| -rw-r--r-- | kernel/rcu/srcutree.c | 98 | ||||
| -rw-r--r-- | kernel/rcu/tasks.h | 85 | ||||
| -rw-r--r-- | kernel/rcu/tiny.c | 9 | ||||
| -rw-r--r-- | kernel/rcu/tree.c | 657 | ||||
| -rw-r--r-- | kernel/rcu/tree.h | 19 | ||||
| -rw-r--r-- | kernel/rcu/tree_exp.h | 43 | ||||
| -rw-r--r-- | kernel/rcu/tree_stall.h | 37 | ||||
| -rw-r--r-- | kernel/rcu/update.c | 49 | ||||
| -rw-r--r-- | kernel/sched/core.c | 5 | ||||
| -rw-r--r-- | kernel/torture.c | 4 |
19 files changed, 1012 insertions, 404 deletions
diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c index 9c2fb613a55d..f04b1978899d 100644 --- a/kernel/locking/locktorture.c +++ b/kernel/locking/locktorture.c @@ -46,6 +46,9 @@ torture_param(int, shutdown_secs, 0, "Shutdown time (j), <= zero to disable."); torture_param(int, stat_interval, 60, "Number of seconds between stats printk()s"); torture_param(int, stutter, 5, "Number of jiffies to run/halt test, 0=disable"); +torture_param(int, rt_boost, 2, + "Do periodic rt-boost. 0=Disable, 1=Only for rt_mutex, 2=For all lock types."); +torture_param(int, rt_boost_factor, 50, "A factor determining how often rt-boost happens."); torture_param(int, verbose, 1, "Enable verbose debugging printk()s"); @@ -127,15 +130,50 @@ static void torture_lock_busted_write_unlock(int tid __maybe_unused) /* BUGGY, do not use in real life!!! */ } -static void torture_boost_dummy(struct torture_random_state *trsp) +static void __torture_rt_boost(struct torture_random_state *trsp) { - /* Only rtmutexes care about priority */ + const unsigned int factor = rt_boost_factor; + + if (!rt_task(current)) { + /* + * Boost priority once every rt_boost_factor operations. When + * the task tries to take the lock, the rtmutex it will account + * for the new priority, and do any corresponding pi-dance. + */ + if (trsp && !(torture_random(trsp) % + (cxt.nrealwriters_stress * factor))) { + sched_set_fifo(current); + } else /* common case, do nothing */ + return; + } else { + /* + * The task will remain boosted for another 10 * rt_boost_factor + * operations, then restored back to its original prio, and so + * forth. + * + * When @trsp is nil, we want to force-reset the task for + * stopping the kthread. + */ + if (!trsp || !(torture_random(trsp) % + (cxt.nrealwriters_stress * factor * 2))) { + sched_set_normal(current, 0); + } else /* common case, do nothing */ + return; + } +} + +static void torture_rt_boost(struct torture_random_state *trsp) +{ + if (rt_boost != 2) + return; + + __torture_rt_boost(trsp); } static struct lock_torture_ops lock_busted_ops = { .writelock = torture_lock_busted_write_lock, .write_delay = torture_lock_busted_write_delay, - .task_boost = torture_boost_dummy, + .task_boost = torture_rt_boost, .writeunlock = torture_lock_busted_write_unlock, .readlock = NULL, .read_delay = NULL, @@ -179,7 +217,7 @@ __releases(torture_spinlock) static struct lock_torture_ops spin_lock_ops = { .writelock = torture_spin_lock_write_lock, .write_delay = torture_spin_lock_write_delay, - .task_boost = torture_boost_dummy, + .task_boost = torture_rt_boost, .writeunlock = torture_spin_lock_write_unlock, .readlock = NULL, .read_delay = NULL, @@ -206,7 +244,7 @@ __releases(torture_spinlock) static struct lock_torture_ops spin_lock_irq_ops = { .writelock = torture_spin_lock_write_lock_irq, .write_delay = torture_spin_lock_write_delay, - .task_boost = torture_boost_dummy, + .task_boost = torture_rt_boost, .writeunlock = torture_lock_spin_write_unlock_irq, .readlock = NULL, .read_delay = NULL, @@ -275,7 +313,7 @@ __releases(torture_rwlock) static struct lock_torture_ops rw_lock_ops = { .writelock = torture_rwlock_write_lock, .write_delay = torture_rwlock_write_delay, - .task_boost = torture_boost_dummy, + .task_boost = torture_rt_boost, .writeunlock = torture_rwlock_write_unlock, .readlock = torture_rwlock_read_lock, .read_delay = torture_rwlock_read_delay, @@ -318,7 +356,7 @@ __releases(torture_rwlock) static struct lock_torture_ops rw_lock_irq_ops = { .writelock = torture_rwlock_write_lock_irq, .write_delay = torture_rwlock_write_delay, - .task_boost = torture_boost_dummy, + .task_boost = torture_rt_boost, .writeunlock = torture_rwlock_write_unlock_irq, .readlock = torture_rwlock_read_lock_irq, .read_delay = torture_rwlock_read_delay, @@ -358,7 +396,7 @@ __releases(torture_mutex) static struct lock_torture_ops mutex_lock_ops = { .writelock = torture_mutex_lock, .write_delay = torture_mutex_delay, - .task_boost = torture_boost_dummy, + .task_boost = torture_rt_boost, .writeunlock = torture_mutex_unlock, .readlock = NULL, .read_delay = NULL, @@ -456,7 +494,7 @@ static struct lock_torture_ops ww_mutex_lock_ops = { .exit = torture_ww_mutex_exit, .writelock = torture_ww_mutex_lock, .write_delay = torture_mutex_delay, - .task_boost = torture_boost_dummy, + .task_boost = torture_rt_boost, .writeunlock = torture_ww_mutex_unlock, .readlock = NULL, .read_delay = NULL, @@ -474,37 +512,6 @@ __acquires(torture_rtmutex) return 0; } -static void torture_rtmutex_boost(struct torture_random_state *trsp) -{ - const unsigned int factor = 50000; /* yes, quite arbitrary */ - - if (!rt_task(current)) { - /* - * Boost priority once every ~50k operations. When the - * task tries to take the lock, the rtmutex it will account - * for the new priority, and do any corresponding pi-dance. - */ - if (trsp && !(torture_random(trsp) % - (cxt.nrealwriters_stress * factor))) { - sched_set_fifo(current); - } else /* common case, do nothing */ - return; - } else { - /* - * The task will remain boosted for another ~500k operations, - * then restored back to its original prio, and so forth. - * - * When @trsp is nil, we want to force-reset the task for - * stopping the kthread. - */ - if (!trsp || !(torture_random(trsp) % - (cxt.nrealwriters_stress * factor * 2))) { - sched_set_normal(current, 0); - } else /* common case, do nothing */ - return; - } -} - static void torture_rtmutex_delay(struct torture_random_state *trsp) { const unsigned long shortdelay_us = 2; @@ -530,10 +537,18 @@ __releases(torture_rtmutex) rt_mutex_unlock(&torture_rtmutex); } +static void torture_rt_boost_rtmutex(struct torture_random_state *trsp) +{ + if (!rt_boost) + return; + + __torture_rt_boost(trsp); +} + static struct lock_torture_ops rtmutex_lock_ops = { .writelock = torture_rtmutex_lock, .write_delay = torture_rtmutex_delay, - .task_boost = torture_rtmutex_boost, + .task_boost = torture_rt_boost_rtmutex, .writeunlock = torture_rtmutex_unlock, .readlock = NULL, .read_delay = NULL, @@ -600,7 +615,7 @@ __releases(torture_rwsem) static struct lock_torture_ops rwsem_lock_ops = { .writelock = torture_rwsem_down_write, .write_delay = torture_rwsem_write_delay, - .task_boost = torture_boost_dummy, + .task_boost = torture_rt_boost, .writeunlock = torture_rwsem_up_write, .readlock = torture_rwsem_down_read, .read_delay = torture_rwsem_read_delay, @@ -652,7 +667,7 @@ static struct lock_torture_ops percpu_rwsem_lock_ops = { .exit = torture_percpu_rwsem_exit, .writelock = torture_percpu_rwsem_down_write, .write_delay = torture_rwsem_write_delay, - .task_boost = torture_boost_dummy, + .task_boost = torture_rt_boost, .writeunlock = torture_percpu_rwsem_up_write, .readlock = torture_percpu_rwsem_down_read, .read_delay = torture_rwsem_read_delay, diff --git a/kernel/notifier.c b/kernel/notifier.c index ab75637fd904..d353e4b5402d 100644 --- a/kernel/notifier.c +++ b/kernel/notifier.c @@ -456,7 +456,6 @@ int raw_notifier_call_chain(struct raw_notifier_head *nh, } EXPORT_SYMBOL_GPL(raw_notifier_call_chain); -#ifdef CONFIG_SRCU /* * SRCU notifier chain routines. Registration and unregistration * use a mutex, and call_chain is synchronized by SRCU (no locks). @@ -573,8 +572,6 @@ void srcu_init_notifier_head(struct srcu_notifier_head *nh) } EXPORT_SYMBOL_GPL(srcu_init_notifier_head); -#endif /* CONFIG_SRCU */ - static ATOMIC_NOTIFIER_HEAD(die_chain); int notrace notify_die(enum die_val val, const char *str, diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index f4f8cb0435b4..fc21c5d5fd5d 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -244,7 +244,24 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) set_current_state(TASK_INTERRUPTIBLE); if (pid_ns->pid_allocated == init_pids) break; + /* + * Release tasks_rcu_exit_srcu to avoid following deadlock: + * + * 1) TASK A unshare(CLONE_NEWPID) + * 2) TASK A fork() twice -> TASK B (child reaper for new ns) + * and TASK C + * 3) TASK B exits, kills TASK C, waits for TASK A to reap it + * 4) TASK A calls synchronize_rcu_tasks() + * -> synchronize_srcu(tasks_rcu_exit_srcu) + * 5) *DEADLOCK* + * + * It is considered safe to release tasks_rcu_exit_srcu here + * because we assume the current task can not be concurrently + * reaped at this point. + */ + exit_tasks_rcu_stop(); schedule(); + exit_tasks_rcu_start(); } __set_current_state(TASK_RUNNING); diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug index 232e29fe3e5e..2984de629f74 100644 --- a/kernel/rcu/Kconfig.debug +++ b/kernel/rcu/Kconfig.debug @@ -82,7 +82,7 @@ config RCU_CPU_STALL_TIMEOUT config RCU_EXP_CPU_STALL_TIMEOUT int "Expedited RCU CPU stall timeout in milliseconds" depends on RCU_STALL_COMMON - range 0 21000 + range 0 300000 default 0 help If a given expedited RCU grace period extends more than the @@ -92,6 +92,19 @@ config RCU_EXP_CPU_STALL_TIMEOUT says to use the RCU_CPU_STALL_TIMEOUT value converted from seconds to milliseconds. +config RCU_CPU_STALL_CPUTIME + bool "Provide additional RCU stall debug information" + depends on RCU_STALL_COMMON + default n + help + Collect statistics during the sampling period, such as the number of + (hard interrupts, soft interrupts, task switches) and the cputime of + (hard interrupts, soft interrupts, kernel tasks) are added to the + RCU stall report. For multiple continuous RCU stalls, all sampling + periods begin at half of the first RCU stall timeout. + The boot option rcupdate.rcu_cpu_stall_cputime has the same function + as this one, but will override this if it exists. + config RCU_TRACE bool "Enable tracing for RCU" depends on DEBUG_KERNEL diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index c5aa934de59b..115616ac3bfa 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -224,6 +224,8 @@ extern int rcu_cpu_stall_ftrace_dump; extern int rcu_cpu_stall_suppress; extern int rcu_cpu_stall_timeout; extern int rcu_exp_cpu_stall_timeout; +extern int rcu_cpu_stall_cputime; +extern bool rcu_exp_stall_task_details __read_mostly; int rcu_jiffies_till_stall_check(void); int rcu_exp_jiffies_till_stall_check(void); @@ -447,14 +449,20 @@ do { \ /* Tiny RCU doesn't expedite, as its purpose in life is instead to be tiny. */ static inline bool rcu_gp_is_normal(void) { return true; } static inline bool rcu_gp_is_expedited(void) { return false; } +static inline bool rcu_async_should_hurry(void) { return false; } static inline void rcu_expedite_gp(void) { } static inline void rcu_unexpedite_gp(void) { } +static inline void rcu_async_hurry(void) { } +static inline void rcu_async_relax(void) { } static inline void rcu_request_urgent_qs_task(struct task_struct *t) { } #else /* #ifdef CONFIG_TINY_RCU */ bool rcu_gp_is_normal(void); /* Internal RCU use. */ bool rcu_gp_is_expedited(void); /* Internal RCU use. */ +bool rcu_async_should_hurry(void); /* Internal RCU use. */ void rcu_expedite_gp(void); void rcu_unexpedite_gp(void); +void rcu_async_hurry(void); +void rcu_async_relax(void); void rcupdate_announce_bootup_oddness(void); #ifdef CONFIG_TASKS_RCU_GENERIC void show_rcu_tasks_gp_kthreads(void); diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c index c54ea2b6a36b..f71fac422c8f 100644 --- a/kernel/rcu/rcu_segcblist.c +++ b/kernel/rcu/rcu_segcblist.c @@ -89,7 +89,7 @@ static void rcu_segcblist_set_len(struct rcu_segcblist *rsclp, long v) } /* Get the length of a segment of the rcu_segcblist structure. */ -static long rcu_segcblist_get_seglen(struct rcu_segcblist *rsclp, int seg) +long rcu_segcblist_get_seglen(struct rcu_segcblist *rsclp, int seg) { return READ_ONCE(rsclp->seglen[seg]); } diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h index 431cee212467..4fe877f5f654 100644 --- a/kernel/rcu/rcu_segcblist.h +++ b/kernel/rcu/rcu_segcblist.h @@ -15,6 +15,8 @@ static inline long rcu_cblist_n_cbs(struct rcu_cblist *rclp) return READ_ONCE(rclp->len); } +long rcu_segcblist_get_seglen(struct rcu_segcblist *rsclp, int seg); + /* Return number of callbacks in segmented callback list by summing seglen. */ long rcu_segcblist_n_segment_cbs(struct rcu_segcblist *rsclp); diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 634df26a2c27..8e6c023212cb 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -399,7 +399,7 @@ static int torture_readlock_not_held(void) return rcu_read_lock_bh_held() || rcu_read_lock_sched_held(); } -static int rcu_torture_read_lock(void) __acquires(RCU) +static int rcu_torture_read_lock(void) { rcu_read_lock(); return 0; @@ -441,7 +441,7 @@ rcu_read_delay(struct torture_random_state *rrsp, struct rt_read_seg *rtrsp) } } -static void rcu_torture_read_unlock(int idx) __releases(RCU) +static void rcu_torture_read_unlock(int idx) { rcu_read_unlock(); } @@ -625,7 +625,7 @@ static struct srcu_struct srcu_ctld; static struct srcu_struct *srcu_ctlp = &srcu_ctl; static struct rcu_torture_ops srcud_ops; -static int srcu_torture_read_lock(void) __acquires(srcu_ctlp) +static int srcu_torture_read_lock(void) { if (cur_ops == &srcud_ops) return srcu_read_lock_nmisafe(srcu_ctlp); @@ -652,7 +652,7 @@ srcu_read_delay(struct torture_random_state *rrsp, struct rt_read_seg *rtrsp) } } -static void srcu_torture_read_unlock(int idx) __releases(srcu_ctlp) +static void srcu_torture_read_unlock(int idx) { if (cur_ops == &srcud_ops) srcu_read_unlock_nmisafe(srcu_ctlp, idx); @@ -814,13 +814,13 @@ static void synchronize_rcu_trivial(void) } } -static int rcu_torture_read_lock_trivial(void) __acquires(RCU) +static int rcu_torture_read_lock_trivial(void) { preempt_disable(); return 0; } -static void rcu_torture_read_unlock_trivial(int idx) __releases(RCU) +static void rcu_torture_read_unlock_trivial(int idx) { preempt_enable(); } diff --git a/kernel/rcu/refscale.c b/kernel/rcu/refscale.c index 435c884c02b5..afa3e1a2f690 100644 --- a/kernel/rcu/refscale.c +++ b/kernel/rcu/refscale.c @@ -76,6 +76,8 @@ torture_param(int, verbose_batched, 0, "Batch verbose debugging printk()s"); // Wait until there are multiple CPUs before starting test. torture_param(int, holdoff, IS_BUILTIN(CONFIG_RCU_REF_SCALE_TEST) ? 10 : 0, "Holdoff time before test start (s)"); +// Number of typesafe_lookup structures, that is, the degree of concurrency. +torture_param(long, lookup_instances, 0, "Number of typesafe_lookup structures."); // Number of loops per experiment, all readers execute operations concurrently. torture_param(long, loops, 10000, "Number of loops per experiment."); // Number of readers, with -1 defaulting to about 75% of the CPUs. @@ -124,7 +126,7 @@ static int exp_idx; // Operations vector for selecting different types of tests. struct ref_scale_ops { - void (*init)(void); + bool (*init)(void); void (*cleanup)(void); void (*readsection)(const int nloops); void (*delaysection)(const int nloops, const int udl, const int ndl); @@ -162,8 +164,9 @@ static void ref_rcu_delay_section(const int nloops, const int udl, const int ndl } } -static void rcu_sync_scale_init(void) +static bool rcu_sync_scale_init(void) { + return true; } static struct ref_scale_ops rcu_ops = { @@ -315,9 +318,10 @@ static struct ref_scale_ops refcnt_ops = { // Definitions for rwlock static rwlock_t test_rwlock; -static void ref_rwlock_init(void) +static bool ref_rwlock_init(void) { rwlock_init(&test_rwlock); + return true; } static void ref_rwlock_section(const int nloops) @@ -351,9 +355,10 @@ static struct ref_scale_ops rwlock_ops = { // Definitions for rwsem static struct rw_semaphore test_rwsem; -static void ref_rwsem_init(void) +static bool ref_rwsem_init(void) { init_rwsem(&test_rwsem); + return true; } static void ref_rwsem_section(const int nloops) @@ -523,6 +528,237 @@ static struct ref_scale_ops clock_ops = { .name = "clock" }; +//////////////////////////////////////////////////////////////////////// +// +// Methods leveraging SLAB_TYPESAFE_BY_RCU. +// + +// Item to look up in a typesafe manner. Array of pointers to these. +struct refscale_typesafe { + atomic_t rts_refctr; // Used by all flavors + spinlock_t rts_lock; + seqlock_t rts_seqlock; + unsigned int a; + unsigned int b; +}; + +static struct kmem_cache *typesafe_kmem_cachep; +static struct refscale_typesafe **rtsarray; +static long rtsarray_size; +static DEFINE_TORTURE_RANDOM_PERCPU(refscale_rand); +static bool (*rts_acquire)(struct refscale_typesafe *rtsp, unsigned int *start); +static bool (*rts_release)(struct refscale_typesafe *rtsp, unsigned int start); + +// Conditionally acquire an explicit in-structure reference count. +static bool typesafe_ref_acquire(struct refscale_typesafe *rtsp, unsigned int *start) +{ + return atomic_inc_not_zero(&rtsp->rts_refctr); +} + +// Unconditionally release an explicit in-structure reference count. +static bool typesafe_ref_release(struct refscale_typesafe *rtsp, unsigned int start) +{ + if (!atomic_dec_return(&rtsp->rts_refctr)) { + WRITE_ONCE(rtsp->a, rtsp->a + 1); + kmem_cache_free(typesafe_kmem_cachep, rtsp); + } + return true; +} + +// Unconditionally acquire an explicit in-structure spinlock. +static bool typesafe_lock_acquire(struct refscale_typesafe *rtsp, unsigned int *start) +{ + spin_lock(&rtsp->rts_lock); + return true; +} + +// Unconditionally release an explicit in-structure spinlock. +static bool typesafe_lock_release(struct refscale_typesafe *rtsp, unsigned int start) +{ + spin_unlock(&rtsp->rts_lock); + return true; +} + +// Unconditionally acquire an explicit in-structure sequence lock. +static bool typesafe_seqlock_acquire(struct refscale_typesafe *rtsp, unsigned int *start) +{ + *start = read_seqbegin(&rtsp->rts_seqlock); + return true; +} + +// Conditionally release an explicit in-structure sequence lock. Return +// true if this release was successful, that is, if no retry is required. +static bool typesafe_seqlock_release(struct refscale_typesafe *rtsp, unsigned int start) +{ + return !read_seqretry(&rtsp->rts_seqlock, start); +} + +// Do a read-side critical section with the specified delay in +// microseconds and nanoseconds inserted so as to increase probability +// of failure. +static void typesafe_delay_section(const int nloops, const int udl, const int ndl) +{ + unsigned int a; + unsigned int b; + int i; + long idx; + struct refscale_typesafe *rtsp; + unsigned int start; + + for (i = nloops; i >= 0; i--) { + preempt_disable(); + idx = torture_random(this_cpu_ptr(&refscale_rand)) % rtsarray_size; + preempt_enable(); +retry: + rcu_read_lock(); + rtsp = rcu_dereference(rtsarray[idx]); + a = READ_ONCE(rtsp->a); + if (!rts_acquire(rtsp, &start)) { + rcu_read_unlock(); + goto retry; + } + if (a != READ_ONCE(rtsp->a)) { + (void)rts_release(rtsp, start); + rcu_read_unlock(); + goto retry; + } + un_delay(udl, ndl); + // Remember, seqlock read-side release can fail. + if (!rts_release(rtsp, start)) { + rcu_read_unlock(); + goto retry; + } + b = READ_ONCE(rtsp->a); + WARN_ONCE(a != b, "Re-read of ->a changed from %u to %u.\n", a, b); + b = rtsp->b; + rcu_read_unlock(); + WARN_ON_ONCE(a * a != b); + } +} + +// Because the acquisition and release methods are expensive, there +// is no point in optimizing away the un_delay() function's two checks. +// Thus simply define typesafe_read_section() as a simple wrapper around +// typesafe_delay_section(). +static void typesafe_read_section(const int nloops) +{ + typesafe_delay_section(nloops, 0, 0); +} + +// Allocate and initialize one refscale_typesafe structure. +static struct refscale_typesafe *typesafe_alloc_one(void) +{ + struct refscale_typesafe *rtsp; + + rtsp = kmem_cache_alloc(typesafe_kmem_cachep, GFP_KERNEL); + if (!rtsp) + return NULL; + atomic_set(&rtsp->rts_refctr, 1); + WRITE_ONCE(rtsp->a, rtsp->a + 1); + WRITE_ONCE(rtsp->b, rtsp->a * rtsp->a); + return rtsp; +} + +// Slab-allocator constructor for refscale_typesafe structures created +// out of a new slab of system memory. +static void refscale_typesafe_ctor(void *rtsp_in) +{ + struct refscale_typesafe *rtsp = rtsp_in; + + spin_lock_init(&rtsp->rts_lock); + seqlock_init(&rtsp->rts_seqlock); + preempt_disable(); + rtsp->a = torture_random(this_cpu_ptr(&refscale_rand)); + preempt_enable(); +} + +static struct ref_scale_ops typesafe_ref_ops; +static struct ref_scale_ops typesafe_lock_ops; +static struct ref_scale_ops typesafe_seqlock_ops; + +// Initialize for a typesafe test. +static bool typesafe_init(void) +{ + long idx; + long si = lookup_instances; + + typesafe_kmem_cachep = kmem_cache_create("refscale_typesafe", + sizeof(struct refscale_typesafe), sizeof(void *), + SLAB_TYPESAFE_BY_RCU, refscale_typesafe_ctor); + if (!typesafe_kmem_cachep) + return false; + if (si < 0) + si = -si * nr_cpu_ids; + else if (si == 0) + si = nr_cpu_ids; + rtsarray_size = si; + rtsarray = kcalloc(si, sizeof(*rtsarray), GFP_KERNEL); + if (!rtsarray) + return false; + for (idx = 0; idx < rtsarray_size; idx++) { + rtsarray[idx] = typesafe_alloc_one(); + if (!rtsarray[idx]) + return false; + } + if (cur_ops == &typesafe_ref_ops) { + rts_acquire = typesafe_ref_acquire; + rts_release = typesafe_ref_release; + } else if (cur_ops == &typesafe_lock_ops) { + rts_acquire = typesafe_lock_acquire; + rts_release = typesafe_lock_release; + } else if (cur_ops == &typesafe_seqlock_ops) { + rts_acquire = typesafe_seqlock_acquire; + rts_release = typesafe_seqlock_release; + } else { + WARN_ON_ONCE(1); + return false; + } + return true; +} + +// Clean up after a typesafe test. +static void typesafe_cleanup(void) +{ + long idx; + + if (rtsarray) { + for (idx = 0; idx < rtsarray_size; idx++) + kmem_cache_free(typesafe_kmem_cachep, rtsarray[idx]); + kfree(rtsarray); + rtsarray = NULL; + rtsarray_size = 0; + } + kmem_cache_destroy(typesafe_kmem_cachep); + typesafe_kmem_cachep = NULL; + rts_acquire = NULL; + rts_release = NULL; +} + +// The typesafe_init() function distinguishes these structures by address. +static struct ref_scale_ops typesafe_ref_ops = { + .init = typesafe_init, + .cleanup = typesafe_cleanup, + .readsection = typesafe_read_section, + .delaysection = typesafe_delay_section, + .name = "typesafe_ref" +}; + +static struct ref_scale_ops typesafe_lock_ops = { + .init = typesafe_init, + .cleanup = typesafe_cleanup, + .readsection = typesafe_read_section, + .delaysection = typesafe_delay_section, + .name = "typesafe_lock" +}; + +static struct ref_scale_ops typesafe_seqlock_ops = { + .init = typesafe_init, + .cleanup = typesafe_cleanup, + .readsection = typesafe_read_section, + .delaysection = typesafe_delay_section, + .name = "typesafe_seqlock" +}; + static void rcu_scale_one_reader(void) { if (readdelay <= 0) @@ -812,6 +1048,7 @@ ref_scale_init(void) static struct ref_scale_ops *scale_ops[] = { &rcu_ops, &srcu_ops, RCU_TRACE_OPS RCU_TASKS_OPS &refcnt_ops, &rwlock_ops, &rwsem_ops, &lock_ops, &lock_irq_ops, &acqrel_ops, &clock_ops, + &typesafe_ref_ops, &typesafe_lock_ops, &typesafe_seqlock_ops, }; if (!torture_init_begin(scale_type, verbose)) @@ -833,7 +1070,10 @@ ref_scale_init(void) goto unwind; } if (cur_ops->init) - cur_ops->init(); + if (!cur_ops->init()) { + firsterr = -EUCLEAN; + goto unwind; + } ref_scale_print_module_parms(cur_ops, "Start of test"); diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index ca4b5dcec675..ab4ee58af84b 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -154,7 +154,7 @@ static void init_srcu_struct_data(struct srcu_struct *ssp) */ static inline bool srcu_invl_snp_seq(unsigned long s) { - return rcu_seq_state(s) == SRCU_SNP_INIT_SEQ; + return s == SRCU_SNP_INIT_SEQ; } /* @@ -469,24 +469,59 @@ static bool srcu_readers_active_idx_check(struct srcu_struct *ssp, int idx) /* * If the locks are the same as the unlocks, then there must have - * been no readers on this index at some time in between. This does - * not mean that there are no more readers, as one could have read - * the current index but not have incremented the lock counter yet. + * been no readers on this index at some point in this function. + * But there might be more readers, as a task might have read + * the current ->srcu_idx but not yet have incremented its CPU's + * ->srcu_lock_count[idx] counter. In fact, it is possible + * that most of the tasks have been preempted between fetching + * ->srcu_idx and incrementing ->srcu_lock_count[idx]. And there + * could be almost (ULONG_MAX / sizeof(struct task_struct)) tasks + * in a system whose address space was fully populated with memory. + * Call this quantity Nt. * - * So suppose that the updater is preempted here for so long - * that more than ULONG_MAX non-nested readers come and go in - * the meantime. It turns out that this cannot result in overflow - * because if a reader modifies its unlock count after we read it - * above, then that reader's next load of ->srcu_idx is guaranteed - * to get the new value, which will cause it to operate on the - * other bank of counters, where it cannot contribute to the - * overflow of these counters. This means that there is a maximum - * of 2*NR_CPUS increments, which cannot overflow given current - * systems, especially not on 64-bit systems. + * So suppose that the updater is preempted at this point in the + * code for a long time. That now-preempted updater has already + * flipped ->srcu_idx (possibly during the preceding grace period), + * done an smp_mb() (again, possibly during the preceding grace + * period), and summed up the ->srcu_unlock_count[idx] counters. + * How many times can a given one of the aforementioned Nt tasks + * increment the old ->srcu_idx value's ->srcu_lock_count[idx] + * counter, in the absence of nesting? * - * OK, how about nesting? This does impose a limit on nesting - * of floor(ULONG_MAX/NR_CPUS/2), which should be sufficient, - * especially on 64-bit systems. + * It can clearly do so once, given that it has already fetched + * the old value of ->srcu_idx and is just about to use that value + * to index its increment of ->srcu_lock_count[idx]. But as soon as + * it leaves that SRCU read-side critical section, it will increment + * ->srcu_unlock_count[idx], which must follow the updater's above + * read from that same value. Thus, as soon the reading task does + * an smp_mb() and a later fetch from ->srcu_idx, that task will be + * guaranteed to get the new index. Except that the increment of + * ->srcu_unlock_count[idx] in __srcu_read_unlock() is after the + * smp_mb(), and the fetch from ->srcu_idx in __srcu_read_lock() + * is before the smp_mb(). Thus, that task might not see the new + * value of ->srcu_idx until the -second- __srcu_read_loc |
