summaryrefslogtreecommitdiff
path: root/kernel/rcu/tree.c
AgeCommit message (Collapse)AuthorFilesLines
2026-03-04rcu/exp: Move expedited kthread worker creation functions above ↵Frederic Weisbecker1-48/+48
rcutree_prepare_cpu() [ Upstream commit c19e5d3b497a3036f800edf751dc7814e3e887e1 ] The expedited kthread worker performing the per node initialization is going to be split into per node kthreads. As such, the future per node kthread creation will need to be called from CPU hotplug callbacks instead of an initcall, right beside the per node boost kthread creation. To prepare for that, move the kthread worker creation above rcutree_prepare_cpu() as a first step to make the review smoother for the upcoming modifications. No intended functional change. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Stable-dep-of: d41e37f26b31 ("rcu: Fix rcu_read_unlock() deadloop due to softirq") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04rcu: s/boost_kthread_mutex/kthread_mutexFrederic Weisbecker1-1/+1
[ Upstream commit 7836b270607676ed1c0c6a4a840a2ede9437a6a1 ] This mutex is currently protecting per node boost kthreads creation and affinity setting across CPU hotplug operations. Since the expedited kworkers will soon be split per node as well, they will be subject to the same concurrency constraints against hotplug. Therefore their creation and affinity tuning operations will be grouped with those of boost kthreads and then rely on the same mutex. To prepare for that, generalize its name. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Stable-dep-of: d41e37f26b31 ("rcu: Fix rcu_read_unlock() deadloop due to softirq") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28rcu: Fix racy re-initialization of irq_work causing hangsFrederic Weisbecker1-0/+2
commit 61399e0c5410567ef60cb1cda34cca42903842e3 upstream. RCU re-initializes the deferred QS irq work everytime before attempting to queue it. However there are situations where the irq work is attempted to be queued even though it is already queued. In that case re-initializing messes-up with the irq work queue that is about to be handled. The chances for that to happen are higher when the architecture doesn't support self-IPIs and irq work are then all lazy, such as with the following sequence: 1) rcu_read_unlock() is called when IRQs are disabled and there is a grace period involving blocked tasks on the node. The irq work is then initialized and queued. 2) The related tasks are unblocked and the CPU quiescent state is reported. rdp->defer_qs_iw_pending is reset to DEFER_QS_IDLE, allowing the irq work to be requeued in the future (note the previous one hasn't fired yet). 3) A new grace period starts and the node has blocked tasks. 4) rcu_read_unlock() is called when IRQs are disabled again. The irq work is re-initialized (but it's queued! and its node is cleared) and requeued. Which means it's requeued to itself. 5) The irq work finally fires with the tick. But since it was requeued to itself, it loops and hangs. Fix this with initializing the irq work only once before the CPU boots. Fixes: b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202508071303.c1134cce-lkp@intel.com Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-10rcu: Return early if callback is not specifiedUladzislau Rezki (Sony)1-0/+4
[ Upstream commit 33b6a1f155d627f5bd80c7485c598ce45428f74f ] Currently the call_rcu() API does not check whether a callback pointer is NULL. If NULL is passed, rcu_core() will try to invoke it, resulting in NULL pointer dereference and a kernel crash. To prevent this and improve debuggability, this patch adds a check for NULL and emits a kernel stack trace to help identify a faulty caller. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19rcu/cpu_stall_cputime: fix the hardirq count for x86 architectureYongliang Gao1-3/+7
[ Upstream commit da6b85598af30e9fec34d82882d7e1e39f3da769 ] When counting the number of hardirqs in the x86 architecture, it is essential to add arch_irq_stat_cpu to ensure accuracy. For example, a CPU loop within the rcu_read_lock function. Before: [ 70.910184] rcu: INFO: rcu_preempt self-detected stall on CPU [ 70.910436] rcu: 3-....: (4999 ticks this GP) idle=*** [ 70.910711] rcu: hardirqs softirqs csw/system [ 70.910870] rcu: number: 0 657 0 [ 70.911024] rcu: cputime: 0 0 2498 ==> 2498(ms) [ 70.911278] rcu: (t=5001 jiffies g=3677 q=29 ncpus=8) After: [ 68.046132] rcu: INFO: rcu_preempt self-detected stall on CPU [ 68.046354] rcu: 2-....: (4999 ticks this GP) idle=*** [ 68.046628] rcu: hardirqs softirqs csw/system [ 68.046793] rcu: number: 2498 663 0 [ 68.046951] rcu: cputime: 0 0 2496 ==> 2496(ms) [ 68.047244] rcu: (t=5000 jiffies g=3825 q=4 ncpus=8) Fixes: be42f00b73a0 ("rcu: Add RCU stall diagnosis information") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501090842.SfI6QPGS-lkp@intel.com/ Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Link: https://lore.kernel.org/r/20250216084109.3109837-1-leonylgao@gmail.com Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-09rcu/kvfree: Fix data-race in __mod_timer / kvfree_call_rcuUladzislau Rezki (Sony)1-2/+12
[ Upstream commit a23da88c6c80e41e0503e0b481a22c9eea63f263 ] KCSAN reports a data race when access the krcp->monitor_work.timer.expires variable in the schedule_delayed_monitor_work() function: <snip> BUG: KCSAN: data-race in __mod_timer / kvfree_call_rcu read to 0xffff888237d1cce8 of 8 bytes by task 10149 on cpu 1: schedule_delayed_monitor_work kernel/rcu/tree.c:3520 [inline] kvfree_call_rcu+0x3b8/0x510 kernel/rcu/tree.c:3839 trie_update_elem+0x47c/0x620 kernel/bpf/lpm_trie.c:441 bpf_map_update_value+0x324/0x350 kernel/bpf/syscall.c:203 generic_map_update_batch+0x401/0x520 kernel/bpf/syscall.c:1849 bpf_map_do_batch+0x28c/0x3f0 kernel/bpf/syscall.c:5143 __sys_bpf+0x2e5/0x7a0 __do_sys_bpf kernel/bpf/syscall.c:5741 [inline] __se_sys_bpf kernel/bpf/syscall.c:5739 [inline] __x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5739 x64_sys_call+0x2625/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:322 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f write to 0xffff888237d1cce8 of 8 bytes by task 56 on cpu 0: __mod_timer+0x578/0x7f0 kernel/time/timer.c:1173 add_timer_global+0x51/0x70 kernel/time/timer.c:1330 __queue_delayed_work+0x127/0x1a0 kernel/workqueue.c:2523 queue_delayed_work_on+0xdf/0x190 kernel/workqueue.c:2552 queue_delayed_work include/linux/workqueue.h:677 [inline] schedule_delayed_monitor_work kernel/rcu/tree.c:3525 [inline] kfree_rcu_monitor+0x5e8/0x660 kernel/rcu/tree.c:3643 process_one_work kernel/workqueue.c:3229 [inline] process_scheduled_works+0x483/0x9a0 kernel/workqueue.c:3310 worker_thread+0x51d/0x6f0 kernel/workqueue.c:3391 kthread+0x1d1/0x210 kernel/kthread.c:389 ret_from_fork+0x4b/0x60 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Reported by Kernel Concurrency Sanitizer on: CPU: 0 UID: 0 PID: 56 Comm: kworker/u8:4 Not tainted 6.12.0-rc2-syzkaller-00050-g5b7c893ed5ed #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Workqueue: events_unbound kfree_rcu_monitor <snip> kfree_rcu_monitor() rearms the work if a "krcp" has to be still offloaded and this is done without holding krcp->lock, whereas the kvfree_call_rcu() holds it. Fix it by acquiring the "krcp->lock" for kfree_rcu_monitor() so both functions do not race anymore. Reported-by: syzbot+061d370693bdd99f9d34@syzkaller.appspotmail.com Link: https://lore.kernel.org/lkml/ZxZ68KmHDQYU0yfD@pc636/T/ Fixes: 8fc5494ad5fa ("rcu/kvfree: Move need_offload_krc() out of krcp->lock") Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-17rcu/nocb: Make IRQs disablement symmetricFrederic Weisbecker1-3/+6
[ Upstream commit b913c3fe685e0aec80130975b0f330fd709ff324 ] Currently IRQs are disabled on call_rcu() and then depending on the context: * If the CPU is in nocb mode: - If the callback is enqueued in the bypass list, IRQs are re-enabled implictly by rcu_nocb_try_bypass() - If the callback is enqueued in the normal list, IRQs are re-enabled implicitly by __call_rcu_nocb_wake() * If the CPU is NOT in nocb mode, IRQs are reenabled explicitly from call_rcu() This makes the code a bit hard to follow, especially as it interleaves with nocb locking. To make the IRQ flags coverage clearer and also in order to prepare for moving all the nocb enqueue code to its own function, always re-enable the IRQ flags explicitly from call_rcu(). Reviewed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Stable-dep-of: f7345ccc62a4 ("rcu/nocb: Fix rcuog wake-up from offline softirq") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29rcu: Eliminate rcu_gp_slow_unregister() false positivePaul E. McKenney1-1/+1
[ Upstream commit 0ae9942f03d0d034fdb0a4f44fc99f62a3107987 ] When using rcutorture as a module, there are a number of conditions that can abort the modprobe operation, for example, when attempting to run both RCU CPU stall warning tests and forward-progress tests. This can cause rcu_torture_cleanup() to be invoked on the unwind path out of rcu_rcu_torture_init(), which will mean that rcu_gp_slow_unregister() is invoked without a matching rcu_gp_slow_register(). This will cause a splat because rcu_gp_slow_unregister() is passed rcu_fwd_cb_nodelay, which does not match a NULL pointer. This commit therefore forgives a mismatch involving a NULL pointer, thus avoiding this false-positive splat. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29rcu: Dump memory object info if callback function is invalidZhen Lei1-0/+1
[ Upstream commit 2cbc482d325ee58001472c4359b311958c4efdd1 ] When a structure containing an RCU callback rhp is (incorrectly) freed and reallocated after rhp is passed to call_rcu(), it is not unusual for rhp->func to be set to NULL. This defeats the debugging prints used by __call_rcu_common() in kernels built with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, which expect to identify the offending code using the identity of this function. And in kernels build without CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, things are even worse, as can be seen from this splat: Unable to handle kernel NULL pointer dereference at virtual address 0 ... ... PC is at 0x0 LR is at rcu_do_batch+0x1c0/0x3b8 ... ... (rcu_do_batch) from (rcu_core+0x1d4/0x284) (rcu_core) from (__do_softirq+0x24c/0x344) (__do_softirq) from (__irq_exit_rcu+0x64/0x108) (__irq_exit_rcu) from (irq_exit+0x8/0x10) (irq_exit) from (__handle_domain_irq+0x74/0x9c) (__handle_domain_irq) from (gic_handle_irq+0x8c/0x98) (gic_handle_irq) from (__irq_svc+0x5c/0x94) (__irq_svc) from (arch_cpu_idle+0x20/0x3c) (arch_cpu_idle) from (default_idle_call+0x4c/0x78) (default_idle_call) from (do_idle+0xf8/0x150) (do_idle) from (cpu_startup_entry+0x18/0x20) (cpu_startup_entry) from (0xc01530) This commit therefore adds calls to mem_dump_obj(rhp) to output some information, for example: slab kmalloc-256 start ffff410c45019900 pointer offset 0 size 256 This provides the rough size of the memory block and the offset of the rcu_head structure, which as least provides at least a few clues to help locate the problem. If the problem is reproducible, additional slab debugging can be enabled, for example, CONFIG_DEBUG_SLAB=y, which can provide significantly more information. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-14rcu: Fix rcu_barrier() VS post CPUHP_TEARDOWN_CPU invocationFrederic Weisbecker1-3/+7
[ Upstream commit 55d4669ef1b76823083caecfab12a8bd2ccdcf64 ] When rcu_barrier() calls rcu_rdp_cpu_online() and observes a CPU off rnp->qsmaskinitnext, it means that all accesses from the offline CPU preceding the CPUHP_TEARDOWN_CPU are visible to RCU barrier, including callbacks expiration and counter updates. However interrupts can still fire after stop_machine() re-enables interrupts and before rcutree_report_cpu_dead(). The related accesses happening between CPUHP_TEARDOWN_CPU and rnp->qsmaskinitnext clearing are _NOT_ guaranteed to be seen by rcu_barrier() without proper ordering, especially when callbacks are invoked there to the end, making rcutree_migrate_callback() bypass barrier_lock. The following theoretical race example can make rcu_barrier() hang: CPU 0 CPU 1 ----- ----- //cpu_down() smpboot_park_threads() //ksoftirqd is parked now <IRQ> rcu_sched_clock_irq() invoke_rcu_core() do_softirq() rcu_core() rcu_do_batch() // callback storm // rcu_do_batch() returns // before completing all // of them // do_softirq also returns early because of // timeout. It defers to ksoftirqd but // it's parked </IRQ> stop_machine() take_cpu_down() rcu_barrier() spin_lock(barrier_lock) // observes rcu_segcblist_n_cbs(&rdp->cblist) != 0 <IRQ> do_softirq() rcu_core() rcu_do_batch() //completes all pending callbacks //smp_mb() implied _after_ callback number dec </IRQ> rcutree_report_cpu_dead() rnp->qsmaskinitnext &= ~rdp->grpmask; rcutree_migrate_callback() // no callback, early return without locking // barrier_lock //observes !rcu_rdp_cpu_online(rdp) rcu_barrier_entrain() rcu_segcblist_entrain() // Observe rcu_segcblist_n_cbs(rsclp) == 0 // because no barrier between reading // rnp->qsmaskinitnext and rsclp->len rcu_segcblist_add_len() smp_mb__before_atomic() // will now observe the 0 count and empty // list, but too late, we enqueue regardless WRITE_ONCE(rsclp->len, rsclp->len + v); // ignored barrier callback // rcu barrier stall... This could be solved with a read memory barrier, enforcing the message passing between rnp->qsmaskinitnext and rsclp->len, matching the full memory barrier after rsclp->len addition in rcu_segcblist_add_len() performed at the end of rcu_do_batch(). However the rcu_barrier() is complicated enough and probably doesn't need too many more subtleties. CPU down is a slowpath and the barrier_lock seldom contended. Solve the issue with unconditionally locking the barrier_lock on rcutree_migrate_callbacks(). This makes sure that either rcu_barrier() sees the empty queue or its entrained callback will be migrated. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-26rcu/exp: Handle RCU expedited grace period kworker allocation failureFrederic Weisbecker1-0/+2
[ Upstream commit e7539ffc9a770f36bacedcf0fbfb4bf2f244f4a5 ] Just like is done for the kworker performing nodes initialization, gracefully handle the possible allocation failure of the RCU expedited grace period main kworker. While at it perform a rename of the related checking functions to better reflect the expedited specifics. Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Fixes: 9621fbee44df ("rcu: Move expedited grace period (GP) work to RT kthread_worker") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-26rcu/exp: Fix RCU expedited parallel grace period kworker allocation failure ↵Frederic Weisbecker1-0/+1
recovery [ Upstream commit a636c5e6f8fc34be520277e69c7c6ee1d4fc1d17 ] Under CONFIG_RCU_EXP_KTHREAD=y, the nodes initialization for expedited grace periods is queued to a kworker. However if the allocation of that kworker failed, the nodes initialization is performed synchronously by the caller instead. Now the check for kworker initialization failure relies on the kworker pointer to be NULL while its value might actually encapsulate an allocation failure error. Make sure to handle this case. Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Fixes: 9621fbee44df ("rcu: Move expedited grace period (GP) work to RT kthread_worker") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-31rcu: Defer RCU kthreads wakeup when CPU is dyingFrederic Weisbecker1-1/+33
[ Upstream commit e787644caf7628ad3269c1fbd321c3255cf51710 ] When the CPU goes idle for the last time during the CPU down hotplug process, RCU reports a final quiescent state for the current CPU. If this quiescent state propagates up to the top, some tasks may then be woken up to complete the grace period: the main grace period kthread and/or the expedited main workqueue (or kworker). If those kthreads have a SCHED_FIFO policy, the wake up can indirectly arm the RT bandwith timer to the local offline CPU. Since this happens after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the timer gets ignored. Therefore if the RCU kthreads are waiting for RT bandwidth to be available, they may never be actually scheduled. This triggers TREE03 rcutorture hangs: rcu: INFO: rcu_preempt self-detected stall on CPU rcu: 4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved) rcu: (t=21035 jiffies g=938281 q=40787 ncpus=6) rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0 rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. rcu: RCU grace-period kthread stack dump: task:rcu_preempt state:R running task stack:14896 pid:14 tgid:14 ppid:2 flags:0x00004000 Call Trace: <TASK> __schedule+0x2eb/0xa80 schedule+0x1f/0x90 schedule_timeout+0x163/0x270 ? __pfx_process_timeout+0x10/0x10 rcu_gp_fqs_loop+0x37c/0x5b0 ? __pfx_rcu_gp_kthread+0x10/0x10 rcu_gp_kthread+0x17c/0x200 kthread+0xde/0x110 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2b/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> The situation can't be solved with just unpinning the timer. The hrtimer infrastructure and the nohz heuristics involved in finding the best remote target for an unpinned timer would then also need to handle enqueues from an offline CPU in the most horrendous way. So fix this on the RCU side instead and defer the wake up to an online CPU if it's too late for the local one. Reported-by: Paul E. McKenney <paulmck@kernel.org> Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-10rcu: Introduce rcu_cpu_online()Frederic Weisbecker1-0/+7
[ Upstream commit 2be4686d866ad5896f2bb94d82fe892197aea9c7 ] Export the RCU point of view as to when a CPU is considered offline (ie: when does RCU consider that a CPU is sufficiently down in the hotplug process to not feature any possible read side). This will be used by RCU-tasks whose vision of an offline CPU should reasonably match the one of RCU core. Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup") Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-10rcu: Break rcu_node_0 --> &rq->__lock orderPeter Zijlstra1-11/+25
[ Upstream commit 85d68222ddc5f4522e456d97d201166acb50f716 ] Commit 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()") added a kfree() call to free any user provided affinity mask, if present. It was changed later to use kfree_rcu() in commit 9a5418bc48ba ("sched/core: Use kfree_rcu() in do_set_cpus_allowed()") to avoid a circular locking dependency problem. It turns out that even kfree_rcu() isn't safe for avoiding circular locking problem. As reported by kernel test robot, the following circular locking dependency now exists: &rdp->nocb_lock --> rcu_node_0 --> &rq->__lock Solve this by breaking the rcu_node_0 --> &rq->__lock chain by moving the resched_cpu() out from under rcu_node lock. [peterz: heavily borrowed from Waiman's Changelog] [paulmck: applied Z qiang feedback] Fixes: 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()") Reported-by: kernel test robot <oliver.sang@intel.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/oe-lkp/202310302207.a25f1a30-oliver.sang@intel.com Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objectsCatalin Marinas1-0/+9
commit 5f98fd034ca6fd1ab8c91a3488968a0e9caaabf6 upstream. Since the actual slab freeing is deferred when calling kvfree_rcu(), so is the kmemleak_free() callback informing kmemleak of the object deletion. From the perspective of the kvfree_rcu() caller, the object is freed and it may remove any references to it. Since kmemleak does not scan RCU internal data storing the pointer, it will report such objects as leaks during the grace period. Tell kmemleak to ignore such objects on the kvfree_call_rcu() path. Note that the tiny RCU implementation does not have such issue since the objects can be tracked from the rcu_ctrlblk structure. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://lore.kernel.org/all/F903A825-F05F-4B77-A2B5-7356282FBA2C@apple.com/ Cc: <stable@vger.kernel.org> Tested-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28rcu/tree: Defer setting of jiffies during stall resetJoel Fernandes (Google)1-0/+12
commit b96e7a5fa0ba9cda32888e04f8f4bac42d49a7f8 upstream. There are instances where rcu_cpu_stall_reset() is called when jiffies did not get a chance to update for a long time. Before jiffies is updated, the CPU stall detector can go off triggering false-positives where a just-started grace period appears to be ages old. In the past, we disabled stall detection in rcu_cpu_stall_reset() however this got changed [1]. This is resulting in false-positives in KGDB usecase [2]. Fix this by deferring the update of jiffies to the third run of the FQS loop. This is more robust, as, even if rcu_cpu_stall_reset() is called just before jiffies is read, we would end up pushing out the jiffies read by 3 more FQS loops. Meanwhile the CPU stall detection will be delayed and we will not get any false positives. [1] https://lore.kernel.org/all/20210521155624.174524-2-senozhatsky@chromium.org/ [2] https://lore.kernel.org/all/20230814020045.51950-2-chenhuacai@loongson.cn/ Tested with rcutorture.cpu_stall option as well to verify stall behavior with/without patch. Tested-by: Huacai Chen <chenhuacai@loongson.cn> Reported-by: Binbin Zhou <zhoubinbin@loongson.cn> Closes: https://lore.kernel.org/all/20230814020045.51950-2-chenhuacai@loongson.cn/ Suggested-by: Paul McKenney <paulmck@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: a80be428fbc1 ("rcu: Do not disable GP stall detection in rcu_cpu_stall_reset()") Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-16rcu: Mark __rcu_irq_enter_check_tick() ->rcu_urgent_qs loadPaul E. McKenney1-1/+1
The rcu_request_urgent_qs_task() function does a cross-CPU store to ->rcu_urgent_qs, so this commit therefore marks the load in __rcu_irq_enter_check_tick() with READ_ONCE(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-07-19rcu: Clarify rcu_is_watching() kernel-doc commentPaul E. McKenney1-5/+9
Make it clear that this function always returns either true or false without other planned failure modes. Reported-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-06-07Merge branches 'doc.2023.05.10a', 'fixes.2023.05.11a', 'kvfree.2023.05.10a', ↵Paul E. McKenney1-43/+88
'nocb.2023.05.11a', 'rcu-tasks.2023.05.10a', 'torture.2023.05.15a' and 'rcu-urgent.2023.06.06a' into HEAD doc.2023.05.10a: Documentation updates fixes.2023.05.11a: Miscellaneous fixes kvfree.2023.05.10a: kvfree_rcu updates nocb.2023.05.11a: Callback-offloading updates rcu-tasks.2023.05.10a: Tasks RCU updates torture.2023.05.15a: Torture-test updates rcu-urgent.2023.06.06a: Urgent SRCU fix
2023-05-11rcu-tasks: Stop rcu_tasks_invoke_cbs() from using never-onlined CPUsPaul E. McKenney1-1/+11
The rcu_tasks_invoke_cbs() function relies on queue_work_on() to silently fall back to WORK_CPU_UNBOUND when the specified CPU is offline. However, the queue_work_on() function's silent fallback mechanism relies on that CPU having been online at some time in the past. When queue_work_on() is passed a CPU that has never been online, workqueue lockups ensue, which can be bad for your kernel's general health and well-being. This commit therefore checks whether a given CPU has ever been online, and, if not substitutes WORK_CPU_UNBOUND in the subsequent call to queue_work_on(). Why not simply omit the queue_work_on() call entirely? Because this function is flooding callback-invocation notifications to all CPUs, and must deal with possibilities that include a sparse cpu_possible_mask. This commit also moves the setting of the rcu_data structure's ->beenonline field to rcu_cpu_starting(), which executes on the incoming CPU before that CPU has ever enabled interrupts. This ensures that the required workqueues are present. In addition, because the incoming CPU has not yet enabled its interrupts, there cannot yet have been any softirq handlers running on this CPU, which means that the WARN_ON_ONCE(!rdp->beenonline) within the RCU_SOFTIRQ handler cannot have triggered yet. Fixes: d363f833c6d88 ("rcu-tasks: Use workqueues for multiple rcu_tasks_invoke_cbs() invocations") Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-05-11rcu: Make rcu_cpu_starting() rely on interrupts being disabledPaul E. McKenney1-6/+5
Currently, rcu_cpu_starting() is written so that it might be invoked with interrupts enabled. However, it is always called when interrupts are disabled, either by rcu_init(), notify_cpu_starting(), or from a call point prior to the call to notify_cpu_starting(). But why bother requiring that interrupts be disabled? The purpose is to allow the rcu_data structure's ->beenonline flag to be set after all early processing has completed for the incoming CPU, thus allowing this flag to be used to determine when workqueues have been set up for the incoming CPU, while still allowing this flag to be used as a diagnostic within rcu_core(). This commit therefore makes rcu_cpu_starting() rely on interrupts being disabled. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-05-11rcu: Mark rcu_cpu_kthread() accesses to ->rcu_cpu_has_workPaul E. McKenney1-2/+2
The rcu_data structure's ->rcu_cpu_has_work field can be modified by any CPU attempting to wake up the rcuc kthread. Therefore, this commit marks accesses to this field from the rcu_cpu_kthread() function. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-05-11rcu: Employ jiffies-based backstop to callback time limitPaul E. McKenney1-8/+20
Currently, if there are more than 100 ready-to-invoke RCU callbacks queued on a given CPU, the rcu_do_batch() function sets a timeout for invocation of the series. This timeout defaulting to three milliseconds, and may be adjusted using the rcutree.rcu_resched_ns kernel boot parameter. This timeout is checked using local_clock(), but the overhead of this function combined with the common-case very small callback-invocation overhead means that local_clock() is checked every 32nd invocation. This works well except for longer-than average callbacks. For example, a series of 500-microsecond-duration callbacks means that local_clock() is checked only once every 16 milliseconds, which makes it difficult to enforce a three-millisecond timeout. This commit therefore adds a Kconfig option RCU_DOUBLE_CHECK_CB_TIME that enables backup timeout checking using the coarser grained but lighter weight jiffies. If the jiffies counter detects a timeout, then local_clock() is consulted even if this is not the 32nd callback. This prevents the aforementioned 16-millisecond latency blow. Reported-by: Domas Mituzas <dmituzas@meta.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-05-11rcu: Check callback-invocation time limit for rcuc kthreadsPaul E. McKenney1-9/+19
Currently, a callback-invocation time limit is enforced only for callbacks invoked from the softirq environment, the rationale being that when callbacks are instead invoked from rcuc and rcuoc kthreads, these callbacks cannot be holding up other softirq vectors. Which is in fact true. However, if an rcuc kthread spends too much time invoking callbacks, it can delay quiescent-state reports from its CPU, which can also be a problem. This commit therefore applies the callback-invocation time limit to callback invocation from the rcuc kthreads as well as from softirq. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-05-09rcu/kvfree: Make drain_page_cache() take early return if cache is disabledZqiang1-0/+3
If the rcutree.rcu_min_cached_objs kernel boot parameter is set to zero, then krcp->page_cache_work will never be triggered to fill page cache. In addition, the put_cached_bnode() will not fill page cache. As a result krcp->bkvcache will always be empty, so there is no need to acquire krcp->lock to get page from krcp->bkvcache. This commit therefore makes drain_page_cache() return immediately if the rcu_min_cached_objs is zero. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-05-09rcu/kvfree: Make fill page cache start from krcp->nr_bkv_objsZqiang1-1/+1
When the fill_page_cache_func() function is invoked, it assumes that the cache of pages is completely empty. However, there can be some time between triggering execution of this function and its actual invocation. During this time, kfree_rcu_work() might run, and might fill in part or all of this cache of pages, thus invalidating the fill_page_cache_func() function's assumption. This will not overfill the cache because put_cached_bnode() will reject the extra page. However, it will result in a needless allocation and freeing of one extra page, which might not be helpful under lowish-memory conditions. This commit therefore causes the fill_page_cache_func() to explicitly account for pages that have been placed into the cache shortly before it starts running. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-05-09rcu/kvfree: Do not run a page work if a cache is disabledUladzislau Rezki (Sony)1-0/+4
By default the cache size is 5 pages per CPU, but it can be disabled at boot time by setting the rcu_min_cached_objs to zero. When that happens, the current code will uselessly set an hrtimer to schedule refilling this cache with zero pages. This commit therefore streamlines this process by simply refusing the set the hrtimer when rcu_min_cached_objs is zero. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-05-09rcu/kvfree: Use consistent krcp when growing kfree_rcu() page cacheZqiang1-1/+1
The add_ptr_to_bulk_krc_lock() function is invoked to allocate a new kfree_rcu() page, also known as a kvfree_rcu_bulk_data structure. The kfree_rcu_cpu structure's lock is used to protect this operation, except that this lock must be momentarily dropped when allocating memory. It is clearly important that the lock that is reacquired be the same lock that was acquired initially via krc_this_cpu_lock(). Unfortunately, this same krc_this_cpu_lock() function is used to re-acquire this lock, and if the task migrated to some other CPU during the memory allocation, this will result in the kvfree_rcu_bulk_data structure being added to the wrong CPU's kfree_rcu_cpu structure. This commit therefore replaces that second call to krc_this_cpu_lock() with raw_spin_lock_irqsave() in order to explicitly acquire the lock on the correct kfree_rcu_cpu structure, thus keeping things straight even when the task migrates. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-05-09rcu/kvfree: Invoke debug_rcu_bhead_unqueue() after checking bnode->gp_snapZqiang1-1/+1
If kvfree_rcu_bulk() sees that the required grace period has failed to elapse, it leaks the memory because readers might still be using it. But in that case, the debug-objects subsystem still marks the relevant structures as having been freed, even though they are instead being leaked. This commit fixes this mismatch by invoking debug_rcu_bhead_unqueue() only when we are actually going to free the objects. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-05-09rcu/kvfree: Add debug check for GP complete for kfree_rcu_cpu listUladzislau Rezki (Sony)1-1/+7
Under low-memory conditions, kvfree_rcu() will use each object's rcu_head structure to queue objects in a singly linked list headed by the kfree_rcu_cpu structure's ->head field. This list is passed to call_rcu() as a unit, but there is no indication of which grace period this list needs to wait for. This in turn prevents adding debug checks in the kfree_rcu_work() as was done for the two page-of-pointers channels in the kfree_rcu_cpu structure. This commit therefore adds a ->head_free_gp_snap field to the kfree_rcu_cpu_work structure to record this grace-period number. It also adds a WARN_ON_ONCE() to kfree_rcu_monitor() that checks to make sure that the required grace period has in fact elapsed. [ paulmck: Fix kerneldoc issue raised by Stephen Rothwell. ] Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-05-09rcu/kvfree: Add debug to check grace periodsPaul E. McKenney1-18/+19
This commit adds debugging checks to verify that the required RCU grace period has elapsed for each kvfree_rcu_bulk_data structure that arrives at the kvfree_rcu_bulk() function. These checks make use of that structure's ->gp_snap field, which has been upgraded from an unsigned long to an rcu_gp_oldstate structure. This upgrade reduces the chances of false positives to nearly zero, even on 32-bit systems, for which this structure carries 64 bits of state. Cc: Ziwei Dai <ziwei.dai@unisoc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-24Merge tag 'rcu.6.4.april5.2023.3' of ↵Linus Torvalds1-7/+11
git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux Pull RCU updates from Joel Fernandes: - Updates and additions to MAINTAINERS files, with Boqun being added to the RCU entry and Zqiang being added as an RCU reviewer. I have also transitioned from reviewer to maintainer; however, Paul will be taking over sending RCU pull-requests for the next merge window. - Resolution of hotplug warning in nohz code, achieved by fixing cpu_is_hotpluggable() through interaction with the nohz subsystem. Tick dependency modifications by Zqiang, focusing on fixing usage of the TICK_DEP_BIT_RCU_EXP bitmask. - Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n kernels, fixed by Zqiang. - Improvements to rcu-tasks stall reporting by Neeraj. - Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for increased robustness, affecting several components like mac802154, drbd, vmw_vmci, tracing, and more. A report by Eric Dumazet showed that the API could be unknowingly used in an atomic context, so we'd rather make sure they know what they're asking for by being explicit: https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/ - Documentation updates, including corrections to spelling, clarifications in comments, and improvements to the srcu_size_state comments. - Better srcu_struct cache locality for readers, by adjusting the size of srcu_struct in support of SRCU usage by Christoph Hellwig. - Teach lockdep to detect deadlocks between srcu_read_lock() vs synchronize_srcu() contributed by Boqun. Previously lockdep could not detect such deadlocks, now it can. - Integration of rcutorture and rcu-related tools, targeted for v6.4 from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis module parameter, and more - Miscellaneous changes, various code cleanups and comment improvements * tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits) checkpatch: Error out if deprecated RCU API used mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep() rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep() ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep() net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep() net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep() lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep() tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep() misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep() drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep() rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan() rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early rcu: Remove never-set needwake assignment from rcu_report_qs_rdp() rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race rcu/trace: use strscpy() to instead of strncpy() tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem ...
2023-04-06rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace periodZiwei Dai1-8/+19
Memory passed to kvfree_rcu() that is to be freed is tracked by a per-CPU kfree_rcu_cpu structure, which in turn contains pointers to kvfree_rcu_bulk_data structures that contain pointers to memory that has not yet been handed to RCU, along with an kfree_rcu_cpu_work structure that tracks the memory that has already been handed to RCU. These structures track three categories of memory: (1) Memory for kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived during an OOM episode. The first two categories are tracked in a cache-friendly manner involving a dynamically allocated page of pointers (the aforementioned kvfree_rcu_bulk_data structures), while the third uses a simple (but decidedly cache-unfriendly) linked list through the rcu_head structures in each block of memory. On a given CPU, these three categories are handled as a unit, with that CPU's kfree_rcu_cpu_work structure having one pointer for each of the three categories. Clearly, new memory for a given category cannot be placed in the corresponding kfree_rcu_cpu_work structure until any old memory has had its grace period elapse and thus has been removed. And the kfree_rcu_monitor() function does in fact check for this. Except that the kfree_rcu_monitor() function checks these pointers one at a time. This means that if the previous kfree_rcu() memory passed to RCU had only category 1 and the current one has only category 2, the kfree_rcu_monitor() function will send that current category-2 memory along immediately. This can result in memory being freed too soon, that is, out from under unsuspecting RCU readers. To see this, consider the following sequence of events, in which: o Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset", then is preempted. o CPU 1 calls kfree_rcu(cset, rcu_head) in order to free "from_cset" after a later grace period. Except that "from_cset" is freed right after the previous grace period ended, so that "from_cset" is immediately freed. Task A resumes and references "from_cset"'s member, after which nothing good happens. In full detail: CPU 0 CPU 1 ---------------------- ---------------------- count_memcg_event_mm() |rcu_read_lock() <--- |mem_cgroup_from_task() |// css_set_ptr is the "from_cset" mentioned on CPU 1 |css_set_ptr = rcu_dereference((task)->cgroups) |// Hard irq comes, current task is scheduled out. cgroup_attach_task() |cgroup_migrate() |cgroup_migrate_execute() |css_set_move_task(task, from_cset, to_cset, true) |cgroup_move_task(task, to_cset) |rcu_assign_pointer(.., to_cset) |... |cgroup_migrate_finish() |put_css_set_locked(from_cset) |from_cset->refcount return 0 |kfree_rcu(cset, rcu_head) // free from_cset after new gp |add_ptr_to_bulk_krc_lock() |schedule_delayed_work(&krcp->monitor_work, ..) kfree_rcu_monitor() |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] |queue_rcu_work(system_wq, &krwp->rcu_work) |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request new gp // There is a perious call_rcu(.., rcu_work_rcufn) // gp end, rcu_work_rcufn() is called. rcu_work_rcufn() |__queue_work(.., rwork->wq, &rwork->work); |kfree_rcu_work() |krwp->bulk_head_free[0] bulk is freed before new gp end!!! |The "from_cset" is freed before new gp end. // the task resumes some time later. |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. This commit therefore causes kfree_rcu_monitor() to refrain from moving kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU grace period has completed for all three categories. v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in kfree_rcu()") Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs") Reported-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-05Merge branches 'rcu/staging-core', 'rcu/staging-docs' and ↵Joel Fernandes (Google)1-0/+2
'rcu/staging-kfree', remote-tracking branches 'paul/srcu-cf.2023.04.04a', 'fbq/rcu/lockdep.2023.03.27a' and 'fbq/rcu/rcutorture.2023.03.20a' into rcu/sta