summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
3 daysptrace: slightly saner 'get_dumpable()' logicLinus Torvalds2-6/+17
commit 31e62c2ebbfdc3fe3dbdf5e02c92a9dc67087a3a upstream. The 'dumpability' of a task is fundamentally about the memory image of the task - the concept comes from whether it can core dump or not - and makes no sense when you don't have an associated mm. And almost all users do in fact use it only for the case where the task has a mm pointer. But we have one odd special case: ptrace_may_access() uses 'dumpable' to check various other things entirely independently of the MM (typically explicitly using flags like PTRACE_MODE_READ_FSCREDS). Including for threads that no longer have a VM (and maybe never did, like most kernel threads). It's not what this flag was designed for, but it is what it is. The ptrace code does check that the uid/gid matches, so you do have to be uid-0 to see kernel thread details, but this means that the traditional "drop capabilities" model doesn't make any difference for this all. Make it all make a *bit* more sense by saying that if you don't have a MM pointer, we'll use a cached "last dumpability" flag if the thread ever had a MM (it will be zero for kernel threads since it is never set), and require a proper CAP_SYS_PTRACE capability to override. Reported-by: Qualys Security Advisory <qsa@qualys.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 daystracepoint: balance regfunc() on func_add() failure in tracepoint_add_func()David Carlier1-0/+2
[ Upstream commit fad217e16fded7f3c09f8637b0f6a224d58b5f2e ] When a tracepoint goes through the 0 -> 1 transition, tracepoint_add_func() invokes the subsystem's ext->regfunc() before attempting to install the new probe via func_add(). If func_add() then fails (for example, when allocate_probes() cannot allocate a new probe array under memory pressure and returns -ENOMEM), the function returns the error without calling the matching ext->unregfunc(), leaving the side effects of regfunc() behind with no installed probe to justify them. For syscall tracepoints this is particularly unpleasant: syscall_regfunc() bumps sys_tracepoint_refcount and sets SYSCALL_TRACEPOINT on every task. After a leaked failure, the refcount is stuck at a non-zero value with no consumer, and every task continues paying the syscall trace entry/exit overhead until reboot. Other subsystems providing regfunc()/unregfunc() pairs exhibit similarly scoped persistent state. Mirror the existing 1 -> 0 cleanup and call ext->unregfunc() in the func_add() error path, gated on the same condition used there so the unwind is symmetric with the registration. Fixes: 8cf868affdc4 ("tracing: Have the reg function allow to fail") Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260413190601.21993-1-devnexen@gmail.com Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> [ changed `tp->ext->unregfunc` to `tp->unregfunc` to match older struct layout ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 daysbpf: Fix use-after-free in arena_vm_close on forkAlexei Starovoitov1-3/+16
commit 4fddde2a732de60bb97e3307d4eb69ac5f1d2b74 upstream. arena_vm_open() only bumps vml->mmap_count but never registers the child VMA in arena->vma_list. The vml->vma always points at the parent VMA, so after parent munmap the pointer dangles. If the child then calls bpf_arena_free_pages(), zap_pages() reads the stale vml->vma triggering use-after-free. Fix this by preventing the arena VMA from being inherited across fork with VM_DONTCOPY, and preventing VMA splits via the may_split callback. Also reject mremap with a .mremap callback returning -EINVAL. A same-size mremap(MREMAP_FIXED) on the full arena VMA reaches copy_vma() through the following path: check_prep_vma() - returns 0 early: new_len == old_len skips VM_DONTEXPAND check prep_move_vma() - vm_start == old_addr and vm_end == old_addr + old_len so may_split is never called move_vma() copy_vma_and_data() copy_vma() vm_area_dup() - copies vm_private_data (vml pointer) vm_ops->open() - bumps vml->mmap_count vm_ops->mremap() - returns -EINVAL, rollback unmaps new VMA The refcount ensures the rollback's arena_vm_close does not free the vml shared with the original VMA. Reported-by: Weiming Shi <bestswngs@gmail.com> Reported-by: Xiang Mei <xmei5@asu.edu> Fixes: 317460317a02 ("bpf: Introduce bpf_arena.") Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260413194245.21449-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 daysexit: prevent preemption of oopsing TASK_DEAD taskJann Horn1-0/+1
commit c1fa0bb633e4a6b11e83ffc57fa5abe8ebb87891 upstream. When an already-exiting task oopses, make_task_dead() currently calls do_task_dead() with preemption enabled. That is forbidden: do_task_dead() calls __schedule(), which has a comment saying "WARNING: must be called with preemption disabled!". If an oopsing task is preempted in do_task_dead(), between becoming TASK_DEAD and entering the scheduler explicitly, bad things happen: finish_task_switch() assumes that once the scheduler has switched away from a TASK_DEAD task, the task can never run again and its stack is no longer needed; but that assumption apparently doesn't hold if the dead task was preempted (the SM_PREEMPT case). This means that the scheduler ends up repeatedly dropping references on the dead task's stack, which can lead to use-after-free or double-free of the entire task stack; in other words, two tasks can end up running on the same stack, resulting in various kinds of memory corruption. (This does not just affect "recursively oopsing" tasks; it is enough to oops once during task exit, for example in a file_operations::release handler) Fixes: 7f80a2fd7db9 ("exit: Stop poorly open coding do_task_dead in make_task_dead") Cc: stable@kernel.org Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
4 daysmm: convert mm_lock_seq to a proper seqcountSuren Baghdasaryan1-4/+1
[ Upstream commit eb449bd96954b1c1e491d19066cfd2a010f0aa47 ] Convert mm_lock_seq to be seqcount_t and change all mmap_write_lock variants to increment it, in-line with the usual seqcount usage pattern. This lets us check whether the mmap_lock is write-locked by checking mm_lock_seq.sequence counter (odd=locked, even=unlocked). This will be used when implementing mmap_lock speculation functions. As a result vm_lock_seq is also change to be unsigned to match the type of mm_lock_seq.sequence. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Link: https://lkml.kernel.org/r/20241122174416.1367052-2-surenb@google.com Stable-dep-of: 52f657e34d7b ("x86: shadow stacks: proper error handling for mmap lock") Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysrtmutex: Use waiter::task instead of current in remove_waiter()Keenan Dong1-5/+8
commit 3bfdc63936dd4773109b7b8c280c0f3b5ae7d349 upstream. remove_waiter() is used by the slowlock paths, but it is also used for proxy-lock rollback in rt_mutex_start_proxy_lock() when invoked from futex_requeue(). In the latter case waiter::task is not current, but remove_waiter() operates on current for the dequeue operation. That results in several problems: 1) the rbtree dequeue happens without waiter::task::pi_lock being held 2) the waiter task's pi_blocked_on state is not cleared, which leaves a dangling pointer primed for UAF around. 3) rt_mutex_adjust_prio_chain() operates on the wrong top priority waiter task Use waiter::task instead of current in all related operations in remove_waiter() to cure those problems. [ tglx: Fixup rt_mutex_adjust_prio_chain(), add a comment and amend the changelog ] Fixes: 8161239a8bcc ("rtmutex: Simplify PI algorithm and make highest prio task get lock") Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Keenan Dong <keenanat2000@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 daystaskstats: set version in TGID exit notificationsYiyang Chen1-0/+1
commit 16c4f0211aaa1ec1422b11b59f64f1abe9009fc0 upstream. delay accounting started populating taskstats records with a valid version field via fill_pid() and fill_tgid(). Later, commit ad4ecbcba728 ("[PATCH] delay accounting taskstats interface send tgid once") changed the TGID exit path to send the cached signal->stats aggregate directly instead of building the outgoing record through fill_tgid(). Unlike fill_tgid(), fill_tgid_exit() only accumulates accounting data and never initializes stats->version. As a result, TGID exit notifications can reach userspace with version == 0 even though PID exit notifications and TASKSTATS_CMD_GET replies carry a valid taskstats version. This is easy to reproduce with `tools/accounting/getdelays.c`. I have a small follow-up patch for that tool which: 1. increases the receive buffer/message size so the pid+tgid combined exit notification is not dropped/truncated 2. prints `stats->version`. With that patch, the reproducer is: Terminal 1: ./getdelays -d -v -l -m 0 Terminal 2: taskset -c 0 python3 -c 'import threading,time; t=threading.Thread(target=time.sleep,args=(0.1,)); t.start(); t.join()' That produces both PID and TGID exit notifications for the same process. The PID exit record reports a valid taskstats version, while the TGID exit record reports `version 0`. This patch (of 2): Set stats->version = TASKSTATS_VERSION after copying the cached TGID aggregate into the outgoing netlink payload so all taskstats records are self-describing again. Link: https://lkml.kernel.org/r/ba83d934e59edd431b693607de573eb9ca059309.1774810498.git.cyyzero16@gmail.com Fixes: ad4ecbcba728 ("[PATCH] delay accounting taskstats interface send tgid once") Signed-off-by: Yiyang Chen <cyyzero16@gmail.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dr. Thomas Orgis <thomas.orgis@uni-hamburg.de> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Wang Yaxin <wang.yaxin@zte.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 daysrandomize_kstack: Maintain kstack_offset per taskRyan Roberts1-0/+2
commit 37beb42560165869838e7d91724f3e629db64129 upstream. kstack_offset was previously maintained per-cpu, but this caused a couple of issues. So let's instead make it per-task. Issue 1: add_random_kstack_offset() and choose_random_kstack_offset() expected and required to be called with interrupts and preemption disabled so that it could manipulate per-cpu state. But arm64, loongarch and risc-v are calling them with interrupts and preemption enabled. I don't _think_ this causes any functional issues, but it's certainly unexpected and could lead to manipulating the wrong cpu's state, which could cause a minor performance degradation due to bouncing the cache lines. By maintaining the state per-task those functions can safely be called in preemptible context. Issue 2: add_random_kstack_offset() is called before executing the syscall and expands the stack using a previously chosen random offset. choose_random_kstack_offset() is called after executing the syscall and chooses and stores a new random offset for the next syscall. With per-cpu storage for this offset, an attacker could force cpu migration during the execution of the syscall and prevent the offset from being updated for the original cpu such that it is predictable for the next syscall on that cpu. By maintaining the state per-task, this problem goes away because the per-task random offset is updated after the syscall regardless of which cpu it is executing on. Fixes: 39218ff4c625 ("stack: Optionally randomize kernel stack offset each syscall") Closes: https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/ Cc: stable@vger.kernel.org Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Link: https://patch.msgid.link/20260303150840.3789438-2-ryan.roberts@arm.com Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 dayssched: Use u64 for bandwidth ratio calculationsJoseph Salisbury3-3/+3
commit c6e80201e057dfb7253385e60bf541121bf5dc33 upstream. to_ratio() computes BW_SHIFT-scaled bandwidth ratios from u64 period and runtime values, but it returns unsigned long. tg_rt_schedulable() also stores the current group limit and the accumulated child sum in unsigned long. On 32-bit builds, large bandwidth ratios can be truncated and the RT group sum can wrap when enough siblings are present. That can let an overcommitted RT hierarchy pass the schedulability check, and it also narrows the helper result for other callers. Return u64 from to_ratio() and use u64 for the RT group totals so bandwidth ratios are preserved and compared at full width on both 32-bit and 64-bit builds. Fixes: b40b2e8eb521 ("sched: rt: multi level group constraints") Assisted-by: Codex:GPT-5 Signed-off-by: Joseph Salisbury <joseph.salisbury@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260403210014.2713404-1-joseph.salisbury@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 dayspadata: Fix pd UAF once and for allHerbert Xu1-99/+37
[ Upstream commit 71203f68c7749609d7fc8ae6ad054bdedeb24f91 ] There is a race condition/UAF in padata_reorder that goes back to the initial commit. A reference count is taken at the start of the process in padata_do_parallel, and released at the end in padata_serial_worker. This reference count is (and only is) required for padata_replace to function correctly. If padata_replace is never called then there is no issue. In the function padata_reorder which serves as the core of padata, as soon as padata is added to queue->serial.list, and the associated spin lock released, that padata may be processed and the reference count on pd would go away. Fix this by getting the next padata before the squeue->serial lock is released. In order to make this possible, simplify padata_reorder by only calling it once the next padata arrives. Fixes: 16295bec6398 ("padata: Generic parallelization/serialization interface") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> [ Adjust context of padata_find_next(). Replace cpumask_next_wrap(cpu, pd->cpumask.pcpu) with cpumask_next_wrap(cpu, pd->cpumask.pcpu, -1, false) in padata_reorder() in v6.12 according to dc5bb9b769c9 ("cpumask: deprecate cpumask_next_wrap()") and f954a2d37637 ("padata: switch padata_find_next() to using cpumask_next_wrap()") . ] Signed-off-by: Bin Lan <lanbincn@139.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-22kernel: be more careful about dup_mmap() failures and uprobe registeringLiam R. Howlett2-3/+18
[ Upstream commit 64c37e134b120fb462fb4a80694bfb8e7be77b14 ] If a memory allocation fails during dup_mmap(), the maple tree can be left in an unsafe state for other iterators besides the exit path. All the locks are dropped before the exit_mmap() call (in mm/mmap.c), but the incomplete mm_struct can be reached through (at least) the rmap finding the vmas which have a pointer back to the mm_struct. Up to this point, there have been no issues with being able to find an mm_struct that was only partially initialised. Syzbot was able to make the incomplete mm_struct fail with recent forking changes, so it has been proven unsafe to use the mm_struct that hasn't been initialised, as referenced in the link below. Although 8ac662f5da19f ("fork: avoid inappropriate uprobe access to invalid mm") fixed the uprobe access, it does not completely remove the race. This patch sets the MMF_OOM_SKIP to avoid the iteration of the vmas on the oom side (even though this is extremely unlikely to be selected as an oom victim in the race window), and sets MMF_UNSTABLE to avoid other potential users from using a partially initialised mm_struct. When registering vmas for uprobe, skip the vmas in an mm that is marked unstable. Modifying a vma in an unstable mm may cause issues if the mm isn't fully initialised. Link: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/ Link: https://lkml.kernel.org/r/20250127170221.1761366-1-Liam.Howlett@oracle.com Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [resolved conflict from missing header includes: - linux/workqueue.h missing, introduced by commit 2bf8e5aceff8 ("uprobes: allow put_uprobe() from non-sleepable softirq context") - linux/srcu.h missing, introduced by commit dd1a7567784e ("uprobes: SRCU-protect uretprobe lifetime (with timeout)") ] Signed-off-by: Heiko Stuebner <heiko.stuebner@cherry.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-22sched/deadline: Use revised wakeup rule for dl_serverPeter Zijlstra1-1/+1
[ Upstream commit 14a857056466be9d3d907a94e92a704ac1be149b ] John noted that commit 115135422562 ("sched/deadline: Fix 'stuck' dl_server") unfixed the issue from commit a3a70caf7906 ("sched/deadline: Fix dl_server behaviour"). The issue in commit 115135422562 was for wakeups of the server after the deadline; in which case you *have* to start a new period. The case for a3a70caf7906 is wakeups before the deadline. Now, because the server is effectively running a least-laxity policy, it means that any wakeup during the runnable phase means dl_entity_overflow() will be true. This means we need to adjust the runtime to allow it to still run until the existing deadline expires. Use the revised wakeup rule for dl_defer entities. Fixes: 115135422562 ("sched/deadline: Fix 'stuck' dl_server") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260404102244.GB22575@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-22tracing/probe: reject non-closed empty immediate stringsPengpeng Hou1-1/+1
[ Upstream commit 4346be6577aaa04586167402ae87bbdbe32484a4 ] parse_probe_arg() accepts quoted immediate strings and passes the body after the opening quote to __parse_imm_string(). That helper currently computes strlen(str) and immediately dereferences str[len - 1], which underflows when the body is empty and not closed with double-quotation. Reject empty non-closed immediate strings before checking for the closing quote. Link: https://lore.kernel.org/all/20260401160315.88518-1-pengpeng@iscas.ac.cn/ Fixes: a42e3c4de964 ("tracing/probe: Add immediate string parameter support") Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-18workqueue: Add pool_workqueue to pending_pwqs list when unplugging multiple ↵Matthew Brost1-1/+13
inactive works commit 703ccb63ae9f7444d6ff876d024e17f628103c69 upstream. In unplug_oldest_pwq(), the first inactive work item on the pool_workqueue is activated correctly. However, if multiple inactive works exist on the same pool_workqueue, subsequent works fail to activate because wq_node_nr_active.pending_pwqs is empty — the list insertion is skipped when the pool_workqueue is plugged. Fix this by checking for additional inactive works in unplug_oldest_pwq() and updating wq_node_nr_active.pending_pwqs accordingly. Fixes: 4c065dbce1e8 ("workqueue: Enable unbound cpumask update on ordered workqueues") Cc: stable@vger.kernel.org Cc: Carlos Santa <carlos.santa@intel.com> Cc: Ryan Neph <ryanneph@google.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Waiman Long <longman@redhat.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-18sched_ext: Fix stale direct dispatch state in ddsp_dsq_idAndrea Righi1-12/+36
[ Upstream commit 7e0ffb72de8aa3b25989c2d980e81b829c577010 ] @p->scx.ddsp_dsq_id can be left set (non-SCX_DSQ_INVALID) triggering a spurious warning in mark_direct_dispatch() when the next wakeup's ops.select_cpu() calls scx_bpf_dsq_insert(), such as: WARNING: kernel/sched/ext.c:1273 at scx_dsq_insert_commit+0xcd/0x140 The root cause is that ddsp_dsq_id was only cleared in dispatch_enqueue(), which is not reached in all paths that consume or cancel a direct dispatch verdict. Fix it by clearing it at the right places: - direct_dispatch(): cache the direct dispatch state in local variables and clear it before dispatch_enqueue() on the synchronous path. For the deferred path, the direct dispatch state must remain set until process_ddsp_deferred_locals() consumes them. - process_ddsp_deferred_locals(): cache the dispatch state in local variables and clear it before calling dispatch_to_local_dsq(), which may migrate the task to another rq. - do_enqueue_task(): clear the dispatch state on the enqueue path (local/global/bypass fallbacks), where the direct dispatch verdict is ignored. - dequeue_task_scx(): clear the dispatch state after dispatch_dequeue() to handle both the deferred dispatch cancellation and the holding_cpu race, covering all cases where a pending direct dispatch is cancelled. - scx_disable_task(): clear the direct dispatch state when transitioning a task out of the current scheduler. Waking tasks may have had the direct dispatch state set by the outgoing scheduler's ops.select_cpu() and then been queued on a wake_list via ttwu_queue_wakelist(), when SCX_OPS_ALLOW_QUEUED_WAKEUP is set. Such tasks are not on the runqueue and are not iterated by scx_bypass(), so their direct dispatch state won't be cleared. Without this clear, any subsequent SCX scheduler that tries to direct dispatch the task will trigger the WARN_ON_ONCE() in mark_direct_dispatch(). Fixes: 5b26f7b920f7 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches") Cc: stable@vger.kernel.org # v6.12+ Cc: Daniel Hodges <hodgesd@meta.com> Cc: Patrick Somaru <patsomaru@meta.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> [ adapted function signatures and code paths ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-18blktrace: fix __this_cpu_read/write in preemptible contextChaitanya Kulkarni1-2/+2
[ Upstream commit da46b5dfef48658d03347cda21532bcdbb521e67 ] tracing_record_cmdline() internally uses __this_cpu_read() and __this_cpu_write() on the per-CPU variable trace_cmdline_save, and trace_save_cmdline() explicitly asserts preemption is disabled via lockdep_assert_preemption_disabled(). These operations are only safe when preemption is off, as they were designed to be called from the scheduler context (probe_wakeup_sched_switch() / probe_wakeup()). __blk_add_trace() was calling tracing_record_cmdline(current) early in the blk_tracer path, before ring buffer reservation, from process context where preemption is fully enabled. This triggers the following using blktests/blktrace/002: blktrace/002 (blktrace ftrace corruption with sysfs trace) [failed] runtime 0.367s ... 0.437s something found in dmesg: [ 81.211018] run blktests blktrace/002 at 2026-02-25 22:24:33 [ 81.239580] null_blk: disk nullb1 created [ 81.357294] BUG: using __this_cpu_read() in preemptible [00000000] code: dd/2516 [ 81.362842] caller is tracing_record_cmdline+0x10/0x40 [ 81.362872] CPU: 16 UID: 0 PID: 2516 Comm: dd Tainted: G N 7.0.0-rc1lblk+ #84 PREEMPT(full) [ 81.362877] Tainted: [N]=TEST [ 81.362878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 [ 81.362881] Call Trace: [ 81.362884] <TASK> [ 81.362886] dump_stack_lvl+0x8d/0xb0 ... (See '/mnt/sda/blktests/results/nodev/blktrace/002.dmesg' for the entire message) [ 81.211018] run blktests blktrace/002 at 2026-02-25 22:24:33 [ 81.239580] null_blk: disk nullb1 created [ 81.357294] BUG: using __this_cpu_read() in preemptible [00000000] code: dd/2516 [ 81.362842] caller is tracing_record_cmdline+0x10/0x40 [ 81.362872] CPU: 16 UID: 0 PID: 2516 Comm: dd Tainted: G N 7.0.0-rc1lblk+ #84 PREEMPT(full) [ 81.362877] Tainted: [N]=TEST [ 81.362878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 [ 81.362881] Call Trace: [ 81.362884] <TASK> [ 81.362886] dump_stack_lvl+0x8d/0xb0 [ 81.362895] check_preemption_disabled+0xce/0xe0 [ 81.362902] tracing_record_cmdline+0x10/0x40 [ 81.362923] __blk_add_trace+0x307/0x5d0 [ 81.362934] ? lock_acquire+0xe0/0x300 [ 81.362940] ? iov_iter_extract_pages+0x101/0xa30 [ 81.362959] blk_add_trace_bio+0x106/0x1e0 [ 81.362968] submit_bio_noacct_nocheck+0x24b/0x3a0 [ 81.362979] ? lockdep_init_map_type+0x58/0x260 [ 81.362988] submit_bio_wait+0x56/0x90 [ 81.363009] __blkdev_direct_IO_simple+0x16c/0x250 [ 81.363026] ? __pfx_submit_bio_wait_endio+0x10/0x10 [ 81.363038] ? rcu_read_lock_any_held+0x73/0xa0 [ 81.363051] blkdev_read_iter+0xc1/0x140 [ 81.363059] vfs_read+0x20b/0x330 [ 81.363083] ksys_read+0x67/0xe0 [ 81.363090] do_syscall_64+0xbf/0xf00 [ 81.363102] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 81.363106] RIP: 0033:0x7f281906029d [ 81.363111] Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d 66 63 0a 00 e8 59 ff 01 00 66 0f 1f 84 00 00 00 00 00 80 3d 41 33 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec [ 81.363113] RSP: 002b:00007ffca127dd48 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 81.363120] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281906029d [ 81.363122] RDX: 0000000000001000 RSI: 0000559f8bfae000 RDI: 0000000000000000 [ 81.363123] RBP: 0000000000001000 R08: 0000002863a10a81 R09: 00007f281915f000 [ 81.363124] R10: 00007f2818f77b60 R11: 0000000000000246 R12: 0000559f8bfae000 [ 81.363126] R13: 0000000000000000 R14: 0000000000000000 R15: 000000000000000a [ 81.363142] </TASK> The same BUG fires from blk_add_trace_plug(), blk_add_trace_unplug(), and blk_add_trace_rq() paths as well. The purpose of tracing_record_cmdline() is to cache the task->comm for a given PID so that the trace can later resolve it. It is only meaningful when a trace event is actually being recorded. Ring buffer reservation via ring_buffer_lock_reserve() disables preemption, and preemption remains disabled until the event is committed :- __blk_add_trace() __trace_buffer_lock_reserve() __trace_buffer_lock_reserve() ring_buffer_lock_reserve() preempt_disable_notrace(); <--- With this fix blktests for blktrace pass: blktests (master) # ./check blktrace blktrace/001 (blktrace zone management command tracing) [passed] runtime 3.650s ... 3.647s blktrace/002 (blktrace ftrace corruption with sysfs trace) [passed] runtime 0.411s ... 0.384s Fixes: 7ffbd48d5cab ("tracing: Cache comms only after an event occurred") Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Rajani Kantha <681739313@139.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-11bpf: Fix u32/s32 bounds when ranges cross min/max boundaryEduard Zingerman1-0/+24
[ Upstream commit fbc7aef517d8765e4c425d2792409bb9bf2e1f13 ] Same as in __reg64_deduce_bounds(), refine s32/u32 ranges in __reg32_deduce_bounds() in the following situations: - s32 range crosses U32_MAX/0 boundary, positive part of the s32 range overlaps with u32 range: 0 U32_MAX | [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] | |----------------------------|----------------------------| |xxxxx s32 range xxxxxxxxx] [xxxxxxx| 0 S32_MAX S32_MIN -1 - s32 range crosses U32_MAX/0 boundary, negative part of the s32 range overlaps with u32 range: 0 U32_MAX | [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] | |----------------------------|----------------------------| |xxxxxxxxx] [xxxxxxxxxxxx s32 range | 0 S32_MAX S32_MIN -1 - No refinement if ranges overlap in two intervals. This helps for e.g. consider the following program: call %[bpf_get_prandom_u32]; w0 &= 0xffffffff; if w0 < 0x3 goto 1f; // on fall-through u32 range [3..U32_MAX] if w0 s> 0x1 goto 1f; // on fall-through s32 range [S32_MIN..1] if w0 s< 0x0 goto 1f; // range can be narrowed to [S32_MIN..-1] r10 = 0; 1: ...; The reg_bounds.c selftest is updated to incorporate identical logic, refinement based on non-overflowing range halves: ((x ∩ [0, smax]) ∩ (y ∩ [0, smax])) ∪ ((x ∩ [smin,-1]) ∩ (y ∩ [smin,-1])) Reported-by: Andrea Righi <arighi@nvidia.com> Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Closes: https://lore.kernel.org/bpf/aakqucg4vcujVwif@gpd4/T/ Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260306-bpf-32-bit-range-overflow-v3-1-f7f67e060a6b@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-11bpf: Add third round of bounds deductionPaul Chaignon1-0/+1
[ Upstream commit 5dbb19b16ac498b0b7f3a8a85f9d25d6d8af397d ] Commit d7f008738171 ("bpf: try harder to deduce register bounds from different numeric domains") added a second call to __reg_deduce_bounds in reg_bounds_sync because a single call wasn't enough to converge to a fixed point in terms of register bounds. With patch "bpf: Improve bounds when s64 crosses sign boundary" from this series, Eduard noticed that calling __reg_deduce_bounds twice isn't enough anymore to converge. The first selftest added in "selftests/bpf: Test cross-sign 64bits range refinement" highlights the need for a third call to __reg_deduce_bounds. After instruction 7, reg_bounds_sync performs the following bounds deduction: reg_bounds_sync entry: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146) __update_reg_bounds: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146) __reg_deduce_bounds: __reg32_deduce_bounds: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146,umin32=0xfffffcf1,umax32=0xffffff6e) __reg64_deduce_bounds: scalar(smin=-655,smax=0xeffffeee,smin32=-783,smax32=-146,umin32=0xfffffcf1,umax32=0xffffff6e) __reg_deduce_mixed_bounds: scalar(smin=-655,smax=0xeffffeee,umin=umin32=0xfffffcf1,umax=0xffffffffffffff6e,smin32=-783,smax32=-146,umax32=0xffffff6e) __reg_deduce_bounds: __reg32_deduce_bounds: scalar(smin=-655,smax=0xeffffeee,umin=umin32=0xfffffcf1,umax=0xffffffffffffff6e,smin32=-783,smax32=-146,umax32=0xffffff6e) __reg64_deduce_bounds: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e) __reg_deduce_mixed_bounds: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e) __reg_bound_offset: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e,var_off=(0xfffffffffffffc00; 0x3ff)) __update_reg_bounds: scalar(smin=-655,smax=smax32=-146,umin=0xfffffffffffffd71,umax=0xffffffffffffff6e,smin32=-783,umin32=0xfffffcf1,umax32=0xffffff6e,var_off=(0xfffffffffffffc00; 0x3ff)) In particular, notice how: 1. In the first call to __reg_deduce_bounds, __reg32_deduce_bounds learns new u32 bounds. 2. __reg64_deduce_bounds is unable to improve bounds at this point. 3. __reg_deduce_mixed_bounds derives new u64 bounds from the u32 bounds. 4. In the second call to __reg_deduce_bounds, __reg64_deduce_bounds improves the smax and umin bounds thanks to patch "bpf: Improve bounds when s64 crosses sign boundary" from this series. 5. Subsequent functions are unable to improve the ranges further (only tnums). Yet, a better smin32 bound could be learned from the smin bound. __reg32_deduce_bounds is able to improve smin32 from smin, but for that we need a third call to __reg_deduce_bounds. As discussed in [1], there may be a better way to organize the deduction rules to learn the same information with less calls to the same functions. Such an optimization requires further analysis and is orthogonal to the present patchset. Link: https://lore.kernel.org/bpf/aIKtSK9LjQXB8FLY@mail.gmail.com/ [1] Acked-by: Eduard Zingerman <eddyz87@gmail.com> Co-developed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/79619d3b42e5525e0e174ed534b75879a5ba15de.1753695655.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-11bpf: Improve bounds when s64 crosses sign boundaryPaul Chaignon1-0/+52
[ Upstream commit 00bf8d0c6c9be0c481fc45a3f7d87c7f8812f229 ] __reg64_deduce_bounds currently improves the s64 range using the u64 range and vice versa, but only if it doesn't cross the sign boundary. This patch improves __reg64_deduce_bounds to cover the case where the s64 range crosses the sign boundary but overlaps with the u64 range on only one end. In that case, we can improve both ranges. Consider the following example, with the s64 range crossing the sign boundary: 0 U64_MAX | [xxxxxxxxxxxxxx u64 range xxxxxxxxxxxxxx] | |----------------------------|----------------------------| |xxxxx s64 range xxxxxxxxx] [xxxxxxx| 0 S64_MAX S64_MIN -1 The u64 range overlaps only with positive portion of the s64 range. We can thus derive the following new s64 and u64 ranges. 0 U64_MAX | [xxxxxx u64 range xxxxx] | |----------------------------|----------------------------| | [xxxxxx s64 range xxxxx] | 0 S64_MAX S64_MIN -1 The same logic can probably apply to the s32/u32 ranges, but this patch doesn't implement that change. In addition to the selftests, the __reg64_deduce_bounds change was also tested with Agni, the formal verification tool for the range analysis [1]. Link: https://github.com/bpfverif/agni [1] Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/933bd9ce1f36ded5559f92fdc09e5dbc823fa245.1753695655.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-11sched/fair: Fix zero_vruntime tracking fixPeter Zijlstra1-7/+3
[ Upstream commit 1319ea57529e131822bab56bf417c8edc2db9ae8 ] John reported that stress-ng-yield could make his machine unhappy and managed to bisect it to commit b3d99f43c72b ("sched/fair: Fix zero_vruntime tracking"). The combination of yield and that commit was specific enough to hypothesize the following scenario: Suppose we have 2 runnable tasks, both doing yield. Then one will be eligible and one will not be, because the average position must be in between these two entities. Therefore, the runnable task will be eligible, and be promoted a full slice (all the tasks do is yield after all). This causes it to jump over the other task and now the other task is eligible and current is no longer. So we schedule. Since we are runnable, there is no {de,en}queue. All we have is the __{en,de}queue_entity() from {put_prev,set_next}_task(). But per the fingered commit, those two no longer move zero_vruntime. All that moves zero_vruntime are tick and full {de,en}queue. This means, that if the two tasks playing leapfrog can reach the critical speed to reach the overflow point inside one tick's worth of time, we're up a creek. Additionally, when multiple cgroups are involved, there is no guarantee the tick will in fact hit every cgroup in a timely manner. Statistically speaking it will, but that same statistics does not rule out the possibility of one cgroup not getting a tick for a significant amount of time -- however unlikely. Therefore, just like with the yield() case, force an update at the end of every slice. This ensures the update is never more than a single slice behind and the whole thing is within 2 lag bounds as per the comment on entity_key(). Fixes: b3d99f43c72b ("sched/fair: Fix zero_vruntime tracking") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260401132355.081530332@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-11sched/fair: Use protect_slice() instead of direct comparisonVincent Guittot1-1/+1
[ Upstream commit 9cdb4fe20cd239c848b5c3f5753d83a9443ba329 ] Replace the test by the relevant protect_slice() function. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dhaval Giani (AMD) <dhaval@gianis.ca> Link: https://lkml.kernel.org/r/20250708165630.1948751-2-vincent.guittot@linaro.org Stable-dep-of: 1319ea57529e ("sched/fair: Fix zero_vruntime tracking fix") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-11bpf: reject direct access to nullable PTR_TO_BUF pointersQi Tang1-1/+2
[ Upstream commit b0db1accbc7395657c2b79db59fa9fae0d6656f3 ] check_mem_access() matches PTR_TO_BUF via base_type() which strips PTR_MAYBE_NULL, allowing direct dereference without a null check. Map iterator ctx->key and ctx->value are PTR_TO_BUF | PTR_MAYBE_NULL. On stop callbacks these are NULL, causing a kernel NULL dereference. Add a type_may_be_null() guard to the PTR_TO_BUF branch, matching the existing PTR_TO_BTF_ID pattern. Fixes: 20b2aff4bc15 ("bpf: Introduce MEM_RDONLY flag") Signed-off-by: Qi Tang <tpluszz77@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260402092923.38357-2-tpluszz77@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-11bpf: Fix regsafe() for pointers to packetAlexei Starovoitov1-1/+6
[ Upstream commit a8502a79e832b861e99218cbd2d8f4312d62e225 ] In case rold->reg->range == BEYOND_PKT_END && rcur->reg->range == N regsafe() may return true which may lead to current state with valid packet range not being explored. Fix the bug. Fixes: 6d94e741a8ff ("bpf: Support for pointers beyond pkt_end.") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Amery Hung <ameryhung@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20260331204228.26726-1-alexei.starovoitov@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-02futex: Require sys_futex_requeue() to have identical flagsPeter Zijlstra1-0/+8
[ Upstream commit 19f94b39058681dec64a10ebeb6f23fe7fc3f77a ] Nicholas reported that his LLM found it was possible to create a UaF when sys_futex_requeue() is used with different flags. The initial motivation for allowing different flags was the variable sized futex, but since that hasn't been merged (yet), simply mandate the flags are identical, as is the case for the old style sys_futex() requeue operations. Fixes: 0f4b5f972216 ("futex: Add sys_futex_requeue()") Reported-by: Nicholas Carlini <npc@anthropic.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-02tracing: Fix potential deadlock in cpu hotplug with osnoiseLuo Haiyang1-5/+5
[ Upstream commit 1f9885732248d22f788e4992c739a98c88ab8a55 ] The following sequence may leads deadlock in cpu hotplug: task1 task2 task3 ----- ----- ----- mutex_lock(&interface_lock) [CPU GOING OFFLINE] cpus_write_lock(); osnoise_cpu_die(); kthread_stop(task3); wait_for_completion(); osnoise_sleep(); mutex_lock(&interface_lock); cpus_read_lock(); [DEAD LOCK] Fix by swap the order of cpus_read_lock() and mutex_lock(&interface_lock). Cc: stable@vger.kernel.org Cc: <mathieu.desnoyers@efficios.com> Cc: <zhang.run@zte.com.cn> Cc: <yang.tao172@zte.com.cn> Cc: <ran.xiaokai@zte.com.cn> Fixes: bce29ac9ce0bb ("trace: Add osnoise tracer") Link: https://patch.msgid.link/20260326141953414bVSj33dAYktqp9Oiyizq8@zte.com.cn Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Luo Haiyang <luo.haiyang@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02tracing: Switch trace_osnoise.c code over to use guard() and __free()Steven Rostedt1-27/+13
[ Upstream commit 930d2b32c0af6895ba4c6ca6404e7f7b6dc214ed ] The osnoise_hotplug_workfn() grabs two mutexes and cpu_read_lock(). It has various gotos to handle unlocking them. Switch them over to guard() and let the compiler worry about it. The osnoise_cpus_read() has a temporary mask_str allocated and there's some gotos to make sure it gets freed on error paths. Switch that over to __free() to let the compiler worry about it. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241225222931.517329690@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Stable-dep-of: 1f9885732248 ("tracing: Fix potential deadlock in cpu hotplug with osnoise") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02futex: Clear stale exiting pointer in futex_lock_pi() retry pathDavidlohr Bueso1-1/+2
commit 210d36d892de5195e6766c45519dfb1e65f3eb83 upstream. Fuzzying/stressing futexes triggered: WARNING: kernel/futex/core.c:825 at wait_for_owner_exiting+0x7a/0x80, CPU#11: futex_lock_pi_s/524 When futex_lock_pi_atomic() sees the owner is exiting, it returns -EBUSY and stores a refcounted task pointer in 'exiting'. After wait_for_owner_exiting() consumes that reference, the local pointer is never reset to nil. Upon a retry, if futex_lock_pi_atomic() returns a different error, the bogus pointer is passed to wait_for_owner_exiting(). CPU0 CPU1 CPU2 futex_lock_pi(uaddr) // acquires the PI futex exit() futex_cleanup_begin() futex_state = EXITING; futex_lock_pi(uaddr) futex_lock_pi_atomic() attach_to_pi_owner() // observes EXITING *exiting = owner; // takes ref return -EBUSY wait_for_owner_exiting(-EBUSY, owner) put_task_struct(); // drops ref // exiting still points to owner goto retry; futex_lock_pi_atomic() lock_pi_update_atomic() cmpxchg(uaddr) *uaddr ^= WAITERS // whatever // value changed return -EAGAIN; wait_for_owner_exiting(-EAGAIN, exiting) // stale WARN_ON_ONCE(exiting) Fix this by resetting upon retry, essentially aligning it with requeue_pi. Fixes: 3ef240eaff36 ("futex: Prevent exit livelock") Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260326001759.4129680-1-dave@stgolabs.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02alarmtimer: Fix argument order in alarm_timer_forward()Zhan Xusheng1-1/+1
commit 5d16467ae56343b9205caedf85e3a131e0914ad8 upstream. alarm_timer_forward() passes arguments to alarm_forward() in the wrong order: alarm_forward(alarm, timr->it_interval, now); However, alarm_forward() is defined as: u64 alarm_forward(struct alarm *alarm, ktime_t now, ktime_t interval); and uses the second argument as the current time: delta = ktime_sub(now, alarm->node.expires); Passing the interval as "now" results in incorrect delta computation, which can lead to missed expirations or incorrect overrun accounting. This issue has been present since the introduction of alarm_timer_forward(). Fix this by swapping the arguments. Fixes: e7561f1633ac ("alarmtimer: Implement forward callback") Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260323061130.29991-1-zhanxusheng@xiaomi.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02sysctl: fix uninitialized variable in proc_do_large_bitmapMarc Buerg1-1/+1
[ Upstream commit f63a9df7e3f9f842945d292a19d9938924f066f9 ] proc_do_large_bitmap() does not initialize variable c, which is expected to be set to a trailing character by proc_get_long(). However, proc_get_long() only sets c when the input buffer contains a trailing character after the parsed value. If c is not initialized it may happen to contain a '-'. If this is the case proc_do_large_bitmap() expects to be able to parse a second part of the input buffer. If there is no second part an unjustified -EINVAL will be returned. Initialize c to 0 to prevent returning -EINVAL on valid input. Fixes: 9f977fb7ae9d ("sysctl: add proc_do_large_bitmap") Signed-off-by: Marc Buerg <buermarc@googlemail.com> Reviewed-by: Joel Granados <joel.granados@kernel.org> Signed-off-by: Joel Granados <joel.granados@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>