summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
35 hourskallsyms: prevent module removal when printing module name and buildidPetr Mladek1-0/+3
commit 3b07086444f80c844351255fd94c2cb0a7224df2 upstream. kallsyms_lookup_buildid() copies the symbol name into the given buffer so that it can be safely read anytime later. But it just copies pointers to mod->name and mod->build_id which might get reused after the related struct module gets removed. The lifetime of struct module is synchronized using RCU. Take the rcu read lock for the entire __sprint_symbol(). Link: https://lkml.kernel.org/r/20251128135920.217303-8-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
35 hourskallsyms: cleanup code for appending the module buildidPetr Mladek1-9/+33
commit 8e81dac4cd5477731169b92cff7c24f8f6635950 upstream. Put the code for appending the optional "buildid" into a helper function, It makes __sprint_symbol() better readable. Also print a warning when the "modname" is set and the "buildid" isn't. It might catch a situation when some lookup function in kallsyms_lookup_buildid() does not handle the "buildid". Use pr_*_once() to avoid an infinite recursion when the function is called from printk(). The recursion is rather theoretical but better be on the safe side. Link: https://lkml.kernel.org/r/20251128135920.217303-5-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
35 hourskallsyms: clean up modname and modbuildid initialization in ↵Petr Mladek1-4/+8
kallsyms_lookup_buildid() commit fda024fb64769e9d6b3916d013c78d6b189129f8 upstream. The @modname and @modbuildid optional return parameters are set only when the symbol is in a module. Always initialize them so that they do not need to be cleared when the module is not in a module. It simplifies the logic and makes the code even slightly more safe. Note that bpf_address_lookup() function will get updated in a separate patch. Link: https://lkml.kernel.org/r/20251128135920.217303-3-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
35 hourskallsyms: clean up @namebuf initialization in kallsyms_lookup_buildid()Petr Mladek1-1/+6
commit 426295ef18c5d5f0b7f75ac89d09022fcfafd25c upstream. Patch series "kallsyms: Prevent invalid access when showing module buildid", v3. We have seen nested crashes in __sprint_symbol(), see below. They seem to be caused by an invalid pointer to "buildid". This patchset cleans up kallsyms code related to module buildid and fixes this invalid access when printing backtraces. I made an audit of __sprint_symbol() and found several situations when the buildid might be wrong: + bpf_address_lookup() does not set @modbuildid + ftrace_mod_address_lookup() does not set @modbuildid + __sprint_symbol() does not take rcu_read_lock and the related struct module might get removed before mod->build_id is printed. This patchset solves these problems: + 1st, 2nd patches are preparatory + 3rd, 4th, 6th patches fix the above problems + 5th patch cleans up a suspicious initialization code. This is the backtrace, we have seen. But it is not really important. The problems fixed by the patchset are obvious: crash64> bt [62/2029] PID: 136151 TASK: ffff9f6c981d4000 CPU: 367 COMMAND: "btrfs" #0 [ffffbdb687635c28] machine_kexec at ffffffffb4c845b3 #1 [ffffbdb687635c80] __crash_kexec at ffffffffb4d86a6a #2 [ffffbdb687635d08] hex_string at ffffffffb51b3b61 #3 [ffffbdb687635d40] crash_kexec at ffffffffb4d87964 #4 [ffffbdb687635d50] oops_end at ffffffffb4c41fc8 #5 [ffffbdb687635d70] do_trap at ffffffffb4c3e49a #6 [ffffbdb687635db8] do_error_trap at ffffffffb4c3e6a4 #7 [ffffbdb687635df8] exc_stack_segment at ffffffffb5666b33 #8 [ffffbdb687635e20] asm_exc_stack_segment at ffffffffb5800cf9 ... This patch (of 7) The function kallsyms_lookup_buildid() initializes the given @namebuf by clearing the first and the last byte. It is not clear why. The 1st byte makes sense because some callers ignore the return code and expect that the buffer contains a valid string, for example: - function_stat_show() - kallsyms_lookup() - kallsyms_lookup_buildid() The initialization of the last byte does not make much sense because it can later be overwritten. Fortunately, it seems that all called functions behave correctly: - kallsyms_expand_symbol() explicitly adds the trailing '\0' at the end of the function. - All *__address_lookup() functions either use the safe strscpy() or they do not touch the buffer at all. Document the reason for clearing the first byte. And remove the useless initialization of the last byte. Link: https://lkml.kernel.org/r/20251128135920.217303-2-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
35 hourssched_ext: Fix stale direct dispatch state in ddsp_dsq_idAndrea Righi1-14/+35
commit 7e0ffb72de8aa3b25989c2d980e81b829c577010 upstream. @p->scx.ddsp_dsq_id can be left set (non-SCX_DSQ_INVALID) triggering a spurious warning in mark_direct_dispatch() when the next wakeup's ops.select_cpu() calls scx_bpf_dsq_insert(), such as: WARNING: kernel/sched/ext.c:1273 at scx_dsq_insert_commit+0xcd/0x140 The root cause is that ddsp_dsq_id was only cleared in dispatch_enqueue(), which is not reached in all paths that consume or cancel a direct dispatch verdict. Fix it by clearing it at the right places: - direct_dispatch(): cache the direct dispatch state in local variables and clear it before dispatch_enqueue() on the synchronous path. For the deferred path, the direct dispatch state must remain set until process_ddsp_deferred_locals() consumes them. - process_ddsp_deferred_locals(): cache the dispatch state in local variables and clear it before calling dispatch_to_local_dsq(), which may migrate the task to another rq. - do_enqueue_task(): clear the dispatch state on the enqueue path (local/global/bypass fallbacks), where the direct dispatch verdict is ignored. - dequeue_task_scx(): clear the dispatch state after dispatch_dequeue() to handle both the deferred dispatch cancellation and the holding_cpu race, covering all cases where a pending direct dispatch is cancelled. - scx_disable_task(): clear the direct dispatch state when transitioning a task out of the current scheduler. Waking tasks may have had the direct dispatch state set by the outgoing scheduler's ops.select_cpu() and then been queued on a wake_list via ttwu_queue_wakelist(), when SCX_OPS_ALLOW_QUEUED_WAKEUP is set. Such tasks are not on the runqueue and are not iterated by scx_bypass(), so their direct dispatch state won't be cleared. Without this clear, any subsequent SCX scheduler that tries to direct dispatch the task will trigger the WARN_ON_ONCE() in mark_direct_dispatch(). Fixes: 5b26f7b920f7 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches") Cc: stable@vger.kernel.org # v6.12+ Cc: Daniel Hodges <hodgesd@meta.com> Cc: Patrick Somaru <patsomaru@meta.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
35 hourssched_ext: Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCUChangwoo Min1-12/+19
commit 0c4a59df370bea245695c00aaae6ae75747139bd upstream. Since commit 8e4f0b1ebcf2 ("bpf: use rcu_read_lock_dont_migrate() for trampoline.c"), the BPF prolog (__bpf_prog_enter) calls migrate_disable() only when CONFIG_PREEMPT_RCU is enabled, via rcu_read_lock_dont_migrate(). Without CONFIG_PREEMPT_RCU, the prolog never touches migration_disabled, so migration_disabled == 1 always means the task is truly migration-disabled regardless of whether it is the current task. The old unconditional p == current check was a false negative in this case, potentially allowing a migration-disabled task to be dispatched to a remote CPU and triggering scx_error in task_can_run_on_remote_rq(). Only apply the p == current disambiguation when CONFIG_PREEMPT_RCU is enabled, where the ambiguity with the BPF prolog still exists. Fixes: 8e4f0b1ebcf2 ("bpf: use rcu_read_lock_dont_migrate() for trampoline.c") Cc: stable@vger.kernel.org # v6.18+ Link: https://lore.kernel.org/lkml/20250821090609.42508-8-dongml2@chinatelecom.cn/ Signed-off-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
35 hoursPM: EM: Fix NULL pointer dereference when perf domain ID is not foundChangwoo Min1-0/+2
commit 9badc2a84e688be1275bb740942d5f6f51746908 upstream. dev_energymodel_nl_get_perf_domains_doit() calls em_perf_domain_get_by_id() but does not check the return value before passing it to __em_nl_get_pd_size(). When a caller supplies a non-existent perf domain ID, em_perf_domain_get_by_id() returns NULL, and __em_nl_get_pd_size() immediately dereferences pd->cpus (struct offset 0x30), causing a NULL pointer dereference. The sister handler dev_energymodel_nl_get_perf_table_doit() already handles this correctly via __em_nl_get_pd_table_id(), which returns NULL and causes the caller to return -EINVAL. Add the same NULL check in the get-perf-domains do handler. Fixes: 380ff27af25e ("PM: EM: Add dump to get-perf-domains in the EM YNL spec") Reported-by: Yi Lai <yi1.lai@linux.intel.com> Closes: https://lore.kernel.org/lkml/aXiySM79UYfk+ytd@ly-workstation/ Signed-off-by: Changwoo Min <changwoo@igalia.com> Cc: 6.19+ <stable@vger.kernel.org> # 6.19+ [ rjw: Subject and changelog edits ] Link: https://patch.msgid.link/20260329073615.649976-1-changwoo@igalia.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
35 hourssched_ext: Fix SCX_KICK_WAIT deadlock by deferring wait to balance callbackTejun Heo2-25/+73
commit 415cb193bb9736f0e830286c72a6fa8eb2a9cc5c upstream. SCX_KICK_WAIT busy-waits in kick_cpus_irq_workfn() using smp_cond_load_acquire() until the target CPU's kick_sync advances. Because the irq_work runs in hardirq context, the waiting CPU cannot reschedule and its own kick_sync never advances. If multiple CPUs form a wait cycle, all CPUs deadlock. Replace the busy-wait in kick_cpus_irq_workfn() with resched_curr() to force the CPU through do_pick_task_scx(), which queues a balance callback to perform the wait. The balance callback drops the rq lock and enables IRQs following the sched_core_balance() pattern, so the CPU can process IPIs while waiting. The local CPU's kick_sync is advanced on entry to do_pick_task_scx() and continuously during the wait, ensuring any CPU that starts waiting for us sees the advancement and cannot form cyclic dependencies. Fixes: 90e55164dad4 ("sched_ext: Implement SCX_KICK_WAIT") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Christian Loehle <christian.loehle@arm.com> Link: https://lore.kernel.org/r/20260316100249.1651641-1-christian.loehle@arm.com Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
35 hourssched_ext: Fix inconsistent NUMA node lookup in scx_select_cpu_dfl()Cheng-Yang Chou1-1/+1
commit db08b1940f4beb25460b4a4e9da3446454f2e8fe upstream. In the WAKE_SYNC path of scx_select_cpu_dfl(), waker_node was computed with cpu_to_node(), while node (for prev_cpu) was computed with scx_cpu_node_if_enabled(). When scx_builtin_idle_per_node is disabled, idle_cpumask(waker_node) is called with a real node ID even though per-node idle tracking is disabled, resulting in undefined behavior. Fix by using scx_cpu_node_if_enabled() for waker_node as well, ensuring both variables are computed consistently. Fixes: 48849271e6611 ("sched_ext: idle: Per-node idle cpumasks") Cc: stable@vger.kernel.org # v6.15+ Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
35 hourssched/debug: Fix avg_vruntime() usagePeter Zijlstra1-1/+3
[ Upstream commit e08d007f9d813616ce7093600bc4fdb9c9d81d89 ] John reported that stress-ng-yield could make his machine unhappy and managed to bisect it to commit b3d99f43c72b ("sched/fair: Fix zero_vruntime tracking"). The commit in question changes avg_vruntime() from a function that is a pure reader, to a function that updates variables. This turns an unlocked sched/debug usage of this function from a minor mistake into a data corruptor. Fixes: af4cf40470c2 ("sched/fair: Add cfs_rq::avg_vruntime") Fixes: b3d99f43c72b ("sched/fair: Fix zero_vruntime tracking") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260401132355.196370805@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
35 hourssched/fair: Fix zero_vruntime tracking fixPeter Zijlstra1-7/+3
[ Upstream commit 1319ea57529e131822bab56bf417c8edc2db9ae8 ] John reported that stress-ng-yield could make his machine unhappy and managed to bisect it to commit b3d99f43c72b ("sched/fair: Fix zero_vruntime tracking"). The combination of yield and that commit was specific enough to hypothesize the following scenario: Suppose we have 2 runnable tasks, both doing yield. Then one will be eligible and one will not be, because the average position must be in between these two entities. Therefore, the runnable task will be eligible, and be promoted a full slice (all the tasks do is yield after all). This causes it to jump over the other task and now the other task is eligible and current is no longer. So we schedule. Since we are runnable, there is no {de,en}queue. All we have is the __{en,de}queue_entity() from {put_prev,set_next}_task(). But per the fingered commit, those two no longer move zero_vruntime. All that moves zero_vruntime are tick and full {de,en}queue. This means, that if the two tasks playing leapfrog can reach the critical speed to reach the overflow point inside one tick's worth of time, we're up a creek. Additionally, when multiple cgroups are involved, there is no guarantee the tick will in fact hit every cgroup in a timely manner. Statistically speaking it will, but that same statistics does not rule out the possibility of one cgroup not getting a tick for a significant amount of time -- however unlikely. Therefore, just like with the yield() case, force an update at the end of every slice. This ensures the update is never more than a single slice behind and the whole thing is within 2 lag bounds as per the comment on entity_key(). Fixes: b3d99f43c72b ("sched/fair: Fix zero_vruntime tracking") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260401132355.081530332@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
35 hoursbpf: Fix incorrect pruning due to atomic fetch precision trackingDaniel Borkmann1-3/+24
[ Upstream commit 179ee84a89114b854ac2dd1d293633a7f6c8dac1 ] When backtrack_insn encounters a BPF_STX instruction with BPF_ATOMIC and BPF_FETCH, the src register (or r0 for BPF_CMPXCHG) also acts as a destination, thus receiving the old value from the memory location. The current backtracking logic does not account for this. It treats atomic fetch operations the same as regular stores where the src register is only an input. This leads the backtrack_insn to fail to propagate precision to the stack location, which is then not marked as precise! Later, the verifier's path pruning can incorrectly consider two states equivalent when they differ in terms of stack state. Meaning, two branches can be treated as equivalent and thus get pruned when they should not be seen as such. Fix it as follows: Extend the BPF_LDX handling in backtrack_insn to also cover atomic fetch operations via is_atomic_fetch_insn() helper. When the fetch dst register is being tracked for precision, clear it, and propagate precision over to the stack slot. For non-stack memory, the precision walk stops at the atomic instruction, same as regular BPF_LDX. This covers all fetch variants. Before: 0: (b7) r1 = 8 ; R1=8 1: (7b) *(u64 *)(r10 -8) = r1 ; R1=8 R10=fp0 fp-8=8 2: (b7) r2 = 0 ; R2=0 3: (db) r2 = atomic64_fetch_add((u64 *)(r10 -8), r2) ; R2=8 R10=fp0 fp-8=mmmmmmmm 4: (bf) r3 = r10 ; R3=fp0 R10=fp0 5: (0f) r3 += r2 mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r2 stack= before 4: (bf) r3 = r10 mark_precise: frame0: regs=r2 stack= before 3: (db) r2 = atomic64_fetch_add((u64 *)(r10 -8), r2) mark_precise: frame0: regs=r2 stack= before 2: (b7) r2 = 0 6: R2=8 R3=fp8 6: (b7) r0 = 0 ; R0=0 7: (95) exit After: 0: (b7) r1 = 8 ; R1=8 1: (7b) *(u64 *)(r10 -8) = r1 ; R1=8 R10=fp0 fp-8=8 2: (b7) r2 = 0 ; R2=0 3: (db) r2 = atomic64_fetch_add((u64 *)(r10 -8), r2) ; R2=8 R10=fp0 fp-8=mmmmmmmm 4: (bf) r3 = r10 ; R3=fp0 R10=fp0 5: (0f) r3 += r2 mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r2 stack= before 4: (bf) r3 = r10 mark_precise: frame0: regs=r2 stack= before 3: (db) r2 = atomic64_fetch_add((u64 *)(r10 -8), r2) mark_precise: frame0: regs= stack=-8 before 2: (b7) r2 = 0 mark_precise: frame0: regs= stack=-8 before 1: (7b) *(u64 *)(r10 -8) = r1 mark_precise: frame0: regs=r1 stack= before 0: (b7) r1 = 8 6: R2=8 R3=fp8 6: (b7) r0 = 0 ; R0=0 7: (95) exit Fixes: 5ffa25502b5a ("bpf: Add instructions for atomic_[cmp]xchg") Fixes: 5ca419f2864a ("bpf: Add BPF_FETCH field / create atomic_fetch_add instruction") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260331222020.401848-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
35 hoursbpf: Reject sleepable kprobe_multi programs at attach timeVarun R Mallya1-0/+4
[ Upstream commit eb7024bfcc5f68ed11ed9dd4891a3073c15f04a8 ] kprobe.multi programs run in atomic/RCU context and cannot sleep. However, bpf_kprobe_multi_link_attach() did not validate whether the program being attached had the sleepable flag set, allowing sleepable helpers such as bpf_copy_from_user() to be invoked from a non-sleepable context. This causes a "sleeping function called from invalid context" splat: BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo preempt_count: 1, expected: 0 RCU nest depth: 2, expected: 0 Fix this by rejecting sleepable programs early in bpf_kprobe_multi_link_attach(), before any further processing. Fixes: 0dcac2725406 ("bpf: Add multi kprobe link") Signed-off-by: Varun R Mallya <varunrmallya@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Leon Hwang <leon.hwang@linux.dev> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260401191126.440683-1-varunrmallya@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
35 hoursbpf: reject direct access to nullable PTR_TO_BUF pointersQi Tang1-1/+2
[ Upstream commit b0db1accbc7395657c2b79db59fa9fae0d6656f3 ] check_mem_access() matches PTR_TO_BUF via base_type() which strips PTR_MAYBE_NULL, allowing direct dereference without a null check. Map iterator ctx->key and ctx->value are PTR_TO_BUF | PTR_MAYBE_NULL. On stop callbacks these are NULL, causing a kernel NULL dereference. Add a type_may_be_null() guard to the PTR_TO_BUF branch, matching the existing PTR_TO_BTF_ID pattern. Fixes: 20b2aff4bc15 ("bpf: Introduce MEM_RDONLY flag") Signed-off-by: Qi Tang <tpluszz77@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260402092923.38357-2-tpluszz77@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
35 hoursbpf: Fix regsafe() for pointers to packetAlexei Starovoitov1-1/+6
[ Upstream commit a8502a79e832b861e99218cbd2d8f4312d62e225 ] In case rold->reg->range == BEYOND_PKT_END && rcur->reg->range == N regsafe() may return true which may lead to current state with valid packet range not being explored. Fix the bug. Fixes: 6d94e741a8ff ("bpf: Support for pointers beyond pkt_end.") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Amery Hung <ameryhung@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20260331204228.26726-1-alexei.starovoitov@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
35 hourscgroup: Fix cgroup_drain_dying() testing the wrong conditionTejun Heo1-20/+22
[ Upstream commit 4c56a8ac6869855866de0bb368a4189739e1d24f ] cgroup_drain_dying() was using cgroup_is_populated() to test whether there are dying tasks to wait for. cgroup_is_populated() tests nr_populated_csets, nr_populated_domain_children and nr_populated_threaded_children, but cgroup_drain_dying() only needs to care about this cgroup's own tasks - whether there are children is cgroup_destroy_locked()'s concern. This caused hangs during shutdown. When systemd tried to rmdir a cgroup that had no direct tasks but had a populated child, cgroup_drain_dying() would enter its wait loop because cgroup_is_populated() was true from nr_populated_domain_children. The task iterator found nothing to wait for, yet the populated state never cleared because it was driven by live tasks in the child cgroup. Fix it by using cgroup_has_tasks() which only tests nr_populated_csets. v3: Fix cgroup_is_populated() -> cgroup_has_tasks() (Sebastian). v2: https://lore.kernel.org/r/20260323200205.1063629-1-tj@kernel.org Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Fixes: 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir") Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
35 hourscgroup: Wait for dying tasks to leave on rmdirTejun Heo1-3/+83
[ Upstream commit 1b164b876c36c3eb5561dd9b37702b04401b0166 ] a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") hid PF_EXITING tasks from cgroup.procs so that systemd doesn't see tasks that have already been reaped via waitpid(). However, the populated counter (nr_populated_csets) is only decremented when the task later passes through cgroup_task_dead() in finish_task_switch(). This means cgroup.procs can appear empty while the cgroup is still populated, causing rmdir to fail with -EBUSY. Fix this by making cgroup_rmdir() wait for dying tasks to fully leave. If the cgroup is populated but all remaining tasks have PF_EXITING set (the task iterator returns none due to the existing filter), wait for a kick from cgroup_task_dead() and retry. The wait is brief as tasks are removed from the cgroup's css_set between PF_EXITING assertion in do_exit() and cgroup_task_dead() in finish_task_switch(). v2: cgroup_is_populated() true to false transition happens under css_set_lock not cgroup_mutex, so retest under css_set_lock before sleeping to avoid missed wakeups (Sebastian). Fixes: a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202603222104.2c81684e-lkp@intel.com Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Michal Koutny <mkoutny@suse.com> Cc: cgroups@vger.kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysfutex: Fix UaF between futex_key_to_node_opt() and vma_replace_policy()Hao-Yu Yang1-1/+1
[ Upstream commit 190a8c48ff623c3d67cb295b4536a660db2012aa ] During futex_key_to_node_opt() execution, vma->vm_policy is read under speculative mmap lock and RCU. Concurrently, mbind() may call vma_replace_policy() which frees the old mempolicy immediately via kmem_cache_free(). This creates a race where __futex_key_to_node() dereferences a freed mempolicy pointer, causing a use-after-free read of mpol->mode. [ 151.412631] BUG: KASAN: slab-use-after-free in __futex_key_to_node (kernel/futex/core.c:349) [ 151.414046] Read of size 2 at addr ffff888001c49634 by task e/87 [ 151.415969] Call Trace: [ 151.416732] __asan_load2 (mm/kasan/generic.c:271) [ 151.416777] __futex_key_to_node (kernel/futex/core.c:349) [ 151.416822] get_futex_key (kernel/futex/core.c:374 kernel/futex/core.c:386 kernel/futex/core.c:593) Fix by adding rcu to __mpol_put(). Fixes: c042c505210d ("futex: Implement FUTEX2_MPOL") Reported-by: Hao-Yu Yang <naup96721@gmail.com> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Hao-Yu Yang <naup96721@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Link: https://patch.msgid.link/20260324174418.GB1850007@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysfutex: Require sys_futex_requeue() to have identical flagsPeter Zijlstra1-0/+8
[ Upstream commit 19f94b39058681dec64a10ebeb6f23fe7fc3f77a ] Nicholas reported that his LLM found it was possible to create a UaF when sys_futex_requeue() is used with different flags. The initial motivation for allowing different flags was the variable sized futex, but since that hasn't been merged (yet), simply mandate the flags are identical, as is the case for the old style sys_futex() requeue operations. Fixes: 0f4b5f972216 ("futex: Add sys_futex_requeue()") Reported-by: Nicholas Carlini <npc@anthropic.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysfutex: Clear stale exiting pointer in futex_lock_pi() retry pathDavidlohr Bueso1-1/+2
commit 210d36d892de5195e6766c45519dfb1e65f3eb83 upstream. Fuzzying/stressing futexes triggered: WARNING: kernel/futex/core.c:825 at wait_for_owner_exiting+0x7a/0x80, CPU#11: futex_lock_pi_s/524 When futex_lock_pi_atomic() sees the owner is exiting, it returns -EBUSY and stores a refcounted task pointer in 'exiting'. After wait_for_owner_exiting() consumes that reference, the local pointer is never reset to nil. Upon a retry, if futex_lock_pi_atomic() returns a different error, the bogus pointer is passed to wait_for_owner_exiting(). CPU0 CPU1 CPU2 futex_lock_pi(uaddr) // acquires the PI futex exit() futex_cleanup_begin() futex_state = EXITING; futex_lock_pi(uaddr) futex_lock_pi_atomic() attach_to_pi_owner() // observes EXITING *exiting = owner; // takes ref return -EBUSY wait_for_owner_exiting(-EBUSY, owner) put_task_struct(); // drops ref // exiting still points to owner goto retry; futex_lock_pi_atomic() lock_pi_update_atomic() cmpxchg(uaddr) *uaddr ^= WAITERS // whatever // value changed return -EAGAIN; wait_for_owner_exiting(-EAGAIN, exiting) // stale WARN_ON_ONCE(exiting) Fix this by resetting upon retry, essentially aligning it with requeue_pi. Fixes: 3ef240eaff36 ("futex: Prevent exit livelock") Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260326001759.4129680-1-dave@stgolabs.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 daysalarmtimer: Fix argument order in alarm_timer_forward()Zhan Xusheng1-1/+1
commit 5d16467ae56343b9205caedf85e3a131e0914ad8 upstream. alarm_timer_forward() passes arguments to alarm_forward() in the wrong order: alarm_forward(alarm, timr->it_interval, now); However, alarm_forward() is defined as: u64 alarm_forward(struct alarm *alarm, ktime_t now, ktime_t interval); and uses the second argument as the current time: delta = ktime_sub(now, alarm->node.expires); Passing the interval as "now" results in incorrect delta computation, which can lead to missed expirations or incorrect overrun accounting. This issue has been present since the introduction of alarm_timer_forward(). Fix this by swapping the arguments. Fixes: e7561f1633ac ("alarmtimer: Implement forward callback") Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260323061130.29991-1-zhanxusheng@xiaomi.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 daystracing: Fix potential deadlock in cpu hotplug with osnoiseLuo Haiyang1-5/+5
commit 1f9885732248d22f788e4992c739a98c88ab8a55 upstream. The following sequence may leads deadlock in cpu hotplug: task1 task2 task3 ----- ----- ----- mutex_lock(&interface_lock) [CPU GOING OFFLINE] cpus_write_lock(); osnoise_cpu_die(); kthread_stop(task3); wait_for_completion(); osnoise_sleep(); mutex_lock(&interface_lock); cpus_read_lock(); [DEAD LOCK] Fix by swap the order of cpus_read_lock() and mutex_lock(&interface_lock). Cc: stable@vger.kernel.org Cc: <mathieu.desnoyers@efficios.com> Cc: <zhang.run@zte.com.cn> Cc: <yang.tao172@zte.com.cn> Cc: <ran.xiaokai@zte.com.cn> Fixes: bce29ac9ce0bb ("trace: Add osnoise tracer") Link: https://patch.msgid.link/20260326141953414bVSj33dAYktqp9Oiyizq8@zte.com.cn Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Luo Haiyang <luo.haiyang@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 daystracing: Drain deferred trigger frees if kthread creation failsWesley Atwell1-13/+66
commit 250ab25391edeeab8462b68be42e4904506c409c upstream. Boot-time trigger registration can fail before the trigger-data cleanup kthread exists. Deferring those frees until late init is fine, but the post-boot fallback must still drain the deferred list if kthread creation never succeeds. Otherwise, boot-deferred nodes can accumulate on trigger_data_free_list, later frees fall back to synchronously freeing only the current object, and the older queued entries are leaked forever. To trigger this, add the following to the kernel command line: trace_event=sched_switch trace_trigger=sched_switch.traceon,sched_switch.traceon The second traceon trigger will fail and be freed. This triggers a NULL pointer dereference and crashes the kernel. Keep the deferred boot-time behavior, but when kthread creation fails, drain the whole queued list synchronously. Do the same in the late-init drain path so queued entries are not stranded there either. Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260324221326.1395799-3-atwellwea@gmail.com Fixes: 61d445af0a7c ("tracing: Add bulk garbage collection of freeing event_trigger_data") Signed-off-by: Wesley Atwell <atwellwea@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
11 dayssysctl: fix uninitialized variable in proc_do_large_bitmapMarc Buerg1-1/+1
[ Upstream commit f63a9df7e3f9f842945d292a19d9938924f066f9 ] proc_do_large_bitmap() does not initialize variable c, which is expected to be set to a trailing character by proc_get_long(). However, proc_get_long() only sets c when the input buffer contains a trailing character after the parsed value. If c is not initialized it may happen to contain a '-'. If this is the case proc_do_large_bitmap() expects to be able to parse a second part of the input buffer. If there is no second part an unjustified -EINVAL will be returned. Initialize c to 0 to prevent returning -EINVAL on valid input. Fixes: 9f977fb7ae9d ("sysctl: add proc_do_large_bitmap") Signed-off-by: Marc Buerg <buermarc@googlemail.com> Reviewed-by: Joel Granados <joel.granados@kernel.org> Signed-off-by: Joel Granados <joel.granados@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysPM: sleep: Drop spurious WARN_ON() from pm_restore_gfp_mask()Youngjun Park1-1/+1
[ Upstream commit a8d51efb5929ae308895455a3e496b5eca2cd143 ] Commit 35e4a69b2003f ("PM: sleep: Allow pm_restrict_gfp_mask() stacking") introduced refcount-based GFP mask management that warns when pm_restore_gfp_mask() is called with saved_gfp_count == 0. Some hibernation paths call pm_restore_gfp_mask() defensively where the GFP mask may or may not be restricted depending on the execution path. For example, the uswsusp interface invokes it in SNAPSHOT_CREATE_IMAGE, SNAPSHOT_UNFREEZE, and snapshot_release(). Before the stacking change this was a silent no-op; it now triggers a spurious WARNING. Remove the WARN_ON() wrapper from the !saved_gfp_count check while retaining the check itself, so that defensive calls remain harmless without producing false warnings. Fixes: 35e4a69b2003f ("PM: sleep: Allow pm_restrict_gfp_mask() stacking") Signed-off-by: Youngjun Park <youngjun.park@lge.com> [ rjw: Subject tweak ] Link: https://patch.msgid.link/20260322120528.750178-1-youngjun.park@lge.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysPM: hibernate: Drain trailing zero pages on userspace restoreAlberto Garcia1-0/+11
[ Upstream commit 734eba62cd32cb9ceffa09e57cdc03d761528525 ] Commit 005e8dddd497 ("PM: hibernate: don't store zero pages in the image file") added an optimization to skip zero-filled pages in the hibernation image. On restore, zero pages are handled internally by snapshot_write_next() in a loop that processes them without returning to the caller. With the userspace restore interface, writing the last non-zero page to /dev/snapshot is followed by the SNAPSHOT_ATOMIC_RESTORE ioctl. At this point there are no more calls to snapshot_write_next() so any trailing zero pages are not processed, snapshot_image_loaded() fails because handle->cur is smaller than expected, the ioctl returns -EPERM and the image is not restored. The in-kernel restore path is not affected by this because the loop in load_image() in swap.c calls snapshot_write_next() until it returns 0. It is this final call that drains any trailing zero pages. Fixed by calling snapshot_write_next() in snapshot_write_finalize(), giving the kernel the chance to drain any trailing zero pages. Fixes: 005e8dddd497 ("PM: hibernate: don't store zero pages in the image file") Signed-off-by: Alberto Garcia <berto@igalia.com> Acked-by: Brian Geffon <bgeffon@google.com> Link: https://patch.msgid.link/ef5a7c5e3e3dbd17dcb20efaa0c53a47a23498bb.1773075892.git.berto@igalia.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysdma: swiotlb: add KMSAN annotations to swiotlb_bounce()Shigeru Yoshida1-2/+19
[ Upstream commit 6f770b73d0311a5b099277653199bb6421c4fed2 ] When a device performs DMA to a bounce buffer, KMSAN is unaware of the write and does not mark the data as initialized. When swiotlb_bounce() later copies the bounce buffer back to the original buffer, memcpy propagates the uninitialized shadow to the original buffer, causing false positive uninit-value reports. Fix this by calling kmsan_unpoison_memory() on the bounce buffer before copying it back in the DMA_FROM_DEVICE path, so that memcpy naturally propagates initialized shadow to the destination. Suggested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/CAG_fn=WUGta-paG1BgsGRoAR+fmuCgh3xo=R3XdzOt_-DqSdHw@mail.gmail.com/ Fixes: 7ade4f10779c ("dma: kmsan: unpoison DMA mappings") Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260315082750.2375581-1-syoshida@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
11 dayssched_ext: Use WRITE_ONCE() for the write side of dsq->seq updatezhidao su1-1/+1
[ Upstream commit 7a8464555d2e5f038758bb19e72ab4710b79e9cd ] bpf_iter_scx_dsq_new() reads dsq->seq via READ_ONCE() without holding any lock, making dsq->seq a lock-free concurrently accessed variable. However, dispatch_enqueue(), the sole writer of dsq->seq, uses a plain increment without the matching WRITE_ONCE() on the write side: dsq->seq++; ^^^^^^^^^^^ plain write -- KCSAN data race The KCSAN documentation requires that if one accessor uses READ_ONCE() or WRITE_ONCE() on a variable to annotate lock-free access, all other accesses must also use the appropriate accessor. A plain write leaves the pair incomplete and will trigger KCSAN warnings. Fix by using WRITE_ONCE() for the write side of the update: WRITE_ONCE(dsq->seq, dsq->seq + 1); This is consistent with bpf_iter_scx_dsq_new() and makes the concurrent access annotation complete and KCSAN-clean. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysbpf: Fix u32/s32 bounds when ranges cross min/max boundaryEduard Zingerman1-0/+24
[ Upstream commit fbc7aef517d8765e4c425d2792409bb9bf2e1f13 ] Same as in __reg64_deduce_bounds(), refine s32/u32 ranges in __reg32_deduce_bounds() in the following situations: - s32 range crosses U32_MAX/0 boundary, positive part of the s32 range overlaps with u32 range: 0 U32_MAX | [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] | |----------------------------|----------------------------| |xxxxx s32 range xxxxxxxxx] [xxxxxxx| 0 S32_MAX S32_MIN -1 - s32 range crosses U32_MAX/0 boundary, negative part of the s32 range overlaps with u32 range: 0 U32_MAX | [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] | |----------------------------|----------------------------| |xxxxxxxxx] [xxxxxxxxxxxx s32 range | 0 S32_MAX S32_MIN -1 - No refinement if ranges overlap in two intervals. This helps for e.g. consider the following program: call %[bpf_get_prandom_u32]; w0 &= 0xffffffff; if w0 < 0x3 goto 1f; // on fall-through u32 range [3..U32_MAX] if w0 s> 0x1 goto 1f; // on fall-through s32 range [S32_MIN..1] if w0 s< 0x0 goto 1f; // range can be narrowed to [S32_MIN..-1] r10 = 0; 1: ...; The reg_bounds.c selftest is updated to incorporate identical logic, refinement based on non-overflowing range halves: ((x ∩ [0, smax]) ∩ (y ∩ [0, smax])) ∪ ((x ∩ [smin,-1]) ∩ (y ∩ [smin,-1])) Reported-by: Andrea Righi <arighi@nvidia.com> Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Closes: https://lore.kernel.org/bpf/aakqucg4vcujVwif@gpd4/T/ Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260306-bpf-32-bit-range-overflow-v3-1-f7f67e060a6b@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysmodule: Fix kernel panic when a symbol st_shndx is out of boundsIhor Solodrai1-0/+7
[ Upstream commit f9d69d5e7bde2295eb7488a56f094ac8f5383b92 ] The module loader doesn't check for bounds of the ELF section index in simplify_symbols(): for (i = 1; i < symsec->sh_size / sizeof(Elf_Sym); i++) { const char *name = info->strtab + sym[i].st_name; switch (sym[i].st_shndx) { case SHN_COMMON: [...] default: /* Divert to percpu allocation if a percpu var. */ if (sym[i].st_shndx == info->index.pcpu) secbase = (unsigned long)mod_percpu(mod); else /** HERE --> **/ secbase = info->sechdrs[sym[i].st_shndx].sh_addr; sym[i].st_value += secbase; break; } } A symbol with an out-of-bounds st_shndx value, for example 0xffff (known as SHN_XINDEX or SHN_HIRESERVE), may cause a kernel panic: BUG: unable to handle page fault for address: ... RIP: 0010:simplify_symbols+0x2b2/0x480 ... Kernel panic - not syncing: Fatal exception This can happen when module ELF is legitimately using SHN_XINDEX or when it is corrupted. Add a bounds check in simplify_symbols() to validate that st_shndx is within the valid range before using it. This issue was discovered due to a bug in llvm-objcopy, see relevant discussion for details [1]. [1] https://lore.kernel.org/linux-modules/20251224005752.201911-1-ihor.solodrai@linux.dev/ Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysbpf: Fix unsound scalar forking in maybe_fork_scalars() for BPF_ORDaniel Wade1-1/+1
[ Upstream commit c845894ebd6fb43226b3118d6b017942550910c5 ] maybe_fork_scalars() is called for both BPF_AND and BPF_OR when the source operand is a constant. When dst has signed range [-1, 0], it forks the verifier state: the pushed path gets dst = 0, the current path gets dst = -1. For BPF_AND this is correct: 0 & K == 0. For BPF_OR this is wrong: 0 | K == K, not 0. The pushed path therefore tracks dst as 0 when the runtime value is K, producing an exploitable verifier/runtime divergence that allows out-of-bounds map access. Fix this by passing env->insn_idx (instead of env->insn_idx + 1) to push_stack(), so the pushed path re-executes the ALU instruction with dst = 0 and naturally computes the correct result for any opcode. Fixes: bffacdb80b93 ("bpf: Recognize special arithmetic shift in the verifier") Signed-off-by: Daniel Wade <danjwade95@gmail.com> Reviewed-by: Amery Hung <ameryhung@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260314021521.128361-2-danjwade95@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysbpf: Fix undefined behavior in interpreter sdiv/smod for INT_MINJenny Guanni Qu1-8/+14
[ Upstream commit c77b30bd1dcb61f66c640ff7d2757816210c7cb0 ] The BPF interpreter's signed 32-bit division and modulo handlers use the kernel abs() macro on s32 operands. The abs() macro documentation (include/linux/math.h) explicitly states the result is undefined when the input is the type minimum. When DST contains S32_MIN (0x80000000), abs((s32)DST) triggers undefined behavior and returns S32_MIN unchanged on arm64/x86. This value is then sign-extended to u64 as 0xFFFFFFFF80000000, causing do_div() to compute the wrong result. The verifier's abstract interpretation (scalar32_min_max_sdiv) computes the mathematically correct result for range tracking, creating a verifier/interpreter mismatch that can be exploited for out-of-bounds map value access. Introduce abs_s32() which handles S32_MIN correctly by casting to u32 before negating, avoiding signed overflow entirely. Replace all 8 abs((s32)...) call sites in the interpreter's sdiv32/smod32 handlers. s32 is the only affected case -- the s64 division/modulo handlers do not use abs(). Fixes: ec0e2da95f72 ("bpf: Support new signed div/mod instructions.") Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com> Link: https://lore.kernel.org/r/20260311011116.2108005-2-qguanni@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysbpf: Fix exception exit lock checking for subprogsIhor Solod