summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
9 daysMerge tag 'probes-fixes-v7.0-rc3' of ↵Linus Torvalds1-2/+6
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - Avoid crash when rmmod/insmod after ftrace killed This fixes a kernel crash caused by kprobes on the symbol in a module which is unloaded after ftrace_kill() is called. - Remove unneeded warnings from __arm_kprobe_ftrace() Remove unneeded WARN messages which can be triggered if the kprobe is using ftrace and it fails to enable the ftrace. Since kprobes correctly handle such failure, we don't need to warn it. * tag 'probes-fixes-v7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: kprobes: Remove unneeded warnings from __arm_kprobe_ftrace() kprobes: avoid crash when rmmod/insmod after ftrace killed
9 daysMerge tag 'timers-urgent-2026-03-15' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix function tracer recursion bug by marking jiffies_64_to_clock_t() notrace" * tag 'timers-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time/jiffies: Mark jiffies_64_to_clock_t() notrace
9 daysMerge tag 'sched-urgent-2026-03-15' of ↵Linus Torvalds2-53/+29
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "More MM-CID fixes, mostly fixing hangs/races: - Fix CID hangs due to a race between concurrent forks - Fix vfork()/CLONE_VM MMCID bug causing hangs - Remove pointless preemption guard - Fix CID task list walk performance regression on large systems by removing the known-flaky and slow counting logic using for_each_process_thread() in mm_cid_*fixup_tasks_to_cpus(), and implementing a simple sched_mm_cid::node list instead" * tag 'sched-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/mmcid: Avoid full tasklist walks sched/mmcid: Remove pointless preempt guard sched/mmcid: Handle vfork()/CLONE_VM correctly sched/mmcid: Prevent CID stalls due to concurrent forks
11 daysMerge tag 'wq-for-7.0-rc3-fixes' of ↵Linus Torvalds2-27/+29
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: - Improve workqueue stall diagnostics: dump all busy workers (not just running ones), show wall-clock duration of in-flight work items, and add a sample module for reproducing stalls - Fix POOL_BH vs WQ_BH flag namespace mismatch in pr_cont_worker_id() - Rename pool->watchdog_ts to pool->last_progress_ts and related functions for clarity * tag 'wq-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Rename show_cpu_pool{s,}_hog{s,}() to reflect broadened scope workqueue: Add stall detector sample module workqueue: Show all busy workers in stall diagnostics workqueue: Show in-flight work item duration in stall diagnostics workqueue: Rename pool->watchdog_ts to pool->last_progress_ts workqueue: Use POOL_BH instead of WQ_BH when checking pool flags
11 daysMerge tag 'cgroup-for-7.0-rc3-fixes' of ↵Linus Torvalds2-28/+37
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Hide PF_EXITING tasks from cgroup.procs to avoid exposing dead tasks that haven't been removed yet, fixing a systemd timeout issue on PREEMPT_RT - Call rebuild_sched_domains() directly in CPU hotplug instead of deferring to a workqueue, fixing a race where online/offline CPUs could briefly appear in stale sched domains * tag 'cgroup-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Don't expose dead tasks in cgroup cgroup/cpuset: Call rebuild_sched_domains() directly in hotplug
11 daysMerge tag 'sched_ext-for-7.0-rc3-fixes' of ↵Linus Torvalds2-27/+109
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix data races flagged by KCSAN: add missing READ_ONCE()/WRITE_ONCE() annotations for lock-free accesses to module parameters and dsq->seq - Fix silent truncation of upper 32 enqueue flags (SCX_ENQ_PREEMPT and above) when passed through the int sched_class interface - Documentation updates: scheduling class precedence, task ownership state machine, example scheduler descriptions, config list cleanup - Selftest fix for format specifier and buffer length in file_write_long() * tag 'sched_ext-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Use WRITE_ONCE() for the write side of scx_enable helper pointer sched_ext: Fix enqueue_task_scx() truncation of upper enqueue flags sched_ext: Documentation: Update sched-ext.rst sched_ext: Use READ_ONCE() for scx_slice_bypass_us in scx_bypass() sched_ext: Documentation: Mention scheduling class precedence sched_ext: Document task ownership state machine sched_ext: Use READ_ONCE() for lock-free reads of module param variables sched_ext/selftests: Fix format specifier and buffer length in file_write_long() sched_ext: Use WRITE_ONCE() for the write side of dsq->seq update
12 dayskprobes: Remove unneeded warnings from __arm_kprobe_ftrace()Masami Hiramatsu (Google)1-2/+2
Remove unneeded warnings for handled errors from __arm_kprobe_ftrace() because all caller handled the error correctly. Link: https://lore.kernel.org/all/177261531182.1312989.8737778408503961141.stgit@mhiramat.tok.corp.google.com/ Reported-by: Zw Tang <shicenci@gmail.com> Closes: https://lore.kernel.org/all/CAPHJ_V+J6YDb_wX2nhXU6kh466Dt_nyDSas-1i_Y8s7tqY-Mzw@mail.gmail.com/ Fixes: 9c89bb8e3272 ("kprobes: treewide: Cleanup the error messages for kprobes") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
12 dayskprobes: avoid crash when rmmod/insmod after ftrace killedMasami Hiramatsu (Google)1-0/+4
After we hit ftrace is killed by some errors, the kernel crash if we remove modules in which kprobe probes. BUG: unable to handle page fault for address: fffffbfff805000d PGD 817fcc067 P4D 817fcc067 PUD 817fc8067 PMD 101555067 PTE 0 Oops: Oops: 0000 [#1] SMP KASAN PTI CPU: 4 UID: 0 PID: 2012 Comm: rmmod Tainted: G W OE Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE RIP: 0010:kprobes_module_callback+0x89/0x790 RSP: 0018:ffff88812e157d30 EFLAGS: 00010a02 RAX: 1ffffffff805000d RBX: dffffc0000000000 RCX: ffffffff86a8de90 RDX: ffffed1025c2af9b RSI: 0000000000000008 RDI: ffffffffc0280068 RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed1025c2af9a R10: ffff88812e157cd7 R11: 205d323130325420 R12: 0000000000000002 R13: ffffffffc0290488 R14: 0000000000000002 R15: ffffffffc0280040 FS: 00007fbc450dd740(0000) GS:ffff888420331000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffbfff805000d CR3: 000000010f624000 CR4: 00000000000006f0 Call Trace: <TASK> notifier_call_chain+0xc6/0x280 blocking_notifier_call_chain+0x60/0x90 __do_sys_delete_module.constprop.0+0x32a/0x4e0 do_syscall_64+0x5d/0xfa0 entry_SYSCALL_64_after_hwframe+0x76/0x7e This is because the kprobe on ftrace does not correctly handles the kprobe_ftrace_disabled flag set by ftrace_kill(). To prevent this error, check kprobe_ftrace_disabled in __disarm_kprobe_ftrace() and skip all ftrace related operations. Link: https://lore.kernel.org/all/176473947565.1727781.13110060700668331950.stgit@mhiramat.tok.corp.google.com/ Reported-by: Ye Bin <yebin10@huawei.com> Closes: https://lore.kernel.org/all/20251125020536.2484381-1-yebin@huaweicloud.com/ Fixes: ae6aa16fdc16 ("kprobes: introduce ftrace based optimization") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
14 dayssched/mmcid: Avoid full tasklist walksThomas Gleixner2-42/+13
Chasing vfork()'ed tasks on a CID ownership mode switch requires a full task list walk, which is obviously expensive on large systems. Avoid that by keeping a list of tasks using a mm MMCID entity in mm::mm_cid and walk this list instead. This removes the proven to be flaky counting logic and avoids a full task list walk in the case of vfork()'ed tasks. Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260310202526.183824481@kernel.org
14 dayssched/mmcid: Remove pointless preempt guardThomas Gleixner1-5/+7
This is a leftover from the early versions of this function where it could be invoked without mm::mm_cid::lock held. Remove it and add lockdep asserts instead. Fixes: 653fda7ae73d ("sched/mmcid: Switch over to the new mechanism") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260310202526.116363613@kernel.org
14 dayssched/mmcid: Handle vfork()/CLONE_VM correctlyThomas Gleixner1-4/+1
Matthieu and Jiri reported stalls where a task endlessly loops in mm_get_cid() when scheduling in. It turned out that the logic which handles vfork()'ed tasks is broken. It is invoked when the number of tasks associated to a process is smaller than the number of MMCID users. It then walks the task list to find the vfork()'ed task, but accounts all the already processed tasks as well. If that double processing brings the number of to be handled tasks to 0, the walk stops and the vfork()'ed task's CID is not fixed up. As a consequence a subsequent schedule in fails to acquire a (transitional) CID and the machine stalls. Cure this by removing the accounting condition and make the fixup always walk the full task list if it could not find the exact number of users in the process' thread list. Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions") Closes: https://lore.kernel.org/b24ffcb3-09d5-4e48-9070-0b69bc654281@kernel.org Reported-by: Matthieu Baerts <matttbe@kernel.org> Reported-by: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260310202526.048657665@kernel.org
14 dayssched/mmcid: Prevent CID stalls due to concurrent forksThomas Gleixner2-9/+15
A newly forked task is accounted as MMCID user before the task is visible in the process' thread list and the global task list. This creates the following problem: CPU1 CPU2 fork() sched_mm_cid_fork(tnew1) tnew1->mm.mm_cid_users++; tnew1->mm_cid.cid = getcid() -> preemption fork() sched_mm_cid_fork(tnew2) tnew2->mm.mm_cid_users++; // Reaches the per CPU threshold mm_cid_fixup_tasks_to_cpus() for_each_other(current, p) .... As tnew1 is not visible yet, this fails to fix up the already allocated CID of tnew1. As a consequence a subsequent schedule in might fail to acquire a (transitional) CID and the machine stalls. Move the invocation of sched_mm_cid_fork() after the new task becomes visible in the thread and the task list to prevent this. This also makes it symmetrical vs. exit() where the task is removed as CID user before the task is removed from the thread and task lists. Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260310202525.969061974@kernel.org
14 daystime/jiffies: Mark jiffies_64_to_clock_t() notraceSteven Rostedt1-1/+1
The trace_clock_jiffies() function that handles the "uptime" clock for tracing calls jiffies_64_to_clock_t(). This causes the function tracer to constantly recurse when the tracing clock is set to "uptime". Mark it notrace to prevent unnecessary recursion when using the "uptime" clock. Fixes: 58d4e21e50ff3 ("tracing: Fix wraparound problems in "uptime" trace clock") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260306212403.72270bb2@robin
2026-03-10sched: idle: Make skipping governor callbacks more consistentRafael J. Wysocki1-1/+10
If the cpuidle governor .select() callback is skipped because there is only one idle state in the cpuidle driver, the .reflect() callback should be skipped as well, at least for consistency (if not for correctness), so do it. Fixes: e5c9ffc6ae1b ("cpuidle: Skip governor when only one idle state is available") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/12857700.O9o76ZdvQC@rafael.j.wysocki
2026-03-09sched_ext: Use WRITE_ONCE() for the write side of scx_enable helper pointerzhidao su1-4/+5
scx_enable() uses double-checked locking to lazily initialize a static kthread_worker pointer. The fast path reads helper locklessly: if (!READ_ONCE(helper)) { // lockless read -- no helper_mutex The write side initializes helper under helper_mutex, but previously used a plain assignment: helper = kthread_run_worker(0, "scx_enable_helper"); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ plain write -- KCSAN data race with READ_ONCE() above Since READ_ONCE() on the fast path and the plain write on the initialization path access the same variable without a common lock, they constitute a data race. KCSAN requires that all sides of a lock-free access use READ_ONCE()/WRITE_ONCE() consistently. Use a temporary variable to stage the result of kthread_run_worker(), and only WRITE_ONCE() into helper after confirming the pointer is valid. This avoids a window where a concurrent caller on the fast path could observe an ERR pointer via READ_ONCE(helper) before the error check completes. Fixes: b06ccbabe250 ("sched_ext: Fix starvation of scx_enable() under fair-class saturation") Signed-off-by: zhidao su <suzhidao@xiaomi.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-07Merge tag 'timers-urgent-2026-03-08' of ↵Linus Torvalds1-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Make clock_adjtime() syscall timex validation slightly more permissive for auxiliary clocks, to not reject syscalls based on the status field that do not try to modify the status field. This makes the ABI behavior in clock_adjtime() consistent with CLOCK_REALTIME" * tag 'timers-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Fix timex status validation for auxiliary clocks
2026-03-07Merge tag 'sched-urgent-2026-03-08' of ↵Linus Torvalds1-0/+30
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a DL scheduler bug that may corrupt internal metrics during PI and setscheduler() syscalls, resulting in kernel warnings and misbehavior. Found during stress-testing" * tag 'sched-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix missing ENQUEUE_REPLENISH during PI de-boosting
2026-03-07Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds3-6/+37
Pull bpf fixes from Alexei Starovoitov: - Fix u32/s32 bounds when ranges cross min/max boundary (Eduard Zingerman) - Fix precision backtracking with linked registers (Eduard Zingerman) - Fix linker flags detection for resolve_btfids (Ihor Solodrai) - Fix race in update_ftrace_direct_add/del (Jiri Olsa) - Fix UAF in bpf_trampoline_link_cgroup_shim (Lang Xu) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: resolve_btfids: Fix linker flags detection selftests/bpf: add reproducer for spurious precision propagation through calls bpf: collect only live registers in linked regs Revert "selftests/bpf: Update reg_bound range refinement logic" selftests/bpf: test refining u32/s32 bounds when ranges cross min/max boundary bpf: Fix u32/s32 bounds when ranges cross min/max boundary bpf: Fix a UAF issue in bpf_trampoline_link_cgroup_shim ftrace: Add missing ftrace_lock to update_ftrace_direct_add/del
2026-03-07Merge tag 'trace-v7.0-rc2-2' of ↵Linus Torvalds3-4/+11
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix possible NULL pointer dereference in trace_data_alloc() On the trace_data_alloc() error path, it can call trigger_data_free() with a NULL pointer. This used to be a kfree() but was changed to trigger_data_free() to clean up any partial initialization. The issue is that trigger_data_free() does not expect a NULL pointer. Have trigger_data_free() return safely on NULL pointer. - Fix multiple events on the command line and bootconfig If multiple events are enabled on the command line separately and not grouped, only the last event gets enabled. That is: trace_event=sched_switch trace_event=sched_waking will only enable sched_waking whereas: trace_event=sched_switch,sched_waking will enable both. The bootconfig makes it even worse as the second way is the more common method. The issue is that a temporary buffer is used to store the events to enable later in boot. Each time the cmdline callback is called, it overwrites what was previously there. Have the callback append the next value (delimited by a comma) if the temporary buffer already has content. - Fix command line trace_buffer_size if >= 2G The logic to allocate the trace buffer uses "int" for the size parameter in the command line code causing overflow issues if more that 2G is specified. * tag 'trace-v7.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix trace_buf_size= cmdline parameter with sizes >= 2G tracing: Fix enabling multiple events on the kernel command line and bootconfig tracing: Add NULL pointer check to trigger_data_free()
2026-03-07sched_ext: Fix enqueue_task_scx() truncation of upper enqueue flagsTejun Heo1-3/+2
enqueue_task_scx() takes int enq_flags from the sched_class interface. SCX enqueue flags starting at bit 32 (SCX_ENQ_PREEMPT and above) are silently truncated when passed through activate_task(). extra_enq_flags was added as a workaround - storing high bits in rq->scx.extra_enq_flags and OR-ing them back in enqueue_task_scx(). However, the OR target is still the int parameter, so the high bits are lost anyway. The current impact is limited as the only affected flag is SCX_ENQ_PREEMPT which is informational to the BPF scheduler - its loss means the scheduler doesn't know about preemption but doesn't cause incorrect behavior. Fix by renaming the int parameter to core_enq_flags and introducing a u64 enq_flags local that merges both sources. All downstream functions already take u64 enq_flags. Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-06bpf: collect only live registers in linked regsEduard Zingerman1-3/+10
Fix an inconsistency between func_states_equal() and collect_linked_regs(): - regsafe() uses check_ids() to verify that cached and current states have identical register id mapping. - func_states_equal() calls regsafe() only for registers computed as live by compute_live_registers(). - clean_live_states() is supposed to remove dead registers from cached states, but it can skip states belonging to an iterator-based loop. - collect_linked_regs() collects all registers sharing the same id, ignoring the marks computed by compute_live_registers(). Linked registers are stored in the state's jump history. - backtrack_insn() marks all linked registers for an instruction as precise whenever one of the linked registers is precise. The above might lead to a scenario: - There is an instruction I with register rY known to be dead at I. - Instruction I is reached via two paths: first A, then B. - On path A: - There is an id link between registers rX and rY. - Checkpoint C is created at I. - Linked register set {rX, rY} is saved to the jump history. - rX is marked as precise at I, causing both rX and rY to be marked precise at C. - On path B: - There is no id link between registers rX and rY, otherwise register states are sub-states of those in C. - Because rY is dead at I, check_ids() returns true. - Current state is considered equal to checkpoint C, propagate_precision() propagates spurious precision mark for register rY along the path B. - Depending on a program, this might hit verifier_bug() in the backtrack_insn(), e.g. if rY ∈ [r1..r5] and backtrack_insn() spots a function call. The reproducer program is in the next patch. This was hit by sched_ext scx_lavd scheduler code. Changes in tests: - verifier_scalar_ids.c selftests need modification to preserve some registers as live for __msg() checks. - exceptions_assert.c adjusted to match changes in the verifier log, R0 is dead after conditional instruction and thus does not get range. - precise.c adjusted to match changes in the verifier log, register r9 is dead after comparison and it's range is not important for test. Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Fixes: 0fb3cf6110a5 ("bpf: use register liveness information for func_states_equal") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260306-linked-regs-and-propagate-precision-v1-1-18e859be570d@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-06tracing: Fix trace_buf_size= cmdline parameter with sizes >= 2GCalvin Owens1-3/+3
Some of the sizing logic through tracer_alloc_buffers() uses int internally, causing unexpected behavior if the user passes a value that does not fit in an int (on my x86 machine, the result is uselessly tiny buffers). Fix by plumbing the parameter's real type (unsigned long) through to the ring buffer allocation functions, which already use unsigned long. It has always been possible to create larger ring buffers via the sysfs interface: this only affects the cmdline parameter. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/bff42a4288aada08bdf74da3f5b67a2c28b761f8.1772852067.git.calvin@wbinvd.org Fixes: 73c5162aa362 ("tracing: keep ring buffer to minimum size till used") Signed-off-by: Calvin Owens <calvin@wbinvd.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-06bpf: Fix u32/s32 bounds when ranges cross min/max boundaryEduard Zingerman1-0/+24
Same as in __reg64_deduce_bounds(), refine s32/u32 ranges in __reg32_deduce_bounds() in the following situations: - s32 range crosses U32_MAX/0 boundary, positive part of the s32 range overlaps with u32 range: 0 U32_MAX | [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] | |----------------------------|----------------------------| |xxxxx s32 range xxxxxxxxx] [xxxxxxx| 0 S32_MAX S32_MIN -1 - s32 range crosses U32_MAX/0 boundary, negative part of the s32 range overlaps with u32 range: 0 U32_MAX | [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] | |----------------------------|----------------------------| |xxxxxxxxx] [xxxxxxxxxxxx s32 range | 0 S32_MAX S32_MIN -1 - No refinement if ranges overlap in two intervals. This helps for e.g. consider the following program: call %[bpf_get_prandom_u32]; w0 &= 0xffffffff; if w0 < 0x3 goto 1f; // on fall-through u32 range [3..U32_MAX] if w0 s> 0x1 goto 1f; // on fall-through s32 range [S32_MIN..1] if w0 s< 0x0 goto 1f; // range can be narrowed to [S32_MIN..-1] r10 = 0; 1: ...; The reg_bounds.c selftest is updated to incorporate identical logic, refinement based on non-overflowing range halves: ((x ∩ [0, smax]) ∩ (y ∩ [0, smax])) ∪ ((x ∩ [smin,-1]) ∩ (y ∩ [smin,-1])) Reported-by: Andrea Righi <arighi@nvidia.com> Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Closes: https://lore.kernel.org/bpf/aakqucg4vcujVwif@gpd4/T/ Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260306-bpf-32-bit-range-overflow-v3-1-f7f67e060a6b@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-06cgroup: Don't expose dead tasks in cgroupSebastian Andrzej Siewior1-0/+6
Once a task exits it has its state set to TASK_DEAD and then it is removed from the cgroup it belonged to. The last step happens on the task gets out of its last schedule() invocation and is delayed on PREEMPT_RT due to locking constraints. As a result it is possible to receive a pid via waitpid() of a task which is still listed in cgroup.procs for the cgroup it belonged to. This is something that systemd does not expect and as a result it waits for its exit until a time out occurs. This can also be reproduced on !PREEMPT_RT kernel with a significant delay in do_exit() after exit_notify(). Hide the task from the output which have PF_EXITING set which is done before the parent is notified. Keeping zombies with live threads shouldn't break anything (suggested by Tejun). Reported-by: Bert Karwatzki <spasswolf@web.de> Closes: https://lore.kernel.org/all/20260219164648.3014-1-spasswolf@web.de/ Tested-by: Bert Karwatzki <spasswolf@web.de> Fixes: 9311e6c29b34 ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT") Cc: stable@vger.kernel.org # v6.19+ Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-06tracing: Fix enabling multiple events on the kernel command line and bootconfigAndrei-Alexandru Tachici1-1/+5
Multiple events can be enabled on the kernel command line via a comma separator. But if the are specified one at a time, then only the last event is enabled. This is because the event names are saved in a temporary buffer, and each call by the init cmdline code will reset that buffer. This also affects names in the boot config file, as it may call the callback multiple times with an example of: kernel.trace_event = ":mod:rproc_qcom_common", ":mod:qrtr", ":mod:qcom_aoss" Change the cmdline callback function to append a comma and the next value if the temporary buffer already has content. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260302-trace-events-allow-multiple-modules-v1-1-ce4436e37fb8@oss.qualcomm.com Signed-off-by: Andrei-Alexandru Tachici <andrei-alexandru.tachici@oss.qualcomm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-06tracing: Add NULL pointer check to trigger_data_free()Guenter Roeck1-0/+3
If trigger_data_alloc() fails and returns NULL, event_hist_trigger_parse() jumps to the out_free error path. While kfree() safely handles a NULL pointer, trigger_data_free() does not. This causes a NULL pointer dereference in trigger_data_free() when evaluating data->cmd_ops->set_filter. Fix the problem by adding a NULL pointer check to trigger_data_free(). The problem was found by an experimental code review agent based on gemini-3.1-pro while reviewing backports into v6.18.y. Cc: Miaoqian Lin <linmq006@gmail.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://patch.msgid.link/20260305193339.2810953-1-linux@roeck-us.net Fixes: 0550069cc25f ("tracing: Properly process error handling in event_hist_trigger_parse()") Assisted-by: Gemini:gemini-3.1-pro Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-06cgroup/cpuset: Call rebuild_sched_domains() directly in hotplugWaiman Long1-28/+31
Besides deferring the call to housekeeping_update(), commit 6df415aa46ec ("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue") also defers the rebuild_sched_domains() call to the workqueue. So a new offline CPU may still be in a sched domain or new online CPU not showing up in the sched domains for a short transition period. That could be a problem in some corner cases and can be the cause of a reported test failure[1]. Fix it by calling rebuild_sched_domains_cpuslocked() directly in hotplug as before. If isolated partition invalidation or recreation is being done, the housekeeping_update() call to update the housekeeping cpumasks will still be deferred to a workqueue. In commit 3bfe47967191 ("cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together"), housekeeping_update() is called before rebuild_sched_domains() because it needs to access the HK_TYPE_DOMAIN housekeeping cpumask. That is now changed to use the static HK_TYPE_DOMAIN_BOOT cpumask as HK_TYPE_DOMAIN cpumask is now changeable at run time. As a result, we can move the rebuild_sched_domains() call before housekeeping_update() with the slight advantage that it will be done in the same cpus_read_lock critical section without the possibility of interference by a concurrent cpu hot add/remove operation. As it doesn't make sense to acquire cpuset_mutex/cpuset_top_mutex after calling housekeeping_update() and immediately release them again, move the cpuset_full_unlock() operation inside update_hk_sched_domains() and rename it to cpuset_update_sd_hk_unlock() to signify that it will release the full set of locks. [1] https://lore.kernel.org/lkml/1a89aceb-48db-4edd-a730-b445e41221fe@nvidia.com Fixes: 6df415aa46ec ("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue") Tested-by: Jon Hunter <jonathanh@nvidia.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-06sched_ext: Use READ_ONCE() for scx_slice_bypass_us in scx_bypass()David Carlier1-1/+1
Commit 0927780c90ce ("sched_ext: Use READ_ONCE() for lock-free reads of module param variables") annotated the plain reads of scx_slice_bypass_us and scx_bypass_lb_intv_us in bypass_lb_cpu(), but missed a third site in scx_bypass(): WRITE_ONCE(scx_slice_dfl, scx_slice_bypass_us * NSEC_PER_USEC); scx_slice_bypass_us is a module parameter writable via sysfs in process context through set_slice_us() -> param_set_uint_minmax(), which performs a plain store without holding bypass_lock. scx_bypass() reads the variable under bypass_lock, but since the writer does not take that lock, the two accesses are concurrent. WRITE_ONCE() only applies volatile semantics to the store of scx_slice_dfl -- the val expression containing scx_slice_bypass_us is evaluated as a plain read, providing no protection against concurrent writes. Wrap the read with READ_ONCE() to complete the annotation started by commit 0927780c90ce and make the access KCSAN-clean, consistent with the existing READ_ONCE(scx_slice_bypass_us) in bypass_lb_cpu(). Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-06workqueue: Rename show_cpu_pool{s,}_hog{s,}() to reflect broadened scopeBreno Leitao1-5/+5
show_cpu_pool_hog() and show_cpu_pools_hogs() no longer only dump CPU hogs — since commit 8823eaef45da ("workqueue: Show all busy workers in stall diagnostics"), they dump every in-flight worker in the pool's busy_hash. Rename them to show_cpu_pool_busy_workers() and show_cpu_pools_busy_workers() to accurately describe what they do. Also fix the pr_info() message to say "stalled worker pools" instead of "stalled CPU-bound worker pools", since sleeping/blocked workers are now included. No functional change. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-06Merge tag 'block-7.0-20260305' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Improve quirk visibility and configurability (Maurizio) - Fix runtime user modification to queue setup (Keith) - Fix multipath leak on try_module_get failure (Keith) - Ignore ambiguous spec definitions for better atomics support (John) - Fix admin queue leak on controller reset (Ming) - Fix large allocation in persistent reservation read keys (Sungwoo Kim) - Fix fcloop callback handling (Justin) - Securely free DHCHAP secrets (Daniel) - Various cleanups and typo fixes (John, Wilfred) - Avoid a circular lock dependency issue in the sysfs nr_requests or scheduler store handling - Fix a circular lock dependency with the pcpu mutex and the queue freeze lock - Cleanup for bio_copy_kern(), using __bio_add_page() rather than the bio_add_page(), as adding a page here cannot fail. The exiting code had broken cleanup for the error condition, so make it clear that the error condition cannot happen - Fix for a __this_cpu_read() in preemptible context splat * tag 'block-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: block: use trylock to avoid lockdep circular dependency in sysfs nvme: fix memory allocation in nvme_pr_read_keys() block: use __bio_add_page in bio_copy_kern block: break pcpu_alloc_mutex dependency on freeze_lock blktrace: fix __this_cpu_read/write in preemptible context nvme-multipath: fix leak on try_module_get failure nvmet-fcloop: Check remoteport port_state before calling done callback nvme-pci: do not try to add queue maps at runtime nvme-pci: cap queue creation to used queues nvme-pci: ensure we're polling a polled queue nvme: fix memory leak in quirks_param_set() nvme: correct comment about nvme_ns_remove() nvme: stop setting namespace gendisk device driver data nvme: add support for dynamic quirk configuration via module parameter nvme: fix admin queue leak on controller reset nvme-fabrics: use kfree_sensitive() for DHCHAP secrets nvme: stop using AWUPF nvme: expose active quirks in sysfs nvme/host: fixup some typos
2026-03-06bpf: drop kthread_exit from noreturn_denyChristian Loehle1-1/+0
kthread_exit became a macro to do_exit in commit 28aaa9c39945 ("kthread: consolidate kthread exit paths to prevent use-after-free"), so there is no kthread_exit function BTF ID to resolve. Remove it from noreturn_deny to avoid resolve_btfids unresolved symbol warnings. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-03-05workqueue: Show all busy workers in stall diagnosticsBreno Leitao1-15/+13
show_cpu_pool_hog() only prints workers whose task is currently running on the CPU (task_is_running()). This misses workers that are busy processing a work item but are sleeping or blocked — for example, a worker that clears PF_WQ_WORKER and enters wait_event_idle(). Such a worker still occupies a pool slot and prevents progress, yet produces an empty backtrace section in the watchdog output. This is happening on real arm64 systems, where toggle_allocation_gate() IPIs every single CPU in the machine (which lacks NMI), causing workqueue stalls that show empty backtraces because toggle_allocation_gate() is sleeping in wait_event_idle(). Remove the task_is_running() filter so every in-flight worker in the pool's busy_hash is dumped. The busy_hash is protected by pool->lock, which is already held. Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-05workqueue: Show in-flight work item duration in stall diagnosticsBreno Leitao2-0/+4
When diagnosing workqueue stalls, knowing how long each in-flight work item has been executing is valuable. Add a current_start timestamp (jiffies) to struct worker, set it when a work item begins execution in process_one_work(), and print the elapsed wall-clock time in show_pwq(). Unlike current_at (which tracks CPU runtime and resets on wakeup for CPU-intensive detection), current_start is never reset because the diagnostic cares about total wall-clock time including sleeps. Before: in-flight: 165:stall_work_fn [wq_stall] After: in-flight: 165:stall_work_fn [wq_stall] for 100s Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-05workqueue: Rename pool->watchdog_ts to pool->last_progress_tsBreno Leitao1-7/+7
The watchdog_ts name doesn't convey what the timestamp actually tracks. This field tracks the last time a workqueue got progress. Rename it to last_progress_ts to make it clear that it records when the pool last made forward progress (started processing new work items). No functional change. Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-05workqueue: Use POOL_BH instead of WQ_BH when checking pool flagsBreno Leitao1-1/+1
pr_cont_worker_id() checks pool->flags against WQ_BH, which is a workqueue-level flag (defined in workqueue.h). Pool flags use a separate namespace with POOL_* constants (defined in workqueue.c). The correct constant is POOL_BH. Both WQ_BH and POOL_BH are defined as (1 << 0) so this has no behavioral impact, but it is semantically wrong and inconsistent with every other pool-level BH check in the file. Fixes: 4cb1ef64609f ("workqueue: Implement BH workqueues to eventually replace tasklets") Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-05sched_ext: Document task ownership state machineAndrea Righi1-16/+98
The task ownership state machine in sched_ext is quite hard to follow from the code alone. The interaction of ownership states, memory ordering rules and cross-CPU "lock dancing" makes the overall model subtle. Extend the documentation next to scx_ops_state to provide a more structured and self-contained description of the state transitions and their synchronization rules. The new reference should make the code easier to reason about and maintain and can help future contributors understand the overall task-ownership workflow. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-05sched_ext: Use READ_ONCE() for lock-free reads of module param variableszhidao su1-2/+2
bypass_lb_cpu() reads scx_bypass_lb_intv_us and scx_slice_bypass_us without holding any lock, in timer callback context where module parameter writes via sysfs can happen concurrently: min_delta_us = scx_bypass_lb_intv_us / SCX_BYPASS_LB_MIN_DELTA_DIV; ^^^^^^^^^^^^^^^^^^^^ plain read -- KCSAN data race if (delta < DIV_ROUND_UP(min_delta_us, scx_slice_bypass_us)) ^^^^^^^^^^^^^^^^^ plain read -- KCSAN data race scx_bypass_lb_intv_us already uses READ_ONCE() in scx_bypass_lb_timerfn() and scx_bypass() for its other lock-free read sites, leaving bypass_lb_cpu() inconsistent. scx_slice_bypass_us has the same lock-free access pattern in the same function. Fix both plain reads by using READ_ONCE() to complete the concurrent access annotation and make the code KCSAN-clean. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-05Merge tag 'trace-v7.0-rc2' of ↵Linus Torvalds5-21/+88
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix thresh_return of function graph tracer The update to store data on the shadow stack removed the abuse of using the task recursion word as a way to keep track of what functions to ignore. The trace_graph_return() was updated to handle this, but when function_graph tracer is using a threshold (only trace functions that took longer than a specified time), it uses trace_graph_thresh_return() instead. This function was still incorrectly using the task struct recursion word causing the function graph tracer to permanently set all functions to "notrace" - Fix thresh_return nosleep accounting When the calltime was moved to the shadow stack storage instead of being on the fgraph descriptor, the calculations for the amount of sleep time was updated. The calculation was done in the trace_graph_thresh_return() function, which also called the trace_graph_return(), which did the calculation again, causing the time to be doubled. Remove the call to trace_graph_return() as what it needed to do wasn't that much, and just do the work in trace_graph_thresh_return(). - Fix syscall trace event activation on boot up The syscall trace events are pseudo events attached to the raw_syscall tracepoints. When the first syscall event is enabled, it enables the raw_syscall tracepoint and doesn't need to do anything when a second syscall event is also enabled. When events are enabled via the kernel command line, syscall events are partially enabled as the enabling is called before rcu_init. This is due to allow early events to be enabled immediately. Because kernel command line events do not distinguish between different types of events, the syscall events are enabled here but are not fully functioning. After rcu_init, they are disabled and re-enabled so that they can be fully enabled. The problem happened is that this "disable-enable" is done one at a time. If more than one syscall event is specified on the command line, by disabling them one at a time, the counter never gets to zero, and the raw_syscall is not disabled and enabled, keeping the syscall events in their non-fully functional state. Instead, disable all events and re-enabled them all, as that will ensure the raw_syscall event is also disabled and re-enabled. - Disable preemption in ftrace pid filtering The ftrace pid filtering attaches to the fork and exit tracepoints to add or remove pids that should be traced. They access variables protected by RCU (preemption disabled). Now that tracepoint callbacks are called with preemption enabled, this protection needs to be added explicitly, and not depend on the functions being called with preemption disabled. - Disable preemption in event pid filtering The event pid filtering needs the same preemption disabling guards as ftrace pid filtering. - Fix accounting of the memory mapped ring buffer on fork Memory mapping the ftrace ring buffer sets the vm_flags to DONTCOPY. But this does not prevent the application from calling madvise(MADVISE_DOFORK). This causes the mapping to be copied on fork. After the first tasks exits, the mapping is considered unmapped by everyone. But when he second task exits, the counter goes below zero and triggers a WARN_ON. Since nothing prevents two separate tasks from mmapping the ftrace ring buffer (although two mappings may mess each other up), there's no reason to stop the memory from being copied on fork. Update the vm_operations to have an ".open" handler to update the accounting and let the ring buffer know someone else has it mapped. - Add all ftrace headers in MAINTAINERS file The MAINTAINERS file only specifies include/linux/ftrace.h But misses ftrace_irq.h and ftrace_regs.h. Make the file use wildcards to get all *ftrace* files. * tag 'trace-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Add MAINTAINERS entries for all ftrace headers tracing: Fix WARN_ON in tracing_buffers_mmap_close tracing: Disable preemption in the tracepoint callbacks handling filtered pids ftrace: Disable preemption in the tracepoint callbacks handling filtered pids tracing: Fix syscall events activation by ensuring refcount hits zero fgraph: Fix thresh_return nosleeptime double-adjust fgraph: Fix thresh_return clear per-task notrace
2026-03-04Merge tag 'modules-7.0-rc3.fixes' of ↵Linus Torvalds2-16/+20
git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux Pull module fixes from Sami Tolvanen: - Fix a potential kernel panic in the module loader by adding a bounds check for the ELF section index. This prevents crashes if attempting to load a module that uses SHN_XINDEX or is corrupted. - Fix the Kconfig menu layout for module versioning, signing, and compression options so they correctly appear as submenus in menuconfig. - Remove a redundant lockdep_free_key_range() call in the load_module() error path. This is already handled by module_deallocate() calling free_mod_mem() since the module_memory rework. * tag 'modules-7.0-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux: module: Fix kernel panic when a symbol st_shndx is out of bounds module: Fix the modversions and signing submenus module: Remove duplicate freeing of lockdep classes
2026-03-04Merge tag 'vfs-7.0-rc3.fixes' of ↵Linus Torvalds4-61/+21
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - kthread: consolidate kthread exit paths to prevent use-after-free - iomap: - don't mark folio uptodate if read IO has bytes pending - don't report direct-io retries to fserror - reject delalloc mappings during writeback - ns: tighten visibility checks - netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence * tag 'vfs-7.0-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: iomap: reject delalloc mappings during writeback iomap: don't mark folio uptodate if read IO has bytes pending selftests: fix mntns iteration selftests nstree: tighten permission checks for listing nsfs: tighten permission checks for handle opening nsfs: tighten permission checks for ns iteration ioctls netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence kthread: consolidate kthread exit paths to prevent use-after-free iomap: don't report direct-io retries to fserror
2026-03-04timekeeping: Fix timex status validation for auxiliary clocksMiroslav Lichvar1-2/+4
The timekeeping_validate_timex() function validates the timex status of an auxiliary system clock even when the status is not to be changed, which causes unexpected errors for applications that make read-only clock_adjtime() calls, or set some other timex fields, but without clearing the status field. Do the AUX-specific status validation only when the modes field contains ADJ_STATUS, i.e. the application is actually trying to change the status. This makes the AUX-specific clock_adjtime() behavior consistent with CLOCK_REALTIME. Fixes: 4eca49d0b621 ("timekeeping: Prepare do_adtimex() for auxiliary clocks") Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260225085231.276751-1-mlichvar@redhat.com
2026-03-04sched_ext: Use WRITE_ONCE() for the write side of dsq->seq updatezhidao su1-1/+1
bpf_iter_scx_dsq_new() reads dsq->seq via READ_ONCE() without holding any lock, making dsq->seq a lock-free concurrently accessed variable. However, dispatch_enqueue(), the sole writer of dsq->seq, uses a plain increment without the matching WRITE_ONCE() on the write side: dsq->seq++; ^^^^^^^^^^^ plain write -- KCSAN data race The KCSAN documentation requires that if one accessor uses READ_ONCE() or WRITE_ONCE() on a variable to annotate lock-free access, all other accesses must also use the appropriate accessor. A plain write leaves the pair incomplete and will trigger KCSAN warnings. Fix by using WRITE_ONCE() for the write side of the update: WRITE_ONCE(dsq->seq, dsq->seq + 1); This is consistent with bpf_iter_scx_dsq_new() and makes the concurrent access annotation complete and KCSAN-clean. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-04Merge tag 'sysctl-7.00-fixes-rc3' of ↵Linus Torvalds1-2/+0
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl fix from Joel Granados: - Fix error when reporting jiffies converted values back to user space Return the converted value instead of "Invalid argument" error * tag 'sysctl-7.00-fixes-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: time/jiffies: Fix sysctl file error on configurations where USER_HZ < HZ
2026-03-04sched/deadline: Fix missing ENQUEUE_REPLENISH during PI de-boostingJuri Lelli1-0/+30
Running stress-ng --schedpolicy 0 on an RT kernel on a big machine might lead to the following WARNINGs (edited). sched: DL de-boosted task PID 22725: REPLENISH flag missing WARNING: CPU: 93 PID: 0 at kernel/sched/deadline.c:239 dequeue_task_dl+0x15c/0x1f8 ... (running_bw underflow) Call trace: dequeue_task_dl+0x15c/0x1f8 (P) dequeue_task+0x80/0x168 deactivate_task+0x24/0x50 push_dl_task+0x264/0x2e0 dl_task_timer+0x1b0/0x228 __hrtimer_run_queues+0x188/0x378 hrtimer_interrupt+0xfc/0x260 ... The problem is that when a SCHED_DEADLINE task (lock holder) is changed to a lower priority class via sched_setscheduler(), it may fail to properly inherit the parameters of potential DEADLINE donors if it didn't already inherit them in the past (shorter deadline than donor's at that time). This might lead to bandwidth accounting corruption, as enqueue_task_dl() won't recognize the lock holder as boosted. The scenario occurs when: 1. A DEADLINE task (donor) blocks on a PI mutex held by another DEADLINE task (holder), but the holder doesn't inherit parameters (e.g., it already has a shorter deadline) 2. sched_setscheduler() changes the holder from DEADLINE to a lower class while still holding the mutex 3. The holder should now inherit DEADLINE parameters from the donor and be enqueued with ENQUEUE_REPLENISH, but this doesn't happen Fix the issue by introducing __setscheduler_dl_pi(), which detects when a DEADLINE (proper or boosted) task gets setscheduled to a lower priority class. In case, the function makes the task inherit DEADLINE parameters of the donoer (pi_se) and sets ENQUEUE_REPLENISH flag to ensure proper bandwidth accounting during the next enqueue operation. Fixes: 2279f540ea7d ("sched/deadline: Fix priority inheritance with multiple scheduling classes") Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260302-upstream-fix-deadline-piboost-b4-v3-1-6ba32184a9e0@redhat.com
2026-03-04time/jiffies: Fix sysctl file error on configurations where USER_HZ < HZGerd Rausch1-2/+0
Commit 2dc164a48e6fd ("sysctl: Create converter functions with two new macros") incorrectly returns error to user space when jiffies sysctl converter is used. The old overflow check got replaced with an unconditional one: + if (USER_HZ < HZ) + return -EINVAL; which will always be true on configurations with "USER_HZ < HZ". Remove the check; it is no longer needed as clock_t_to_jiffies() returns ULONG_MAX for the overflow case and proc_int_u2k_conv_uop() checks for "> INT_MAX" after conversion Fixes: 2dc164a48e6fd ("sysctl: Create converter functions with two new macros") Reported-by: Colm Harrington <colm.harrington@oracle.com> Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Joel Granados <joel.granados@kernel.org>
2026-03-03tracing: Fix WARN_ON in tracing_buffers_mmap_closeQing Wang2-0/+34
When a process forks, the child process copies the parent's VMAs but the user_mapped reference count is not incremented. As a result, when both the parent and child processes exit, tracing_buffers_mmap_close() is called twice. On the second call, user_mapped is already 0, causing the function to return -ENODEV and triggering a WARN_ON. Normally, this isn't an issue as the memory is mapped with VM_DONTCOPY set. But this is only a hint, and the application can call madvise(MADVISE_DOFORK) which resets the VM_DONTCOPY flag. When the application does that, it can trigger this issue on fork. Fix it by incrementing the user_mapped reference count without re-mapping the pages in the VMA's open callback. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://patch.msgid.link/20260227025842.1085206-1-wangqing7171@gmail.com Fixes: cf9f0f7c4c5bb ("tracing: Allow user-space mapping of the ring-buffer") Reported-by: syzbot+3b5dd2030fe08afdf65d@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=3b5dd2030fe08afdf65d Tested-by: syzbot+3b5dd2030fe08afdf65d@syzkaller.appspotmail.com Signed-off-by: Qing Wang <wangqing7171@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-03tracing: Disable preemption in the tracepoint callbacks handling filtered pidsMasami Hiramatsu (Google)1-0/+2
Filtering PIDs for events triggered the following during selftests: [37] event tracing - restricts events based on pid notrace filtering [ 155.874095] [ 155.874869] ============================= [ 155.876037] WARNING: suspicious RCU usage [ 155.877287] 7.0.0-rc1-00004-g8cd473a19bc7 #7 Not tainted [ 155.879263] ----------------------------- [ 155.882839] kernel/trace/trace_events.c:1057 suspicious rcu_dereference_check() usage! [ 155.889281] [ 155.889281] other info that might help us debug this: [ 155.889281] [ 155.894519] [ 155.894519] rcu_scheduler_active = 2, debug_locks = 1 [ 155.898068] no locks held by ftracetest/4364. [ 155.900524] [ 155.900524] stack backtrace: [ 155.902645] CPU: 1 UID: 0 PID: 4364 Comm: ftracetest Not tainted 7.0.0-rc1-00004-g8cd473a19bc7 #7 PREEMPT(lazy) [ 155.902648] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 [ 155.902651] Call Trace: [ 155.902655] <TASK> [ 155.902659] dump_stack_lvl+0x67/0x90 [ 155.902665] lockdep_rcu_suspicious+0x154/0x1a0 [ 155.902672] event_filter_pid_sched_process_fork+0x9a/0xd0 [ 155.902678] kernel_clone+0x367/0x3a0 [ 155.902689] __x64_sys_clone+0x116/0x140 [ 155.902696] do_syscall_64+0x158/0x460 [ 155.902700] ? entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 155.902702] ? trace_irq_disable+0x1d/0xc0 [ 155.902709] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 155.902711] RIP: 0033:0x4697c3 [ 155.902716] Code: 1f 84 00 00 00 00 00 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 89 c2 85 c0 75 2c 64 48 8b 04 25 10 00 00 [ 155.902718] RSP: 002b:00007ffc41150428 EFLAGS: 00000246 ORIG_RAX: 0000000000000038 [ 155.902721] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004697c3 [ 155.902722] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011 [ 155.902724] RBP: 0000000000000000 R08: 0000000000000000 R09: 000000003fccf990 [ 155.902725] R10: 000000003fccd690 R11: 0000000000000246 R12: 0000000000000001 [ 155.902726] R13: 000000003fce8103 R14: 0000000000000001 R15: 0000000000000000 [ 155.902733] </TASK> [ 155.902747] The tracepoint callbacks recently were changed to allow preemption. The event PID filtering callbacks that were attached to the fork and exit tracepoints expected preemption disabled in order to access the RCU protected PID lists. Add a guard(preempt)() to protect the references to the PID list. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260303215738.6ab275af@fedora Fixes: a46023d5616e ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast") Link: https://patch.msgid.link/20260303131706.96057f61a48a34c43ce1e396@kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-03ftrace: Disable preemption in the tracepoint callbacks handling filtered pidsSteven Rostedt1-0/+2
When function trace PID filtering is enabled, the function tracer will attach a callback to the fork tracepoint as well as the exit tracepoint that will add the forked child PID to the PID filtering list as well as remove the PID that is exiting. Commit a46023d5616e ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast") removed the disabling of preemption when calling tracepoint callbacks. The callbacks used for the PID filtering accounting depended on preemption being disabled, and now the trigger a "suspicious RCU usage" warning message. Make them explicitly disable preemption. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260302213546.156e3e4f@gandalf.local.home Fixes: a46023d5616e ("tracing: Guard __DECLARE_TRACE() use of __DO_TRACE_CALL() with SRCU-fast") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-03-03tracing: Fix syscall events activation by ensuring refcount hits zeroHuiwen He1-15/+37
When multiple syscall events are specified in the kernel command line (e.g., trace_event=syscalls:sys_enter_openat,syscalls:sys_enter_close), they are often not captured after boot, even though they appear enabled in the tracing/set_event file. The issue stems from how syscall events are initialized. Syscall tracepoints require the global reference count (sys_tracepoint_refcount) to transition from 0 to 1 to trigger the registration of the syscall work (TIF_SYSCALL_TRACEPOINT) for tasks, including the init process (pid 1). The current implementation of early_enable_events() with disable_first=true used an interleaved sequence of "Disable A -> Enable A -> Disable B -> Enable B". If multiple syscalls are enabled, the refcount never drops to zero, preventing the 0->1 transition that triggers actual registration. Fix this by splitting early_enable_events() into two distinct phases: 1. Disable all events specified in the buffer. 2. Enable all events specified in the buffer. This ensures the refcount hits zero before re-enabling, allowing syscall events to be properly activated during early boot. The code is also refactored to use a helper function to avoid logic duplication between the disable and enable phases. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260224023544.1250787-1-hehuiwen@kylinos.cn Fixes: ce1039bd3a89 ("tracing: Fix enabling of syscall events on the command line") Signed-off-by: Huiwen He <hehuiwen@kylinos.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-03fgraph: Fix thresh_return nosleeptime double-adjustShengming Hu1-4/+10
trace_graph_thresh_return() called handle_nosleeptime() and then delegated to trace_graph_return(), which calls handle_nosleeptime() again. When sleep-time accounting is disabled this double-adjusts calltime and can produce bogus durations (including underflow). Fix this by computing rettime once, applying handle_nosleeptime() only once, using the adjusted calltime for threshold comparison, and writing the return event directly via __trace_graph_return() when the threshold is met. Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260221113314048jE4VRwIyZEALiYByGK0My@zte.com.cn Fixes: 3c9880f3ab52b ("ftrace: Use a running sleeptime instead of saving on shadow stack") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>