summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
3 daysbpf: Fix use-after-free in arena_vm_close on forkAlexei Starovoitov1-3/+16
commit 4fddde2a732de60bb97e3307d4eb69ac5f1d2b74 upstream. arena_vm_open() only bumps vml->mmap_count but never registers the child VMA in arena->vma_list. The vml->vma always points at the parent VMA, so after parent munmap the pointer dangles. If the child then calls bpf_arena_free_pages(), zap_pages() reads the stale vml->vma triggering use-after-free. Fix this by preventing the arena VMA from being inherited across fork with VM_DONTCOPY, and preventing VMA splits via the may_split callback. Also reject mremap with a .mremap callback returning -EINVAL. A same-size mremap(MREMAP_FIXED) on the full arena VMA reaches copy_vma() through the following path: check_prep_vma() - returns 0 early: new_len == old_len skips VM_DONTEXPAND check prep_move_vma() - vm_start == old_addr and vm_end == old_addr + old_len so may_split is never called move_vma() copy_vma_and_data() copy_vma() vm_area_dup() - copies vm_private_data (vml pointer) vm_ops->open() - bumps vml->mmap_count vm_ops->mremap() - returns -EINVAL, rollback unmaps new VMA The refcount ensures the rollback's arena_vm_close does not free the vml shared with the original VMA. Reported-by: Weiming Shi <bestswngs@gmail.com> Reported-by: Xiang Mei <xmei5@asu.edu> Fixes: 317460317a02 ("bpf: Introduce bpf_arena.") Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260413194245.21449-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 dayssched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() ↵Tejun Heo1-4/+6
FIFO-tail commit 2f2ea77092660b53bfcbc4acc590b57ce9ab5dce upstream. dispatch_enqueue()'s FIFO-tail path used list_empty(&dsq->list) to decide whether to set dsq->first_task on enqueue. dsq->list can contain parked BPF iterator cursors (SCX_DSQ_LNODE_ITER_CURSOR), so list_empty() is not a reliable "no real task" check. If the last real task is unlinked while a cursor is parked, first_task becomes NULL; the next FIFO-tail enqueue then sees list_empty() == false and skips the first_task update, leaving scx_bpf_dsq_peek() returning NULL for a non-empty DSQ. Test dsq->first_task directly, which already tracks only real tasks and is maintained under dsq->lock. Fixes: 44f5c8ec5b9a ("sched_ext: Add lockless peek operation for DSQs") Cc: stable@vger.kernel.org # v6.19+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Cc: Ryan Newton <newton@meta.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 dayssched_ext: idle: Recheck prev_cpu after narrowing allowed maskDavid Carlier1-6/+6
commit b34c82777a2c0648ee053595f4b290fd5249b093 upstream. scx_select_cpu_dfl() narrows @allowed to @cpus_allowed & @p->cpus_ptr when the BPF caller supplies a @cpus_allowed that differs from @p->cpus_ptr and @p doesn't have full affinity. However, @is_prev_allowed was computed against the original (wider) @cpus_allowed, so the prev_cpu fast paths could pick a @prev_cpu that is in @cpus_allowed but not in @p->cpus_ptr, violating the intended invariant that the returned CPU is always usable by @p. The kernel masks this via the SCX_EV_SELECT_CPU_FALLBACK fallback, but the behavior contradicts the documented contract. Move the @is_prev_allowed evaluation past the narrowing block so it tests against the final @allowed mask. Fixes: ee9a4e92799d ("sched_ext: idle: Properly handle invalid prev_cpu during idle selection") Cc: stable@vger.kernel.org # v6.16+ Assisted-by: Claude <noreply@anthropic.com> Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 daystracing/fprobe: Check the same type fprobe on table as the unregistered oneMasami Hiramatsu (Google)1-17/+65
commit 0ac0058a74ac5765c7ce09ea630f4fdeaf4d80fa upstream. Commit 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case") introduced a different ftrace_ops for entry-only fprobes. However, when unregistering an fprobe, the kernel only checks if another fprobe exists at the same address, without checking which type of fprobe it is. If different fprobes are registered at the same address, the same address will be registered in both fgraph_ops and ftrace_ops, but only one of them will be deleted when unregistering. (the one removed first will not be deleted from the ops). This results in junk entries remaining in either fgraph_ops or ftrace_ops. For example: ======= cd /sys/kernel/tracing # 'Add entry and exit events on the same place' echo 'f:event1 vfs_read' >> dynamic_events echo 'f:event2 vfs_read%return' >> dynamic_events # 'Enable both of them' echo 1 > events/fprobes/enable cat enabled_functions vfs_read (2) ->arch_ftrace_ops_list_func+0x0/0x210 # 'Disable and remove exit event' echo 0 > events/fprobes/event2/enable echo -:event2 >> dynamic_events # 'Disable and remove all events' echo 0 > events/fprobes/enable echo > dynamic_events # 'Add another event' echo 'f:event3 vfs_open%return' > dynamic_events cat dynamic_events f:fprobes/event3 vfs_open%return echo 1 > events/fprobes/enable cat enabled_functions vfs_open (1) tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150} vfs_read (1) tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150} ======= As you can see, an entry for the vfs_read remains. To fix this issue, when unregistering, the kernel should also check if there is the same type of fprobes still exist at the same address, and if not, delete its entry from either fgraph_ops or ftrace_ops. Link: https://lore.kernel.org/all/177669367993.132053.10553046138528674802.stgit@mhiramat.tok.corp.google.com/ Fixes: 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 daystracing/probes: Limit size of event probe to 3KSteven Rostedt2-1/+9
commit b2aa3b4d64e460ac606f386c24e7d8a873ce6f1a upstream. There currently isn't a max limit an event probe can be. One could make an event greater than PAGE_SIZE, which makes the event useless because if it's bigger than the max event that can be recorded into the ring buffer, then it will never be recorded. A event probe should never need to be greater than 3K, so make that the max size. As long as the max is less than the max that can be recorded onto the ring buffer, it should be fine. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Fixes: 93ccae7a22274 ("tracing/kprobes: Support basic types on dynamic events") Link: https://patch.msgid.link/20260428122302.706610ba@gandalf.local.home Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 daystracing/fprobe: Unregister fprobe even if memory allocation failsMasami Hiramatsu (Google)1-10/+15
commit 1aec9e5c3e31ce1e28f914427fb7f90b91d310df upstream. unregister_fprobe() can fail under memory pressure because of memory allocation failure, but this maybe called from module unloading, and usually there is no way to retry it. Moreover. trace_fprobe does not check the return value. To fix this problem, unregister fprobe and fprobe_hash_node even if working memory allocation fails. Anyway, if the last fprobe is removed, the filter will be freed. Link: https://lore.kernel.org/all/177669365629.132053.8433032896213721288.stgit@mhiramat.tok.corp.google.com/ Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 daystracing/fprobe: Remove fprobe from hash in failure pathMasami Hiramatsu (Google)1-43/+47
commit 845947aca6814f5723ed65e556eb5ee09493f05b upstream. When register_fprobe_ips() fails, it tries to remove a list of fprobe_hash_node from fprobe_ip_table, but it missed to remove fprobe itself from fprobe_table. Moreover, when removing the fprobe_hash_node which is added to rhltable once, it must use kfree_rcu() after removing from rhltable. To fix these issues, this reuses unregister_fprobe() internal code to rollback the half-way registered fprobe. Link: https://lore.kernel.org/all/177669366417.132053.17874946321744910456.stgit@mhiramat.tok.corp.google.com/ Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 daystracing/fprobe: Avoid kcalloc() in rcu_read_lock sectionMasami Hiramatsu (Google)1-47/+45
commit aa72812b49104bb5a38272fc9541feb62ca6fd32 upstream. fprobe_remove_node_in_module() is called under RCU read locked, but this invokes kcalloc() if there are more than 8 fprobes installed on the module. Sashiko warns it because kcalloc() can sleep [1]. [1] https://sashiko.dev/#/patchset/177552432201.853249.5125045538812833325.stgit%40mhiramat.tok.corp.google.com To fix this issue, expand the batch size to 128 and do not expand the fprobe_addr_list, but just cancel walking on fprobe_ip_table, update fgraph/ftrace_ops and retry the loop again. Link: https://lore.kernel.org/all/177669367206.132053.1493637946869032744.stgit@mhiramat.tok.corp.google.com/ Fixes: 0de4c70d04a4 ("tracing: fprobe: use rhltable for fprobe_ip_table") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 daysrseq: Don't advertise time slice extensions if disabledThomas Gleixner1-4/+5
commit 010b7723c0a3b9ad58f50b715dbe2e7781d29400 upstream. If time slice extensions have been disabled on the kernel command line, then advertising them in RSEQ flags is wrong. Adjust the conditionals to reflect reality, fixup the misleading comments about the gap of these flags and the rseq::flags field. Fixes: d6200245c75e ("rseq: Allow registering RSEQ with slice extension") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Link: https://patch.msgid.link/20260428224427.437059375%40kernel.org Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 daysrseq: Set rseq::cpu_id_start to 0 on unregistrationThomas Gleixner1-11/+9
commit 2cb68e45120dfc66404c7547d95b8ac6ff0b25ce upstream. The RSEQ rework changed that to RSEQ_CPU_UNINITILIZED, which is obviously incompatible. Revert back to the original behavior. Fixes: 0f085b41880e ("rseq: Provide and use rseq_set_ids()") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Link: https://patch.msgid.link/20260428224427.271566313%40kernel.org Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 dayssched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup settersTejun Heo1-3/+6
commit 80afd4c84bc8f5e80145ce35279f5ce53f6043db upstream. scx_group_set_{weight,idle,bandwidth}() cache scx_root before acquiring scx_cgroup_ops_rwsem, so the pointer can be stale by the time the op runs. If the loaded scheduler is disabled and freed (via RCU work) and another is enabled between the naked load and the rwsem acquire, the reader sees scx_cgroup_enabled=true (the new scheduler's) but dereferences the freed one - UAF on SCX_HAS_OP(sch, ...) / SCX_CALL_OP(sch, ...). scx_cgroup_enabled is toggled only under scx_cgroup_ops_rwsem write (scx_cgroup_{init,exit}), so reading scx_root inside the rwsem read section correlates @sch with the enabled snapshot. Fixes: a5bd6ba30b33 ("sched_ext: Use cgroup_lock/unlock() to synchronize against cgroup operations") Cc: stable@vger.kernel.org # v6.18+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 daysexit: prevent preemption of oopsing TASK_DEAD taskJann Horn1-0/+1
commit c1fa0bb633e4a6b11e83ffc57fa5abe8ebb87891 upstream. When an already-exiting task oopses, make_task_dead() currently calls do_task_dead() with preemption enabled. That is forbidden: do_task_dead() calls __schedule(), which has a comment saying "WARNING: must be called with preemption disabled!". If an oopsing task is preempted in do_task_dead(), between becoming TASK_DEAD and entering the scheduler explicitly, bad things happen: finish_task_switch() assumes that once the scheduler has switched away from a TASK_DEAD task, the task can never run again and its stack is no longer needed; but that assumption apparently doesn't hold if the dead task was preempted (the SM_PREEMPT case). This means that the scheduler ends up repeatedly dropping references on the dead task's stack, which can lead to use-after-free or double-free of the entire task stack; in other words, two tasks can end up running on the same stack, resulting in various kinds of memory corruption. (This does not just affect "recursively oopsing" tasks; it is enough to oops once during task exit, for example in a file_operations::release handler) Fixes: 7f80a2fd7db9 ("exit: Stop poorly open coding do_task_dead in make_task_dead") Cc: stable@kernel.org Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 daystracepoint: balance regfunc() on func_add() failure in tracepoint_add_func()David Carlier1-0/+2
commit fad217e16fded7f3c09f8637b0f6a224d58b5f2e upstream. When a tracepoint goes through the 0 -> 1 transition, tracepoint_add_func() invokes the subsystem's ext->regfunc() before attempting to install the new probe via func_add(). If func_add() then fails (for example, when allocate_probes() cannot allocate a new probe array under memory pressure and returns -ENOMEM), the function returns the error without calling the matching ext->unregfunc(), leaving the side effects of regfunc() behind with no installed probe to justify them. For syscall tracepoints this is particularly unpleasant: syscall_regfunc() bumps sys_tracepoint_refcount and sets SYSCALL_TRACEPOINT on every task. After a leaked failure, the refcount is stuck at a non-zero value with no consumer, and every task continues paying the syscall trace entry/exit overhead until reboot. Other subsystems providing regfunc()/unregfunc() pairs exhibit similarly scoped persistent state. Mirror the existing 1 -> 0 cleanup and call ext->unregfunc() in the func_add() error path, gated on the same condition used there so the unwind is symmetric with the registration. Fixes: 8cf868affdc4 ("tracing: Have the reg function allow to fail") Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260413190601.21993-1-devnexen@gmail.com Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysrtmutex: Use waiter::task instead of current in remove_waiter()Keenan Dong1-5/+8
commit 3bfdc63936dd4773109b7b8c280c0f3b5ae7d349 upstream. remove_waiter() is used by the slowlock paths, but it is also used for proxy-lock rollback in rt_mutex_start_proxy_lock() when invoked from futex_requeue(). In the latter case waiter::task is not current, but remove_waiter() operates on current for the dequeue operation. That results in several problems: 1) the rbtree dequeue happens without waiter::task::pi_lock being held 2) the waiter task's pi_blocked_on state is not cleared, which leaves a dangling pointer primed for UAF around. 3) rt_mutex_adjust_prio_chain() operates on the wrong top priority waiter task Use waiter::task instead of current in all related operations in remove_waiter() to cure those problems. [ tglx: Fixup rt_mutex_adjust_prio_chain(), add a comment and amend the changelog ] Fixes: 8161239a8bcc ("rtmutex: Simplify PI algorithm and make highest prio task get lock") Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Keenan Dong <keenanat2000@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daystaskstats: set version in TGID exit notificationsYiyang Chen1-0/+1
commit 16c4f0211aaa1ec1422b11b59f64f1abe9009fc0 upstream. delay accounting started populating taskstats records with a valid version field via fill_pid() and fill_tgid(). Later, commit ad4ecbcba728 ("[PATCH] delay accounting taskstats interface send tgid once") changed the TGID exit path to send the cached signal->stats aggregate directly instead of building the outgoing record through fill_tgid(). Unlike fill_tgid(), fill_tgid_exit() only accumulates accounting data and never initializes stats->version. As a result, TGID exit notifications can reach userspace with version == 0 even though PID exit notifications and TASKSTATS_CMD_GET replies carry a valid taskstats version. This is easy to reproduce with `tools/accounting/getdelays.c`. I have a small follow-up patch for that tool which: 1. increases the receive buffer/message size so the pid+tgid combined exit notification is not dropped/truncated 2. prints `stats->version`. With that patch, the reproducer is: Terminal 1: ./getdelays -d -v -l -m 0 Terminal 2: taskset -c 0 python3 -c 'import threading,time; t=threading.Thread(target=time.sleep,args=(0.1,)); t.start(); t.join()' That produces both PID and TGID exit notifications for the same process. The PID exit record reports a valid taskstats version, while the TGID exit record reports `version 0`. This patch (of 2): Set stats->version = TASKSTATS_VERSION after copying the cached TGID aggregate into the outgoing netlink payload so all taskstats records are self-describing again. Link: https://lkml.kernel.org/r/ba83d934e59edd431b693607de573eb9ca059309.1774810498.git.cyyzero16@gmail.com Fixes: ad4ecbcba728 ("[PATCH] delay accounting taskstats interface send tgid once") Signed-off-by: Yiyang Chen <cyyzero16@gmail.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dr. Thomas Orgis <thomas.orgis@uni-hamburg.de> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Wang Yaxin <wang.yaxin@zte.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysring-buffer: Do not double count the reader_pageMasami Hiramatsu (Google)1-6/+7
commit 92d5a606721f759ebebf448b3bd2b7a781d50bd0 upstream. Since the cpu_buffer->reader_page is updated if there are unwound pages. After that update, we should skip the page if it is the original reader_page, because the original reader_page is already checked. Cc: stable@vger.kernel.org Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ian Rogers <irogers@google.com> Link: https://patch.msgid.link/177701353063.2223789.1471163147644103306.stgit@mhiramat.tok.corp.google.com Fixes: ca296d32ece3 ("tracing: ring_buffer: Rewind persistent ring buffer on reboot") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysrandomize_kstack: Maintain kstack_offset per taskRyan Roberts1-0/+2
commit 37beb42560165869838e7d91724f3e629db64129 upstream. kstack_offset was previously maintained per-cpu, but this caused a couple of issues. So let's instead make it per-task. Issue 1: add_random_kstack_offset() and choose_random_kstack_offset() expected and required to be called with interrupts and preemption disabled so that it could manipulate per-cpu state. But arm64, loongarch and risc-v are calling them with interrupts and preemption enabled. I don't _think_ this causes any functional issues, but it's certainly unexpected and could lead to manipulating the wrong cpu's state, which could cause a minor performance degradation due to bouncing the cache lines. By maintaining the state per-task those functions can safely be called in preemptible context. Issue 2: add_random_kstack_offset() is called before executing the syscall and expands the stack using a previously chosen random offset. choose_random_kstack_offset() is called after executing the syscall and chooses and stores a new random offset for the next syscall. With per-cpu storage for this offset, an attacker could force cpu migration during the execution of the syscall and prevent the offset from being updated for the original cpu such that it is predictable for the next syscall on that cpu. By maintaining the state per-task, this problem goes away because the per-task random offset is updated after the syscall regardless of which cpu it is executing on. Fixes: 39218ff4c625 ("stack: Optionally randomize kernel stack offset each syscall") Closes: https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/ Cc: stable@vger.kernel.org Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Link: https://patch.msgid.link/20260303150840.3789438-2-ryan.roberts@arm.com Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daystracing/fprobe: Reject registration of a registered fprobe before initMasami Hiramatsu (Google)1-11/+10
commit 6ad51ada17ed80c9a5f205b4c01c424cac8b0d46 upstream. Reject registration of a registered fprobe which is on the fprobe hash table before initializing fprobe. The add_fprobe_hash() checks this re-register fprobe, but since fprobe_init() clears hlist_array field, it is too late to check it. It has to check the re-registration before touncing fprobe. Link: https://lore.kernel.org/all/177669364845.132053.18375367916162315835.stgit@mhiramat.tok.corp.google.com/ Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 dayssched: Use u64 for bandwidth ratio calculationsJoseph Salisbury3-3/+3
commit c6e80201e057dfb7253385e60bf541121bf5dc33 upstream. to_ratio() computes BW_SHIFT-scaled bandwidth ratios from u64 period and runtime values, but it returns unsigned long. tg_rt_schedulable() also stores the current group limit and the accumulated child sum in unsigned long. On 32-bit builds, large bandwidth ratios can be truncated and the RT group sum can wrap when enough siblings are present. That can let an overcommitted RT hierarchy pass the schedulability check, and it also narrows the helper result for other callers. Return u64 from to_ratio() and use u64 for the RT group totals so bandwidth ratios are preserved and compared at full width on both 32-bit and 64-bit builds. Fixes: b40b2e8eb521 ("sched: rt: multi level group constraints") Assisted-by: Codex:GPT-5 Signed-off-by: Joseph Salisbury <joseph.salisbury@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260403210014.2713404-1-joseph.salisbury@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-22clockevents: Add missing resets of the next_event_forced flagThomas Gleixner2-1/+7
commit 4096fd0e8eaea13ebe5206700b33f49635ae18e5 upstream. The prevention mechanism against timer interrupt starvation missed to reset the next_event_forced flag in a couple of places: - When the clock event state changes. That can cause the flag to be stale over a shutdown/startup sequence - When a non-forced event is armed, which then prevents rearming before that event. If that event is far out in the future this will cause missed timer interrupts. - In the suspend wakeup handler. That led to stalls which have been reported by several people. Add the missing resets, which fixes the problems for the reporters. Fixes: d6e152d905bd ("clockevents: Prevent timer interrupt starvation") Reported-by: Hanabishi <i.r.e.c.c.a.k.u.n+kernel.org@gmail.com> Reported-by: Eric Naim <dnaim@cachyos.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Hanabishi <i.r.e.c.c.a.k.u.n+kernel.org@gmail.com> Tested-by: Eric Naim <dnaim@cachyos.org> Cc: stable@vger.kernel.org Closes: https://lore.kernel.org/68d1e9ac-2780-4be3-8ee3-0788062dd3a4@gmail.com Link: https://patch.msgid.link/87340xfeje.ffs@tglx Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-12Merge tag 'wq-for-7.0-rc7-fixes' of ↵Linus Torvalds1-1/+13
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: "This is a fix for a stall which triggers on ordered workqueues when there are multiple inactive work items during workqueue property changes through sysfs, which doesn't happen that frequently. While really late, the fix is very low risk as it just repeats an operation which is already being performed: - Fix incomplete activation of multiple inactive works when unplugging a pool_workqueue, where the pending_pwqs list wasn't being updated for subsequent works" * tag 'wq-for-7.0-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Add pool_workqueue to pending_pwqs list when unplugging multiple inactive works
2026-04-12Merge tag 'timers-urgent-2026-04-12' of ↵Linus Torvalds5-10/+30
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "Two fixes for the time/timers subsystem: - Invert the inverted fastpath decision in check_tick_dependency(), which prevents NOHZ full to stop the tick. That's a regression introduced in the 7.0 merge window. - Prevent a unpriviledged DoS in the clockevents code, where user space can starve the timer interrupt by arming a timerfd or posix interval timer in a tight loop with an absolute expiry time in the past. The fix turned out to be incomplete and was was amended yesterday to make it work on some 20 years old AMD machines as well. All issues with it have been confirmed to be resolved by various reporters" * tag 'timers-urgent-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clockevents: Prevent timer interrupt starvation tick/nohz: Fix inverted return value in check_tick_dependency() fast path
2026-04-12Merge tag 'sched-urgent-2026-04-12' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix DL server related slowdown to deferred fair tasks" * tag 'sched-urgent-2026-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Use revised wakeup rule for dl_server
2026-04-11Merge tag 'probes-fixes-v7.0-rc7' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing probe fix from Masami Hiramatsu: "Reject non-closed empty immediate strings Fix a buffer index underflow bug that occurred when passing an non-closed empty immediate string to the probe event" * tag 'probes-fixes-v7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/probe: reject non-closed empty immediate strings
2026-04-10Merge tag 'riscv-for-linus-v7.0-rc8' of ↵Linus Torvalds1-13/+17
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Paul Walmsley: "Before v7.0 is released, fix a few issues with the CFI patchset, merged earlier in v7.0-rc, that primarily affect interfaces to non-kernel code: - Improve the prctl() interface for per-task indirect branch landing pad control to expand abbreviations and to resemble the speculation control prctl() interface - Expand the "LP" and "SS" abbreviations in the ptrace uapi header file to "branch landing pad" and "shadow stack", to improve readability - Fix a typo in a CFI-related macro name in the ptrace uapi header file - Ensure that the indirect branch tracking state and shadow stack state are unlocked immediately after an exec() on the new task so that libc subsequently can control it - While working in this area, clean up the kernel-internal, cross-architecture prctl() function names by expanding the abbreviations mentioned above" * tag 'riscv-for-linus-v7.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: prctl: cfi: change the branch landing pad prctl()s to be more descriptive riscv: ptrace: cfi: expand "SS" references to "shadow stack" in uapi headers prctl: rename branch landing pad implementation functions to be more explicit riscv: ptrace: expand "LP" references to "branch landing pads" in uapi headers riscv: cfi: clear CFI lock status in start_thread() riscv: ptrace: cfi: fix "PRACE" typo in uapi header
2026-04-10clockevents: Prevent timer interrupt starvationThomas Gleixner5-9/+29
Calvin reported an odd NMI watchdog lockup which claims that the CPU locked up in user space. He provided a reproducer, which sets up a timerfd based timer and then rearms it in a loop with an absolute expiry time of 1ns. As the expiry time is in the past, the timer ends up as the first expiring timer in the per CPU hrtimer base and the clockevent device is programmed with the minimum delta value. If the machine is fast enough, this ends up in a endless loop of programming the delta value to the minimum value defined by the clock event device, before the timer interrupt can fire, which starves the interrupt and consequently triggers the lockup detector because the hrtimer callback of the lockup mechanism is never invoked. As a first step to prevent this, avoid reprogramming the clock event device when: - a forced minimum delta event is pending - the new expiry delta is less then or equal to the minimum delta Thanks to Calvin for providing the reproducer and to Borislav for testing and providing data from his Zen5 machine. The problem is not limited to Zen5, but depending on the underlying clock event device (e.g. TSC deadline timer on Intel) and the CPU speed not necessarily observable. This change serves only as the last resort and further changes will be made to prevent this scenario earlier in the call chain as far as possible. [ tglx: Updated to restore the old behaviour vs. !force and delta <= 0 and fixed up the tick-broadcast handlers as pointed out by Borislav ] Fixes: d316c57ff6bf ("[PATCH] clockevents: add core functionality") Reported-by: Calvin Owens <calvin@wbinvd.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Calvin Owens <calvin@wbinvd.org> Tested-by: Borislav Petkov <bp@alien8.de> Link: https://lore.kernel.org/lkml/acMe-QZUel-bBYUh@mozart.vkv.me/ Link: https://patch.msgid.link/20260407083247.562657657@kernel.org
2026-04-09Merge tag 'dma-mapping-7.0-2026-04-09' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fix from Marek Szyprowski: "A fix for DMA-mapping subsystem, which hides annoying, false-positive warnings from DMA-API debug on coherent platforms like x86_64 (Mikhail Gavrilov)" * tag 'dma-mapping-7.0-2026-04-09' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-debug: suppress cacheline overlap warning when arch has no DMA alignment requirement
2026-04-08sched/deadline: Use revised wakeup rule for dl_serverPeter Zijlstra1-1/+1
John noted that commit 115135422562 ("sched/deadline: Fix 'stuck' dl_server") unfixed the issue from commit a3a70caf7906 ("sched/deadline: Fix dl_server behaviour"). The issue in commit 115135422562 was for wakeups of the server after the deadline; in which case you *have* to start a new period. The case for a3a70caf7906 is wakeups before the deadline. Now, because the server is effectively running a least-laxity policy, it means that any wakeup during the runnable phase means dl_entity_overflow() will be true. This means we need to adjust the runtime to allow it to still run until the existing deadline expires. Use the revised wakeup rule for dl_defer entities. Fixes: 115135422562 ("sched/deadline: Fix 'stuck' dl_server") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260404102244.GB22575@noisy.programming.kicks-ass.net
2026-04-07Merge tag 'mm-hotfixes-stable-2026-04-06-15-27' of ↵Linus Torvalds1-2/+7
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Eight hotfixes. All are cc:stable and seven are for MM. All are singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-04-06-15-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: ocfs2: fix out-of-bounds write in ocfs2_write_end_inline mm/damon/stat: deallocate damon_call() failure leaking damon_ctx mm/vma: fix memory leak in __mmap_region() mm/memory_hotplug: maintain N_NORMAL_MEMORY during hotplug mm/damon/sysfs: dealloc repeat_call_control if damon_call() fails mm: reinstate unconditional writeback start in balance_dirty_pages() liveupdate: propagate file deserialization failures mm: filemap: fix nr_pages calculation overflow in filemap_map_pages()
2026-04-07tick/nohz: Fix inverted return value in check_tick_dependency() fast pathJosh Snyder1-1/+1
Commit 56534673cea7f ("tick/nohz: Optimize check_tick_dependency() with early return") added a fast path that returns !val when the tick_stop tracepoint is disabled. This is inverted: the slow path returns true when a dependency IS found (val != 0), but !val returns true when val is zero (no dependency). The result is that can_stop_full_tick() sees "dependency found" when there are none, and the tick never stops on nohz_full CPUs. Fix this by returning !!val instead of !val, matching the slow-path semantics. Fixes: 56534673cea7f ("tick/nohz: Optimize check_tick_dependency() with early return") Signed-off-by: Josh Snyder <josh@code406.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Assisted-by: Claude:claude-opus-4-6 Link: https://patch.msgid.link/20260402-fix-idle-tick2-v1-1-eecb589649d3@code406.com
2026-04-06liveupdate: propagate file deserialization failuresLeo Timmins1-2/+7
luo_session_deserialize() ignored the return value from luo_file_deserialize(). As a result, a session could be left partially restored even though the /dev/liveupdate open path treats deserialization failures as fatal. Propagate the error so a failed file deserialization aborts session deserialization instead of silently continuing. Link: https://lkml.kernel.org/r/20260325044608.8407-1-leotimmins1974@gmail.com Link: https://lkml.kernel.org/r/20260325044608.8407-2-leotimmins1974@gmail.com Fixes: 16cec0d26521 ("liveupdate: luo_session: add ioctls for file preservation") Signed-off-by: Leo Timmins <leotimmins1974@gmail.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-06tracing/probe: reject non-closed empty immediate stringsPengpeng Hou1-1/+1
parse_probe_arg() accepts quoted immediate strings and passes the body after the opening quote to __parse_imm_string(). That helper currently computes strlen(str) and immediately dereferences str[len - 1], which underflows when the body is empty and not closed with double-quotation. Reject empty non-closed immediate strings before checking for the closing quote. Link: https://lore.kernel.org/all/20260401160315.88518-1-pengpeng@iscas.ac.cn/ Fixes: a42e3c4de964 ("tracing/probe: Add immediate string parameter support") Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-04-05Merge tag 'sched-urgent-2026-04-05' of ↵Linus Torvalds2-8/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix zero_vruntime tracking again (Peter Zijlstra) - Fix avg_vruntime() usage in sched_debug (Peter Zijlstra) * tag 'sched-urgent-2026-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/debug: Fix avg_vruntime() usage sched/fair: Fix zero_vruntime tracking fix
2026-04-04prctl: cfi: change the branch landing pad prctl()s to be more descriptivePaul Walmsley1-10/+13
Per Linus' comments requesting the replacement of "INDIR_BR_LP" in the indirect branch tracking prctl()s with something more readable, and suggesting the use of the speculation control prctl()s as an exemplar, reimplement the prctl()s and related constants that control per-task forward-edge control flow integrity. This primarily involves two changes. First, the prctls are restructured to resemble the style of the speculative execution workaround control prctls PR_{GET,SET}_SPECULATION_CTRL, to make them easier to extend in the future. Second, the "indir_br_lp" abbrevation is expanded to "branch_landing_pads" to be less telegraphic. The kselftest and documentation is adjusted accordingly. Link: https://lore.kernel.org/linux-riscv/CAHk-=whhSLGZAx3N5jJpb4GLFDqH_QvS07D+6BnkPWmCEzTAgw@mail.gmail.com/ Cc: Deepak Gupta <debug@rivosinc.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Paul Walmsley <pjw@kernel.org>
2026-04-04prctl: rename branch landing pad implementation functions to be more explicitPaul Walmsley1-7/+8
Per Linus' comments about the unreadability of abbreviations such as "indir_br_lp", rename the three prctl() implementation functions to be more explicit. This involves renaming "indir_br_lp_status" in the function names to "branch_landing_pad_state". While here, add _prctl_ into the function names, following the speculation control prctl implementation functions. Link: https://lore.kernel.org/linux-riscv/CAHk-=whhSLGZAx3N5jJpb4GLFDqH_QvS07D+6BnkPWmCEzTAgw@mail.gmail.com/ Cc: Deepak Gupta <debug@rivosinc.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Paul Walmsley <pjw@kernel.org>
2026-04-03Merge tag 'sched_ext-for-7.0-rc6-fixes-2' of ↵Linus Torvalds2-26/+54
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "These are late but both fix subtle yet critical problems and the blast radius is limited strictly to sched_ext. - Fix stale direct dispatch state in ddsp_dsq_id which can cause spurious warnings in mark_direct_dispatch() on task wakeup - Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCU configs which can lead to incorrectly dispatching migration- disabled tasks to remote CPUs" * tag 'sched_ext-for-7.0-rc6-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix stale direct dispatch state in ddsp_dsq_id sched_ext: Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCU
2026-04-03sched_ext: Fix stale direct dispatch state in ddsp_dsq_idAndrea Righi1-14/+35
@p->scx.ddsp_dsq_id can be left set (non-SCX_DSQ_INVALID) triggering a spurious warning in mark_direct_dispatch() when the next wakeup's ops.select_cpu() calls scx_bpf_dsq_insert(), such as: WARNING: kernel/sched/ext.c:1273 at scx_dsq_insert_commit+0xcd/0x140 The root cause is that ddsp_dsq_id was only cleared in dispatch_enqueue(), which is not reached in all paths that consume or cancel a direct dispatch verdict. Fix it by clearing it at the right places: - direct_dispatch(): cache the direct dispatch state in local variables and clear it before dispatch_enqueue() on the synchronous path. For the deferred path, the direct dispatch state must remain set until process_ddsp_deferred_locals() consumes them. - process_ddsp_deferred_locals(): cache the dispatch state in local variables and clear it before calling dispatch_to_local_dsq(), which may migrate the task to another rq. - do_enqueue_task(): clear the dispatch state on the enqueue path (local/global/bypass fallbacks), where the direct dispatch verdict is ignored. - dequeue_task_scx(): clear the dispatch state after dispatch_dequeue() to handle both the deferred dispatch cancellation and the holding_cpu race, covering all cases where a pending direct dispatch is cancelled. - scx_disable_task(): clear the direct dispatch state when transitioning a task out of the current scheduler. Waking tasks may have had the direct dispatch state set by the outgoing scheduler's ops.select_cpu() and then been queued on a wake_list via ttwu_queue_wakelist(), when SCX_OPS_ALLOW_QUEUED_WAKEUP is set. Such tasks are not on the runqueue and are not iterated by scx_bypass(), so their direct dispatch state won't be cleared. Without this clear, any subsequent SCX scheduler that tries to direct dispatch the task will trigger the WARN_ON_ONCE() in mark_direct_dispatch(). Fixes: 5b26f7b920f7 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches") Cc: stable@vger.kernel.org # v6.12+ Cc: Daniel Hodges <hodgesd@meta.com> Cc: Patrick Somaru <patsomaru@meta.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-03Merge tag 'pm-7.0-rc7' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix a potential NULL pointer dereference in the energy model netlink interface and a potential double free in an error path in the common cpufreq governor management code: - Fix a NULL pointer dereference in the energy model netlink interface that may occur if a given perf domain ID is not recognized (Changwoo Min) - Avoid double free in the cpufreq_dbs_governor_init() error path when kobject_init_and_add() fails (Guangshuo Li)" * tag 'pm-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: governor: fix double free in cpufreq_dbs_governor_init() error path PM: EM: Fix NULL pointer dereference when perf domain ID is not found
2026-04-02Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds3-7/+59
Pull bpf fixes from Alexei Starovoitov: - Fix register equivalence for pointers to packet (Alexei Starovoitov) - Fix incorrect pruning due to atomic fetch precision tracking (Daniel Borkmann) - Fix grace period wait for bpf_link-ed tracepoints (Kumar Kartikeya Dwivedi) - Fix use-after-free of sockmap's sk->sk_socket (Kuniyuki Iwashima) - Reject direct access to nullable PTR_TO_BUF pointers (Qi Tang) - Reject sleepable kprobe_multi programs at attach time (Varun R Mallya) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Add more precision tracking tests for atomics bpf: Fix incorrect pruning due to atomic fetch precision tracking bpf: Reject sleepable kprobe_multi programs at attach time bpf: reject direct access to nullable PTR_TO_BUF pointers bpf: sockmap: Fix use-after-free of sk->sk_socket in sk_psock_verdict_data_ready(). bpf: Fix grace period wait for tracepoint bpf_link bpf: Fix regsafe() for pointers to packet
2026-04-02sched_ext: Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCUChangwoo Min1-12