summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2025-12-07tracing: Fix WARN_ON in tracing_buffers_mmap_close for split VMAsDeepanshu Kartikey1-0/+10
commit b042fdf18e89a347177a49e795d8e5184778b5b6 upstream. When a VMA is split (e.g., by partial munmap or MAP_FIXED), the kernel calls vm_ops->close on each portion. For trace buffer mappings, this results in ring_buffer_unmap() being called multiple times while ring_buffer_map() was only called once. This causes ring_buffer_unmap() to return -ENODEV on subsequent calls because user_mapped is already 0, triggering a WARN_ON. Trace buffer mappings cannot support partial mappings because the ring buffer structure requires the complete buffer including the meta page. Fix this by adding a may_split callback that returns -EINVAL to prevent VMA splits entirely. Cc: stable@vger.kernel.org Fixes: cf9f0f7c4c5bb ("tracing: Allow user-space mapping of the ring-buffer") Link: https://patch.msgid.link/20251119064019.25904-1-kartikey406@gmail.com Closes: https://syzkaller.appspot.com/bug?extid=a72c325b042aae6403c7 Tested-by: syzbot+a72c325b042aae6403c7@syzkaller.appspotmail.com Reported-by: syzbot+a72c325b042aae6403c7@syzkaller.appspotmail.com Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-07timekeeping: Fix error code in tk_aux_sysfs_init()Dan Carpenter1-1/+3
commit c7418164b463056bf4327b6a2abe638b78250f13 upstream. If kobject_create_and_add() fails on the first iteration, then the error code is set to -ENOMEM which is correct. But if it fails in subsequent iterations then "ret" is zero, which means success, but it should be -ENOMEM. Set the error code to -ENOMEM correctly. Fixes: 7b5ab04f035f ("timekeeping: Fix resource leak in tk_aux_sysfs_init() error paths") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Malaya Kumar Rout <mrout@redhat.com> Link: https://patch.msgid.link/aSW1R8q5zoY_DgQE@stanley.mountain Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-07dma-direct: Fix missing sg_dma_len assignment in P2PDMA bus mappingsPranjal Shrivastava1-0/+1
[ Upstream commit d0d08f4bd7f667dc7a65cd7133c0a94a6f02aca3 ] Prior to commit a25e7962db0d7 ("PCI/P2PDMA: Refactor the p2pdma mapping helpers"), P2P segments were mapped using the pci_p2pdma_map_segment() helper. This helper was responsible for populating sg->dma_address, marking the bus address, and also setting sg_dma_len(sg). The refactor[1] removed this helper and moved the mapping logic directly into the callers. While iommu_dma_map_sg() was correctly updated to set the length in the new flow, it was missed in dma_direct_map_sg(). Thus, in dma_direct_map_sg(), the PCI_P2PDMA_MAP_BUS_ADDR case sets the dma_address and marks the segment, but immediately executes 'continue', which causes the loop to skip the standard assignment logic at the end: sg_dma_len(sg) = sg->length; As a result, when CONFIG_NEED_SG_DMA_LENGTH is enabled, the dma_length field remains uninitialized (zero) for P2P bus address mappings. This breaks upper-layer drivers (for e.g. RDMA/IB) that rely on sg_dma_len() to determine the transfer size. Fix this by explicitly setting the DMA length in the PCI_P2PDMA_MAP_BUS_ADDR case before continuing to the next scatterlist entry. Fixes: a25e7962db0d7 ("PCI/P2PDMA: Refactor the p2pdma mapping helpers") Reported-by: Jacob Moroni <jmoroni@google.com> Signed-off-by: Pranjal Shrivastava <praan@google.com> [1] https://lore.kernel.org/all/ac14a0e94355bf898de65d023ccf8a2ad22a3ece.1746424934.git.leon@kernel.org/ Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Shivaji Kant <shivajikant@google.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20251126114112.3694469-1-praan@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-01sched_ext: fix flag check for deferred callbacksEmil Tsalapatis1-1/+1
commit a3c4a0a42e61aad1056a3d33fd603c1ae66d4288 upstream. When scheduling the deferred balance callbacks, check SCX_RQ_BAL_CB_PENDING instead of SCX_RQ_BAL_PENDING. This way schedule_deferred() properly tests whether there is already a pending request for queue_balance_callback() to be invoked at the end of .balance(). Fixes: a8ad873113d3 ("sched_ext: defer queue_balance_callback() until after ops.dispatch") Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-01sched_ext: Fix scx_kick_pseqs corruption on concurrent scheduler loadsAndrea Righi1-5/+5
commit 05e63305c85c88141500f0a2fb02afcfba9396e1 upstream. If we load a BPF scheduler while another scheduler is already running, alloc_kick_pseqs() would be called again, overwriting the previously allocated arrays. Fix by moving the alloc_kick_pseqs() call after the scx_enable_state() check, ensuring that the arrays are only allocated when a scheduler can actually be loaded. Fixes: 14c1da3895a11 ("sched_ext: Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc()") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-01sched_ext: defer queue_balance_callback() until after ops.dispatchEmil Tsalapatis2-2/+28
[ Upstream commit a8ad873113d3fe01f9b5d737d4b0570fa36826b0 ] The sched_ext code calls queue_balance_callback() during enqueue_task() to defer operations that drop multiple locks until we can unpin them. The call assumes that the rq lock is held until the callbacks are invoked, and the pending callbacks will not be visible to any other threads. This is enforced by a WARN_ON_ONCE() in rq_pin_lock(). However, balance_one() may actually drop the lock during a BPF dispatch call. Another thread may win the race to get the rq lock and see the pending callback. To avoid this, sched_ext must only queue the callback after the dispatch calls have completed. CPU 0 CPU 1 CPU 2 scx_balance() rq_unpin_lock() scx_balance_one() |= IN_BALANCE scx_enqueue() ops.dispatch() rq_unlock() rq_lock() queue_balance_callback() rq_unlock() [WARN] rq_pin_lock() rq_lock() &= ~IN_BALANCE rq_repin_lock() Changelog v2-> v1 (https://lore.kernel.org/sched-ext/aOgOxtHCeyRT_7jn@gpd4) - Fixed explanation in patch description (Andrea) - Fixed scx_rq mask state updates (Andrea) - Added Reviewed-by tag from Andrea Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-01sched_ext: Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc()Tejun Heo1-10/+79
[ Upstream commit 14c1da3895a116f4e32c20487046655f26d3999b ] On systems with >4096 CPUs, scx_kick_cpus_pnt_seqs allocation fails during boot because it exceeds the 32,768 byte percpu allocator limit. Restructure to use DEFINE_PER_CPU() for the per-CPU pointers, with each CPU pointing to its own kvzalloc'd array. Move allocation from boot time to scx_enable() and free in scx_disable(), so the O(nr_cpu_ids^2) memory is only consumed when sched_ext is active. Use RCU to guard against racing with free. Arrays are freed via call_rcu() and kick_cpus_irq_workfn() uses rcu_dereference_bh() with a NULL check. While at it, rename to scx_kick_pseqs for brevity and update comments to clarify these are pick_task sequence numbers. v2: RCU protect scx_kick_seqs to manage kick_cpus_irq_workfn() racing against disable as per Andrea. v3: Fix bugs notcied by Andrea. Reported-by: Phil Auld <pauld@redhat.com> Link: http://lkml.kernel.org/r/20251007133523.GA93086@pauld.westford.csb Cc: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-01timekeeping: Fix resource leak in tk_aux_sysfs_init() error pathsMalaya Kumar Rout1-9/+12
[ Upstream commit 7b5ab04f035f829ed6008e4685501ec00b3e73c9 ] tk_aux_sysfs_init() returns immediately on error during the auxiliary clock initialization loop without cleaning up previously allocated kobjects and sysfs groups. If kobject_create_and_add() or sysfs_create_group() fails during loop iteration, the parent kobjects (tko and auxo) and any previously created child kobjects are leaked. Fix this by adding proper error handling with goto labels to ensure all allocated resources are cleaned up on failure. kobject_put() on the parent kobjects will handle cleanup of their children. Fixes: 7b95663a3d96 ("timekeeping: Provide interface to control auxiliary clocks") Signed-off-by: Malaya Kumar Rout <mrout@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251120150213.246777-1-mrout@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-01perf: Fix 0 count issue of cpu-clockDapeng Mi1-1/+1
[ Upstream commit f1f96511b1c4c33e53f05909dd267878e0643a9a ] Currently cpu-clock event always returns 0 count, e.g., perf stat -e cpu-clock -- sleep 1 Performance counter stats for 'sleep 1': 0 cpu-clock # 0.000 CPUs utilized 1.002308394 seconds time elapsed The root cause is the commit 'bc4394e5e79c ("perf: Fix the throttle error of some clock events")' adds PERF_EF_UPDATE flag check before calling cpu_clock_event_update() to update the count, however the PERF_EF_UPDATE flag is never set when the cpu-clock event is stopped in counting mode (pmu->dev() -> cpu_clock_event_del() -> cpu_clock_event_stop()). This leads to the cpu-clock event count is never updated. To fix this issue, force to set PERF_EF_UPDATE flag for cpu-clock event just like what task-clock does. Fixes: bc4394e5e79c ("perf: Fix the throttle error of some clock events") Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://patch.msgid.link/20251112080526.3971392-1-dapeng1.mi@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-01tick/sched: Fix bogus condition in report_idle_softirq()Wen Yang1-6/+5
[ Upstream commit 807e0d187da4c0b22036b5e34000f7a8c52f6e50 ] In commit 0345691b24c0 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") the new function report_idle_softirq() was created by breaking code out of the existing can_stop_idle_tick() for kernels v5.18 and newer. In doing so, the code essentially went from this form: if (A) { static int ratelimit; if (ratelimit < 10 && !C && A&D) { pr_warn("NOHZ tick-stop error: ..."); ratelimit++; } return false; } to a new function: static bool report_idle_softirq(void) { static int ratelimit; if (likely(!A)) return false; if (ratelimit < 10) return false; ... pr_warn("NOHZ tick-stop error: local softirq work is pending, handler #%02x!!!\n", pending); ratelimit++; return true; } commit a7e282c77785 ("tick/rcu: Fix bogus ratelimit condition") realized ratelimit was essentially set to zero instead of ten, and hence *no* softirq pending messages would ever be issued, but "fixed" it as: - if (ratelimit < 10) + if (ratelimit >= 10) return false; However, this fix introduced another issue: When ratelimit is greater than or equal 10, even if A is true, it will directly return false. While ratelimit in the original code was only used to control printing and will not affect the return value. Restore the original logic and restrict ratelimit to control the printk and not the return value. Fixes: 0345691b24c0 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") Fixes: a7e282c77785 ("tick/rcu: Fix bogus ratelimit condition") Signed-off-by: Wen Yang <wen.yang@linux.dev> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251119174525.29470-1-wen.yang@linux.dev Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-01sched_ext: Fix scx_enable() crash on helper kthread creation failureSaket Kumar Bhaskar1-1/+4
commit 7b6216baae751369195fa3c83d434d23bcda406a upstream. A crash was observed when the sched_ext selftests runner was terminated with Ctrl+\ while test 15 was running: NIP [c00000000028fa58] scx_enable.constprop.0+0x358/0x12b0 LR [c00000000028fa2c] scx_enable.constprop.0+0x32c/0x12b0 Call Trace: scx_enable.constprop.0+0x32c/0x12b0 (unreliable) bpf_struct_ops_link_create+0x18c/0x22c __sys_bpf+0x23f8/0x3044 sys_bpf+0x2c/0x6c system_call_exception+0x124/0x320 system_call_vectored_common+0x15c/0x2ec kthread_run_worker() returns an ERR_PTR() on failure rather than NULL, but the current code in scx_alloc_and_add_sched() only checks for a NULL helper. Incase of failure on SIGQUIT, the error is not handled in scx_alloc_and_add_sched() and scx_enable() ends up dereferencing an error pointer. Error handling is fixed in scx_alloc_and_add_sched() to propagate PTR_ERR() into ret, so that scx_enable() jumps to the existing error path, avoiding random dereference on failure. Fixes: bff3b5aec1b7 ("sched_ext: Move disable machinery into scx_sched") Cc: stable@vger.kernel.org # v6.16+ Reported-and-tested-by: Samir Mulani <samir@linux.ibm.com> Signed-off-by: Saket Kumar Bhaskar <skb99@linux.ibm.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-01timers: Fix NULL function pointer race in timer_shutdown_sync()Yipeng Zou1-3/+4
commit 20739af07383e6eb1ec59dcd70b72ebfa9ac362c upstream. There is a race condition between timer_shutdown_sync() and timer expiration that can lead to hitting a WARN_ON in expire_timers(). The issue occurs when timer_shutdown_sync() clears the timer function to NULL while the timer is still running on another CPU. The race scenario looks like this: CPU0 CPU1 <SOFTIRQ> lock_timer_base() expire_timers() base->running_timer = timer; unlock_timer_base() [call_timer_fn enter] mod_timer() ... timer_shutdown_sync() lock_timer_base() // For now, will not detach the timer but only clear its function to NULL if (base->running_timer != timer) ret = detach_if_pending(timer, base, true); if (shutdown) timer->function = NULL; unlock_timer_base() [call_timer_fn exit] lock_timer_base() base->running_timer = NULL; unlock_timer_base() ... // Now timer is pending while its function set to NULL. // next timer trigger <SOFTIRQ> expire_timers() WARN_ON_ONCE(!fn) // hit ... lock_timer_base() // Now timer will detach if (base->running_timer != timer) ret = detach_if_pending(timer, base, true); if (shutdown) timer->function = NULL; unlock_timer_base() The problem is that timer_shutdown_sync() clears the timer function regardless of whether the timer is currently running. This can leave a pending timer with a NULL function pointer, which triggers the WARN_ON_ONCE(!fn) check in expire_timers(). Fix this by only clearing the timer function when actually detaching the timer. If the timer is running, leave the function pointer intact, which is safe because the timer will be properly detached when it finishes running. Fixes: 0cc04e80458a ("timers: Add shutdown mechanism to the internal functions") Signed-off-by: Yipeng Zou <zouyipeng@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20251122093942.301559-1-zouyipeng@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24PM: hibernate: Use atomic64_t for compressed_size variableMario Limonciello (AMD)1-5/+5
commit 66ededc694f1d06a71ca35a3c8e3689e9b85b3ce upstream. `compressed_size` can overflow, showing nonsensical values. Change from `atomic_t` to `atomic64_t` to prevent overflow. Fixes: a06c6f5d3cc9 ("PM: hibernate: Move to crypto APIs for LZO compression") Reported-by: Askar Safin <safinaskar@gmail.com> Closes: https://lore.kernel.org/linux-pm/20251105180506.137448-1-safinaskar@gmail.com/ Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: Askar Safin <safinaskar@gmail.com> Cc: 6.9+ <stable@vger.kernel.org> # 6.9+ Link: https://patch.msgid.link/20251106045158.3198061-3-superm1@kernel.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24PM: hibernate: Emit an error when image writing failsMario Limonciello (AMD)1-4/+7
commit 62b9ca1706e1bbb60d945a58de7c7b5826f6b2a2 upstream. If image writing fails, a return code is passed up to the caller, but none of the callers log anything to the log and so the only record of it is the return code that userspace gets. Adjust the logging so that the image size and speed of writing is only emitted on success and if there is an error, it's saved to the logs. Fixes: a06c6f5d3cc9 ("PM: hibernate: Move to crypto APIs for LZO compression") Reported-by: Askar Safin <safinaskar@gmail.com> Closes: https://lore.kernel.org/linux-pm/20251105180506.137448-1-safinaskar@gmail.com/ Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: Askar Safin <safinaskar@gmail.com> Cc: 6.9+ <stable@vger.kernel.org> # 6.9+ [ rjw: Added missing braces after "else", changelog edits ] Link: https://patch.msgid.link/20251106045158.3198061-2-superm1@kernel.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24ftrace: Fix BPF fexit with livepatchSong Liu2-11/+14
commit 56b3c85e153b84f27e6cff39623ba40a1ad299d3 upstream. When livepatch is attached to the same function as bpf trampoline with a fexit program, bpf trampoline code calls register_ftrace_direct() twice. The first time will fail with -EAGAIN, and the second time it will succeed. This requires register_ftrace_direct() to unregister the address on the first attempt. Otherwise, the bpf trampoline cannot attach. Here is an easy way to reproduce this issue: insmod samples/livepatch/livepatch-sample.ko bpftrace -e 'fexit:cmdline_proc_show {}' ERROR: Unable to attach probe: fexit:vmlinux:cmdline_proc_show... Fix this by cleaning up the hash when register_ftrace_function_nolock hits errors. Also, move the code that resets ops->func and ops->trampoline to the error path of register_ftrace_direct(); and add a helper function reset_direct() in register_ftrace_direct() and unregister_ftrace_direct(). Fixes: d05cb470663a ("ftrace: Fix modification of direct_function hash while in use") Cc: stable@vger.kernel.org # v6.6+ Reported-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com> Closes: https://lore.kernel.org/live-patching/c5058315a39d4615b333e485893345be@crowdstrike.com/ Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-and-tested-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com> Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20251027175023.1521602-2-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24crash: fix crashkernel resource shrinkSourabh Jain1-1/+1
commit 00fbff75c5acb4755f06f08bd1071879c63940c5 upstream. When crashkernel is configured with a high reservation, shrinking its value below the low crashkernel reservation causes two issues: 1. Invalid crashkernel resource objects 2. Kernel crash if crashkernel shrinking is done twice For example, with crashkernel=200M,high, the kernel reserves 200MB of high memory and some default low memory (say 256MB). The reservation appears as: cat /proc/iomem | grep -i crash af000000-beffffff : Crash kernel 433000000-43f7fffff : Crash kernel If crashkernel is then shrunk to 50MB (echo 52428800 > /sys/kernel/kexec_crash_size), /proc/iomem still shows 256MB reserved: af000000-beffffff : Crash kernel Instead, it should show 50MB: af000000-b21fffff : Crash kernel Further shrinking crashkernel to 40MB causes a kernel crash with the following trace (x86): BUG: kernel NULL pointer dereference, address: 0000000000000038 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI <snip...> Call Trace: <TASK> ? __die_body.cold+0x19/0x27 ? page_fault_oops+0x15a/0x2f0 ? search_module_extables+0x19/0x60 ? search_bpf_extables+0x5f/0x80 ? exc_page_fault+0x7e/0x180 ? asm_exc_page_fault+0x26/0x30 ? __release_resource+0xd/0xb0 release_resource+0x26/0x40 __crash_shrink_memory+0xe5/0x110 crash_shrink_memory+0x12a/0x190 kexec_crash_size_store+0x41/0x80 kernfs_fop_write_iter+0x141/0x1f0 vfs_write+0x294/0x460 ksys_write+0x6d/0xf0 <snip...> This happens because __crash_shrink_memory()/kernel/crash_core.c incorrectly updates the crashk_res resource object even when crashk_low_res should be updated. Fix this by ensuring the correct crashkernel resource object is updated when shrinking crashkernel memory. Link: https://lkml.kernel.org/r/20251101193741.289252-1-sourabhjain@linux.ibm.com Fixes: 16c6006af4d4 ("kexec: enable kexec_crash_size to support two crash kernel regions") Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Zhen Lei <thunder.leizhen@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24kho: warn and exit when unpreserved page wasn't preservedPratyush Yadav1-4/+4
commit b05addf6f0596edb1f82ab4059438c7ef2d2686d upstream. Calling __kho_unpreserve() on a pair of (pfn, end_pfn) that wasn't preserved is a bug. Currently, if that is done, the physxa or bits can be NULL. This results in a soft lockup since a NULL physxa or bits results in redoing the loop without ever making any progress. Return when physxa or bits are not found, but WARN first to loudly indicate invalid behaviour. Link: https://lkml.kernel.org/r/20251103180235.71409-3-pratyush@kernel.org Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24gcov: add support for GCC 15Peter Oberparleiter1-1/+3
commit ec4d11fc4b2dd4a2fa8c9d801ee9753b74623554 upstream. Using gcov on kernels compiled with GCC 15 results in truncated 16-byte long .gcda files with no usable data. To fix this, update GCOV_COUNTERS to match the value defined by GCC 15. Tested with GCC 14.3.0 and GCC 15.2.0. Link: https://lkml.kernel.org/r/20251028115125.1319410-1-oberpar@linux.ibm.com Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com> Reported-by: Matthieu Baerts <matttbe@kernel.org> Closes: https://github.com/linux-test-project/lcov/issues/445 Tested-by: Matthieu Baerts <matttbe@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24sched_ext: Fix unsafe locking in the scx_dump_state()Zqiang1-2/+2
[ Upstream commit 5f02151c411dda46efcc5dc57b0845efcdcfc26d ] For built with CONFIG_PREEMPT_RT=y kernels, the dump_lock will be converted sleepable spinlock and not disable-irq, so the following scenarios occur: inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. irq_work/0/27 [HC0[0]:SC0[0]:HE1:SE1] takes: (&rq->__lock){?...}-{2:2}, at: raw_spin_rq_lock_nested+0x2b/0x40 {IN-HARDIRQ-W} state was registered at: lock_acquire+0x1e1/0x510 _raw_spin_lock_nested+0x42/0x80 raw_spin_rq_lock_nested+0x2b/0x40 sched_tick+0xae/0x7b0 update_process_times+0x14c/0x1b0 tick_periodic+0x62/0x1f0 tick_handle_periodic+0x48/0xf0 timer_interrupt+0x55/0x80 __handle_irq_event_percpu+0x20a/0x5c0 handle_irq_event_percpu+0x18/0xc0 handle_irq_event+0xb5/0x150 handle_level_irq+0x220/0x460 __common_interrupt+0xa2/0x1e0 common_interrupt+0xb0/0xd0 asm_common_interrupt+0x2b/0x40 _raw_spin_unlock_irqrestore+0x45/0x80 __setup_irq+0xc34/0x1a30 request_threaded_irq+0x214/0x2f0 hpet_time_init+0x3e/0x60 x86_late_time_init+0x5b/0xb0 start_kernel+0x308/0x410 x86_64_start_reservations+0x1c/0x30 x86_64_start_kernel+0x96/0xa0 common_startup_64+0x13e/0x148 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&rq->__lock); <Interrupt> lock(&rq->__lock); *** DEADLOCK *** stack backtrace: CPU: 0 UID: 0 PID: 27 Comm: irq_work/0 Call Trace: <TASK> dump_stack_lvl+0x8c/0xd0 dump_stack+0x14/0x20 print_usage_bug+0x42e/0x690 mark_lock.part.44+0x867/0xa70 ? __pfx_mark_lock.part.44+0x10/0x10 ? string_nocheck+0x19c/0x310 ? number+0x739/0x9f0 ? __pfx_string_nocheck+0x10/0x10 ? __pfx_check_pointer+0x10/0x10 ? kvm_sched_clock_read+0x15/0x30 ? sched_clock_noinstr+0xd/0x20 ? local_clock_noinstr+0x1c/0xe0 __lock_acquire+0xc4b/0x62b0 ? __pfx_format_decode+0x10/0x10 ? __pfx_string+0x10/0x10 ? __pfx___lock_acquire+0x10/0x10 ? __pfx_vsnprintf+0x10/0x10 lock_acquire+0x1e1/0x510 ? raw_spin_rq_lock_nested+0x2b/0x40 ? __pfx_lock_acquire+0x10/0x10 ? dump_line+0x12e/0x270 ? raw_spin_rq_lock_nested+0x20/0x40 _raw_spin_lock_nested+0x42/0x80 ? raw_spin_rq_lock_nested+0x2b/0x40 raw_spin_rq_lock_nested+0x2b/0x40 scx_dump_state+0x3b3/0x1270 ? finish_task_switch+0x27e/0x840 scx_ops_error_irq_workfn+0x67/0x80 irq_work_single+0x113/0x260 irq_work_run_list.part.3+0x44/0x70 run_irq_workd+0x6b/0x90 ? __pfx_run_irq_workd+0x10/0x10 smpboot_thread_fn+0x529/0x870 ? __pfx_smpboot_thread_fn+0x10/0x10 kthread+0x305/0x3f0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x40/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> This commit therefore use rq_lock_irqsave/irqrestore() to replace rq_lock/unlock() in the scx_dump_state(). Fixes: 07814a9439a3 ("sched_ext: Print debug dump after an error exit") Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-24posix-timers: Plug potential memory leak in do_timer_create()Eslam Khafagy1-6/+6
[ Upstream commit e0fd4d42e27f761e9cc82801b3f183e658dc749d ] When posix timer creation is set to allocate a given timer ID and the access to the user space value faults, the function terminates without freeing the already allocated posix timer structure. Move the allocation after the user space access to cure that. [ tglx: Massaged change log ] Fixes: ec2d0c04624b3 ("posix-timers: Provide a mechanism to allocate a given timer ID") Reported-by: syzbot+9c47ad18f978d4394986@syzkaller.appspotmail.com Suggested-by: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Eslam Khafagy <eslam.medhat1993@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/20251114122739.994326-1-eslam.medhat1993@gmail.com Closes: https://lore.kernel.org/all/69155df4.a70a0220.3124cb.0017.GAE@google.com/T/ Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-24bpf: account for current allocated stack depth in widen_imprecise_scalars()Eduard Zingerman1-2/+4
[ Upstream commit b0c8e6d3d866b6a7f73877f71968dbffd27b7785 ] The usage pattern for widen_imprecise_scalars() looks as follows: prev_st = find_prev_entry(env, ...); queued_st = push_stack(...); widen_imprecise_scalars(env, prev_st, queued_st); Where prev_st is an ancestor of the queued_st in the explored states tree. This ancestor is not guaranteed to have same allocated stack depth as queued_st. E.g. in the following case: def main(): for i in 1..2: foo(i) // same callsite, differnt param def foo(i): if i == 1: use 128 bytes of stack iterator based loop Here, for a second 'foo' call prev_st->allocated_stack is 128, while queued_st->allocated_stack is much smaller. widen_imprecise_scalars() needs to take this into account and avoid accessing bpf_verifier_state->frame[*]->stack out of bounds. Fixes: 2793a8b015f7 ("bpf: exact states comparison for iterator convergence checks") Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251114025730.772723-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-24futex: Optimize per-cpu reference countingPeter Zijlstra1-6/+6
[ Upstream commit 4cb5ac2626b5704ed712ac1d46b9d89fdfc12c5d ] Shrikanth noted that the per-cpu reference counter was still some 10% slower than the old immutable option (which removes the reference counting entirely). Further optimize the per-cpu reference counter by: - switching from RCU to preempt; - using __this_cpu_*() since we now have preempt disabled; - switching from smp_load_acquire() to READ_ONCE(). This is all safe because disabling preemption inhibits the RCU grace period exactly like rcu_read_lock(). Having preemption disabled allows using __this_cpu_*() provided the only access to the variable is in task context -- which is the case here. Furthermore, since we know changing fph->state to FR_ATOMIC demands a full RCU grace period we can rely on the implied smp_mb() from that to replace the acquire barrier(). This is very similar to the percpu_down_read_internal() fast-path. The reason this is significant for PowerPC is that it uses the generic this_cpu_*() implementation which relies on local_irq_disable() (the x86 implementation relies on it being a single memop instruction to be IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE barrier, not having to use explicit barriers safes a bunch. Combined this reduces the performance gap by half, down to some 5%. Fixes: 760e6f7befba ("futex: Remove support for IMMUTABLE") Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20251106092929.GR4067720@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13perf/core: Fix system hang caused by cpu-clock usageDapeng Mi1-5/+15
commit eb3182ef0405ff2f6668fd3e5ff9883f60ce8801 upstream. cpu-clock usage by the async-profiler tool can trigger a system hang, which got bisected back to the following commit by Octavia Togami: 18dbcbfabfff ("perf: Fix the POLL_HUP delivery breakage") causes this issue The root cause of the hang is that cpu-clock is a special type of SW event which relies on hrtimers. The __perf_event_overflow() callback is invoked from the hrtimer handler for cpu-clock events, and __perf_event_overflow() tries to call cpu_clock_event_stop() to stop the event, which calls htimer_cancel() to cancel the hrtimer. But that's a recursion into the hrtimer code from a hrtimer handler, which (unsurprisingly) deadlocks. To fix this bug, use hrtimer_try_to_cancel() instead, and set the PERF_HES_STOPPED flag, which causes perf_swevent_hrtimer() to stop the event once it sees the PERF_HES_STOPPED flag. [ mingo: Fixed the comments and improved the changelog. ] Closes: https://lore.kernel.org/all/CAHPNGSQpXEopYreir+uDDEbtXTBvBvi8c6fYXJvceqtgTPao3Q@mail.gmail.com/ Fixes: 18dbcbfabfff ("perf: Fix the POLL_HUP delivery breakage") Reported-by: Octavia Togami <octavia.togami@gmail.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Octavia Togami <octavia.togami@gmail.com> Cc: stable@vger.kernel.org Link: https://github.com/lucko/spark/issues/530 Link: https://patch.msgid.link/20251015051828.12809-1-dapeng1.mi@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-13tracing: Fix memory leaks in create_field_var()Zilin Guan1-2/+4
[ Upstream commit 80f0d631dcc76ee1b7755bfca1d8417d91d71414 ] The function create_field_var() allocates memory for 'val' through create_hist_field() inside parse_atom(), and for 'var' through create_var(), which in turn allocates var->type and var->var.name internally. Simply calling kfree() to release these structures will result in memory leaks. Use destroy_hist_field() to properly free 'val', and explicitly release the memory of var->type and var->var.name before freeing 'var' itself. Link: https://patch.msgid.link/20251106120132.3639920-1-zilin@seu.edu.cn Fixes: 02205a6752f22 ("tracing: Add support for 'field variables'") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13ring-buffer: Do not warn in ring_buffer_map_get_reader() when reader catches upSteven Rostedt1-0/+4
commit aa997d2d2a0b2e76f4df0f1f12829f02acb4fb6b upstream. The function ring_buffer_map_get_reader() is a bit more strict than the other get reader functions, and except for certain situations the rb_get_reader_page() should not return NULL. If it does, it triggers a warning. This warning was triggering but after looking at why, it was because another acceptable situation was happening and it wasn't checked for. If the reader catches up to the writer and there's still data to be read on the reader page, then the rb_get_reader_page() will return NULL as there's no new page to get. In this situation, the reader page should not be updated and no warning should trigger. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Reported-by: syzbot+92a3745cea5ec6360309@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/690babec.050a0220.baf87.0064.GAE@google.com/ Link: https://lore.kernel.org/20251016132848.1b11bb37@gandalf.local.home Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-13tracing: tprobe-events: Fix to put tracepoint_user when disable the tprobeMasami Hiramatsu (Google)1-0/+4
commit c91afa7610235f89a5e8f5686aac23892ab227ed upstream. __unregister_trace_fprobe() checks tf->tuser to put it when removing tprobe. However, disable_trace_fprobe() does not use it and only calls unregister_fprobe(). Thus it forgets to disable tracepoint_user. If the trace_fprobe has tuser, put it for unregistering the tracepoint callbacks when disabling tprobe correctly. Link: https://lore.kernel.org/all/176244794466.155515.3971904050506100243.stgit@devnote2/ Fixes: 2867495dea86 ("tracing: tprobe-events: Register tracepoint when enable tprobe event") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Tested-by: Beau Belgrave <beaub@linux.microsoft.com> Reviewed-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-13tracing: tprobe-events: Fix to register tracepoint correctlyMasami Hiramatsu (Google)1-1/+2
commit 10d9dda426d684e98b17161f02f77894c6de9b60 upstream. Since __tracepoint_user_init() calls tracepoint_user_register() without initializing tuser->tpoint with given tracpoint, it does not register tracepoint stub function as callback correctly, and tprobe does not work. Initializing tuser->tpoint correctly before tracepoint_user_register() so that it sets up tracepoint callback. I confirmed below example works fine again. echo "t sched_switch preempt prev_pid=prev->pid next_pid=next->pid" > /sys/kernel/tracing/dynamic_events echo 1 > /sys/kernel/tracing/events/tracepoints/sched_switch/enable cat /sys/kernel/tracing/trace_pipe Link: https://lore.kernel.org/all/176244793514.155515.6466348656998627773.stgit@devnote2/ Fixes: 2867495dea86 ("tracing: tprobe-events: Register tracepoint when enable tprobe event") Reported-by: Beau Belgrave <beaub@linux.microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Tested-by: Beau Belgrave <beaub@linux.microsoft.com> Reviewed-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-13ftrace: Fix softlockup in ftrace_module_enableVladimir Riabchun1-0/+2
[ Upstream commit 4099b98203d6b33d990586542fa5beee408032a3 ] A soft lockup was observed when loading amdgpu module. If a module has a lot of tracable functions, multiple calls to kallsyms_lookup can spend too much time in RCU critical section and with disabled preemption, causing kernel panic. This is the same issue that was fixed in commit d0b24b4e91fc ("ftrace: Prevent RCU stall on PREEMPT_VOLUNTARY kernels") and commit 42ea22e754ba ("ftrace: Add cond_resched() to ftrace_graph_set_hash()"). Fix it the same way by adding cond_resched() in ftrace_module_enable. Link: https://lore.kernel.org/aMQD9_lxYmphT-up@vova-pc Signed-off-by: Vladimir Riabchun <ferr.lambarginio@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13uprobe: Do not emulate/sstep original instruction when ip is changedJiri Olsa1-0/+7
[ Upstream commit 4363264111e1297fa37aa39b0598faa19298ecca ] If uprobe handler changes instruction pointer we still execute single step) or emulate the original instruction and increment the (new) ip with its length. This makes the new instruction pointer bogus and application will likely crash on illegal instruction execution. If user decided to take execution elsewhere, it makes little sense to execute the original instruction, so let's skip it. Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20250916215301.664963-3-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13futex: Don't leak robust_list pointer on exec racePranav Tyagi1-50/+56
[ Upstream commit 6b54082c3ed4dc9821cdf0edb17302355cc5bb45 ] sys_get_robust_list() and compat_get_robust_list() use ptrace_may_access() to check if the calling task is allowed to access another task's robust_list pointer. This check is racy against a concurrent exec() in the target process. During exec(), a task may transition from a non-privileged binary to a privileged one (e.g., setuid binary) and its credentials/memory mappings may change. If get_robust_list() performs ptrace_may_access() before this transition, it may erroneously allow access to sensitive information after the target becomes privileged. A racy access allows an attacker to exploit a window during which ptrace_may_access() passes before a target process transitions to a privileged state via exec(). For example, consider a non-privileged task T that is about to execute a setuid-root binary. An attacker task A calls get_robust_list(T) while T is still unprivileged. Since ptrace_may_access() checks permissions based on current credentials, it succeeds. However, if T begins exec immediately afterwards, it becomes privileged and may change its memory mappings. Because get_robust_list() proceeds to access T->robust_list without synchronizing with exec() it may read user-space pointers from a now-privileged process. This violates the intended post-exec access restrictions and could expose sensitive memory addresses or be used as a primitive in a larger exploit chain. Consequently, the race can lead to unauthorized disclosure of information across privilege boundaries and poses a potential security risk. Take a read lock on signal->exec_update_lock prior to invoking ptrace_may_access() and accessing the robust_list/compat_robust_list. This ensures that the target task's exec state remains stable during the check, allowing for consistent and synchronized validation of credentials. Suggested-by: Jann Horn <jann@thejh.net> Signed-off-by: Pranav Tyagi <pranav.tyagi03@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/linux-fsdevel/1477863998-3298-5-git-send-email-jann@thejh.net/ Link: https://github.com/KSPP/linux/issues/119 Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13bpf: Do not limit bpf_cgroup_from_id to current's namespaceKumar Kartikeya Dwivedi2-5/+21
[ Upstream commit 2c895133950646f45e5cf3900b168c952c8dbee8 ] The bpf_cgroup_from_id kfunc relies on cgroup_get_from_id to obtain the cgroup corresponding to a given cgroup ID. This helper can be called in a lot of contexts where the current thread can be random. A recent example was its use in sched_ext's ops.tick(), to obtain the root cgroup pointer. Since the current task can be whatever random user space task preempted by the timer tick, this makes the behavior of the helper unreliable. Refactor out __cgroup_get_from_id as the non-namespace aware version of cgroup_get_from_id, and change bpf_cgroup_from_id to make use of it. There is no compatibility breakage here, since changing the namespace against which the lookup is being done to the root cgroup namespace only permits a wider set of lookups to succeed now. The cgroup IDs across namespaces are globally unique, and thus don't need to be retranslated. Reported-by: Dan Schatzberg <dschatzberg@meta.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250915032618.1551762-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13bpf: Use tnums for JEQ/JNE is_branch_taken logicPaul Chaignon2-0/+12
[ Upstream commit f41345f47fb267a9c95ca710c33448f8d0d81d83 ] In the following toy program (reg states minimized for readability), R0 and R1 always have different values at instruction 6. This is obvious when reading the program but cannot be guessed from ranges alone as they overlap (R0 in [0; 0xc0000000], R1 in [1024; 0xc0000400]). 0: call bpf_get_prandom_u32#7 ; R0_w=scalar() 1: w0 = w0 ; R0_w=scalar(var_off=(0x0; 0xffffffff)) 2: r0 >>= 30 ; R0_w=scalar(var_off=(0x0; 0x3)) 3: r0 <<= 30 ; R0_w=scalar(var_off=(0x0; 0xc0000000)) 4: r1 = r0 ; R1_w=scalar(var_off=(0x0; 0xc0000000)) 5: r1 += 1024 ; R1_w=scalar(var_off=(0x400; 0xc0000000)) 6: if r1 != r0 goto pc+1 Looking at tnums however, we can deduce that R1 is always different from R0 because their tnums don't agree on known bits. This patch uses this logic to improve is_scalar_branch_taken in case of BPF_JEQ and BPF_JNE. This change has a tiny impact on complexity, which was measured with the Cilium complexity CI test. That test covers 72 programs with various build and load time configurations for a total of 970 test cases. For 80% of test cases, the patch has no impact. On the other test cases, the patch decreases complexity by only 0.08% on average. In the best case, the verifier needs to walk 3% less instructions and, in the worst case, 1.5% more. Overall, the patch has a small positive impact, especially for our largest programs. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/be3ee70b6e489c49881cb1646114b1d861b5c334.1755694147.git.paul.chaignon@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13PM: sleep: Allow pm_restrict_gfp_mask() stackingRafael J. Wysocki2-9/+17
[ Upstream commit 35e4a69b2003f20a69e7d19ae96ab1eef1aa8e8d ] Allow pm_restrict_gfp_mask() to be called many times in a row to avoid issues with calling dpm_suspend_start() when the GFP mask has been already restricted. Only the first invocation of pm_restrict_gfp_mask() will actually restrict the GFP mask and the subsequent calls will warn if there is a mismatch between the expected allowed GFP mask and the actual one. Moreover, if pm_restrict_gfp_mask() is called many times in a row, pm_restore_gfp_mask() needs to be called matching number of times in a row to actually restore the GFP mask. Calling it when the GFP mask has not been restricted will cause it to warn. This is necessary for the GFP mask restriction starting in hibernation_snapshot() to continue throughout the entire hibernation flow until it completes or it is aborted (either by a wakeup event or by an error). Fixes: 449c9c02537a1 ("PM: hibernate: Restrict GFP mask in hibernation_snapshot()") Fixes: 469d80a3712c ("PM: hibernate: Fix hybrid-sleep") Reported-by: Askar Safin <safinaskar@gmail.com> Closes: https://lore.kernel.org/linux-pm/20251025050812.421905-1-safinaskar@gmail.com/ Link: https://lore.kernel.org/linux-pm/20251028111730.2261404-1-safinaskar@gmail.com/ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: Mario Limonciello (AMD) <superm1@kernel.org> Cc: 6.16+ <stable@vger.kernel.org> # 6.16+ Link: https://patch.msgid.link/5935682.DvuYhMxLoT@rafael.j.wysocki Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-13PM: hibernate: Combine return paths in power_down()Rafael J. Wysocki1-18/+14
[ Upstream commit 1f5bcfe91ffce71bdd1022648b9d501d46d20c09 ] To avoid code duplication and improve clarity, combine the code paths in power_down() leading to a return from that function. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Link: https://patch.msgid.link/3571055.QJadu78ljV@rafael.j.wysocki [ rjw: Changed the new label name to "exit" ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Stable-dep-of: 35e4a69b2003 ("PM: sleep: Allow pm_restrict_gfp_mask() stacking") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-13bpf: Conditionally include dynptr copy kfuncsMalin Jonsson1-0/+2
[ Upstream commit 8ce93aabbf75171470e3d1be56bf1a6937dc5db8 ] Since commit a498ee7576de ("bpf: Implement dynptr copy kfuncs"), if CONFIG_BPF_EVENTS is not enabled, but BPF_SYSCALL and DEBUG_INFO_BTF are, the build will break like so: BTFIDS vmlinux.unstripped WARN: resolve_btfids: unresolved symbol bpf_probe_read_user_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_probe_read_user_dynptr WARN: resolve_btfids: unresolved symbol bpf_probe_read_kernel_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_probe_read_kernel_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_task_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_task_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_dynptr make[2]: *** [scripts/Makefile.vmlinux:72: vmlinux.unstripped] Error 255 make[2]: *** Deleting file 'vmlinux.unstripped' make[1]: *** [/repo/malin/upstream/linux/Makefile:1242: vmlinux] Error 2 make: *** [Makefile:248: __sub-make] Error 2 Guard these symbols with #ifdef CONFIG_BPF_EVENTS to resolve the problem. Fixes: a498ee7576de ("bpf: Implement dynptr copy kfuncs") Reported-by: Yong Gu <yong.g.gu@ericsson.com> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Malin Jonsson <malin.jonsson@est.tech> Link: https://lore.kernel.org/r/20251024151436.139131-1-malin.jonsson@est.tech Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13bpf: Sync pending IRQ work before freeing ring bufferNoorain Eqbal1-0/+2
[ Upstream commit 4e9077638301816a7d73fa1e1b4c1db4a7e3b59c ] Fix a race where irq_work can be queued in bpf_ringbuf_commit() but the ring buffer is freed before the work executes. In the syzbot reproducer, a BPF program attached to sched_switch triggers bpf_ringbuf_commit(), queuing an irq_work. If the ring buffer is freed before this work executes, the irq_work thread may accesses freed memory. Calling `irq_work_sync(&rb->work)` ensures that all pending irq_work complete before freeing the buffer. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Reported-by: syzbot+2617fc732430968b45d2@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2617fc732430968b45d2 Tested-by: syzbot+2617fc732430968b45d2@syzkaller.appspotmail.com Signed-off-by: Noorain Eqbal <nooraineqbal@gmail.com> Link: https://lore.kernel.org/r/20251020180301.103366-1-nooraineqbal@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-13sched_ext: Mark scx_bpf_dsq_move_set_[slice|vtime]() with KF_RCUTejun Heo1-4/+4
commit 54e96258a6930909b690fd7e8889749231ba8085 upstream. scx_bpf_dsq_move_set_slice() and scx_bpf_dsq_move_set_vtime() take a DSQ iterator argument which has to be valid. Mark them with KF_RCU. Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-02cpuset: Use new excpus for nocpu error check when enabling root partitionChen Ridong1-5/+1
[ Upstream commit 59d5de3655698679ad8fd2cc82228de4679c4263 ] A previous patch fixed a bug where new_prs should be assigned before checking housekeeping conflicts. This patch addresses another potential issue: the nocpu error check currently uses the xcpus which is not updated. Although no issue has been observed so far, the check should be performed using the new effective exclusive cpus. The comment has been removed because the function returns an error if nocpu checking fails, which is unrelated to the parent. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-02sched_ext: Keep bypass on between enable failure and scx_disable_workfn()Tejun Heo1-1/+1
[ Upstream commit 4a1d9d73aabc8f97f48c4f84f936de3b265ffd6f ] scx_enable() turns on the bypass mode while enable is in progress. If enabling fails, it turns off the bypass mode and then triggers scx_error(). scx_error() will trigger scx_disable_workfn() which will turn on the bypass mode again and unload the failed scheduler. This moves the system out of bypass mode between the enable error path and the disable path, which is unnecessary and can be brittle - e.g. the thread running scx_enable() may already be on the failed scheduler and can be switched out before it triggers scx_error() leading to a stall. The watchdog would eventually kick in, so the situation isn't critical but is still suboptimal. There is nothing to be gained by turning off the bypass mode between scx_enable() failure and scx_disable_workfn(). Keep bypass on. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-02seccomp: passthrough uprobe systemcall without filteringJiri Olsa1-7/+25
[ Upstream commit 89d1d8434d246c96309a6068dfcf9e36dc61227b ] Adding uprobe as another exception to the seccomp filter alongside with the uretprobe syscall. Same as the uretprobe the uprobe syscall is installed by kernel as replacement for the breakpoint exception and is limited to x86_64 arch and isn't expected to ever be supported in i386. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <kees@kernel.org> Link: https://lore.kernel.org/r/20250720112133.244369-21-jolsa@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-02perf: Skip user unwind if the task is a kernel threadJosh Poimboeuf1-1/+2
[ Upstream commit 16ed389227651330879e17bd83d43bd234006722 ] If the task is not a user thread, there's no user stack to unwind. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250820180428.930791978@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-02perf: Have get_perf_callchain() return NULL if crosstask and user are setJosh Poimboeuf1-5/+5
[ Upstream commit 153f9e74dec230f2e070e16fa061bc7adfd2c450 ] get_perf_callchain() doesn't support cross-task unwinding for user space stacks, have it return NULL if both the crosstask and user arguments are set. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250820180428.426423415@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-02perf: Use current->flags & PF_KTHREAD|PF_USER_WORKER instead of current->mm ↵Steven Rostedt2-5/+5
== NULL [ Upstream commit 90942f9fac05702065ff82ed0bade0d08168d4ea ] To determine if a task is a kernel thread or not, it is more reliable to use (current->flags & (PF_KTHREAD|PF_USER_WORKERi)) than to rely on current->mm being NULL. That is because some kernel tasks (io_uring helpers) may have a mm field. Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250820180428.592367294@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-02genirq/manage: Add buslock back in to enable_irq()Charles Keepax1-1/+1
[ Upstream commit ef3330b99c01bda53f2a189b58bed8f6b7397f28 ] The locking was changed from a buslock to a plain lock, but the patch description states there was no functional change. Assuming this was accidental so reverting to using the buslock. Fixes: bddd10c55407 ("genirq/manage: Rework enable_irq()") Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251023154901.1333755-4-ckeepax@opensource.cirrus.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-02genirq/manage: Add buslock back in to __disable_irq_nosync()Charles Keepax1-1/+1
[ Upstream commit 56363e25f79fe83e63039c5595b8cd9814173d37 ] The locking was changed from a buslock to a plain lock, but the patch description states there was no functional change. Assuming this was accidental so reverting to using the buslock. Fixes: 1b7444446724 ("genirq/manage: Rework __disable_irq_nosync()") Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251023154901.1333755-3-ckeepax@opensource.cirrus.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-02genirq/chip: Add buslock back in to irq_set_handler()Charles Keepax1-1/+1
[ Upstream commit 5d7e45dd670e42df4836afeaa9baf9d41ca4b434 ] The locking was changed from a buslock to a plain lock, but the patch description states there was no functional change. Assuming this was accidental so reverting to using the buslock. Fixes: 5cd05f3e2315 ("genirq/chip: Rework irq_set_handler() variants") Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251023154901.1333755-2-ckeepax@opensource.cirrus.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-02timekeeping: Fix aux clocks sysfs initialization loop boundHaofeng Li1-1/+1
[ Upstream commit 39a9ed0fb6dac58547afdf9b6cb032d326a3698f ] The loop in tk_aux_sysfs_init() uses `i <= MAX_AUX_CLOCKS` as the termination condition, which results in 9 iterations (i=0 to 8) when MAX_AUX_CLOCKS is defined as 8. However, the kernel is designed to support only up to 8 auxiliary clocks. This off-by-one error causes the creation of a 9th sysfs entry that exceeds the intended auxiliary clock range. Fix the loop bound to use `i < MAX_AUX_CLOCKS` to ensure exactly 8 auxiliary clock entries are created, matching the design specification. Fixes: 7b95663a3d96 ("timekeeping: Provide interface to control auxiliary clocks") Signed-off-by: Haofeng Li <lihaofeng@kylinos.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/tencent_2376993D9FC06A3616A4F981B3DE1C599607@qq.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-02sched_ext: Sync error_irq_work before freeing scx_schedTejun Heo1-0/+2
[ Upstream commit efeeaac9ae9763f9c953e69633c86bc3031e39b5 ] By the time scx_sched_free_rcu_work() runs, the scx_sched is no longer reachable. However, a previously queued error_irq_work may still be pending or running. Ensure it completes before proceeding with teardown. Fixes: bff3b5aec1b7 ("sched_ext: Move disable machinery into scx_sched") Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-02sched_ext: Put event_stats_cpu in struct scx_sched_pcpuTejun Heo2-16/+19
[ Upstream commit bcb7c2305682c77a8bfdbfe37106b314ac10110f ] scx_sched.event_stats_cpu is the percpu counters that are used to track stats. Introduce struct scx_sched_pcpu and move the counters inside. This will ease adding more per-cpu fields. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> Stable-dep-of: efeeaac9ae97 ("sched_ext: Sync error_irq_work before freeing scx_sched") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-11-02sched_ext: Move internal type and accessor definitions to ext_internal.hTejun Heo4-1057/+1062
[ Upstream commit 0c2b8356e430229efef42b03bd765a2a7ecf73fd ] There currently isn't a place to place SCX-internal types and accessors to be shared between ext.c and ext_idle.c. Create kernel/sched/ext_internal.h and move internal type and accessor definitions there. This trims ext.c a bit and makes future additions easier. Pure code reorganization. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com> Stable-dep-of: efeeaac9ae97 ("sched_ext: Sync error_irq_work before freeing scx_sched") Signed-off-by: Sasha Levin <sashal@kernel.org>