summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
42 hoursfutex: Clear stale exiting pointer in futex_lock_pi() retry pathDavidlohr Bueso1-1/+2
commit 210d36d892de5195e6766c45519dfb1e65f3eb83 upstream. Fuzzying/stressing futexes triggered: WARNING: kernel/futex/core.c:825 at wait_for_owner_exiting+0x7a/0x80, CPU#11: futex_lock_pi_s/524 When futex_lock_pi_atomic() sees the owner is exiting, it returns -EBUSY and stores a refcounted task pointer in 'exiting'. After wait_for_owner_exiting() consumes that reference, the local pointer is never reset to nil. Upon a retry, if futex_lock_pi_atomic() returns a different error, the bogus pointer is passed to wait_for_owner_exiting(). CPU0 CPU1 CPU2 futex_lock_pi(uaddr) // acquires the PI futex exit() futex_cleanup_begin() futex_state = EXITING; futex_lock_pi(uaddr) futex_lock_pi_atomic() attach_to_pi_owner() // observes EXITING *exiting = owner; // takes ref return -EBUSY wait_for_owner_exiting(-EBUSY, owner) put_task_struct(); // drops ref // exiting still points to owner goto retry; futex_lock_pi_atomic() lock_pi_update_atomic() cmpxchg(uaddr) *uaddr ^= WAITERS // whatever // value changed return -EAGAIN; wait_for_owner_exiting(-EAGAIN, exiting) // stale WARN_ON_ONCE(exiting) Fix this by resetting upon retry, essentially aligning it with requeue_pi. Fixes: 3ef240eaff36 ("futex: Prevent exit livelock") Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260326001759.4129680-1-dave@stgolabs.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hourstracing: Fix potential deadlock in cpu hotplug with osnoiseLuo Haiyang1-5/+5
[ Upstream commit 1f9885732248d22f788e4992c739a98c88ab8a55 ] The following sequence may leads deadlock in cpu hotplug: task1 task2 task3 ----- ----- ----- mutex_lock(&interface_lock) [CPU GOING OFFLINE] cpus_write_lock(); osnoise_cpu_die(); kthread_stop(task3); wait_for_completion(); osnoise_sleep(); mutex_lock(&interface_lock); cpus_read_lock(); [DEAD LOCK] Fix by swap the order of cpus_read_lock() and mutex_lock(&interface_lock). Cc: stable@vger.kernel.org Cc: <mathieu.desnoyers@efficios.com> Cc: <zhang.run@zte.com.cn> Cc: <yang.tao172@zte.com.cn> Cc: <ran.xiaokai@zte.com.cn> Fixes: bce29ac9ce0bb ("trace: Add osnoise tracer") Link: https://patch.msgid.link/20260326141953414bVSj33dAYktqp9Oiyizq8@zte.com.cn Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Luo Haiyang <luo.haiyang@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hourstracing: Switch trace_osnoise.c code over to use guard() and __free()Steven Rostedt1-27/+13
[ Upstream commit 930d2b32c0af6895ba4c6ca6404e7f7b6dc214ed ] The osnoise_hotplug_workfn() grabs two mutexes and cpu_read_lock(). It has various gotos to handle unlocking them. Switch them over to guard() and let the compiler worry about it. The osnoise_cpus_read() has a temporary mask_str allocated and there's some gotos to make sure it gets freed on error paths. Switch that over to __free() to let the compiler worry about it. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241225222931.517329690@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Stable-dep-of: 1f9885732248 ("tracing: Fix potential deadlock in cpu hotplug with osnoise") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hoursalarmtimer: Fix argument order in alarm_timer_forward()Zhan Xusheng1-1/+1
commit 5d16467ae56343b9205caedf85e3a131e0914ad8 upstream. alarm_timer_forward() passes arguments to alarm_forward() in the wrong order: alarm_forward(alarm, timr->it_interval, now); However, alarm_forward() is defined as: u64 alarm_forward(struct alarm *alarm, ktime_t now, ktime_t interval); and uses the second argument as the current time: delta = ktime_sub(now, alarm->node.expires); Passing the interval as "now" results in incorrect delta computation, which can lead to missed expirations or incorrect overrun accounting. This issue has been present since the introduction of alarm_timer_forward(). Fix this by swapping the arguments. Fixes: e7561f1633ac ("alarmtimer: Implement forward callback") Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260323061130.29991-1-zhanxusheng@xiaomi.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
42 hourssysctl: fix uninitialized variable in proc_do_large_bitmapMarc Buerg1-1/+1
[ Upstream commit f63a9df7e3f9f842945d292a19d9938924f066f9 ] proc_do_large_bitmap() does not initialize variable c, which is expected to be set to a trailing character by proc_get_long(). However, proc_get_long() only sets c when the input buffer contains a trailing character after the parsed value. If c is not initialized it may happen to contain a '-'. If this is the case proc_do_large_bitmap() expects to be able to parse a second part of the input buffer. If there is no second part an unjustified -EINVAL will be returned. Initialize c to 0 to prevent returning -EINVAL on valid input. Fixes: 9f977fb7ae9d ("sysctl: add proc_do_large_bitmap") Signed-off-by: Marc Buerg <buermarc@googlemail.com> Reviewed-by: Joel Granados <joel.granados@kernel.org> Signed-off-by: Joel Granados <joel.granados@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursPM: hibernate: Drain trailing zero pages on userspace restoreAlberto Garcia1-0/+11
[ Upstream commit 734eba62cd32cb9ceffa09e57cdc03d761528525 ] Commit 005e8dddd497 ("PM: hibernate: don't store zero pages in the image file") added an optimization to skip zero-filled pages in the hibernation image. On restore, zero pages are handled internally by snapshot_write_next() in a loop that processes them without returning to the caller. With the userspace restore interface, writing the last non-zero page to /dev/snapshot is followed by the SNAPSHOT_ATOMIC_RESTORE ioctl. At this point there are no more calls to snapshot_write_next() so any trailing zero pages are not processed, snapshot_image_loaded() fails because handle->cur is smaller than expected, the ioctl returns -EPERM and the image is not restored. The in-kernel restore path is not affected by this because the loop in load_image() in swap.c calls snapshot_write_next() until it returns 0. It is this final call that drains any trailing zero pages. Fixed by calling snapshot_write_next() in snapshot_write_finalize(), giving the kernel the chance to drain any trailing zero pages. Fixes: 005e8dddd497 ("PM: hibernate: don't store zero pages in the image file") Signed-off-by: Alberto Garcia <berto@igalia.com> Acked-by: Brian Geffon <bgeffon@google.com> Link: https://patch.msgid.link/ef5a7c5e3e3dbd17dcb20efaa0c53a47a23498bb.1773075892.git.berto@igalia.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursPM: hibernate: Don't ignore return from set_memory_ro()Christophe Leroy4-15/+24
[ Upstream commit f4311756a83fb01c28a9bf841cbb7eb2b318eebf ] set_memory_ro() and set_memory_rw() can fail, leaving memory unprotected. Take the returned value into account and abort in case of failure. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Stable-dep-of: 734eba62cd32 ("PM: hibernate: Drain trailing zero pages on userspace restore") Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursdma: swiotlb: add KMSAN annotations to swiotlb_bounce()Shigeru Yoshida1-2/+19
[ Upstream commit 6f770b73d0311a5b099277653199bb6421c4fed2 ] When a device performs DMA to a bounce buffer, KMSAN is unaware of the write and does not mark the data as initialized. When swiotlb_bounce() later copies the bounce buffer back to the original buffer, memcpy propagates the uninitialized shadow to the original buffer, causing false positive uninit-value reports. Fix this by calling kmsan_unpoison_memory() on the bounce buffer before copying it back in the DMA_FROM_DEVICE path, so that memcpy naturally propagates initialized shadow to the destination. Suggested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/CAG_fn=WUGta-paG1BgsGRoAR+fmuCgh3xo=R3XdzOt_-DqSdHw@mail.gmail.com/ Fixes: 7ade4f10779c ("dma: kmsan: unpoison DMA mappings") Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260315082750.2375581-1-syoshida@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursmodule: Fix kernel panic when a symbol st_shndx is out of boundsIhor Solodrai1-0/+7
[ Upstream commit f9d69d5e7bde2295eb7488a56f094ac8f5383b92 ] The module loader doesn't check for bounds of the ELF section index in simplify_symbols(): for (i = 1; i < symsec->sh_size / sizeof(Elf_Sym); i++) { const char *name = info->strtab + sym[i].st_name; switch (sym[i].st_shndx) { case SHN_COMMON: [...] default: /* Divert to percpu allocation if a percpu var. */ if (sym[i].st_shndx == info->index.pcpu) secbase = (unsigned long)mod_percpu(mod); else /** HERE --> **/ secbase = info->sechdrs[sym[i].st_shndx].sh_addr; sym[i].st_value += secbase; break; } } A symbol with an out-of-bounds st_shndx value, for example 0xffff (known as SHN_XINDEX or SHN_HIRESERVE), may cause a kernel panic: BUG: unable to handle page fault for address: ... RIP: 0010:simplify_symbols+0x2b2/0x480 ... Kernel panic - not syncing: Fatal exception This can happen when module ELF is legitimately using SHN_XINDEX or when it is corrupted. Add a bounds check in simplify_symbols() to validate that st_shndx is within the valid range before using it. This issue was discovered due to a bug in llvm-objcopy, see relevant discussion for details [1]. [1] https://lore.kernel.org/linux-modules/20251224005752.201911-1-ihor.solodrai@linux.dev/ Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursbpf: Fix undefined behavior in interpreter sdiv/smod for INT_MINJenny Guanni Qu1-8/+14
[ Upstream commit c77b30bd1dcb61f66c640ff7d2757816210c7cb0 ] The BPF interpreter's signed 32-bit division and modulo handlers use the kernel abs() macro on s32 operands. The abs() macro documentation (include/linux/math.h) explicitly states the result is undefined when the input is the type minimum. When DST contains S32_MIN (0x80000000), abs((s32)DST) triggers undefined behavior and returns S32_MIN unchanged on arm64/x86. This value is then sign-extended to u64 as 0xFFFFFFFF80000000, causing do_div() to compute the wrong result. The verifier's abstract interpretation (scalar32_min_max_sdiv) computes the mathematically correct result for range tracking, creating a verifier/interpreter mismatch that can be exploited for out-of-bounds map value access. Introduce abs_s32() which handles S32_MIN correctly by casting to u32 before negating, avoiding signed overflow entirely. Replace all 8 abs((s32)...) call sites in the interpreter's sdiv32/smod32 handlers. s32 is the only affected case -- the s64 division/modulo handlers do not use abs(). Fixes: ec0e2da95f72 ("bpf: Support new signed div/mod instructions.") Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com> Link: https://lore.kernel.org/r/20260311011116.2108005-2-qguanni@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursbpf: Release module BTF IDR before module unloadKumar Kartikeya Dwivedi1-4/+20
[ Upstream commit 146bd2a87a65aa407bb17fac70d8d583d19aba06 ] Gregory reported in [0] that the global_map_resize test when run in repeatedly ends up failing during program load. This stems from the fact that BTF reference has not dropped to zero after the previous run's module is unloaded, and the older module's BTF is still discoverable and visible. Later, in libbpf, load_module_btfs() will find the ID for this stale BTF, open its fd, and then it will be used during program load where later steps taking module reference using btf_try_get_module() fail since the underlying module for the BTF is gone. Logically, once a module is unloaded, it's associated BTF artifacts should become hidden. The BTF object inside the kernel may still remain alive as long its reference counts are alive, but it should no longer be discoverable. To fix this, let us call btf_free_id() from the MODULE_STATE_GOING case for the module unload to free the BTF associated IDR entry, and disable its discovery once module unload returns to user space. If a race happens during unload, the outcome is non-deterministic anyway. However, user space should be able to rely on the guarantee that once it has synchronously established a successful module unload, no more stale artifacts associated with this module can be obtained subsequently. Note that we must be careful to not invoke btf_free_id() in btf_put() when btf_is_module() is true now. There could be a window where the module unload drops a non-terminal reference, frees the IDR, but the same ID gets reused and the second unconditional btf_free_id() ends up releasing an unrelated entry. To avoid a special case for btf_is_module() case, set btf->id to zero to make btf_free_id() idempotent, such that we can unconditionally invoke it from btf_put(), and also from the MODULE_STATE_GOING case. Since zero is an invalid IDR, the idr_remove() should be a noop. Note that we can be sure that by the time we reach final btf_put() for btf_is_module() case, the btf_free_id() is already done, since the module itself holds the BTF reference, and it will call this function for the BTF before dropping its own reference. [0]: https://lore.kernel.org/bpf/cover.1773170190.git.grbell@redhat.com Fixes: 36e68442d1af ("bpf: Load and verify kernel module BTFs") Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Suggested-by: Martin KaFai Lau <martin.lau@kernel.org> Reported-by: Gregory Bell <grbell@redhat.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260312205307.1346991-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursperf: Make sure to use pmu_ctx->pmu for groupsPeter Zijlstra1-11/+8
[ Upstream commit 4b9ce671960627b2505b3f64742544ae9801df97 ] Oliver reported that x86_pmu_del() ended up doing an out-of-bound memory access when group_sched_in() fails and needs to roll back. This *should* be handled by the transaction callbacks, but he found that when the group leader is a software event, the transaction handlers of the wrong PMU are used. Despite the move_group case in perf_event_open() and group_sched_in() using pmu_ctx->pmu. Turns out, inherit uses event->pmu to clone the events, effectively undoing the move_group case for all inherited contexts. Fix this by also making inherit use pmu_ctx->pmu, ensuring all inherited counters end up in the same pmu context. Similarly, __perf_event_read() should use equally use pmu_ctx->pmu for the group case. Fixes: bd2756811766 ("perf: Rewrite core context handling") Reported-by: Oliver Rosenberg <olrose55@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://patch.msgid.link/20260309133713.GB606826@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
42 hoursperf: Extract a few helpersPeter Zijlstra1-17/+22
[ Upstream commit 9a32bd9901fe5b1dcf544389dbf04f3b0a2fbab4 ] The context time update code is repeated verbatim a few times. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240807115550.031212518@infradead.org Stable-dep-of: 4b9ce6719606 ("perf: Make sure to use pmu_ctx->pmu for groups") Signed-off-by: Sasha Levin <sashal@kernel.org>
10 dayssched: idle: Consolidate the handling of two special casesRafael J. Wysocki1-9/+21
[ Upstream commit f4c31b07b136839e0fb3026f8a5b6543e3b14d2f ] There are two special cases in the idle loop that are handled inconsistently even though they are analogous. The first one is when a cpuidle driver is absent and the default CPU idle time power management implemented by the architecture code is used. In that case, the scheduler tick is stopped every time before invoking default_idle_call(). The second one is when a cpuidle driver is present, but there is only one idle state in its table. In that case, the scheduler tick is never stopped at all. Since each of these approaches has its drawbacks, reconcile them with the help of one simple heuristic. Namely, stop the tick if the CPU has been woken up by it in the previous iteration of the idle loop, or let it tick otherwise. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Fixes: ed98c3491998 ("sched: idle: Do not stop the tick before cpuidle_idle_call()") [ rjw: Added Fixes tag, changelog edits ] Link: https://patch.msgid.link/4741364.LvFx2qVVIh@rafael.j.wysocki Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
10 dayskprobes: Remove unneeded warnings from __arm_kprobe_ftrace()Masami Hiramatsu (Google)1-2/+2
[ Upstream commit 5ef268cb7a0aac55521fd9881f1939fa94a8988e ] Remove unneeded warnings for handled errors from __arm_kprobe_ftrace() because all caller handled the error correctly. Link: https://lore.kernel.org/all/177261531182.1312989.8737778408503961141.stgit@mhiramat.tok.corp.google.com/ Reported-by: Zw Tang <shicenci@gmail.com> Closes: https://lore.kernel.org/all/CAPHJ_V+J6YDb_wX2nhXU6kh466Dt_nyDSas-1i_Y8s7tqY-Mzw@mail.gmail.com/ Fixes: 9c89bb8e3272 ("kprobes: treewide: Cleanup the error messages for kprobes") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 dayskprobes: Remove unneeded gotoMasami Hiramatsu (Google)1-24/+21
[ Upstream commit 5e5b8b49335971b68b54afeb0e7ded004945af07 ] Remove unneeded gotos. Since the labels referred by these gotos have only one reference for each, we can replace those gotos with the referred code. Link: https://lore.kernel.org/all/173371211203.480397.13988907319659165160.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Stable-dep-of: 5ef268cb7a0a ("kprobes: Remove unneeded warnings from __arm_kprobe_ftrace()") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 dayssched/fair: Fix pelt clock sync when entering idleVincent Guittot2-6/+6
[ Upstream commit 98c88dc8a1ace642d9021b103b28cba7b51e3abc ] Samuel and Alex reported regressions of the util_avg of RT rq with commit 17e3e88ed0b6 ("sched/fair: Fix pelt lost idle time detection"). It happens that fair is updating and syncing the pelt clock with task one when pick_next_task_fair() fails to pick a task but before the prev scheduling class got a chance to update its pelt signals. Move update_idle_rq_clock_pelt() in set_next_task_idle() which is called after prev class has been called. Fixes: 17e3e88ed0b6 ("sched/fair: Fix pelt lost idle time detection") Reported-by: Samuel Wu <wusamuel@google.com> Closes: https://lore.kernel.org/all/CAG2KctpO6VKS6GN4QWDji0t92_gNBJ7HjjXrE+6H+RwRXt=iLg@mail.gmail.com/ Reported-by: Alex Hoh <Alex.Hoh@mediatek.com> Closes: https://lore.kernel.org/all/8cf19bf0e0054dcfed70e9935029201694f1bb5a.camel@mediatek.com/ Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Samuel Wu <wusamuel@google.com> Tested-by: Alex Hoh <Alex.Hoh@mediatek.com> Link: https://patch.msgid.link/20260121163317.505635-1-vincent.guittot@linaro.org (cherry picked from commit 98c88dc8a1ace642d9021b103b28cba7b51e3abc) [ wusamuel: Did not include line 'exec_start = rq_clock_task()', which is not present in 6.6.y but found in mainline ] Signed-off-by: Samuel Wu <wusamuel@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysrcu/nocb: Fix possible invalid rdp's->nocb_cb_kthread pointer accessZqiang1-3/+2
[ Upstream commit 1bba3900ca18bdae28d1b9fa10f16a8f8cb2ada1 ] In the preparation stage of CPU online, if the corresponding the rdp's->nocb_cb_kthread does not exist, will be created, there is a situation where the rdp's rcuop kthreads creation fails, and then de-offload this CPU's rdp, does not assign this CPU's rdp->nocb_cb_kthread pointer, but this rdp's->nocb_gp_rdp and rdp's->rdp_gp->nocb_gp_kthread is still valid. This will cause the subsequent re-offload operation of this offline CPU, which will pass the conditional check and the kthread_unpark() will access invalid rdp's->nocb_cb_kthread pointer. This commit therefore use rdp's->nocb_gp_kthread instead of rdp_gp's->nocb_gp_kthread for safety check. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org> [ Minor conflict resolved. ] Signed-off-by: Jianqiang kang <jianqkang@sina.cn> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysx86/uprobes: Fix XOL allocation failure for 32-bit tasksOleg Nesterov1-3/+7
[ Upstream commit d55c571e4333fac71826e8db3b9753fadfbead6a ] This script #!/usr/bin/bash echo 0 > /proc/sys/kernel/randomize_va_space echo 'void main(void) {}' > TEST.c # -fcf-protection to ensure that the 1st endbr32 insn can't be emulated gcc -m32 -fcf-protection=branch TEST.c -o test bpftrace -e 'uprobe:./test:main {}' -c ./test "hangs", the probed ./test task enters an endless loop. The problem is that with randomize_va_space == 0 get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used by the stack vma. arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE. vm_unmapped_area() happily returns the high address > TASK_SIZE and then get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)" check. handle_swbp() doesn't report this failure (probably it should) and silently restarts the probed insn. Endless loop. I think that the right fix should change the x86 get_unmapped_area() paths to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case because ->orig_ax = -1. But we need a simple fix for -stable, so this patch just sets TS_COMPAT if the probed task is 32-bit to make in_ia32_syscall() true. Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()") Reported-by: Paulo Andrade <pandrade@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/ Cc: stable@vger.kernel.org Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysbpf: Forget ranges when refining tnum after JSETPaul Chaignon1-0/+4
commit 6279846b9b2532e1b04559ef8bd0dec049f29383 upstream. Syzbot reported a kernel warning due to a range invariant violation on the following BPF program. 0: call bpf_get_netns_cookie 1: if r0 == 0 goto <exit> 2: if r0 & Oxffffffff goto <exit> The issue is on the path where we fall through both jumps. That path is unreachable at runtime: after insn 1, we know r0 != 0, but with the sign extension on the jset, we would only fallthrough insn 2 if r0 == 0. Unfortunately, is_branch_taken() isn't currently able to figure this out, so the verifier walks all branches. The verifier then refines the register bounds using the second condition and we end up with inconsistent bounds on this unreachable path: 1: if r0 == 0 goto <exit> r0: u64=[0x1, 0xffffffffffffffff] var_off=(0, 0xffffffffffffffff) 2: if r0 & 0xffffffff goto <exit> r0 before reg_bounds_sync: u64=[0x1, 0xffffffffffffffff] var_off=(0, 0) r0 after reg_bounds_sync: u64=[0x1, 0] var_off=(0, 0) Improving the range refinement for JSET to cover all cases is tricky. We also don't expect many users to rely on JSET given LLVM doesn't generate those instructions. So instead of improving the range refinement for JSETs, Eduard suggested we forget the ranges whenever we're narrowing tnums after a JSET. This patch implements that approach. Reported-by: syzbot+c711ce17dd78e5d4fdcf@syzkaller.appspotmail.com Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/9d4fd6432a095d281f815770608fdcd16028ce0b.1752171365.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> [ shung-hsi.yu: no detection or kernel warning for invariant violation before 6.8, but the same umin=1,umax=0 state can occur when jset is preceed by r0 < 1. Changes were made to adapt to older range refinement logic before commit 67420501e868 ("bpf: generalize reg_set_min_max() to handle non-const register comparisons"). ] Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daystracing: Add recursion protection in kernel stack trace recordingSteven Rostedt1-0/+6
[ Upstream commit 5f1ef0dfcb5b7f4a91a9b0e0ba533efd9f7e2cdb ] A bug was reported about an infinite recursion caused by tracing the rcu events with the kernel stack trace trigger enabled. The stack trace code called back into RCU which then called the stack trace again. Expand the ftrace recursion protection to add a set of bits to protect events from recursion. Each bit represents the context that the event is in (normal, softirq, interrupt and NMI). Have the stack trace code use the interrupt context to protect against recursion. Note, the bug showed an issue in both the RCU code as well as the tracing stacktrace code. This only handles the tracing stack trace side of the bug. The RCU fix will be handled separately. Link: https://lore.kernel.org/all/20260102122807.7025fc87@gandalf.local.home/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Link: https://patch.msgid.link/20260105203141.515cd49f@gandalf.local.home Reported-by: Yao Kai <yaokai34@huawei.com> Tested-by: Yao Kai <yaokai34@huawei.com> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Leon Chen <leonchen.oss@139.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daystracing: Fix trace_buf_size= cmdline parameter with sizes >= 2GCalvin Owens1-3/+3
commit d008ba8be8984760e36d7dcd4adbd5a41a645708 upstream. Some of the sizing logic through tracer_alloc_buffers() uses int internally, causing unexpected behavior if the user passes a value that does not fit in an int (on my x86 machine, the result is uselessly tiny buffers). Fix by plumbing the parameter's real type (unsigned long) through to the ring buffer allocation functions, which already use unsigned long. It has always been possible to create larger ring buffers via the sysfs interface: this only affects the cmdline parameter. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/bff42a4288aada08bdf74da3f5b67a2c28b761f8.1772852067.git.calvin@wbinvd.org Fixes: 73c5162aa362 ("tracing: keep ring buffer to minimum size till used") Signed-off-by: Calvin Owens <calvin@wbinvd.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daystracing: Fix syscall events activation by ensuring refcount hits zeroHuiwen He1-15/+37
commit 0a663b764dbdf135a126284f454c9f01f95a87d4 upstream. When multiple syscall events are specified in the kernel command line (e.g., trace_event=syscalls:sys_enter_openat,syscalls:sys_enter_close), they are often not captured after boot, even though they appear enabled in the tracing/set_event file. The issue stems from how syscall events are initialized. Syscall tracepoints require the global reference count (sys_tracepoint_refcount) to transition from 0 to 1 to trigger the registration of the syscall work (TIF_SYSCALL_TRACEPOINT) for tasks, including the init process (pid 1). The current implementation of early_enable_events() with disable_first=true used an interleaved sequence of "Disable A -> Enable A -> Disable B -> Enable B". If multiple syscalls are enabled, the refcount never drops to zero, preventing the 0->1 transition that triggers actual registration. Fix this by splitting early_enable_events() into two distinct phases: 1. Disable all events specified in the buffer. 2. Enable all events specified in the buffer. This ensures the refcount hits zero before re-enabling, allowing syscall events to be properly activated during early boot. The code is also refactored to use a helper function to avoid logic duplication between the disable and enable phases. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260224023544.1250787-1-hehuiwen@kylinos.cn Fixes: ce1039bd3a89 ("tracing: Fix enabling of syscall events on the command line") Signed-off-by: Huiwen He <hehuiwen@kylinos.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daystime/jiffies: Mark jiffies_64_to_clock_t() notraceSteven Rostedt1-1/+1
[ Upstream commit 755a648e78f12574482d4698d877375793867fa1 ] The trace_clock_jiffies() function that handles the "uptime" clock for tracing calls jiffies_64_to_clock_t(). This causes the function tracer to constantly recurse when the tracing clock is set to "uptime". Mark it notrace to prevent unnecessary recursion when using the "uptime" clock. Fixes: 58d4e21e50ff3 ("tracing: Fix wraparound problems in "uptime" trace clock") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260306212403.72270bb2@robin Signed-off-by: Sasha Levin <sashal@kernel.org>
10 dayskprobes: avoid crash when rmmod/insmod after ftrace killedMasami Hiramatsu (Google)1-0/+4
commit e113f0b46d19626ec15388bcb91432c9a4fd6261 upstream. After we hit ftrace is killed by some errors, the kernel crash if we remove modules in which kprobe probes. BUG: unable to handle page fault for address: fffffbfff805000d PGD 817fcc067 P4D 817fcc067 PUD 817fc8067 PMD 101555067 PTE 0 Oops: Oops: 0000 [#1] SMP KASAN PTI CPU: 4 UID: 0 PID: 2012 Comm: rmmod Tainted: G W OE Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE RIP: 0010:kprobes_module_callback+0x89/0x790 RSP: 0018:ffff88812e157d30 EFLAGS: 00010a02 RAX: 1ffffffff805000d RBX: dffffc0000000000 RCX: ffffffff86a8de90 RDX: ffffed1025c2af9b RSI: 0000000000000008 RDI: ffffffffc0280068 RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed1025c2af9a R10: ffff88812e157cd7 R11: 205d323130325420 R12: 0000000000000002 R13: ffffffffc0290488 R14: 0000000000000002 R15: ffffffffc0280040 FS: 00007fbc450dd740(0000) GS:ffff888420331000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffbfff805000d CR3: 000000010f624000 CR4: 00000000000006f0 Call Trace: <TASK> notifier_call_chain+0xc6/0x280 blocking_notifier_call_chain+0x60/0x90 __do_sys_delete_module.constprop.0+0x32a/0x4e0 do_syscall_64+0x5d/0xfa0 entry_SYSCALL_64_after_hwframe+0x76/0x7e This is because the kprobe on ftrace does not correctly handles the kprobe_ftrace_disabled flag set by ftrace_kill(). To prevent this error, check kprobe_ftrace_disabled in __disarm_kprobe_ftrace() and skip all ftrace related operations. Link: https://lore.kernel.org/all/176473947565.1727781.13110060700668331950.stgit@mhiramat.tok.corp.google.com/ Reported-by: Ye Bin <yebin10@huawei.com> Closes: https://lore.kernel.org/all/20251125020536.2484381-1-yebin@huaweicloud.com/ Fixes: ae6aa16fdc16 ("kprobes: introduce ftrace based optimization") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 dayscgroup: fix race between task migration and iterationQingye Zhao1-0/+1
commit 5ee01f1a7343d6a3547b6802ca2d4cdce0edacb1 upstream. When a task is migrated out of a css_set, cgroup_migrate_add_task() first moves it from cset->tasks to cset->mg_tasks via: list_move_tail(&task->cg_list, &cset->mg_tasks); If a css_task_iter currently has it->task_pos pointing to this task, css_set_move_task() calls css_task_iter_skip() to keep the iterator valid. However, since the task has already been moved to ->mg_tasks, the iterator is advanced relative to the mg_tasks list instead of the original tasks list. As a result, remaining tasks on cset->tasks, as well as tasks queued on cset->mg_tasks, can be skipped by iteration. Fix this by calling css_set_skip_task_iters() before unlinking task->cg_list from cset->tasks. This advances all active iterators to the next task on cset->tasks, so iteration continues correctly even when a task is concurrently being migrated. This race is hard to hit in practice without instrumentation, but it can be reproduced by artificially slowing down cgroup_procs_show(). For example, on an Android device a temporary /sys/kernel/cgroup/cgroup_test knob can be added to inject a delay into cgroup_procs_show(), and then: 1) Spawn three long-running tasks (PIDs 101, 102, 103). 2) Create a test cgroup and move the tasks into it. 3) Enable a large delay via /sys/kernel/cgroup/cgroup_test. 4) In one shell, read cgroup.procs from the test cgroup. 5) Within the delay window, in another shell migrate PID 102 by writing it to a different cgroup.procs file. Under this setup, cgroup.procs can intermittently show only PID 101 while skipping PID 103. Once the migration completes, reading the file again shows all tasks as expected. Note that this change does not allow removing the existing css_set_skip_task_iters() call in css_set_move_task(). The new call in cgroup_migrate_add_task() only handles iterators that are racing with migration while the task is still on cset->tasks. Iterators may also start after the task has been moved to cset->mg_tasks. If we dropped css_set_skip_task_iters() from css_set_move_task(), such iterators could keep task_pos pointing to a migrating task, causing css_task_iter_advance() to malfunction on the destination css_set, up to and including crashes or infinite loops. The race window between migration and iteration is very small, and css_task_iter is not on a hot path. In the worst case, when an iterator is positioned on the first thread of the migrating process, cgroup_migrate_add_task() may have to skip multiple tasks via css_set_skip_task_iters(). However, this only happens when migration and iteration actually race, so the performance impact is negligible compared to the correctness fix provided here. Fixes: b636fd38dc40 ("cgroup: Implement css_task_iter_skip()") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Qingye Zhao <zhaoqingye@honor.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 dayssched: idle: Make skipping governor callbacks more consistentRafael J. Wysocki1-1/+10
[ Upstream commit d557640e4ce589a24dca5ca7ce3b9680f471325f ] If the cpuidle governor .select() callback is skipped because there is only one idle state in the cpuidle driver, the .reflect() callback should be skipped as well, at least for consistency (if not for correctness), so do it. Fixes: e5c9ffc6ae1b ("cpuidle: Skip governor when only one idle state is available") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/12857700.O9o76ZdvQC@rafael.j.wysocki Signed-off-by: Sasha Levin <sashal@kernel.org>
10 daysunshare: fix unshare_fs() handlingAl Viro1-1/+1
[ Upstream commit 6c4b2243cb6c0755159bd567130d5e12e7b10d9f ] There's an unpleasant corner case in unshare(2), when we have a CLONE_NEWNS in flags and current->fs hadn't been shared at all; in that case copy_mnt_ns() gets passed current->fs instead of a private copy, which causes interesting warts in proof of correctness] > I guess if private means fs->users == 1, the condition could still be true. Unfortunately, it's worse than just a convoluted proof of correctness. Consider the case when we have CLONE_NEWCGROUP in addition to CLONE_NEWNS (and current->fs->users == 1). We pass current->fs to copy_mnt_ns(), all right. Suppose it succeeds and flips current->fs->{pwd,root} to corresponding locations in the new namespace. Now we proceed to copy_cgroup_ns(), which fails (e.g. with -ENOMEM). We call put_mnt_ns() on the namespace created by copy_mnt_ns(), it's destroyed and its mount tree is dissolved, but... current->fs->root and current->fs->pwd are both left pointing to now detached mounts. They are pinning those, so it's not a UAF, but it leaves the calling process with unshare(2) failing with -ENOMEM _and_ leaving it with pwd and root on detached isolated mounts. The last part is clearly a bug. There is other fun related to that mess (races with pivot_root(), including the one between pivot_root() and fork(), of all things), but this one is easy to isolate and fix - treat CLONE_NEWNS as "allocate a new fs_struct even if it hadn't been shared in the first place". Sure, we could go for something like "if both CLONE_NEWNS *and* one of the things that might end up failing after copy_mnt_ns() call in create_new_namespaces() are set, force allocation of new fs_struct", but let's keep it simple - the cost of copy_fs_struct() is trivial. Another benefit is that copy_mnt_ns() with CLONE_NEWNS *always* gets a freshly allocated fs_struct, yet to be attached to anything. That seriously simplifies the analysis... FWIW, that bug had been there since the introduction of unshare(2) ;-/ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://patch.msgid.link/20260207082524.GE3183987@ZenIV Tested-by: Waiman Long <longman@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
10 daystracing: Add NULL pointer check to trigger_data_free()Guenter Roeck1-0/+3
[ Upstream commit 457965c13f0837a289c9164b842d0860133f6274 ] If trigger_data_alloc() fails and returns NULL, event_hist_trigger_parse() jumps to the out_free error path. While kfree() safely handles a NULL pointer, trigger_data_free() does not. This causes a NULL pointer dereference in trigger_data_free() when evaluating data->cmd_ops->set_filter. Fix the problem by adding a NULL pointer check to trigger_data_free(). The problem was found by an experimental code review agent based on gemini-3.1-pro while reviewing backports into v6.18.y. Cc: Miaoqian Lin <linmq006@gmail.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://patch.msgid.link/20260305193339.2810953-1-linux@roeck-us.net Fixes: 0550069cc25f ("tracing: Properly process error handling in event_hist_trigger_parse()") Assisted-by: Gemini:gemini-3.1-pro Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
10 daysbpf: Fix a UAF issue in bpf_trampoline_link_cgroup_shimLang Xu1-3/+1
[ Upstream commit 56145d237385ca0e7ca9ff7b226aaf2eb8ef368b ] The root cause of this bug is that when 'bpf_link_put' reduces the refcount of 'shim_link->link.link' to zero, the resource is considered released but may still be referenced via 'tr->progs_hlist' in 'cgroup_shim_find'. The actual cleanup of 'tr->progs_hlist' in 'bpf_shim_tramp_link_release' is deferred. During this window, another process can cause a use-after-free via 'bpf_trampoline_link_cgroup_shim'. Based on Martin KaFai Lau's suggestions, I have created a simple patch. To fix this: Add an atomic non-zero check in 'bpf_trampoline_link_cgroup_shim'. Only increment the refcount if it is not already zero. Testing: I verified the fix by adding a delay in 'bpf_shim_tramp_link_release' to make the bug easier to trigger: static void bpf_shim_tramp_link_release(struct bpf_link *link) { /* ... */ if (!shim_link->trampoline) return; + msleep(100); WARN_ON_ONCE(bpf_trampoline_unlink_prog(&shim_link->link, shim_link->trampoline, NULL)); bpf_trampoline_put(shim_link->trampoline); } Before the patch, running a PoC easily reproduced the crash(almost 100%) with a call trace similar to KaiyanM's report. After the patch, the bug no longer occurs even after millions of iterations. Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor") Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Closes: https://lore.kernel.org/bpf/3c4ebb0b.46ff8.19abab8abe2.Coremail.kaiyanm@hust.edu.cn/ Signed-off-by: Lang Xu <xulang@uniontech.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/279EEE1BA1DDB49D+20260303095217.34436-1-xulang@uniontech.com Signed-off-by: Sasha Levin <sashal@kernel.org>
10 daysbpf: export bpf_link_inc_not_zero.Kui-Feng Lee1-1/+2
[ Upstream commit 67c3e8353f45c27800eecc46e00e8272f063f7d1 ] bpf_link_inc_not_zero() will be used by kernel modules. We will use it in bpf_testmod.c later. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240530065946.979330-5-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Stable-dep-of: 56145d237385 ("bpf: Fix a UAF issue in bpf_trampoline_link_cgroup_shim") Signed-off-by: Sasha Levin <sashal@kernel.org>
10 daysbpf: Fix stack-out-of-bounds write in devmapKohei Enju1-5/+17
[ Upstream commit b7bf516c3ecd9a2aae2dc2635178ab87b734fef1 ] get_upper_ifindexes() iterates over all upper devices and writes their indices into an array without checking bounds. Also the callers assume that the max number of upper devices is MAX_NEST_DEV and allocate excluded_devices[1+MAX_NEST_DEV] on the stack, but that assumption is not correct and the number of upper devices could be larger than MAX_NEST_DEV (e.g., many macvlans), causing a stack-out-of-bounds write. Add a max parameter to get_upper_ifindexes() to avoid the issue. When there are too many upper devices, return -EOVERFLOW and abort the redirect. To reproduce, create more than MAX_NEST_DEV(8) macvlans on a device with an XDP program attached using BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS. Then send a packet to the device to trigger the XDP redirect path. Reported-by: syzbot+10cc7f13760b31bd2e61@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/698c4ce3.050a0220.340abe.000b.GAE@google.com/T/ Fixes: aeea1b86f936 ("bpf, devmap: Exclude XDP broadcast to master device") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Kohei Enju <kohei@enjuk.jp> Link: https://lore.kernel.org/r/20260225053506.4738-1-kohei@enjuk.jp Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
10 daysperf: Fix __perf_event_overflow() vs perf_remove_from_context() racePeter Zijlstra1-1/+41
[ Upstream commit c9bc1753b3cc41d0e01fbca7f035258b5f4db0ae ] Make sure that __perf_event_overflow() runs with IRQs disabled for all possible callchains. Specifically the software events can end up running it with only preemption disabled. This opens up a race vs perf_event_exit_event() and friends that will go and free various things the overflow path expects to be present, like the BPF program. Fixes: 592903cdcbf6 ("perf_counter: add an event_list") Reported-by: Simond Hu <cmdhh1767@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Simond Hu <cmdhh1767@gmail.com> Link: https://patch.msgid.link/20260224122909.GV1395416@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
10 daysrseq: Clarify rseq registration rseq_size bound check commentMathieu Desnoyers1-2/+3
[ Upstream commit 26d43a90be81fc90e26688a51d3ec83188602731 ] The rseq registration validates that the rseq_size argument is greater or equal to 32 (the original rseq size), but the com