summaryrefslogtreecommitdiff
path: root/kernel/events
AgeCommit message (Collapse)AuthorFilesLines
10 daysperf/core: Prevent VMA split of buffer mappingsThomas Gleixner1-0/+10
commit b024d7b56c77191cde544f838debb7f8451cd0d6 upstream. The perf mmap code is careful about mmap()'ing the user page with the ringbuffer and additionally the auxiliary buffer, when the event supports it. Once the first mapping is established, subsequent mapping have to use the same offset and the same size in both cases. The reference counting for the ringbuffer and the auxiliary buffer depends on this being correct. Though perf does not prevent that a related mapping is split via mmap(2), munmap(2) or mremap(2). A split of a VMA results in perf_mmap_open() calls, which take reference counts, but then the subsequent perf_mmap_close() calls are not longer fulfilling the offset and size checks. This leads to reference count leaks. As perf already has the requirement for subsequent mappings to match the initial mapping, the obvious consequence is that VMA splits, caused by resizing of a mapping or partial unmapping, have to be prevented. Implement the vm_operations_struct::may_split() callback and return unconditionally -EINVAL. That ensures that the mapping offsets and sizes cannot be changed after the fact. Remapping to a different fixed address with the same size is still possible as it takes the references for the new mapping and drops those of the old mapping. Fixes: 45bfb2e50471 ("perf/core: Add AUX area to ring buffer for raw data streams") Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-27504 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysperf/core: Handle buffer mapping fail correctly in perf_mmap()Thomas Gleixner1-3/+11
commit f74b9f4ba63ffdf597aaaa6cad7e284cb8e04820 upstream. After successful allocation of a buffer or a successful attachment to an existing buffer perf_mmap() tries to map the buffer read only into the page table. If that fails, the already set up page table entries are zapped, but the other perf specific side effects of that failure are not handled. The calling code just cleans up the VMA and does not invoke perf_mmap_close(). This leaks reference counts, corrupts user->vm accounting and also results in an unbalanced invocation of event::event_mapped(). Cure this by moving the event::event_mapped() invocation before the map_range() call so that on map_range() failure perf_mmap_close() can be invoked without causing an unbalanced event::event_unmapped() call. perf_mmap_close() undoes the reference counts and eventually frees buffers. Fixes: b709eb872e19 ("perf/core: map pages in advance") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysperf/core: Exit early on perf_mmap() failThomas Gleixner1-2/+4
commit 07091aade394f690e7b655578140ef84d0e8d7b0 upstream. When perf_mmap() fails to allocate a buffer, it still invokes the event_mapped() callback of the related event. On X86 this might increase the perf_rdpmc_allowed reference counter. But nothing undoes this as perf_mmap_close() is never called in this case, which causes another reference count leak. Return early on failure to prevent that. Fixes: 1e0fb9ec679c ("perf/core: Add pmu callbacks to track event mapping and unmapping") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysperf/core: Don't leak AUX buffer refcount on allocation failureThomas Gleixner1-3/+4
commit 5468c0fbccbb9d156522c50832244a8b722374fb upstream. Failure of the AUX buffer allocation leaks the reference count. Set the reference count to 1 only when the allocation succeeds. Fixes: 45bfb2e50471 ("perf/core: Add AUX area to ring buffer for raw data streams") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysperf/core: Preserve AUX buffer allocation failure resultThomas Gleixner1-2/+1
commit 54473e0ef849f44e5ee43e6d6746c27030c3825b upstream. A recent overhaul sets the return value to 0 unconditionally after the allocations, which causes reference count leaks and corrupts the user->vm accounting. Preserve the AUX buffer allocation failure return value, so that the subsequent code works correctly. Fixes: 0983593f32c4 ("perf/core: Lift event->mmap_mutex in perf_mmap()") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-17perf: Revert to requiring CAP_SYS_ADMIN for uprobesPeter Zijlstra1-1/+1
[ Upstream commit ba677dbe77af5ffe6204e0f3f547f3ba059c6302 ] Jann reports that uprobes can be used destructively when used in the middle of an instruction. The kernel only verifies there is a valid instruction at the requested offset, but due to variable instruction length cannot determine if this is an instruction as seen by the intended execution stream. Additionally, Mark Rutland notes that on architectures that mix data in the text segment (like arm64), a similar things can be done if the data word is 'mistaken' for an instruction. As such, require CAP_SYS_ADMIN for uprobes. Fixes: c9e0924e5c2b ("perf/core: open access to probes for CAP_PERFMON privileged process") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/CAG48ez1n4520sq0XrWYDHKiKxE_+WCfAK+qt9qkY4ZiBGmL-5g@mail.gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-17perf/core: Fix the WARN_ON_ONCE is out of lock protected regionLuo Gengkun1-2/+2
[ Upstream commit 7b4c5a37544ba22c6ebe72c0d4ea56c953459fa5 ] commit 3172fb986666 ("perf/core: Fix WARN in perf_cgroup_switch()") try to fix a concurrency problem between perf_cgroup_switch and perf_cgroup_event_disable. But it does not to move the WARN_ON_ONCE into lock-protected region, so the warning is still be triggered. Fixes: 3172fb986666 ("perf/core: Fix WARN in perf_cgroup_switch()") Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250626135403.2454105-1-luogengkun@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27perf/core: Fix WARN in perf_cgroup_switch()Luo Gengkun1-2/+20
[ Upstream commit 3172fb986666dfb71bf483b6d3539e1e587fa197 ] There may be concurrency between perf_cgroup_switch and perf_cgroup_event_disable. Consider the following scenario: after a new perf cgroup event is created on CPU0, the new event may not trigger a reprogramming, causing ctx->is_active to be 0. In this case, when CPU1 disables this perf event, it executes __perf_remove_from_context-> list _del_event->perf_cgroup_event_disable on CPU1, which causes a race with perf_cgroup_switch running on CPU0. The following describes the details of this concurrency scenario: CPU0 CPU1 perf_cgroup_switch: ... # cpuctx->cgrp is not NULL here if (READ_ONCE(cpuctx->cgrp) == NULL) return; perf_remove_from_context: ... raw_spin_lock_irq(&ctx->lock); ... # ctx->is_active == 0 because reprogramm is not # tigger, so CPU1 can do __perf_remove_from_context # for CPU0 __perf_remove_from_context: perf_cgroup_event_disable: ... if (--ctx->nr_cgroups) ... # this warning will happened because CPU1 changed # ctx.nr_cgroups to 0. WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0); [peterz: use guard instead of goto unlock] Fixes: db4a835601b7 ("perf/core: Set cgroup in CPU contexts for new cgroup events") Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250604033924.3914647-3-luogengkun@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27perf: Fix cgroup state vs ERRORPeter Zijlstra1-21/+30
[ Upstream commit 61988e36dc5457cdff7ae7927e8d9ad1419ee998 ] While chasing down a missing perf_cgroup_event_disable() elsewhere, Leo Yan found that both perf_put_aux_event() and perf_remove_sibling_event() were also missing one. Specifically, the rule is that events that switch to OFF,ERROR need to call perf_cgroup_event_disable(). Unify the disable paths to ensure this. Fixes: ab43762ef010 ("perf: Allow normal events to output AUX data") Fixes: 9f0c4fa111dc ("perf/core: Add a new PERF_EV_CAP_SIBLING event capability") Reported-by: Leo Yan <leo.yan@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250605123343.GD35970@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27perf: Fix sample vs do_exit()Peter Zijlstra1-0/+7
[ Upstream commit 4f6fc782128355931527cefe3eb45338abd8ab39 ] Baisheng Gao reported an ARM64 crash, which Mark decoded as being a synchronous external abort -- most likely due to trying to access MMIO in bad ways. The crash further shows perf trying to do a user stack sample while in exit_mmap()'s tlb_finish_mmu() -- i.e. while tearing down the address space it is trying to access. It turns out that we stop perf after we tear down the userspace mm; a receipie for disaster, since perf likes to access userspace for various reasons. Flip this order by moving up where we stop perf in do_exit(). Additionally, harden PERF_SAMPLE_CALLCHAIN and PERF_SAMPLE_STACK_USER to abort when the current task does not have an mm (exit_mm() makes sure to set current->mm = NULL; before commencing with the actual teardown). Such that CPU wide events don't trip on this same problem. Fixes: c5ebcedb566e ("perf: Add ability to attach user stack dump to sample") Reported-by: Baisheng Gao <baisheng.gao@unisoc.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250605110815.GQ39944@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19perf: Ensure bpf_perf_link path is properly serializedPeter Zijlstra1-4/+30
[ Upstream commit 7ed9138a72829d2035ecbd8dbd35b1bc3c137c40 ] Ravi reported that the bpf_perf_link_attach() usage of perf_event_set_bpf_prog() is not serialized by ctx->mutex, unlike the PERF_EVENT_IOC_SET_BPF case. Reported-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lkml.kernel.org/r/20250307193305.486326750@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19perf/core: Fix broken throttling when max_samples_per_tick=1Qing Wang1-8/+8
[ Upstream commit f51972e6f8b9a737b2b3eb588069acb538fa72de ] According to the throttling mechanism, the pmu interrupts number can not exceed the max_samples_per_tick in one tick. But this mechanism is ineffective when max_samples_per_tick=1, because the throttling check is skipped during the first interrupt and only performed when the second interrupt arrives. Perhaps this bug may cause little influence in one tick, but if in a larger time scale, the problem can not be underestimated. When max_samples_per_tick = 1: Allowed-interrupts-per-second max-samples-per-second default-HZ ARCH 200 100 100 X86 500 250 250 ARM64 ... Obviously, the pmu interrupt number far exceed the user's expect. Fixes: e050e3f0a71b ("perf: Fix broken interrupt rate throttling") Signed-off-by: Qing Wang <wangqing7171@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250405141635.243786-3-wangqing7171@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-23perf/core: Change to POLLERR for pinned events with errorNamhyung Kim1-2/+2
Commit: f4b07fd62d4d11d5 ("perf/core: Use POLLHUP for pinned events in error") started to emit POLLHUP for pinned events in an error state. But the POLLHUP is also used to signal events that the attached task is terminated. To distinguish pinned per-task events in the error state it would need to check if the task is live. Change it to POLLERR to make it clear. Suggested-by: Gabriel Marin <gmx@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250422223318.180343-1-namhyung@kernel.org
2025-04-08perf: Fix hang while freeing sigtrap eventFrederic Weisbecker1-46/+18
Perf can hang while freeing a sigtrap event if a related deferred signal hadn't managed to be sent before the file got closed: perf_event_overflow() task_work_add(perf_pending_task) fput() task_work_add(____fput()) task_work_run() ____fput() perf_release() perf_event_release_kernel() _free_event() perf_pending_task_sync() task_work_cancel() -> FAILED rcuwait_wait_event() Once task_work_run() is running, the list of pending callbacks is removed from the task_struct and from this point on task_work_cancel() can't remove any pending and not yet started work items, hence the task_work_cancel() failure and the hang on rcuwait_wait_event(). Task work could be changed to remove one work at a time, so a work running on the current task can always cancel a pending one, however the wait / wake design is still subject to inverted dependencies when remote targets are involved, as pictured by Oleg: T1 T2 fd = perf_event_open(pid => T2->pid); fd = perf_event_open(pid => T1->pid); close(fd) close(fd) <IRQ> <IRQ> perf_event_overflow() perf_event_overflow() task_work_add(perf_pending_task) task_work_add(perf_pending_task) </IRQ> </IRQ> fput() fput() task_work_add(____fput()) task_work_add(____fput()) task_work_run() task_work_run() ____fput() ____fput() perf_release() perf_release() perf_event_release_kernel() perf_event_release_kernel() _free_event() _free_event() perf_pending_task_sync() perf_pending_task_sync() rcuwait_wait_event() rcuwait_wait_event() Therefore the only option left is to acquire the event reference count upon queueing the perf task work and release it from the task work, just like it was done before 3a5465418f5f ("perf: Fix event leak upon exec and file release") but without the leaks it fixed. Some adjustments are necessary to make it work: * A child event might dereference its parent upon freeing. Care must be taken to release the parent last. * Some places assuming the event doesn't have any reference held and therefore can be freed right away must instead put the reference and let the reference counting to its job. Reported-by: "Yi Lai" <yi1.lai@linux.intel.com> Closes: https://lore.kernel.org/all/Zx9Losv4YcJowaP%2F@ly-workstation/ Reported-by: syzbot+3c4321e10eea460eb606@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/673adf75.050a0220.87769.0024.GAE@google.com/ Fixes: 3a5465418f5f ("perf: Fix event leak upon exec and file release") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250304135446.18905-1-frederic@kernel.org
2025-04-07uprobes: Avoid false-positive lockdep splat on CONFIG_PREEMPT_RT=y in the ↵Andrii Nakryiko1-2/+13
ri_timer() uprobe timer callback, use raw_write_seqcount_*() Avoid a false-positive lockdep warning in the CONFIG_PREEMPT_RT=y configuration when using write_seqcount_begin() in the uprobe timer callback by using raw_write_* APIs. Uprobe's use of timer callback is guaranteed to not race with itself for a given uprobe_task, and as such seqcount's insistence on having preemption disabled on the writer side is irrelevant. So switch to raw_ variants of seqcount API instead of disabling preemption unnecessarily. Also, point out in the comments more explicitly why we use seqcount despite our reader side being rather simple and never retrying. We favor well-maintained kernel primitive in favor of open-coding our own memory barriers. Fixes: 8622e45b5da1 ("uprobes: Reuse return_instances between multiple uretprobes within task") Reported-by: Alexei Starovoitov <ast@kernel.org> Suggested-by: Sebastian Siewior <bigeasy@linutronix.de> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20250404194848.2109539-1-andrii@kernel.org
2025-04-06perf/core: Fix WARN_ON(!ctx) in __free_event() for partial initGabriel Shahrouzi1-3/+3
Move the get_ctx(child_ctx) call and the child_event->ctx assignment to occur immediately after the child event is allocated. Ensure that child_event->ctx is non-NULL before any subsequent error path within inherit_event calls free_event(), satisfying the assumptions of the cleanup code. Details: There's no clear Fixes tag, because this bug is a side-effect of multiple interacting commits over time (up to 15 years old), not a single regression. The code initially incremented refcount then assigned context immediately after the child_event was created. Later, an early validity check for child_event was added before the refcount/assignment. Even later, a WARN_ON_ONCE() cleanup check was added, assuming event->ctx is valid if the pmu_ctx is valid. The problem is that the WARN_ON_ONCE() could trigger after the initial check passed but before child_event->ctx was assigned, violating its precondition. The solution is to assign child_event->ctx right after its initial validation. This ensures the context exists for any subsequent checks or cleanup routines, resolving the WARN_ON_ONCE(). To resolve it, defer the refcount update and child_event->ctx assignment directly after child_event->pmu_ctx is set but before checking if the parent event is orphaned. The cleanup routine depends on event->pmu_ctx being non-NULL before it verifies event->ctx is non-NULL. This also maintains the author's original intent of passing in child_ctx to find_get_pmu_context before its refcount/assignment. [ mingo: Expanded the changelog from another email by Gabriel Shahrouzi. ] Reported-by: syzbot+ff3aa851d46ab82953a3@syzkaller.appspotmail.com Signed-off-by: Gabriel Shahrouzi <gshahrouzi@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Link: https://lore.kernel.org/r/20250405203036.582721-1-gshahrouzi@gmail.com Closes: https://syzkaller.appspot.com/bug?extid=ff3aa851d46ab82953a3
2025-04-06Merge tag 'perf-urgent-2025-04-06' of ↵Linus Torvalds1-9/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event fix from Ingo Molnar: "Fix a perf events time accounting bug" * tag 'perf-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix child_total_time_enabled accounting bug at task exit
2025-04-01mseal sysmap: uprobe mappingJeff Xu1-1/+2
Provide support to mseal the uprobe mapping. Unlike other system mappings, the uprobe mapping is not established during program startup. However, its lifetime is the same as the process's lifetime. It could be sealed from creation. Test was done with perf tool, and observe the uprobe mapping is sealed. Link: https://lkml.kernel.org/r/20250305021711.3867874-6-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org> Cc: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Berg <benjamin@sipsolutions.net> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Elliot Hughes <enh@google.com> Cc: Florian Faineli <f.fainelli@gmail.com> Cc: Greg Ungerer <gerg@kernel.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <mike.rapoport@gmail.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stephen Röttger <sroettger@google.com> Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01Merge tag 'mm-stable-2025-03-30-16-52' of ↵Linus Torvalds1-1/+12
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was found to be incorrect. - The series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. * tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits) mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex() x86/mm: restore early initialization of high_memory for 32-bits mm/vmscan: don't try to reclaim hwpoison folio mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper cgroup: docs: add pswpin and pswpout items in cgroup v2 doc mm: vmscan: split proactive reclaim statistics from direct reclaim statistics selftests/mm: speed up split_huge_page_test selftests/mm: uffd-unit-tests support for hugepages > 2M docs/mm/damon/design: document active DAMOS filter type mm/damon: implement a new DAMOS filter type for active pages fs/dax: don't disassociate zero page entries MM documentation: add "Unaccepted" meminfo entry selftests/mm: add commentary about 9pfs bugs fork: use __vmalloc_node() for stack allocation docs/mm: Physical Memory: Populate the "Zones" section xen: balloon: update the NR_BALLOON_PAGES state hv_balloon: update the NR_BALLOON_PAGES state balloon_compaction: update the NR_BALLOON_PAGES state meminfo: add a per node counter for balloon drivers mm: remove references to folio in __memcg_kmem_uncharge_page() ...
2025-03-31perf/core: Fix child_total_time_enabled accounting bug at task exitYeoreum Yun1-9/+9
The perf events code fails to account for total_time_enabled of inactive events. Here is a failure case for accounting total_time_enabled for CPU PMU events: sudo ./perf stat -vvv -e armv8_pmuv3_0/event=0x08/ -e armv8_pmuv3_1/event=0x08/ -- stress-ng --pthread=2 -t 2s ... armv8_pmuv3_0/event=0x08/: 1138698008 2289429840 2174835740 armv8_pmuv3_1/event=0x08/: 1826791390 1950025700 847648440 ` ` ` ` ` > total_time_running with child ` > total_time_enabled with child > count with child Performance counter stats for 'stress-ng --pthread=2 -t 2s': 1,138,698,008 armv8_pmuv3_0/event=0x08/ (94.99%) 1,826,791,390 armv8_pmuv3_1/event=0x08/ (43.47%) The two events above are opened on two different CPU PMUs, for example, each event is opened for a cluster in an Arm big.LITTLE system, they will never run on the same CPU. In theory, the total enabled time should be same for both events, as two events are opened and closed together. As the result show, the two events' total enabled time including child event is different (2289429840 vs 1950025700). This is because child events are not accounted properly if a event is INACTIVE state when the task exits: perf_event_exit_event() `> perf_remove_from_context() `> __perf_remove_from_context() `> perf_child_detach() -> Accumulate child_total_time_enabled `> list_del_event() -> Update child event's time The problem is the time accumulation happens prior to child event's time updating. Thus, it misses to account the last period's time when the event exits. The perf core layer follows the rule that timekeeping is tied to state change. To address the issue, make __perf_remove_from_context() handle the task exit case by passing 'DETACH_EXIT' to it and invoke perf_event_state() for state alongside with accounting the time. Then, perf_child_detach() populates the time into the parent's time metrics. After this patch, the bug is fixed: sudo ./perf stat -vvv -e armv8_pmuv3_0/event=0x08/ -e armv8_pmuv3_1/event=0x08/ -- stress-ng --pthread=2 -t 10s ... armv8_pmuv3_0/event=0x08/: 15396770398 32157963940 21898169000 armv8_pmuv3_1/event=0x08/: 22428964974 32157963940 10259794940 Performance counter stats for 'stress-ng --pthread=2 -t 10s': 15,396,770,398 armv8_pmuv3_0/event=0x08/ (68.10%) 22,428,964,974 armv8_pmuv3_1/event=0x08/ (31.90%) [ mingo: Clarified the changelog. ] Fixes: ef54c1a476aef ("perf: Rework perf_event_exit_event()") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Leo Yan <leo.yan@arm.com> Link: https://lore.kernel.org/r/20250326082003.1630986-1-yeoreum.yun@arm.com
2025-03-25Merge tag 'lsm-pr-20250323' of ↵Linus Torvalds1-7/+7
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull lsm updates from Paul Moore: - Various minor updates to the LSM Rust bindings Changes include marking trivial Rust bindings as inlines and comment tweaks to better reflect the LSM hooks. - Add LSM/SELinux access controls to io_uring_allowed() Similar to the io_uring_disabled sysctl, add a LSM hook to io_uring_allowed() to enable LSMs a simple way to enforce security policy on the use of io_uring. This pull request includes SELinux support for this new control using the io_uring/allowed permission. - Remove an unused parameter from the security_perf_event_open() hook The perf_event_attr struct parameter was not used by any currently supported LSMs, remove it from the hook. - Add an explicit MAINTAINERS entry for the credentials code We've seen problems in the past where patches to the credentials code sent by non-maintainers would often languish on the lists for multiple months as there was no one explicitly tasked with the responsibility of reviewing and/or merging credentials related code. Considering that most of the code under security/ has a vested interest in ensuring that the credentials code is well maintained, I'm volunteering to look after the credentials code and Serge Hallyn has also volunteered to step up as an official reviewer. I posted the MAINTAINERS update as a RFC to LKML in hopes that someone else would jump up with an "I'll do it!", but beyond Serge it was all crickets. - Update Stephen Smalley's old email address to prevent confusion This includes a corresponding update to the mailmap file. * tag 'lsm-pr-20250323' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: mailmap: map Stephen Smalley's old email addresses lsm: remove old email address for Stephen Smalley MAINTAINERS: add Serge Hallyn as a credentials reviewer MAINTAINERS: add an explicit credentials entry cred,rust: mark Credential methods inline lsm,rust: reword "destroy" -> "release" in SecurityCtx lsm,rust: mark SecurityCtx methods inline perf: Remove unnecessary parameter of security check lsm: fix a missing security_uring_allowed() prototype io_uring,lsm,selinux: add LSM hooks for io_uring_setup() io_uring: refactor io_uring_allowed()
2025-03-25Merge tag 'timers-cleanups-2025-03-23' of ↵Linus Torvalds1-4/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "A treewide hrtimer timer cleanup hrtimers are initialized with hrtimer_init() and a subsequent store to the callback pointer. This turned out to be suboptimal for the upcoming Rust integration and is obviously a silly implementation to begin with. This cleanup replaces the hrtimer_init(T); T->function = cb; sequence with hrtimer_setup(T, cb); The conversion was done with Coccinelle and a few manual fixups. Once the conversion has completely landed in mainline, hrtimer_init() will be removed and the hrtimer::function becomes a private member" * tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits) wifi: rt2x00: Switch to use hrtimer_update_function() io_uring: Use helper function hrtimer_update_function() serial: xilinx_uartps: Use helper function hrtimer_update_function() ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup() RDMA: Switch to use hrtimer_setup() virtio: mem: Switch to use hrtimer_setup() drm/vmwgfx: Switch to use hrtimer_setup() drm/xe/oa: Switch to use hrtimer_setup() drm/vkms: Switch to use hrtimer_setup() drm/msm: Switch to use hrtimer_setup() drm/i915/request: Switch to use hrtimer_setup() drm/i915/uncore: Switch to use hrtimer_setup() drm/i915/pmu: Switch to use hrtimer_setup() drm/i915/perf: Switch to use hrtimer_setup() drm/i915/gvt: Switch to use hrtimer_setup() drm/i915/huc: Switch to use hrtimer_setup() drm/amdgpu: Switch to use hrtimer_setup() stm class: heartbeat: Switch to use hrtimer_setup() i2c: Switch to use hrtimer_setup() iio: Switch to use hrtimer_setup() ...
2025-03-17perf: Clean up pmu specific dataKan Liang1-73/+3
The pmu specific data is saved in task_struct now. Remove it from event context structure. Remove swap_task_ctx() as well. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250314172700.438923-7-kan.liang@linux.intel.com
2025-03-17perf: Supply task information to sched_task()Kan Liang1-9/+11
To save/restore LBR call stack data in system-wide mode, the task_struct information is required. Extend the parameters of sched_task() to supply task_struct information. When schedule in, the LBR call stack data for new task will be restored. When schedule out, the LBR call stack data for old task will be saved. Only need to pass the required task_struct information. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250314172700.438923-4-kan.liang@linux.intel.com
2025-03-17perf: attach/detach PMU specific dataKan Liang1-0/+289
The LBR call stack data has to be saved/restored during context switch to fix the shorter LBRs call stacks issue in the system-wide mode. Allocate PMU specific data and attach them to the corresponding task_struct during LBR call stack monitoring. When a LBR call stack event is accounted, the perf_ctx_data for the related tasks will be allocated/attached by attach_perf_ctx_data(). When a LBR call stack event is unaccounted, the perf_ctx_data for related tasks will be detached/freed by detach_perf_ctx_data(). The LBR call stack event could be a per-task event or a system-wide event. - For a per-task event, perf only allocates the perf_ctx_data for the current task. If the allocation fails, perf will error out. - For a system-wide event, perf has to allocate the perf_ctx_data for both the existing tasks and the upcoming tasks. The allocation for the existing tasks is done in perf_event_alloc(). If any allocation fails, perf will error out. The allocation for the new tasks will be done in perf_event_fork(). A global reader/writer semaphore, global_ctx_data_rwsem, is added to address the global race. - The perf_ctx_data only be freed by the last LBR call stack event. The number of the per-task events is tracked by refcount of each task. Since the system-wide events impact all tasks, it's not practical to go through the whole task list to update the refcount for each system-wide event. The number of system-wide events is tracked by a global variable global_ctx_data_ref. Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250314172700.438923-3-kan.liang@linux.intel.com
2025-03-17perf: Save PMU specific data in task_structKan Liang1-0/+1
Some PMU specific data has to be saved/restored during context switch, e.g. LBR call stack data. Currently, the data is saved in event context structure, but only for per-process event. For system-wide event, because of missing the LBR call stack data after context switch, LBR callstacks are always shorter in comparison to per-process mode. For example, Per-process mode: $perf record --call-graph lbr -- taskset -c 0 ./tchain_edit - 99.90% 99.86% tchain_edit tchain_edit [.] f3 99.86% _start __libc_start_main generic_start_main main f1 - f2 f3 System-wide mode: $perf record --call-graph lbr -a -- taskset -c 0 ./tchain_edit - 99.88% 99.82% tchain_edit tchain_edit [.] f3 - 62.02% main f1 f2 f3 - 28.83% f1 - f2 f3 - 28.83% f1 - f2 f3 - 8.88% generic_start_main main f1 f2 f3 It isn't practical to simply allocate the data for system-wide event in CPU context structure for all tasks. We have no idea which CPU a task will be scheduled to. The duplicated LBR data has to be maintained on every CPU context structure. That's a huge waste. Otherwise, the LBR data still lost if the task is scheduled to another CPU. Save the pmu specific data in task_struct. The size of pmu specific data is 788 bytes for LBR call stack. Usually, the overall amount of threads doesn't exceed a few thousands. For 10K threads, keeping LBR data would consume additional ~8MB. The additional space will only be allocated during LBR call stack monitoring. It will be released when the monitoring is finished. Furthermore, moving task_ctx_data from perf_event_context to task_struct can reduce complexity and make things clearer. E.g. perf doesn't need to swap task_ctx_data on optimized context switch path. This patch set is just the first step. There could be other optimization/extension on top of this patch set. E.g. for cgroup profiling, perf just needs to save/store the LBR call stack information for tasks in specific cgroup. That could reduce the additional space. Also, the LBR call stack can be available for software events, or allow even debugging use cases, like LBRs on crash later. Because of the alignment requirement of Intel Arch LBR, the Kmem cache is used to allocate the PMU specific data. It's required when child task allocates the space. Save it in struct perf_ctx_data. The refcount in struct perf_ctx_data is used to track the users of pmu specific data. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Alexey Budankov <alexey.budankov@linux.intel.com> Link: https://lore.kernel.org/r/20250314172700.438923-1-kan.liang@linux.intel.com
2025-03-17perf/ring_buffer: Allow the EPOLLRDNORM flag for pollTao Chen1-1/+1
The poll man page says POLLRDNORM is equivalent to POLLIN. For poll(), it seems that if user sets pollfd with POLLRDNORM in userspace, perf_poll will not return until timeout even if perf_output_wakeup called, whereas POLLIN returns. Fixes: 76369139ceb9 ("perf: Split up buffer handling from core code") Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250314030036.2543180-1-chen.dylane@linux.dev
2025-03-17perf/core: Use POLLHUP for pinned events in errorNamhyung Kim1-0/+9
Pinned performance events can enter an error state when they fail to be scheduled in the context due to a failed constraint or some other conflict or condition. In error state these events won't generate any samples anymore and are silently ignored until they are recovered by PERF_EVENT_IOC_ENABLE, or the condition can also change so that they can be scheduled in. Tooling should be allowed to know about the state change, but currently there's no mechanism to notify tooling when events enter an error state. One way to do this is to issue a POLLHUP event to poll(2) to handle this. Reading events in an error state would return 0 (EOF) and it matches to the behavior of POLLHUP according to the man page. Tooling should remove the fd of the event from pollfd after getting POLLHUP, otherwise it'll be returned repeatedly. [ mingo: Clarified the changelog ] Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317061745.1777584-1-namhyung@kernel.org
2025-03-16kernel/events/uprobes: handle device-exclusive entries correctly in ↵<