summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
9 daystracing: Avoid NULL return from hist_field_name() on truncationDavid Carlier1-4/+2
[ Upstream commit 576ec047d20b368b43c4d5db98c4f2e0f3c101ec ] hist_field_name() returns "" everywhere except the fully-qualified VAR_REF/EXPR case, where snprintf() truncation returns NULL early and bypasses the bottom NULL->"" guard. Callers don't expect NULL: strcat(expr, hist_field_name(field, 0)) at trace_events_hist.c:1758 and the strcmp() in the sort-key match loop at :4804 both deref it. system and event_name are bounded by MAX_EVENT_NAME_LEN, but the field name on a VAR_REF is kstrdup'd from a histogram variable name parsed out of the trigger string and has no length cap, so a long enough var name in a fully qualified reference can reach the truncation path. Keep the length check but leave field_name as "" on overflow. Link: https://patch.msgid.link/20260508195747.25492-1-devnexen@gmail.com Fixes: 5ec1d1e97de1 ("tracing: Rebuild full_name on each hist_field_name() call") Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daystracing: Do not call map->ops->elt_free() if elt_alloc() failsMasami Hiramatsu (Google)1-4/+13
commit 8f0f5c4fb9df0e19a341e0c6ed8dc4fda9124f03 upstream. In paths where tracing_map_elt_alloc() failed to allocate objects, the map->ops->elt_alloc() call was never successful. In this case, map->ops->elt_free() should not be called. Link: https://sashiko.dev/#/patchset/20260520223101.34710-1-rosenp%40gmail.com Cc: stable@vger.kernel.org Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Rosen Penev <rosenp@gmail.com> Reported-by: Sashiko <sashiko-bot@kernel.org> Fixes: 2734b629525a ("tracing: Add per-element variable support to tracing_map") Link: https://patch.msgid.link/177933895460.108746.5396070821443932634.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysring-buffer: Fix reporting of missed events in iteratorSteven Rostedt1-5/+3
commit a254b6d13b0edd6272926674d2afc46d46e496b7 upstream. When tracing is active while reading the trace file, if the iterator reading the buffer detects that the writer has passed the iterator head, it will reset and set a "missed events" flag. This flag is passed to the output processing to show the user that events were missed: CPU:4 [LOST EVENTS] The problem is that the flag is reset after it is checked in ring_buffer_iter_dropped(). But the "trace" file iterates over all the CPU ring buffers and it will check if they are dropped when figuring out which buffer to print next. This prematurely clears the missed_events flag if the CPU buffer with the missed events is not the one that is printed next. On the iteration where the CPU buffer with the missed events is printed, the check if it had missed events would return false and the output does not show that events were missed. Do not reset the missed_events flag when checking if there were missed events, but instead clear it when moving the iterator head to the next event. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260520220801.4fd09d13@fedora Fixes: c9b7a4a72ff64 ("ring-buffer/tracing: Have iterator acknowledge dropped events") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysaudit: enforce AUDIT_LOCKED for AUDIT_TRIM and AUDIT_MAKE_EQUIVSergio Correia1-0/+4
commit f9e1c1324b4d98d591a6f7568fdebf5cf456dfc2 upstream. AUDIT_ADD_RULE and AUDIT_DEL_RULE correctly check for AUDIT_LOCKED and return -EPERM, but AUDIT_TRIM and AUDIT_MAKE_EQUIV do not. This allows a process with CAP_AUDIT_CONTROL to modify directory tree watches and equivalence mappings even when the audit configuration has been locked, undermining the purpose of the lock. Add AUDIT_LOCKED checks to both commands. Cc: stable@vger.kernel.org Reviewed-by: Ricardo Robaina <rrobaina@redhat.com> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Sergio Correia <scorreia@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysaudit: fix incorrect inheritable capability in CAPSET recordsSergio Correia1-1/+1
commit e4a640475e43f406fdfd56d370b1f34b0cbbc18d upstream. __audit_log_capset() records the effective capability set into the inheritable field due to a copy-paste error. Every CAPSET audit record therefore reports cap_pi (process inheritable) with the value of cap_effective instead of cap_inheritable. This silently corrupts audit data used for compliance and forensic analysis: an attacker who modifies inheritable capabilities to prepare for a privilege-escalating exec would have the change masked in the audit trail. The bug has been present since the original introduction of CAPSET audit records in 2008. Cc: stable@vger.kernel.org Fixes: e68b75a027bb ("When the capset syscall is used it is not possible for audit to record the actual capbilities being added/removed. This patch adds a new record type which emits the target pid and the eff, inh, and perm cap sets.") Reviewed-by: Ricardo Robaina <rrobaina@redhat.com> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Sergio Correia <scorreia@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daysalarmtimer: Check RTC features instead of opsAlexandre Belloni1-1/+1
[ Upstream commit e09784a8a751e539dffc94d43bc917b0ac1e934a ] RTC drivers used to leave .set_alarm() NULL in order to signal the RTC device doesn't support alarms. The drivers are now clearing the RTC_FEATURE_ALARM bit for that purpose in order to keep the rtc_class_ops structure const. So now, .set_alarm() is set unconditionally and this possibly causes the alarmtimer code to select an RTC device that doesn't support alarms. Test RTC_FEATURE_ALARM instead of relying on ops->set_alarm to determine whether alarms are available. Fixes: 7ae41220ef58 ("rtc: introduce features bitfield") Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210511014516.563031-1-alexandre.belloni@bootlin.com Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daystracing: branch: Fix inverted check on stat tracer registrationBreno Leitao1-4/+4
[ Upstream commit 3b75dd76e64a04771861bb5647951c264919e563 ] init_annotated_branch_stats() and all_annotated_branch_stats() check the return value of register_stat_tracer() with "if (!ret)", but register_stat_tracer() returns 0 on success and a negative errno on failure. The inverted check causes the warning to be printed on every successful registration, e.g.: Warning: could not register annotated branches stats while leaving real failures silent. The initcall also returned a hard-coded 1 instead of the actual error. Invert the check and propagate ret so that the warning fires on real errors and the initcall reports the correct status. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: https://patch.msgid.link/20260420-tracing-v1-1-d8f4cd0d6af1@debian.org Fixes: 002bb86d8d42 ("tracing/ftrace: separate events tracing and stats tracing engine") Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 dayscgroup/rdma: fix integer overflow in rdmacg_try_charge()cuitao1-1/+1
[ Upstream commit c802f460dd485c1332b5a35e7adcfb2bc22536a2 ] The expression `rpool->resources[index].usage + 1` is computed in int arithmetic before being assigned to s64 variable `new`. When usage equals INT_MAX (the default "max" value), the addition overflows to INT_MIN. This negative value then passes the `new > max` check incorrectly, allowing a charge that should be rejected and corrupting usage to negative. Fix by casting usage to s64 before the addition so the arithmetic is done in 64-bit. Fixes: 39d3e7584a68 ("rdmacg: Added rdma cgroup controller") Signed-off-by: cuitao <cuitao@kylinos.cn> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daystracing: Rebuild full_name on each hist_field_name() callPengpeng Hou1-5/+7
[ Upstream commit 5ec1d1e97de134beed3a5b08235a60fc1c51af96 ] hist_field_name() uses a static MAX_FILTER_STR_VAL buffer for fully qualified variable-reference names, but it currently appends into that buffer with strcat() without rebuilding it first. As a result, repeated calls append a new "system.event.field" name onto the previous one, which can eventually run past the end of full_name. Build the name with snprintf() on each call and return NULL if the fully qualified name does not fit in MAX_FILTER_STR_VAL. Link: https://patch.msgid.link/20260401112224.85582-1-pengpeng@iscas.ac.cn Fixes: 067fe038e70f ("tracing: Add variable reference handling to hist triggers") Reviewed-by: Tom Zanussi <zanussi@kernel.org> Tested-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysbpf: fix end-of-list detection in cgroup_storage_get_next_key()Weiming Shi1-1/+1
[ Upstream commit 5828b9e5b272ecff7cf5d345128d3de7324117f7 ] list_next_entry() never returns NULL -- when the current element is the last entry it wraps to the list head via container_of(). The subsequent NULL check is therefore dead code and get_next_key() never returns -ENOENT for the last element, instead reading storage->key from a bogus pointer that aliases internal map fields and copying the result to userspace. Replace it with list_entry_is_head() so the function correctly returns -ENOENT when there are no more entries. Fixes: de9cbbaadba5 ("bpf: introduce cgroup storage maps") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com> Acked-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/20260403132951.43533-2-bestswngs@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daystaskstats: set version in TGID exit notificationsYiyang Chen1-0/+1
commit 16c4f0211aaa1ec1422b11b59f64f1abe9009fc0 upstream. delay accounting started populating taskstats records with a valid version field via fill_pid() and fill_tgid(). Later, commit ad4ecbcba728 ("[PATCH] delay accounting taskstats interface send tgid once") changed the TGID exit path to send the cached signal->stats aggregate directly instead of building the outgoing record through fill_tgid(). Unlike fill_tgid(), fill_tgid_exit() only accumulates accounting data and never initializes stats->version. As a result, TGID exit notifications can reach userspace with version == 0 even though PID exit notifications and TASKSTATS_CMD_GET replies carry a valid taskstats version. This is easy to reproduce with `tools/accounting/getdelays.c`. I have a small follow-up patch for that tool which: 1. increases the receive buffer/message size so the pid+tgid combined exit notification is not dropped/truncated 2. prints `stats->version`. With that patch, the reproducer is: Terminal 1: ./getdelays -d -v -l -m 0 Terminal 2: taskset -c 0 python3 -c 'import threading,time; t=threading.Thread(target=time.sleep,args=(0.1,)); t.start(); t.join()' That produces both PID and TGID exit notifications for the same process. The PID exit record reports a valid taskstats version, while the TGID exit record reports `version 0`. This patch (of 2): Set stats->version = TASKSTATS_VERSION after copying the cached TGID aggregate into the outgoing netlink payload so all taskstats records are self-describing again. Link: https://lkml.kernel.org/r/ba83d934e59edd431b693607de573eb9ca059309.1774810498.git.cyyzero16@gmail.com Fixes: ad4ecbcba728 ("[PATCH] delay accounting taskstats interface send tgid once") Signed-off-by: Yiyang Chen <cyyzero16@gmail.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dr. Thomas Orgis <thomas.orgis@uni-hamburg.de> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Wang Yaxin <wang.yaxin@zte.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 dayspadata: Fix pd UAF once and for allHerbert Xu1-99/+37
[ Upstream commit 71203f68c7749609d7fc8ae6ad054bdedeb24f91 ] There is a race condition/UAF in padata_reorder that goes back to the initial commit. A reference count is taken at the start of the process in padata_do_parallel, and released at the end in padata_serial_worker. This reference count is (and only is) required for padata_replace to function correctly. If padata_replace is never called then there is no issue. In the function padata_reorder which serves as the core of padata, as soon as padata is added to queue->serial.list, and the associated spin lock released, that padata may be processed and the reference count on pd would go away. Fix this by getting the next padata before the squeue->serial lock is released. In order to make this possible, simplify padata_reorder by only calling it once the next padata arrives. Fixes: 16295bec6398 ("padata: Generic parallelization/serialization interface") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> [ Adjust context of padata_find_next(). Replace cpumask_next_wrap(cpu, pd->cpumask.pcpu) with cpumask_next_wrap(cpu, pd->cpumask.pcpu, -1, false) in padata_reorder() in v5.10 according to dc5bb9b769c9 ("cpumask: deprecate cpumask_next_wrap()") and f954a2d37637 ("padata: switch padata_find_next() to using cpumask_next_wrap()") . ] Signed-off-by: Bin Lan <lanbincn@139.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
9 daysx86/uprobes: Fix XOL allocation failure for 32-bit tasksOleg Nesterov1-3/+7
[ Upstream commit d55c571e4333fac71826e8db3b9753fadfbead6a ] This script #!/usr/bin/bash echo 0 > /proc/sys/kernel/randomize_va_space echo 'void main(void) {}' > TEST.c # -fcf-protection to ensure that the 1st endbr32 insn can't be emulated gcc -m32 -fcf-protection=branch TEST.c -o test bpftrace -e 'uprobe:./test:main {}' -c ./test "hangs", the probed ./test task enters an endless loop. The problem is that with randomize_va_space == 0 get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used by the stack vma. arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE. vm_unmapped_area() happily returns the high address > TASK_SIZE and then get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)" check. handle_swbp() doesn't report this failure (probably it should) and silently restarts the probed insn. Endless loop. I think that the right fix should change the x86 get_unmapped_area() paths to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case because ->orig_ax = -1. But we need a simple fix for -stable, so this patch just sets TS_COMPAT if the probed task is 32-bit to make in_ia32_syscall() true. Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()") Reported-by: Paulo Andrade <pandrade@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/ Cc: stable@vger.kernel.org Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 daystracing/probe: reject non-closed empty immediate stringsPengpeng Hou1-1/+1
[ Upstream commit 4346be6577aaa04586167402ae87bbdbe32484a4 ] parse_probe_arg() accepts quoted immediate strings and passes the body after the opening quote to __parse_imm_string(). That helper currently computes strlen(str) and immediately dereferences str[len - 1], which underflows when the body is empty and not closed with double-quotation. Reject empty non-closed immediate strings before checking for the closing quote. Link: https://lore.kernel.org/all/20260401160315.88518-1-pengpeng@iscas.ac.cn/ Fixes: a42e3c4de964 ("tracing/probe: Add immediate string parameter support") Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-15ptrace: slightly saner 'get_dumpable()' logicLinus Torvalds2-6/+17
commit 31e62c2ebbfdc3fe3dbdf5e02c92a9dc67087a3a upstream. The 'dumpability' of a task is fundamentally about the memory image of the task - the concept comes from whether it can core dump or not - and makes no sense when you don't have an associated mm. And almost all users do in fact use it only for the case where the task has a mm pointer. But we have one odd special case: ptrace_may_access() uses 'dumpable' to check various other things entirely independently of the MM (typically explicitly using flags like PTRACE_MODE_READ_FSCREDS). Including for threads that no longer have a VM (and maybe never did, like most kernel threads). It's not what this flag was designed for, but it is what it is. The ptrace code does check that the uid/gid matches, so you do have to be uid-0 to see kernel thread details, but this means that the traditional "drop capabilities" model doesn't make any difference for this all. Make it all make a *bit* more sense by saying that if you don't have a MM pointer, we'll use a cached "last dumpability" flag if the thread ever had a MM (it will be zero for kernel threads since it is never set), and require a proper CAP_SYS_PTRACE capability to override. Reported-by: Qualys Security Advisory <qsa@qualys.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-18bpf: Fix regsafe() for pointers to packetAlexei Starovoitov1-1/+6
[ Upstream commit a8502a79e832b861e99218cbd2d8f4312d62e225 ] In case rold->reg->range == BEYOND_PKT_END && rcur->reg->range == N regsafe() may return true which may lead to current state with valid packet range not being explored. Fix the bug. Fixes: 6d94e741a8ff ("bpf: Support for pointers beyond pkt_end.") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Amery Hung <ameryhung@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20260331204228.26726-1-alexei.starovoitov@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-18futex: Clear stale exiting pointer in futex_lock_pi() retry pathDavidlohr Bueso1-1/+2
commit 210d36d892de5195e6766c45519dfb1e65f3eb83 upstream. Fuzzying/stressing futexes triggered: WARNING: kernel/futex/core.c:825 at wait_for_owner_exiting+0x7a/0x80, CPU#11: futex_lock_pi_s/524 When futex_lock_pi_atomic() sees the owner is exiting, it returns -EBUSY and stores a refcounted task pointer in 'exiting'. After wait_for_owner_exiting() consumes that reference, the local pointer is never reset to nil. Upon a retry, if futex_lock_pi_atomic() returns a different error, the bogus pointer is passed to wait_for_owner_exiting(). CPU0 CPU1 CPU2 futex_lock_pi(uaddr) // acquires the PI futex exit() futex_cleanup_begin() futex_state = EXITING; futex_lock_pi(uaddr) futex_lock_pi_atomic() attach_to_pi_owner() // observes EXITING *exiting = owner; // takes ref return -EBUSY wait_for_owner_exiting(-EBUSY, owner) put_task_struct(); // drops ref // exiting still points to owner goto retry; futex_lock_pi_atomic() lock_pi_update_atomic() cmpxchg(uaddr) *uaddr ^= WAITERS // whatever // value changed return -EAGAIN; wait_for_owner_exiting(-EAGAIN, exiting) // stale WARN_ON_ONCE(exiting) Fix this by resetting upon retry, essentially aligning it with requeue_pi. Fixes: 3ef240eaff36 ("futex: Prevent exit livelock") Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260326001759.4129680-1-dave@stgolabs.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-18alarmtimer: Fix argument order in alarm_timer_forward()Zhan Xusheng1-1/+1
commit 5d16467ae56343b9205caedf85e3a131e0914ad8 upstream. alarm_timer_forward() passes arguments to alarm_forward() in the wrong order: alarm_forward(alarm, timr->it_interval, now); However, alarm_forward() is defined as: u64 alarm_forward(struct alarm *alarm, ktime_t now, ktime_t interval); and uses the second argument as the current time: delta = ktime_sub(now, alarm->node.expires); Passing the interval as "now" results in incorrect delta computation, which can lead to missed expirations or incorrect overrun accounting. This issue has been present since the introduction of alarm_timer_forward(). Fix this by swapping the arguments. Fixes: e7561f1633ac ("alarmtimer: Implement forward callback") Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260323061130.29991-1-zhanxusheng@xiaomi.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-18sysctl: fix uninitialized variable in proc_do_large_bitmapMarc Buerg1-1/+1
[ Upstream commit f63a9df7e3f9f842945d292a19d9938924f066f9 ] proc_do_large_bitmap() does not initialize variable c, which is expected to be set to a trailing character by proc_get_long(). However, proc_get_long() only sets c when the input buffer contains a trailing character after the parsed value. If c is not initialized it may happen to contain a '-'. If this is the case proc_do_large_bitmap() expects to be able to parse a second part of the input buffer. If there is no second part an unjustified -EINVAL will be returned. Initialize c to 0 to prevent returning -EINVAL on valid input. Fixes: 9f977fb7ae9d ("sysctl: add proc_do_large_bitmap") Signed-off-by: Marc Buerg <buermarc@googlemail.com> Reviewed-by: Joel Granados <joel.granados@kernel.org> Signed-off-by: Joel Granados <joel.granados@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-18sched: idle: Consolidate the handling of two special casesRafael J. Wysocki1-9/+21
[ Upstream commit f4c31b07b136839e0fb3026f8a5b6543e3b14d2f ] There are two special cases in the idle loop that are handled inconsistently even though they are analogous. The first one is when a cpuidle driver is absent and the default CPU idle time power management implemented by the architecture code is used. In that case, the scheduler tick is stopped every time before invoking default_idle_call(). The second one is when a cpuidle driver is present, but there is only one idle state in its table. In that case, the scheduler tick is never stopped at all. Since each of these approaches has its drawbacks, reconcile them with the help of one simple heuristic. Namely, stop the tick if the CPU has been woken up by it in the previous iteration of the idle loop, or let it tick otherwise. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Fixes: ed98c3491998 ("sched: idle: Do not stop the tick before cpuidle_idle_call()") [ rjw: Added Fixes tag, changelog edits ] Link: https://patch.msgid.link/4741364.LvFx2qVVIh@rafael.j.wysocki Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-18tracing: Fix syscall events activation by ensuring refcount hits zeroHuiwen He1-15/+36
[ Upstream commit 0a663b764dbdf135a126284f454c9f01f95a87d4 ] When multiple syscall events are specified in the kernel command line (e.g., trace_event=syscalls:sys_enter_openat,syscalls:sys_enter_close), they are often not captured after boot, even though they appear enabled in the tracing/set_event file. The issue stems from how syscall events are initialized. Syscall tracepoints require the global reference count (sys_tracepoint_refcount) to transition from 0 to 1 to trigger the registration of the syscall work (TIF_SYSCALL_TRACEPOINT) for tasks, including the init process (pid 1). The current implementation of early_enable_events() with disable_first=true used an interleaved sequence of "Disable A -> Enable A -> Disable B -> Enable B". If multiple syscalls are enabled, the refcount never drops to zero, preventing the 0->1 transition that triggers actual registration. Fix this by splitting early_enable_events() into two distinct phases: 1. Disable all events specified in the buffer. 2. Enable all events specified in the buffer. This ensures the refcount hits zero before re-enabling, allowing syscall events to be properly activated during early boot. The code is also refactored to use a helper function to avoid logic duplication between the disable and enable phases. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260224023544.1250787-1-hehuiwen@kylinos.cn Fixes: ce1039bd3a89 ("tracing: Fix enabling of syscall events on the command line") Signed-off-by: Huiwen He <hehuiwen@kylinos.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-18bpf: Forget ranges when refining tnum after JSETPaul Chaignon1-0/+4
commit 6279846b9b2532e1b04559ef8bd0dec049f29383 upstream. Syzbot reported a kernel warning due to a range invariant violation on the following BPF program. 0: call bpf_get_netns_cookie 1: if r0 == 0 goto <exit> 2: if r0 & Oxffffffff goto <exit> The issue is on the path where we fall through both jumps. That path is unreachable at runtime: after insn 1, we know r0 != 0, but with the sign extension on the jset, we would only fallthrough insn 2 if r0 == 0. Unfortunately, is_branch_taken() isn't currently able to figure this out, so the verifier walks all branches. The verifier then refines the register bounds using the second condition and we end up with inconsistent bounds on this unreachable path: 1: if r0 == 0 goto <exit> r0: u64=[0x1, 0xffffffffffffffff] var_off=(0, 0xffffffffffffffff) 2: if r0 & 0xffffffff goto <exit> r0 before reg_bounds_sync: u64=[0x1, 0xffffffffffffffff] var_off=(0, 0) r0 after reg_bounds_sync: u64=[0x1, 0] var_off=(0, 0) Improving the range refinement for JSET to cover all cases is tricky. We also don't expect many users to rely on JSET given LLVM doesn't generate those instructions. So instead of improving the range refinement for JSETs, Eduard suggested we forget the ranges whenever we're narrowing tnums after a JSET. This patch implements that approach. Reported-by: syzbot+c711ce17dd78e5d4fdcf@syzkaller.appspotmail.com Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/9d4fd6432a095d281f815770608fdcd16028ce0b.1752171365.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> [ shung-hsi.yu: no detection or kernel warning for invariant violation before 6.8, but the same umin=1,umax=0 state can occur when jset is preceed by r0 < 1. Changes were made to adapt to older range refinement logic before commit 67420501e868 ("bpf: generalize reg_set_min_max() to handle non-const register comparisons"). ] Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-18tracing: Fix trace_buf_size= cmdline parameter with sizes >= 2GCalvin Owens1-3/+3
commit d008ba8be8984760e36d7dcd4adbd5a41a645708 upstream. Some of the sizing logic through tracer_alloc_buffers() uses int internally, causing unexpected behavior if the user passes a value that does not fit in an int (on my x86 machine, the result is uselessly tiny buffers). Fix by plumbing the parameter's real type (unsigned long) through to the ring buffer allocation functions, which already use unsigned long. It has always been possible to create larger ring buffers via the sysfs interface: this only affects the cmdline parameter. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/bff42a4288aada08bdf74da3f5b67a2c28b761f8.1772852067.git.calvin@wbinvd.org Fixes: 73c5162aa362 ("tracing: keep ring buffer to minimum size till used") Signed-off-by: Calvin Owens <calvin@wbinvd.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-18time/jiffies: Mark jiffies_64_to_clock_t() notraceSteven Rostedt1-1/+1
[ Upstream commit 755a648e78f12574482d4698d877375793867fa1 ] The trace_clock_jiffies() function that handles the "uptime" clock for tracing calls jiffies_64_to_clock_t(). This causes the function tracer to constantly recurse when the tracing clock is set to "uptime". Mark it notrace to prevent unnecessary recursion when using the "uptime" clock. Fixes: 58d4e21e50ff3 ("tracing: Fix wraparound problems in "uptime" trace clock") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260306212403.72270bb2@robin Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-18time: add kernel-doc in time.cRandy Dunlap1-11/+158
[ Upstream commit 67b3f564cb1e769ef8e45835129a4866152fcfdb ] Add kernel-doc for all APIs that do not already have it. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: John Stultz <jstultz@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Acked-by: John Stultz <jstultz@google.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20230704052405.5089-3-rdunlap@infradead.org Stable-dep-of: 755a648e78f1 ("time/jiffies: Mark jiffies_64_to_clock_t() notrace") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-18cgroup: fix race between task migration and iterationQingye Zhao1-0/+1
commit 5ee01f1a7343d6a3547b6802ca2d4cdce0edacb1 upstream. When a task is migrated out of a css_set, cgroup_migrate_add_task() first moves it from cset->tasks to cset->mg_tasks via: list_move_tail(&task->cg_list, &cset->mg_tasks); If a css_task_iter currently has it->task_pos pointing to this task, css_set_move_task() calls css_task_iter_skip() to keep the iterator valid. However, since the task has already been moved to ->mg_tasks, the iterator is advanced relative to the mg_tasks list instead of the original tasks list. As a result, remaining tasks on cset->tasks, as well as tasks queued on cset->mg_tasks, can be skipped by iteration. Fix this by calling css_set_skip_task_iters() before unlinking task->cg_list from cset->tasks. This advances all active iterators to the next task on cset->tasks, so iteration continues correctly even when a task is concurrently being migrated. This race is hard to hit in practice without instrumentation, but it can be reproduced by artificially slowing down cgroup_procs_show(). For example, on an Android device a temporary /sys/kernel/cgroup/cgroup_test knob can be added to inject a delay into cgroup_procs_show(), and then: 1) Spawn three long-running tasks (PIDs 101, 102, 103). 2) Create a test cgroup and move the tasks into it. 3) Enable a large delay via /sys/kernel/cgroup/cgroup_test. 4) In one shell, read cgroup.procs from the test cgroup. 5) Within the delay window, in another shell migrate PID 102 by writing it to a different cgroup.procs file. Under this setup, cgroup.procs can intermittently show only PID 101 while skipping PID 103. Once the migration completes, reading the file again shows all tasks as expected. Note that this change does not allow removing the existing css_set_skip_task_iters() call in css_set_move_task(). The new call in cgroup_migrate_add_task() only handles iterators that are racing with migration while the task is still on cset->tasks. Iterators may also start after the task has been moved to cset->mg_tasks. If we dropped css_set_skip_task_iters() from css_set_move_task(), such iterators could keep task_pos pointing to a migrating task, causing css_task_iter_advance() to malfunction on the destination css_set, up to and including crashes or infinite loops. The race window between migration and iteration is very small, and css_task_iter is not on a hot path. In the worst case, when an iterator is positioned on the first thread of the migrating process, cgroup_migrate_add_task() may have to skip multiple tasks via css_set_skip_task_iters(). However, this only happens when migration and iteration actually race, so the performance impact is negligible compared to the correctness fix provided here. Fixes: b636fd38dc40 ("cgroup: Implement css_task_iter_skip()") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Qingye Zhao <zhaoqingye@honor.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-18sched: idle: Make skipping governor callbacks more consistentRafael J. Wysocki1-1/+10
[ Upstream commit d557640e4ce589a24dca5ca7ce3b9680f471325f ] If the cpuidle governor .select() callback is skipped because there is only one idle state in the cpuidle driver, the .reflect() callback should be skipped as well, at least for consistency (if not for correctness), so do it. Fixes: e5c9ffc6ae1b ("cpuidle: Skip governor when only one idle state is available") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/12857700.O9o76ZdvQC@rafael.j.wysocki Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-18unshare: fix unshare_fs() handlingAl Viro1-1/+1
[ Upstream commit 6c4b2243cb6c0755159bd567130d5e12e7b10d9f ] There's an unpleasant corner case in unshare(2), when we have a CLONE_NEWNS in flags and current->fs hadn't been shared at all; in that case copy_mnt_ns() gets passed current->fs instead of a private copy, which causes interesting warts in proof of correctness] > I guess if private means fs->users == 1, the condition could still be true. Unfortunately, it's worse than just a convoluted proof of correctness. Consider the case when we have CLONE_NEWCGROUP in addition to CLONE_NEWNS (and current->fs->users == 1). We pass current->fs to copy_mnt_ns(), all right. Suppose it succeeds and flips current->fs->{pwd,root} to corresponding locations in the new namespace. Now we proceed to copy_cgroup_ns(), which fails (e.g. with -ENOMEM). We call put_mnt_ns() on the namespace created by copy_mnt_ns(), it's destroyed and its mount tree is dissolved, but... current->fs->root and current->fs->pwd are both left pointing to now detached mounts. They are pinning those, so it's not a UAF, but it leaves the calling process with unshare(2) failing with -ENOMEM _and_ leaving it with pwd and root on detached isolated mounts. The last part is clearly a bug. There is other fun related to that mess (races with pivot_root(), including the one between pivot_root() and fork(), of all things), but this one is easy to isolate and fix - treat CLONE_NEWNS as "allocate a new fs_struct even if it hadn't been shared in the first place". Sure, we could go for something like "if both CLONE_NEWNS *and* one of the things that might end up failing after copy_mnt_ns() call in create_new_namespaces() are set, force allocation of new fs_struct", but let's keep it simple - the cost of copy_fs_struct() is trivial. Another benefit is that copy_mnt_ns() with CLONE_NEWNS *always* gets a freshly allocated fs_struct, yet to be attached to anything. That seriously simplifies the analysis... FWIW, that bug had been there since the introduction of unshare(2) ;-/ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://patch.msgid.link/20260207082524.GE3183987@ZenIV Tested-by: Waiman Long <longman@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04kexec: derive purgatory entry from symbolLi Chen1-57/+74
[ Upstream commit 480e1d5c64bb14441f79f2eb9421d5e26f91ea3d ] kexec_load_purgatory() derives image->start by locating e_entry inside an SHF_EXECINSTR section. If the purgatory object contains multiple executable sections with overlapping sh_addr, the entrypoint check can match more than once and trigger a WARN. Derive the entry section from the purgatory_start symbol when present and compute image->start from its final placement. Keep the existing e_entry fallback for purgatories that do not expose the symbol. WARNING: kernel/kexec_file.c:1009 at kexec_load_purgatory+0x395/0x3c0, CPU#10: kexec/1784 Call Trace: <TASK> bzImage64_load+0x133/0xa00 __do_sys_kexec_file_load+0x2b3/0x5c0 do_syscall_64+0x81/0x610 entry_SYSCALL_64_after_hwframe+0x76/0x7e [me@linux.beauty: move helper to avoid forward declaration, per Baoquan] Link: https://lkml.kernel.org/r/20260128043511.316860-1-me@linux.beauty Link: https://lkml.kernel.org/r/20260120124005.148381-1-me@linux.beauty Fixes: 8652d44f466a ("kexec: support purgatories with .text.hot sections") Signed-off-by: Li Chen <me@linux.beauty> Acked-by: Baoquan He <bhe@redhat.com> Cc: Alexander Graf <graf@amazon.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Li Chen <me@linux.beauty> Cc: Philipp Rudo <prudo@redhat.com> Cc: Ricardo Ribalda Delgado <ribalda@chromium.org> Cc: Ross Zwisler <zwisler@google.com> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04bpf: verifier improvement in 32bit shift sign extension patternCupertino Miranda1-11/+7
[ Upstream commit d18dec4b8990048ce75f0ece32bb96b3fbd3f422 ] This patch improves the verifier to correctly compute bounds for sign extension compiler pattern composed of left shift by 32bits followed by a sign right shift by 32bits. Pattern in the verifier was limitted to positive value bounds and would reset bound computation for negative values. New code allows both positive and negative values for sign extension without compromising bound computation and verifier to pass. This change is required by GCC which generate such pattern, and was detected in the context of systemd, as described in the following GCC bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119731 Three new tests were added in verifier_subreg.c. Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com> Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Cc: David Faust <david.faust@oracle.com> Cc: Jose Marchesi <jose.marchesi@oracle.com> Cc: Elena Zannoni <elena.zannoni@oracle.com> Link: https://lore.kernel.org/r/20251202180220.11128-2-cupertino.miranda@oracle.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04tracing: Remove duplicate ENABLE_EVENT_STR and DISABLE_EVENT_STR macrosSteven Rostedt1-5/+0
[ Upstream commit 9df0e49c5b9b8d051529be9994e4f92f2d20be6f ] The macros ENABLE_EVENT_STR and DISABLE_EVENT_STR were added to trace.h so that more than one file can have access to them, but was never removed from their original location. Remove the duplicates. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Link: https://patch.msgid.link/20260126130037.4ba201f9@gandalf.local.home Fixes: d0bad49bb0a09 ("tracing: Add enable_hist/disable_hist triggers") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04ucount: check for CAP_SYS_RESOURCE using ns_capable_noaudit()Ondrej Mosnacek1-1/+1
[ Upstream commit 0895a000e4fff9e950a7894210db45973e485c35 ] The user.* sysctls implement the ctl_table_root::permissions hook and they override the file access mode based on the CAP_SYS_RESOURCE capability (at most rwx if capable, at most r-- if not). The capability is being checked unconditionally, so if an LSM denies the capability, an audit record may be logged even when access is in fact granted. Given the logic in the set_permissions() function in kernel/ucount.c and the unfortunate way the permission checking is implemented, it doesn't seem viable to avoid false positive denials by deferring the capability check. Thus, do the same as in net_ctl_permissions() (net/sysctl_net.c) - switch from ns_capable() to ns_capable_noaudit(), so that the check never logs an audit record. Link: https://lkml.kernel.org/r/20260122140745.239428-1-omosnace@redhat.com Fixes: dbec28460a89 ("userns: Add per user namespace sysctls.") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Reviewed-by: Paul Moore <paul@paul-moore.com> Acked-by: Serge Hallyn <serge@hallyn.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Alexey Gladkov <legion@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04sched/rt: Skip currently executing CPU in rto_next_cpu()Chen Jinghuang1-0/+5
[ Upstream commit 94894c9c477e53bcea052e075c53f89df3d2a33e ] CPU0 becomes overloaded when hosting a CPU-bound RT task, a non-CPU-bound RT task, and a CFS task stuck in kernel space. When other CPUs switch from RT to non-RT tasks, RT load balancing (LB) is triggered; with HAVE_RT_PUSH_IPI enabled, they send IPIs to CPU0 to drive the execution of rto_push_irq_work_func. During push_rt_task on CPU0, if next_task->prio < rq->donor->prio, resched_curr() sets NEED_RESCHED and after the push operation completes, CPU0 calls rto_next_cpu(). Since only CPU0 is overloaded in this scenario, rto_next_cpu() should ideally return -1 (no further IPI needed). However, multiple CPUs invoking tell_cpu_to_push() during LB increments rd->rto_loop_next. Even when rd->rto_cpu is set to -1, the mismatch between rd->rto_loop and rd->rto_loop_next forces rto_next_cpu() to restart its search from -1. With CPU0 remaining overloaded (satisfying rt_nr_migratory && rt_nr_total > 1), it gets reselected, causing CPU0 to queue irq_work to itself and send self-IPIs repeatedly. As long as CPU0 stays overloaded and other CPUs run pull_rt_tasks(), it falls into an infinite self-IPI loop, which triggers a CPU hardlockup due to continuous self-interrupts. The trigging scenario is as follows: cpu0 cpu1 cpu2 pull_rt_task tell_cpu_to_push <------------irq_work_queue_on rto_push_irq_work_func push_rt_task resched_curr(rq) pull_rt_task rto_next_cpu tell_cpu_to_push <-------------------------- atomic_inc(rto_loop_next) rd->rto_loop != next rto_next_cpu irq_work_queue_on rto_push_irq_work_func Fix redundant self-IPI by filtering the initiating CPU in rto_next_cpu(). This solution has been verified to effectively eliminate spurious self-IPIs and prevent CPU hardlockup scenarios. Fixes: 4bdced5c9a29 ("sched/rt: Simplify the IPI based RT balancing logic") Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://patch.msgid.link/20260122012533.673768-1-chenjinghuang2@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04hrtimer: Fix trace oddityThomas Gleixner1-1/+1
[ Upstream commit 5d6446f409da00e5a389125ddb5ce09f5bc404c9 ] It turns out that __run_hrtimer() will trace like: <idle>-0 [032] d.h2. 20705.474563: hrtimer_cancel: hrtimer=0xff2db8f77f8226e8 <idle>-0 [032] d.h1. 20705.474563: hrtimer_expire_entry: hrtimer=0xff2db8f77f8226e8 now=20699452001850 function=tick_nohz_handler/0x0 Which is a bit nonsensical, the timer doesn't get canceled on expiration. The cause is the use of the incorrect debug helper. Fixes: c6a2a1770245 ("hrtimer: Add tracepoint for hrtimers") Reported-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260121143208.219595606@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-11