summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2024-11-10Merge tag 'mm-hotfixes-stable-2024-11-09-22-40' of ↵Linus Torvalds2-5/+7
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "20 hotfixes, 14 of which are cc:stable. Three affect DAMON. Lorenzo's five-patch series to address the mmap_region error handling is here also. Apart from that, various singletons" * tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mailmap: add entry for Thorsten Blum ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove() signal: restore the override_rlimit logic fs/proc: fix compile warning about variable 'vmcore_mmap_ops' ucounts: fix counter leak in inc_rlimit_get_ucounts() selftests: hugetlb_dio: check for initial conditions to skip in the start mm: fix docs for the kernel parameter ``thp_anon=`` mm/damon/core: avoid overflow in damon_feed_loop_next_input() mm/damon/core: handle zero schemes apply interval mm/damon/core: handle zero {aggregation,ops_update} intervals mm/mlock: set the correct prev on failure objpool: fix to make percpu slot allocation more robust mm/page_alloc: keep track of free highatomic mm: resolve faulty mmap_region() error path behaviour mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling mm: refactor map_deny_write_exec() mm: unconditionally close VMAs on error mm: avoid unsafe VMA hook invocation when error arises on mmap hook mm/thp: fix deferred split unqueue naming and locking mm/thp: fix deferred split queue not partially_mapped
2024-11-07signal: restore the override_rlimit logicRoman Gushchin2-3/+6
Prior to commit d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") UCOUNT_RLIMIT_SIGPENDING rlimit was not enforced for a class of signals. However now it's enforced unconditionally, even if override_rlimit is set. This behavior change caused production issues. For example, if the limit is reached and a process receives a SIGSEGV signal, sigqueue_alloc fails to allocate the necessary resources for the signal delivery, preventing the signal from being delivered with siginfo. This prevents the process from correctly identifying the fault address and handling the error. From the user-space perspective, applications are unaware that the limit has been reached and that the siginfo is effectively 'corrupted'. This can lead to unpredictable behavior and crashes, as we observed with java applications. Fix this by passing override_rlimit into inc_rlimit_get_ucounts() and skip the comparison to max there if override_rlimit is set. This effectively restores the old behavior. Link: https://lkml.kernel.org/r/20241104195419.3962584-1-roman.gushchin@linux.dev Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Co-developed-by: Andrei Vagin <avagin@google.com> Signed-off-by: Andrei Vagin <avagin@google.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Alexey Gladkov <legion@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07ucounts: fix counter leak in inc_rlimit_get_ucounts()Andrei Vagin1-2/+1
The inc_rlimit_get_ucounts() increments the specified rlimit counter and then checks its limit. If the value exceeds the limit, the function returns an error without decrementing the counter. Link: https://lkml.kernel.org/r/20241101191940.3211128-1-roman.gushchin@linux.dev Fixes: 15bc01effefe ("ucounts: Fix signal ucount refcounting") Signed-off-by: Andrei Vagin <avagin@google.com> Co-developed-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Tested-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Alexey Gladkov <legion@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Andrei Vagin <avagin@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Gladkov <legion@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06Merge tag 'tracefs-v6.12-rc6' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracefs fixes from Steven Rostedt: "Fix tracefs mount options. Commit 78ff64081949 ("vfs: Convert tracefs to use the new mount API") broke the gid setting when set by fstab or other mount utility. It is ignored when it is set. Fix the code so that it recognises the option again and will honor the settings on mount at boot up. Update the internal documentation and create a selftest to make sure it doesn't break again in the future" * tag 'tracefs-v6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/selftests: Add tracefs mount options test tracing: Document tracefs gid mount option tracing: Fix tracefs mount options
2024-11-03Merge tag 'timers-urgent-2024-11-03' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A single fix for posix CPU timers. When a thread is cloned, the posix CPU timers are not inherited. If the parent has a CPU timer armed the corresponding tick dependency in the tasks tick_dep_mask is set and copied to the new thread, which means the new thread and all decendants will prevent the system to go into full NOHZ operation. Clear the tick dependency mask in copy_process() to fix this" * tag 'timers-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Clear TICK_DEP_BIT_POSIX_TIMER on clone
2024-11-03Merge tag 'sched-urgent-2024-11-03' of ↵Linus Torvalds6-22/+69
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: - Plug a race between pick_next_task_fair() and try_to_wake_up() where both try to write to the same task, even though both paths hold a runqueue lock, but obviously from different runqueues. The problem is that the store to task::on_rq in __block_task() is visible to try_to_wake_up() which assumes that the task is not queued. Both sides then operate on the same task. Cure it by rearranging __block_task() so the the store to task::on_rq is the last operation on the task. - Prevent a potential NULL pointer dereference in task_numa_work() task_numa_work() iterates the VMAs of a process. A concurrent unmap of the address space can result in a NULL pointer return from vma_next() which is unchecked. Add the missing NULL pointer check to prevent this. - Operate on the correct scheduler policy in task_should_scx() task_should_scx() returns true when a task should be handled by sched EXT. It checks the tasks scheduling policy. This fails when the check is done before a policy has been set. Cure it by handing the policy into task_should_scx() so it operates on the requested value. - Add the missing handling of sched EXT in the delayed dequeue mechanism. This was simply forgotten. * tag 'sched-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/ext: Fix scx vs sched_delayed sched: Pass correct scheduling policy to __setscheduler_class sched/numa: Fix the potential null pointer dereference in task_numa_work() sched: Fix pick_next_task_fair() vs try_to_wake_up() race
2024-11-03Merge tag 'perf-urgent-2024-11-03' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Thomas Gleixner: "perf_event_clear_cpumask() uses list_for_each_entry_rcu() without being in a RCU read side critical section, which triggers a 'suspicious RCU usage' warning. It turns out that the list walk does not be RCU protected because the write side lock is held in this context. Change it to a regular list walk" * tag 'perf-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix missing RCU reader protection in perf_event_clear_cpumask()
2024-11-03Merge tag 'irq-urgent-2024-11-03' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: - Fix an off-by-one error in the failure path of msi_domain_alloc(), which causes the cleanup loop to terminate early and leaking the first allocated interrupt. - Handle a corner case in GIC-V4 versus a lazily mapped Virtual Processing Element (VPE). If the VPE has not been mapped because the guest has not yet emitted a mapping command, then the set_affinity() callback returns an error code, which causes the vCPU management to fail. Return success in this case without touching the hardware. This will be done later when the guest issues the mapping command. * tag 'irq-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic-v4: Correctly deal with set_affinity on lazily-mapped VPEs genirq/msi: Fix off-by-one error in msi_domain_alloc()
2024-11-01tracing: Document tracefs gid mount optionKalesh Singh1-0/+4
Commit ee7f3666995d ("tracefs: Have new files inherit the ownership of their parent") and commit 48b27b6b5191 ("tracefs: Set all files to the same group ownership as the mount option") introduced a new gid mount option that allows specifying a group to apply to all entries in tracefs. Document this in the tracing readme. Cc: Eric Sandeen <sandeen@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Ali Zahraee <ahzahraee@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/20241030171928.4168869-3-kaleshsingh@google.com Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-31Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds4-27/+66
Pull bpf fixes from Daniel Borkmann: - Fix BPF verifier to force a checkpoint when the program's jump history becomes too long (Eduard Zingerman) - Add several fixes to the BPF bits iterator addressing issues like memory leaks and overflow problems (Hou Tao) - Fix an out-of-bounds write in trie_get_next_key (Byeonguk Jeong) - Fix BPF test infra's LIVE_FRAME frame update after a page has been recycled (Toke Høiland-Jørgensen) - Fix BPF verifier and undo the 40-bytes extra stack space for bpf_fastcall patterns due to various bugs (Eduard Zingerman) - Fix a BPF sockmap race condition which could trigger a NULL pointer dereference in sock_map_link_update_prog (Cong Wang) - Fix tcp_bpf_recvmsg_parser to retrieve seq_copied from tcp_sk under the socket lock (Jiayuan Chen) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, test_run: Fix LIVE_FRAME frame update after a page has been recycled selftests/bpf: Add three test cases for bits_iter bpf: Use __u64 to save the bits in bits iterator bpf: Check the validity of nr_words in bpf_iter_bits_new() bpf: Add bpf_mem_alloc_check_size() helper bpf: Free dynamically allocated bits in bpf_iter_bits_destroy() bpf: disallow 40-bytes extra stack for bpf_fastcall patterns selftests/bpf: Add test for trie_get_next_key() bpf: Fix out-of-bounds write in trie_get_next_key() selftests/bpf: Test with a very short loop bpf: Force checkpoint when jmp history is too long bpf: fix filed access without lock sock_map: fix a NULL pointer dereference in sock_map_link_update_prog()
2024-10-30sched/ext: Fix scx vs sched_delayedPeter Zijlstra1-2/+12
Commit 98442f0ccd82 ("sched: Fix delayed_dequeue vs switched_from_fair()") forgot about scx :/ Fixes: 98442f0ccd82 ("sched: Fix delayed_dequeue vs switched_from_fair()") Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lkml.kernel.org/r/20241030104934.GK14555@noisy.programming.kicks-ass.net
2024-10-30bpf: Use __u64 to save the bits in bits iteratorHou Tao1-3/+30
On 32-bit hosts (e.g., arm32), when a bpf program passes a u64 to bpf_iter_bits_new(), bpf_iter_bits_new() will use bits_copy to store the content of the u64. However, bits_copy is only 4 bytes, leading to stack corruption. The straightforward solution would be to replace u64 with unsigned long in bpf_iter_bits_new(). However, this introduces confusion and problems for 32-bit hosts because the size of ulong in bpf program is 8 bytes, but it is treated as 4-bytes after passed to bpf_iter_bits_new(). Fix it by changing the type of both bits and bit_count from unsigned long to u64. However, the change is not enough. The main reason is that bpf_iter_bits_next() uses find_next_bit() to find the next bit and the pointer passed to find_next_bit() is an unsigned long pointer instead of a u64 pointer. For 32-bit little-endian host, it is fine but it is not the case for 32-bit big-endian host. Because under 32-bit big-endian host, the first iterated unsigned long will be the bits 32-63 of the u64 instead of the expected bits 0-31. Therefore, in addition to changing the type, swap the two unsigned longs within the u64 for 32-bit big-endian host. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241030100516.3633640-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-30bpf: Check the validity of nr_words in bpf_iter_bits_new()Hou Tao1-1/+9
Check the validity of nr_words in bpf_iter_bits_new(). Without this check, when multiplication overflow occurs for nr_bits (e.g., when nr_words = 0x0400-0001, nr_bits becomes 64), stack corruption may occur due to bpf_probe_read_kernel_common(..., nr_bytes = 0x2000-0008). Fix it by limiting the maximum value of nr_words to 511. The value is derived from the current implementation of BPF memory allocator. To ensure compatibility if the BPF memory allocator's size limitation changes in the future, use the helper bpf_mem_alloc_check_size() to check whether nr_bytes is too larger. And return -E2BIG instead of -ENOMEM for oversized nr_bytes. Fixes: 4665415975b0 ("bpf: Add bits iterator") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241030100516.3633640-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-30bpf: Add bpf_mem_alloc_check_size() helperHou Tao1-1/+13
Introduce bpf_mem_alloc_check_size() to check whether the allocation size exceeds the limitation for the kmalloc-equivalent allocator. The upper limit for percpu allocation is LLIST_NODE_SZ bytes larger than non-percpu allocation, so a percpu argument is added to the helper. The helper will be used in the following patch to check whether the size parameter passed to bpf_mem_alloc() is too big. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241030100516.3633640-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-30bpf: Free dynamically allocated bits in bpf_iter_bits_destroy()Hou Tao1-6/+5
bpf_iter_bits_destroy() uses "kit->nr_bits <= 64" to check whether the bits are dynamically allocated. However, the check is incorrect and may cause a kmemleak as shown below: unreferenced object 0xffff88812628c8c0 (size 32): comm "swapper/0", pid 1, jiffies 4294727320 hex dump (first 32 bytes): b0 c1 55 f5 81 88 ff ff f0 f0 f0 f0 f0 f0 f0 f0 ..U........... f0 f0 f0 f0 f0 f0 f0 f0 00 00 00 00 00 00 00 00 .............. backtrace (crc 781e32cc): [<00000000c452b4ab>] kmemleak_alloc+0x4b/0x80 [<0000000004e09f80>] __kmalloc_node_noprof+0x480/0x5c0 [<00000000597124d6>] __alloc.isra.0+0x89/0xb0 [<000000004ebfffcd>] alloc_bulk+0x2af/0x720 [<00000000d9c10145>] prefill_mem_cache+0x7f/0xb0 [<00000000ff9738ff>] bpf_mem_alloc_init+0x3e2/0x610 [<000000008b616eac>] bpf_global_ma_init+0x19/0x30 [<00000000fc473efc>] do_one_initcall+0xd3/0x3c0 [<00000000ec81498c>] kernel_init_freeable+0x66a/0x940 [<00000000b119f72f>] kernel_init+0x20/0x160 [<00000000f11ac9a7>] ret_from_fork+0x3c/0x70 [<0000000004671da4>] ret_from_fork_asm+0x1a/0x30 That is because nr_bits will be set as zero in bpf_iter_bits_next() after all bits have been iterated. Fix the issue by setting kit->bit to kit->nr_bits instead of setting kit->nr_bits to zero when the iteration completes in bpf_iter_bits_next(). In addition, use "!nr_bits || bits >= nr_bits" to check whether the iteration is complete and still use "nr_bits > 64" to indicate whether bits are dynamically allocated. The "!nr_bits" check is necessary because bpf_iter_bits_new() may fail before setting kit->nr_bits, and this condition will stop the iteration early instead of accessing the zeroed or freed kit->bits. Considering the initial value of kit->bits is -1 and the type of kit->nr_bits is unsigned int, change the type of kit->nr_bits to int. The potential overflow problem will be handled in the following patch. Fixes: 4665415975b0 ("bpf: Add bits iterator") Acked-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241030100516.3633640-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-29bpf: disallow 40-bytes extra stack for bpf_fastcall patternsEduard Zingerman1-12/+2
Hou Tao reported an issue with bpf_fastcall patterns allowing extra stack space above MAX_BPF_STACK limit. This extra stack allowance is not integrated properly with the following verifier parts: - backtracking logic still assumes that stack can't exceed MAX_BPF_STACK; - bpf_verifier_env->scratched_stack_slots assumes only 64 slots are available. Here is an example of an issue with precision tracking (note stack slot -8 tracked as precise instead of -520): 0: (b7) r1 = 42 ; R1_w=42 1: (b7) r2 = 42 ; R2_w=42 2: (7b) *(u64 *)(r10 -512) = r1 ; R1_w=42 R10=fp0 fp-512_w=42 3: (7b) *(u64 *)(r10 -520) = r2 ; R2_w=42 R10=fp0 fp-520_w=42 4: (85) call bpf_get_smp_processor_id#8 ; R0_w=scalar(...) 5: (79) r2 = *(u64 *)(r10 -520) ; R2_w=42 R10=fp0 fp-520_w=42 6: (79) r1 = *(u64 *)(r10 -512) ; R1_w=42 R10=fp0 fp-512_w=42 7: (bf) r3 = r10 ; R3_w=fp0 R10=fp0 8: (0f) r3 += r2 mark_precise: frame0: last_idx 8 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r2 stack= before 7: (bf) r3 = r10 mark_precise: frame0: regs=r2 stack= before 6: (79) r1 = *(u64 *)(r10 -512) mark_precise: frame0: regs=r2 stack= before 5: (79) r2 = *(u64 *)(r10 -520) mark_precise: frame0: regs= stack=-8 before 4: (85) call bpf_get_smp_processor_id#8 mark_precise: frame0: regs= stack=-8 before 3: (7b) *(u64 *)(r10 -520) = r2 mark_precise: frame0: regs=r2 stack= before 2: (7b) *(u64 *)(r10 -512) = r1 mark_precise: frame0: regs=r2 stack= before 1: (b7) r2 = 42 9: R2_w=42 R3_w=fp42 9: (95) exit This patch disables the additional allowance for the moment. Also, two test cases are removed: - bpf_fastcall_max_stack_ok: it fails w/o additional stack allowance; - bpf_fastcall_max_stack_fail: this test is no longer necessary, stack size follows regular rules, pattern invalidation is checked by other test cases. Reported-by: Hou Tao <houtao@huaweicloud.com> Closes: https://lore.kernel.org/bpf/20241023022752.172005-1-houtao@huaweicloud.com/ Fixes: 5b5f51bff1b6 ("bpf: no_caller_saved_registers attribute for helper calls") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241029193911.1575719-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-29Merge tag 'cgroup-for-6.12-rc5-fixes' of ↵Linus Torvalds2-3/+20
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - cgroup_bpf_release_fn() could saturate system_wq with cgrp->bpf.release_work which can then form a circular dependency leading to deadlocks. Fix by using a dedicated workqueue. The system_wq's max concurrency limit is being increased separately. - Fix theoretical off-by-one bug when enforcing max cgroup hierarchy depth * tag 'cgroup-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Fix potential overflow issue when checking max_depth cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
2024-10-29Merge tag 'sched_ext-for-6.12-rc5-fixes' of ↵Linus Torvalds1-12/+17
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Instances of scx_ops_bypass() could race each other leading to misbehavior. Fix by protecting the operation with a spinlock. - selftest and userspace header fixes * tag 'sched_ext-for-6.12-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix enq_last_no_enq_fails selftest sched_ext: Make cast_mask() inline scx: Fix raciness in scx_ops_bypass() scx: Fix exit selftest to use custom DSQ sched_ext: Fix function pointer type mismatches in BPF selftests selftests/sched_ext: add order-only dependency of runner.o on BPFOBJ
2024-10-29bpf: Fix out-of-bounds write in trie_get_next_key()Byeonguk Jeong1-1/+1
trie_get_next_key() allocates a node stack with size trie->max_prefixlen, while it writes (trie->max_prefixlen + 1) nodes to the stack when it has full paths from the root to leaves. For example, consider a trie with max_prefixlen is 8, and the nodes with key 0x00/0, 0x00/1, 0x00/2, ... 0x00/8 inserted. Subsequent calls to trie_get_next_key with _key with .prefixlen = 8 make 9 nodes be written on the node stack with size 8. Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map") Signed-off-by: Byeonguk Jeong <jungbu2855@gmail.com> Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org> Tested-by: Hou Tao <houtao1@huawei.com> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/Zxx384ZfdlFYnz6J@localhost.localdomain Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-29bpf: Force checkpoint when jmp history is too longEduard Zingerman1-3/+6
A specifically crafted program might trick verifier into growing very long jump history within a single bpf_verifier_state instance. Very long jump history makes mark_chain_precision() unreasonably slow, especially in case if verifier processes a loop. Mitigate this by forcing new state in is_state_visited() in case if current state's jump history is too long. Use same constant as in `skip_inf_loop_check`, but multiply it by arbitrarily chosen value 2 to account for jump history containing not only information about jumps, but also information about stack access. For an example of problematic program consider the code below, w/o this patch the example is processed by verifier for ~15 minutes, before failing to allocate big-enough chunk for jmp_history. 0: r7 = *(u16 *)(r1 +0);" 1: r7 += 0x1ab064b9;" 2: if r7 & 0x702000 goto 1b; 3: r7 &= 0x1ee60e;" 4: r7 += r1;" 5: if r7 s> 0x37d2 goto +0;" 6: r0 = 0;" 7: exit;" Perf profiling shows that most of the time is spent in mark_chain_precision() ~95%. The easiest way to explain why this program causes problems is to apply the following patch: diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 0c216e71cec7..4b4823961abe 100644 \--- a/include/linux/bpf.h \+++ b/include/linux/bpf.h \@@ -1926,7 +1926,7 @@ struct bpf_array { }; }; -#define BPF_COMPLEXITY_LIMIT_INSNS 1000000 /* yes. 1M insns */ +#define BPF_COMPLEXITY_LIMIT_INSNS 256 /* yes. 1M insns */ #define MAX_TAIL_CALL_CNT 33 /* Maximum number of loops for bpf_loop and bpf_iter_num. diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index f514247ba8ba..75e88be3bb3e 100644 \--- a/kernel/bpf/verifier.c \+++ b/kernel/bpf/verifier.c \@@ -18024,8 +18024,13 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx) skip_inf_loop_check: if (!force_new_state && env->jmps_processed - env->prev_jmps_processed < 20 && - env->insn_processed - env->prev_insn_processed < 100) + env->insn_processed - env->prev_insn_processed < 100) { + verbose(env, "is_state_visited: suppressing checkpoint at %d, %d jmps processed, cur->jmp_history_cnt is %d\n", + env->insn_idx, + env->jmps_processed - env->prev_jmps_processed, + cur->jmp_history_cnt); add_new_state = false; + } goto miss; } /* If sl->state is a part of a loop and this loop's entry is a part of \@@ -18142,6 +18147,9 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx) if (!add_new_state) return 0; + verbose(env, "is_state_visited: new checkpoint at %d, resetting env->jmps_processed\n", + env->insn_idx); + /* There were no equivalent states, remember the current one. * Technically the current state is not proven to be safe yet, * but it will either reach outer most bpf_exit (which means it's safe) And observe verification log: ... is_state_visited: new checkpoint at 5, resetting env->jmps_processed 5: R1=ctx() R7=ctx(...) 5: (65) if r7 s> 0x37d2 goto pc+0 ; R7=ctx(...) 6: (b7) r0 = 0 ; R0_w=0 7: (95) exit from 5 to 6: R1=ctx() R7=ctx(...) R10=fp0 6: R1=ctx() R7=ctx(...) R10=fp0 6: (b7) r0 = 0 ; R0_w=0 7: (95) exit is_state_visited: suppressing checkpoint at 1, 3 jmps processed, cur->jmp_history_cnt is 74 from 2 to 1: R1=ctx() R7_w=scalar(...) R10=fp0 1: R1=ctx() R7_w=scalar(...) R10=fp0 1: (07) r7 += 447767737 is_state_visited: suppressing checkpoint at 2, 3 jmps processed, cur->jmp_history_cnt is 75 2: R7_w=scalar(...) 2: (45) if r7 & 0x702000 goto pc-2 ... mark_precise 152 steps for r7 ... 2: R7_w=scalar(...) is_state_visited: suppressing checkpoint at 1, 4 jmps processed, cur->jmp_history_cnt is 75 1: (07) r7 += 447767737 is_state_visited: suppressing checkpoint at 2, 4 jmps processed, cur->jmp_history_cnt is 76 2: R7_w=scalar(...) 2: (45) if r7 & 0x702000 goto pc-2 ... BPF program is too large. Processed 257 insn The log output shows that checkpoint at label (1) is never created, because it is suppressed by `skip_inf_loop_check` logic: a. When 'if' at (2) is processed it pushes a state with insn_idx (1) onto stack and proceeds to (3); b. At (5) checkpoint is created, and this resets env->{jmps,insns}_processed. c. Verification proceeds and reaches `exit`; d. State saved at step (a) is popped from stack and is_state_visited() considers if checkpoint needs to be added, but because env->{jmps,insns}_processed had been just reset at step (b) the `skip_inf_loop_check` logic forces `add_new_state` to false. e. Verifier proceeds with current state, which slowly accumulates more and more entries in the jump history. The accumulation of entries in the jump history is a problem because of two factors: - it eventually exhausts memory available for kmalloc() allocation; - mark_chain_precision() traverses the jump history of a state, meaning that if `r7` is marked precise, verifier would iterate ever growing jump history until parent state boundary is reached. (note: the log also shows a REG INVARIANTS VIOLATION warning upon jset processing, but that's another bug to fix). With this patch applied, the example above is rejected by verifier under 1s of time, reaching 1M instructions limit. The program is a simplified reproducer from syzbot report. Previous discussion could be found at [1]. The patch does not cause any changes in verification performance, when tested on selftests from veristat.cfg and cilium programs taken from [2]. [1] https://lore.kernel.org/bpf/20241009021254.2805446-1-eddyz87@gmail.com/ [2] https://github.com/anakryiko/cilium Changelog: - v1 -> v2: - moved patch to bpf tree; - moved force_new_state variable initialization after declaration and shortened the comment. v1: https://lore.kernel.org/bpf/20241018020307.1766906-1-eddyz87@gmail.com/ Fixes: 2589726d12a1 ("bpf: introduce bounded loops") Reported-by: syzbot+7e46cdef14bf496a3ab4@syzkaller.appspotmail.com Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20241029172641.1042523-1-eddyz87@gmail.com Closes: https://lore.kernel.org/bpf/670429f6.050a0220.49194.0517.GAE@google.com/
2024-10-29sched: Pass correct scheduling policy to __setscheduler_classAboorva Devarajan5-11/+11
Commit 98442f0ccd82 ("sched: Fix delayed_dequeue vs switched_from_fair()") overlooked that __setscheduler_prio(), now __setscheduler_class() relies on p->policy for task_should_scx(), and moved the call before __setscheduler_params() updates it, causing it to be using the old p->policy value. Resolve this by changing task_should_scx() to take the policy itself instead of a task pointer, such that __sched_setscheduler() can pass in the updated policy. Fixes: 98442f0ccd82 ("sched: Fix delayed_dequeue vs switched_from_fair()") Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org>
2024-10-28resource,kexec: walk_system_ram_res_rev must retain resource flagsGregory Price1-3/+1
walk_system_ram_res_rev() erroneously discards resource flags when passing the information to the callback. This causes systems with IORESOURCE_SYSRAM_DRIVER_MANAGED memory to have these resources selected during kexec to store kexec buffers if that memory happens to be at placed above normal system ram. This leads to undefined behavior after reboot. If the kexec buffer is never touched, nothing happens. If the kexec buffer is touched, it could lead to a crash (like below) or undefined behavior. Tested on a system with CXL memory expanders with driver managed memory, TPM enabled, and CONFIG_IMA_KEXEC=y. Adding printk's showed the flags were being discarded and as a result the check for IORESOURCE_SYSRAM_DRIVER_MANAGED passes. find_next_iomem_res: name(System RAM (kmem)) start(10000000000) end(1034fffffff) flags(83000200) locate_mem_hole_top_down: start(10000000000) end(1034fffffff) flags(0) [.] BUG: unable to handle page fault for address: ffff89834ffff000 [.] #PF: supervisor read access in kernel mode [.] #PF: error_code(0x0000) - not-present page [.] PGD c04c8bf067 P4D c04c8bf067 PUD c04c8be067 PMD 0 [.] Oops: 0000 [#1] SMP [.] RIP: 0010:ima_restore_measurement_list+0x95/0x4b0 [.] RSP: 0018:ffffc900000d3a80 EFLAGS: 00010286 [.] RAX: 0000000000001000 RBX: 0000000000000000 RCX: ffff89834ffff000 [.] RDX: 0000000000000018 RSI: ffff89834ffff000 RDI: ffff89834ffff018 [.] RBP: ffffc900000d3ba0 R08: 0000000000000020 R09: ffff888132b8a900 [.] R10: 4000000000000000 R11: 000000003a616d69 R12: 0000000000000000 [.] R13: ffffffff8404ac28 R14: 0000000000000000 R15: ffff89834ffff000 [.] FS: 0000000000000000(0000) GS:ffff893d44640000(0000) knlGS:0000000000000000 [.] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [.] ata5: SATA link down (SStatus 0 SControl 300) [.] CR2: ffff89834ffff000 CR3: 000001034d00f001 CR4: 0000000000770ef0 [.] PKRU: 55555554 [.] Call Trace: [.] <TASK> [.] ? __die+0x78/0xc0 [.] ? page_fault_oops+0x2a8/0x3a0 [.] ? exc_page_fault+0x84/0x130 [.] ? asm_exc_page_fault+0x22/0x30 [.] ? ima_restore_measurement_list+0x95/0x4b0 [.] ? template_desc_init_fields+0x317/0x410 [.] ? crypto_alloc_tfm_node+0x9c/0xc0 [.] ? init_ima_lsm+0x30/0x30 [.] ima_load_kexec_buffer+0x72/0xa0 [.] ima_init+0x44/0xa0 [.] __initstub__kmod_ima__373_1201_init_ima7+0x1e/0xb0 [.] ? init_ima_lsm+0x30/0x30 [.] do_one_initcall+0xad/0x200 [.] ? idr_alloc_cyclic+0xaa/0x110 [.] ? new_slab+0x12c/0x420 [.] ? new_slab+0x12c/0x420 [.] ? number+0x12a/0x430 [.] ? sysvec_apic_timer_interrupt+0xa/0x80 [.] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 [.] ? parse_args+0xd4/0x380 [.] ? parse_args+0x14b/0x380 [.] kernel_init_freeable+0x1c1/0x2b0 [.] ? rest_init+0xb0/0xb0 [.] kernel_init+0x16/0x1a0 [.] ret_from_fork+0x2f/0x40 [.] ? rest_init+0xb0/0xb0 [.] ret_from_fork_asm+0x11/0x20 [.] </TASK> Link: https://lore.kernel.org/all/20231114091658.228030-1-bhe@redhat.com/ Link: https://lkml.kernel.org/r/20241017190347.5578-1-gourry@gourry.net Fixes: 7acf164b259d ("resource: add walk_system_ram_res_rev()") Signed-off-by: Gregory Price <gourry@gourry.net> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Mika Westerberg <mika.westerberg@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28fork: only invoke khugepaged, ksm hooks if no errorLorenzo Stoakes1-5/+2
There is no reason to invoke these hooks early against an mm that is in an incomplete state. The change in commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") makes this more pertinent as we may be in a state where entries in the maple tree are not yet consistent. Their placement early in dup_mmap() only appears to have been meaningful for early error checking, and since functionally it'd require a very small allocation to fail (in practice 'too small to fail') that'd only occur in the most dire circumstances, meaning the fork would fail or be OOM'd in any case. Since both khugepaged and KSM tracking are there to provide optimisations to memory performance rather than critical functionality, it doesn't really matter all that much if, under such dire memory pressure, we fail to register an mm with these. As a result, we follow the example of commit d2081b2bf819 ("mm: khugepaged: make khugepaged_enter() void function") and make ksm_fork() a void function also. We only expose the mm to these functions once we are done with them and only if no error occurred in the fork operation. Link: https://lkml.kernel.org/r/e0cb8b840c9d1d5a6e84d4f8eff5f3f2022aa10c.1729014377.git.lorenzo.stoakes@oracle.com Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jann Horn <jannh@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28fork: do not invoke uffd on fork if error occursLorenzo Stoakes1-1/+4
Patch series "fork: do not expose incomplete mm on fork". During fork we may place the virtual memory address space into an inconsistent state before the fork operation is complete. In addition, we may encounter an error during the fork operation that indicates that the virtual memory address space is invalidated. As a result, we should not be exposing it in any way to external machinery that might interact with the mm or VMAs, machinery that is not designed to deal with incomplete state. We specifically update the fork logic to defer khugepaged and ksm to the end of the operation and only to be invoked if no error arose, and disallow uffd from observing fork events should an error have occurred. This patch (of 2): Currently on fork we expose the virtual address space of a process to userland unconditionally if uffd is registered in VMAs, regardless of whether an error arose in the fork. This is performed in dup_userfaultfd_complete() which is invoked unconditionally, and performs two duties - invoking registered handlers for the UFFD_EVENT_FORK event via dup_fctx(), and clearing down userfaultfd_fork_ctx objects established in dup_userfaultfd(). This is problematic, because the virtual address space may not yet be correctly initialised if an error arose. The change in commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") makes this more pertinent as we may be in a state where entries in the maple tree are not yet consistent. We address this by, on fork error, ensuring that we roll back state that we would otherwise expect to clean up through the event being handled by userland and perform the memory freeing duty otherwise performed by dup_userfaultfd_complete(). We do this by implementing a new function, dup_userfaultfd_fail(), which performs the same loop, only decrementing reference counts. Note that we perform mmgrab() on the parent and child mm's, however userfaultfd_ctx_put() will mmdrop() this once the reference count drops to zero, so we will avoid memory leaks correctly here. Link: https://lkml.kernel.org/r/cover.1729014377.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/d3691d58bb58712b6fb3df2be441d175bd3cdf07.1729014377.git.lorenzo.stoakes@oracle.com Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-27Merge tag 'ftrace-v6.12-rc4' of ↵Linus Torvalds1-8/+4
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace fixes from Steven Rostedt: - Fix missing mutex unlock in error path of register_ftrace_graph() A previous fix added a return on an error path and forgot to unlock the mutex. Instead of dealing with error paths, use guard(mutex) as the mutex is just released at the exit of the function anyway. Other functions in this file should be updated with this, but that's a cleanup and not a fix. - Change cpuhp setup name to be consistent with other cpuhp states The same fix that the above patch fixes added a cpuhp_setup_state() call with the name of "fgraph_idle_init". I was informed that it should instead be something like: "fgraph:online". Update that too. * tag 'ftrace-v6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: fgraph: Change the name of cpuhp state to "fgraph:online" fgraph: Fix missing unlock in register_ftrace_graph()
2024-10-27genirq/msi: Fix off-by-one error in msi_domain_alloc()Jinjie Ruan1-1/+1
The error path in msi_domain_alloc(), frees the already allocated MSI interrupts in a loop, but the loop condition terminates when the index reaches zero, which fails to free the first allocated MSI interrupt at index zero. Check for >= 0 so that msi[0] is freed as well. Fixes: f3cf8bb0d6c3 ("genirq: Add generic msi irq domain support") Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241026063639.10711-1-ruanjinjie@huawei.com
2024-10-27posix-cpu-timers: Clear TICK_DEP_BIT_POSIX_TIMER on cloneBenjamin Segall1-0/+2
When cloning a new thread, its posix_cputimers are not inherited, and are cleared by posix_cputimers_init(). However, this does not clear the tick dependency it creates in tsk->tick_dep_mask, and the handler does not reach the code to clear the dependency if there were no timers to begin with. Thus if a thread has a cputimer running before clone/fork, all descendants will prevent nohz_full unless they create a cputimer of their own. Fix this by entirely clearing the tick_dep_mask in copy_process(). (There is currently no inherited state that needs a tick dependency) Process-wide timers do not have this problem because fork does not copy signal_struct as a baseline, it creates one from scratch. Fixes: b78783000d5c ("posix-cpu-timers: Migrate to use new tick dependency mask model") Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/xm26o737bq8o.fsf@google.com
2024-10-26sched/numa: Fix the potential null pointer dereference in task_numa_work()Shawn Wang1-2/+2
When running stress-ng-vm-segv test, we found a null pointer dereference error in task_numa_work(). Here is the backtrace: [323676.066985] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020 ...... [323676.067108] CPU: 35 PID: 2694524 Comm: stress-ng-vm-se ...... [323676.067113] pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) [323676.067115] pc : vma_migratable+0x1c/0xd0 [323676.067122] lr : task_numa_work+0x1ec/0x4e0 [323676.067127] sp : ffff8000ada73d20 [323676.067128] x29: ffff8000ada73d20 x28: 0000000000000000 x27: 000000003e89f010 [323676.067130] x26: 0000000000080000 x25: ffff800081b5c0d8 x24: ffff800081b27000 [323676.067133] x23: 0000000000010000 x22: 0000000104d18cc0 x21: ffff0009f7158000 [323676.067135] x20: 0000000000000000 x19: 0000000000000000 x18: ffff8000ada73db8 [323676.067138] x17: 0001400000000000 x16: ffff800080df40b0 x15: 0000000000000035 [323676.067140] x14: ffff8000ada73cc8 x13: 1fffe0017cc72001 x12: ffff8000ada73cc8 [323676.067142] x11: ffff80008001160c x10: ffff000be639000c x9 : ffff8000800f4ba4 [323676.067145] x8 : ffff000810375000 x7 : ffff8000ada73974 x6 : 0000000000000001 [323676.067147] x5 : 0068000b33e26707 x4 : 0000000000000001 x3 : ffff0009f7158000 [323676.067149] x2 : 0000000000000041 x1 : 0000000000004400 x0 : 0000000000000000 [323676.067152] Call trace: [323676.067153] vma_migratable+0x1c/0xd0 [323676.067155] task_numa_work+0x1ec/0x4e0 [323676.067157] task_work_run+0x78/0xd8 [323676.067161] do_notify_resume+0x1ec/0x290 [323676.067163] el0_svc+0x150/0x160 [323676.067167] el0t_64_sync_handler+0xf8/0x128 [323676.067170] el0t_64_sync+0x17c/0x180 [323676.067173] Code: d2888001 910003fd f9000bf3 aa0003f3 (f9401000) [323676.067177] SMP: stopping secondary CPUs [323676.070184] Starting crashdump kernel... stress-ng-vm-segv in stress-ng is used to stress test the SIGSEGV error handling function of the system, which tries to cause a SIGSEGV error on return from unmapping the whole address space of the child process. Normally this program will not cause kernel crashes. But before the munmap system call returns to user mode, a potential task_numa_work() for numa balancing could be added and executed. In this scenario, since the child process has no vma after munmap, the vma_next() in task_numa_work() will return a null pointer even if the vma iterator restarts from 0. Recheck the vma pointer before dereferencing it in task_numa_work(). Fixes: 214dbc428137 ("sched: convert to vma iterator") Signed-off-by: Shawn Wang <shawnwang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org # v6.2+ Link: https://lkml.kernel.org/r/20241025022208.125527-1-shawnwang@linux.alibaba.com
2024-10-25scx: Fix raciness in scx_ops_bypass()David Vernet1-12/+17
scx_ops_bypass() can currently race on the ops enable / disable path as follows: 1. scx_ops_bypass(true) called on enable path, bypass depth is set to 1 2. An op on the init path exits, which schedules scx_ops_disable_workfn() 3. scx_ops_bypass(false) is called on the disable path, and bypass depth is decremented to 0 4. kthread is scheduled to execute scx_ops_disable_workfn() 5. scx_ops_bypass(true) called, bypass depth set to 1 6. scx_ops_bypass() races when iterating over CPUs While it's not safe to take any blocking locks on the bypass path, it is safe to take a raw spinlock which cannot be preempted. This patch therefore updates scx_ops_bypass() to use a raw spinlock to synchronize, and changes scx_ops_bypass_depth to be a regular int. Without this change, we observe the following warnings when running the 'exit' sched_ext selftest (sometimes requires a couple of runs): .[root@virtme-ng sched_ext]# ./runner -t exit ===== START ===== TEST: exit ... [ 14.935078] WARNING: CPU: 2 PID: 360 at kernel/sched/ext.c:4332 scx_ops_bypass+0x1ca/0x280 [ 14.935126] Modules linked in: [ 14.935150] CPU: 2 UID: 0 PID: 360 Comm: sched_ext_ops_h Not tainted 6.11.0-virtme #24 [ 14.935192] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 14.935242] Sched_ext: exit (enabling+all) [ 14.935244] RIP: 0010:scx_ops_bypass+0x1ca/0x280 [ 14.935300] Code: ff ff ff e8 48 96 10 00 fb e9 08 ff ff ff c6 05 7b 34 e8 01 01 90 48 c7 c7 89 86 88 87 e8 be 1d f8 ff 90 0f 0b 90 90 eb 95 90 <0f> 0b 90 41 8b 84 24 24 0a 00 00 eb 97 90 0f 0b 90 41 8b 84 24 24 [ 14.935394] RSP: 0018:ffffb706c0957ce0 EFLAGS: 00010002 [ 14.935424] RAX: 0000000000000009 RBX: 0000000000000001 RCX: 00000000e3fb8b2a [ 14.935465] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff88a4c080 [ 14.935512] RBP: 0000000000009b56 R08: 0000000000000004 R09: 00000003f12e520a [ 14.935555] R10: ffffffff863a9795 R11: 0000000000000000 R12: ffff8fc5fec31300 [ 14.935598] R13: ffff8fc5fec31318 R14: 0000000000000286 R15: 0000000000000018 [ 14.935642] FS: 0000000000000000(0000) GS:ffff8fc5fe680000(0000) knlGS:0000000000000000 [ 14.935684] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 14.935721] CR2: 0000557d92890b88 CR3: 000000002464a000 CR4: 0000000000750ef0 [ 14.935765] PKRU: 55555554 [ 14.935782] Call Trace: [ 14.935802] <TASK> [ 14.935823] ? __warn+0xce/0x220 [ 14.935850] ? scx_ops_bypass+0x1ca/0x280 [ 14.935881] ? report_bug+0xc1/0x160 [ 14.935909] ? handle_bug+0x61/0x90 [ 14.935934] ? exc_invalid_op+0x1a/0x50 [ 14.935959] ? asm_exc_invalid_op+0x1a/0x20 [ 14.935984] ? raw_spin_rq_lock_nested+0x15/0x30 [ 14.936019] ? scx_ops_bypass+0x1ca/0x280 [ 14.936046] ? srso_alias_return_thunk+0x5/0xfbef5 [ 14.936081] ? __pfx_scx_ops_disable_workfn+0x10/0x10 [ 14.936111] scx_ops_disable_workfn+0x146/0xac0 [ 14.936142] ? finish_task_switch+0xa9/0x2c0 [ 14.936172] ? srso_alias_return_thunk+0x5/0xfbef5 [ 14.936211] ? __pfx_scx_ops_disable_workfn+0x10/0x10 [ 14.936244] kthread_worker_fn+0x101/0x2c0 [ 14.936268] ? __pfx_kthread_worker_fn+0x10/0x10 [ 14.936299] kthread+0xec/0x110 [ 14.936327] ? __pfx_kthread+0x10/0x10 [ 14.936351] ret_from_fork+0x37/0x50 [ 14.936374] ? __pfx_kthread+0x10/0x10 [ 14.936400] ret_from_fork_asm+0x1a/0x30 [ 14.936427] </TASK> [ 14.936443] irq event stamp: 21002 [ 14.936467] hardirqs last enabled at (21001): [<ffffffff863aa35f>] resched_cpu+0x9f/0xd0 [ 14.936521] hardirqs last disabled at (21002): [<ffffffff863dd0ba>] scx_ops_bypass+0x11a/0x280 [ 14.936571] softirqs last enabled at (20642): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0 [ 14.936622] softirqs last disabled at (20637): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0 [ 14.936672] ---[ end trace 0000000000000000 ]--- [ 14.953282] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF) [ 14.953352] ------------[ cut here ]------------ [ 14.953383] WARNING: CPU: 2 PID: 360 at kernel/sched/ext.c:4335 scx_ops_bypass+0x1d8/0x280 [ 14.953428] Modules linked in: [ 14.953453] CPU: 2 UID: 0 PID: 360 Comm: sched_ext_ops_h Tainted: G W 6.11.0-virtme #24 [ 14.953505] Tainted: [W]=WARN [ 14.953527] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 14.953574] RIP: 0010:scx_ops_bypass+0x1d8/0x280 [ 14.953603] Code: c6 05 7b 34 e8 01 01 90 48 c7 c7 89 86 88 87 e8 be 1d f8 ff 90 0f 0b 90 90 eb 95 90 0f 0b 90 41 8b 84 24 24 0a 00 00 eb 97 90 <0f> 0b 90 41 8b 84 24 24 0a 00 00 eb 92 f3 0f 1e fa 49 8d 84 24 f0 [ 14.953693] RSP: 0018:ffffb706c0957ce0 EFLAGS: 00010046 [ 14.953722] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001 [ 14.953763] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8fc5fec31318 [ 14.953804] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 [ 14.953845] R10: ffffffff863a9795 R11: 0000000000000000 R12: ffff8fc5fec31300 [ 14.953888] R13: ffff8fc5fec31318 R14: 0000000000000286 R15: 0000000000000018 [ 14.953934] FS: 0000000000000000(0000) GS:ffff8fc5fe680000(0000) knlGS:0000000000000000 [ 14.953974] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 14.954009] CR2: 0000557d92890b88 CR3: 000000002464a000 CR4: 0000000000750ef0 [ 14.954052] PKRU: 55555554 [ 14.954068] Call Trace: [ 14.954085] <TASK> [ 14.954102] ? __warn+0xce/0x220 [ 14.954126] ? scx_ops_bypass+0x1d8/0x280 [ 14.954150] ? report_bug+0xc1/0x160 [ 14.954178] ? handle_bug+0x61/0x90 [ 14.954203] ? exc_invalid_op+0x1a/0x50 [ 14.954226] ? asm_exc_invalid_op+0x1a/0x20 [ 14.954250] ? raw_spin_rq_lock_nested+0x15/0x30 [ 14.954285] ? scx_ops_bypass+0x1d8/0x280 [ 14.954311] ? __mutex_unlock_slowpath+0x3a/0x260 [ 14.954343] scx_ops_disable_workfn+0xa3e/0xac0 [ 14.954381] ? __pfx_scx_ops_disable_workfn+0x10/0x10 [ 14.954413] kthread_w