summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2023-12-08Revert "workqueue: remove unused cancel_work()"Andrey Grodzovsky1-0/+9
[ Upstream commit 73b4b53276a1d6290cd4f47dbbc885b6e6e59ac6 ] This reverts commit 6417250d3f894e66a68ba1cd93676143f2376a6f. amdpgu need this function in order to prematurly stop pending reset works when another reset work already in progress. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Lai Jiangshan<jiangshanlai@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Stable-dep-of: 91d3d149978b ("r8169: prevent potential deadlock in rtl8169_close") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-08rcu: Avoid tracing a few functions executed in stop machinePatrick Wang1-4/+4
commit 48f8070f5dd8e13148ae4647780a452d53c457a2 upstream. Stop-machine recently started calling additional functions while waiting: ---------------------------------------------------------------- Former stop machine wait loop: do { cpu_relax(); => macro ... } while (curstate != STOPMACHINE_EXIT); ----------------------------------------------------------------- Current stop machine wait loop: do { stop_machine_yield(cpumask); => function (notraced) ... touch_nmi_watchdog(); => function (notraced, inside calls also notraced) ... rcu_momentary_dyntick_idle(); => function (notraced, inside calls traced) } while (curstate != MULTI_STOP_EXIT); ------------------------------------------------------------------ These functions (and the functions that they call) must be marked notrace to prevent them from being updated while they are executing. The consequences of failing to mark these functions can be severe: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: 1-...!: (0 ticks this GP) idle=14f/1/0x4000000000000000 softirq=3397/3397 fqs=0 rcu: 3-...!: (0 ticks this GP) idle=ee9/1/0x4000000000000000 softirq=5168/5168 fqs=0 (detected by 0, t=8137 jiffies, g=5889, q=2 ncpus=4) Task dump for CPU 1: task:migration/1 state:R running task stack: 0 pid: 19 ppid: 2 flags:0x00000000 Stopper: multi_cpu_stop+0x0/0x18c <- stop_machine_cpuslocked+0x128/0x174 Call Trace: Task dump for CPU 3: task:migration/3 state:R running task stack: 0 pid: 29 ppid: 2 flags:0x00000000 Stopper: multi_cpu_stop+0x0/0x18c <- stop_machine_cpuslocked+0x128/0x174 Call Trace: rcu: rcu_preempt kthread timer wakeup didn't happen for 8136 jiffies! g5889 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 rcu: Possible timer handling issue on cpu=2 timer-softirq=594 rcu: rcu_preempt kthread starved for 8137 jiffies! g5889 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2 rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. rcu: RCU grace-period kthread stack dump: task:rcu_preempt state:I stack: 0 pid: 14 ppid: 2 flags:0x00000000 Call Trace: schedule+0x56/0xc2 schedule_timeout+0x82/0x184 rcu_gp_fqs_loop+0x19a/0x318 rcu_gp_kthread+0x11a/0x140 kthread+0xee/0x118 ret_from_exception+0x0/0x14 rcu: Stack dump where RCU GP kthread last ran: Task dump for CPU 2: task:migration/2 state:R running task stack: 0 pid: 24 ppid: 2 flags:0x00000000 Stopper: multi_cpu_stop+0x0/0x18c <- stop_machine_cpuslocked+0x128/0x174 Call Trace: This commit therefore marks these functions notrace: rcu_preempt_deferred_qs() rcu_preempt_need_deferred_qs() rcu_preempt_deferred_qs_irqrestore() [ paulmck: Apply feedback from Neeraj Upadhyay. ] Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Ronald Monthero <debug.penguin32@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-03lockdep: Fix block chain corruptionPeter Zijlstra1-1/+2
[ Upstream commit bca4104b00fec60be330cd32818dd5c70db3d469 ] Kent reported an occasional KASAN splat in lockdep. Mark then noted: > I suspect the dodgy access is to chain_block_buckets[-1], which hits the last 4 > bytes of the redzone and gets (incorrectly/misleadingly) attributed to > nr_large_chain_blocks. That would mean @size == 0, at which point size_to_bucket() returns -1 and the above happens. alloc_chain_hlocks() has 'size - req', for the first with the precondition 'size >= rq', which allows the 0. This code is trying to split a block, del_chain_block() takes what we need, and add_chain_block() puts back the remainder, except in the above case the remainder is 0 sized and things go sideways. Fixes: 810507fe6fd5 ("locking/lockdep: Reuse freed chain_hlocks entries") Reported-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lkml.kernel.org/r/20231121114126.GH8262@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28tracing: Have trace_event_file have ref countersSteven Rostedt (Google)4-15/+49
commit bb32500fb9b78215e4ef6ee8b4345c5f5d7eafb4 upstream. The following can crash the kernel: # cd /sys/kernel/tracing # echo 'p:sched schedule' > kprobe_events # exec 5>>events/kprobes/sched/enable # > kprobe_events # exec 5>&- The above commands: 1. Change directory to the tracefs directory 2. Create a kprobe event (doesn't matter what one) 3. Open bash file descriptor 5 on the enable file of the kprobe event 4. Delete the kprobe event (removes the files too) 5. Close the bash file descriptor 5 The above causes a crash! BUG: kernel NULL pointer dereference, address: 0000000000000028 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 6 PID: 877 Comm: bash Not tainted 6.5.0-rc4-test-00008-g2c6b6b1029d4-dirty #186 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:tracing_release_file_tr+0xc/0x50 What happens here is that the kprobe event creates a trace_event_file "file" descriptor that represents the file in tracefs to the event. It maintains state of the event (is it enabled for the given instance?). Opening the "enable" file gets a reference to the event "file" descriptor via the open file descriptor. When the kprobe event is deleted, the file is also deleted from the tracefs system which also frees the event "file" descriptor. But as the tracefs file is still opened by user space, it will not be totally removed until the final dput() is called on it. But this is not true with the event "file" descriptor that is already freed. If the user does a write to or simply closes the file descriptor it will reference the event "file" descriptor that was just freed, causing a use-after-free bug. To solve this, add a ref count to the event "file" descriptor as well as a new flag called "FREED". The "file" will not be freed until the last reference is released. But the FREE flag will be set when the event is removed to prevent any more modifications to that event from happening, even if there's still a reference to the event "file" descriptor. Link: https://lore.kernel.org/linux-trace-kernel/20231031000031.1e705592@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20231031122453.7a48b923@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Fixes: f5ca233e2e66d ("tracing: Increase trace array ref count on enable and filter files") Reported-by: Beau Belgrave <beaub@linux.microsoft.com> Tested-by: Beau Belgrave <beaub@linux.microsoft.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28tracing: Have the user copy of synthetic event address use correct contextSteven Rostedt (Google)1-1/+1
commit 4f7969bcd6d33042d62e249b41b5578161e4c868 upstream. A synthetic event is created by the synthetic event interface that can read both user or kernel address memory. In reality, it reads any arbitrary memory location from within the kernel. If the address space is in USER (where CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE is set) then it uses strncpy_from_user_nofault() to copy strings otherwise it uses strncpy_from_kernel_nofault(). But since both functions use the same variable there's no annotation to what that variable is (ie. __user). This makes sparse complain. Quiet sparse by typecasting the strncpy_from_user_nofault() variable to a __user pointer. Link: https://lore.kernel.org/linux-trace-kernel/20231031151033.73c42e23@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Fixes: 0934ae9977c2 ("tracing: Fix reading strings from synthetic events"); Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202311010013.fm8WTxa5-lkp@intel.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28kernel/reboot: emergency_restart: Set correct system_stateBenjamin Bara1-0/+1
commit 60466c067927abbcaff299845abd4b7069963139 upstream. As the emergency restart does not call kernel_restart_prepare(), the system_state stays in SYSTEM_RUNNING. Since bae1d3a05a8b, this hinders i2c_in_atomic_xfer_mode() from becoming active, and therefore might lead to avoidable warnings in the restart handlers, e.g.: [ 12.667612] WARNING: CPU: 1 PID: 1 at kernel/rcu/tree_plugin.h:318 rcu_note_context_switch+0x33c/0x6b0 [ 12.676926] Voluntary context switch within RCU read-side critical section! ... [ 12.742376] schedule_timeout from wait_for_completion_timeout+0x90/0x114 [ 12.749179] wait_for_completion_timeout from tegra_i2c_wait_completion+0x40/0x70 ... [ 12.994527] atomic_notifier_call_chain from machine_restart+0x34/0x58 [ 13.001050] machine_restart from panic+0x2a8/0x32c Avoid these by setting the correct system_state. Fixes: bae1d3a05a8b ("i2c: core: remove use of in_atomic()") Cc: stable@vger.kernel.org # v5.2+ Reviewed-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Tested-by: Nishanth Menon <nm@ti.com> Signed-off-by: Benjamin Bara <benjamin.bara@skidata.com> Link: https://lore.kernel.org/r/20230327-tegra-pmic-reboot-v7-1-18699d5dcd76@skidata.com Signed-off-by: Lee Jones <lee@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28genirq/generic_chip: Make irq_remove_generic_chip() irqdomain awareHerve Codina1-6/+19
commit 5e7afb2eb7b2a7c81e9f608cbdf74a07606fd1b5 upstream. irq_remove_generic_chip() calculates the Linux interrupt number for removing the handler and interrupt chip based on gc::irq_base as a linear function of the bit positions of set bits in the @msk argument. When the generic chip is present in an irq domain, i.e. created with a call to irq_alloc_domain_generic_chips(), gc::irq_base contains not the base Linux interrupt number. It contains the base hardware interrupt for this chip. It is set to 0 for the first chip in the domain, 0 + N for the next chip, where $N is the number of hardware interrupts per chip. That means the Linux interrupt number cannot be calculated based on gc::irq_base for irqdomain based chips without a domain map lookup, which is currently missing. Rework the code to take the irqdomain case into account and calculate the Linux interrupt number by a irqdomain lookup of the domain specific hardware interrupt number. [ tglx: Massage changelog. Reshuffle the logic and add a proper comment. ] Fixes: cfefd21e693d ("genirq: Add chip suspend and resume callbacks") Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20231024150335.322282-1-herve.codina@bootlin.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objectsCatalin Marinas1-0/+9
commit 5f98fd034ca6fd1ab8c91a3488968a0e9caaabf6 upstream. Since the actual slab freeing is deferred when calling kvfree_rcu(), so is the kmemleak_free() callback informing kmemleak of the object deletion. From the perspective of the kvfree_rcu() caller, the object is freed and it may remove any references to it. Since kmemleak does not scan RCU internal data storing the pointer, it will report such objects as leaks during the grace period. Tell kmemleak to ignore such objects on the kvfree_call_rcu() path. Note that the tiny RCU implementation does not have such issue since the objects can be tracked from the rcu_ctrlblk structure. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://lore.kernel.org/all/F903A825-F05F-4B77-A2B5-7356282FBA2C@apple.com/ Cc: <stable@vger.kernel.org> Tested-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28PM: hibernate: Clean up sync_read handling in snapshot_write_next()Brian Geffon1-5/+1
commit d08970df1980476f27936e24d452550f3e9e92e1 upstream. In snapshot_write_next(), sync_read is set and unset in three different spots unnecessiarly. As a result there is a subtle bug where the first page after the meta data has been loaded unconditionally sets sync_read to 0. If this first PFN was actually a highmem page, then the returned buffer will be the global "buffer," and the page needs to be loaded synchronously. That is, I'm not sure we can always assume the following to be safe: handle->buffer = get_buffer(&orig_bm, &ca); handle->sync_read = 0; Because get_buffer() can call get_highmem_page_buffer() which can return 'buffer'. The easiest way to address this is just set sync_read before snapshot_write_next() returns if handle->buffer == buffer. Signed-off-by: Brian Geffon <bgeffon@google.com> Fixes: 8357376d3df2 ("[PATCH] swsusp: Improve handling of highmem") Cc: All applicable <stable@vger.kernel.org> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28PM: hibernate: Use __get_safe_page() rather than touching the listBrian Geffon1-4/+6
commit f0c7183008b41e92fa676406d87f18773724b48b upstream. We found at least one situation where the safe pages list was empty and get_buffer() would gladly try to use a NULL pointer. Signed-off-by: Brian Geffon <bgeffon@google.com> Fixes: 8357376d3df2 ("[PATCH] swsusp: Improve handling of highmem") Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28rcu/tree: Defer setting of jiffies during stall resetJoel Fernandes (Google)3-2/+34
commit b96e7a5fa0ba9cda32888e04f8f4bac42d49a7f8 upstream. There are instances where rcu_cpu_stall_reset() is called when jiffies did not get a chance to update for a long time. Before jiffies is updated, the CPU stall detector can go off triggering false-positives where a just-started grace period appears to be ages old. In the past, we disabled stall detection in rcu_cpu_stall_reset() however this got changed [1]. This is resulting in false-positives in KGDB usecase [2]. Fix this by deferring the update of jiffies to the third run of the FQS loop. This is more robust, as, even if rcu_cpu_stall_reset() is called just before jiffies is read, we would end up pushing out the jiffies read by 3 more FQS loops. Meanwhile the CPU stall detection will be delayed and we will not get any false positives. [1] https://lore.kernel.org/all/20210521155624.174524-2-senozhatsky@chromium.org/ [2] https://lore.kernel.org/all/20230814020045.51950-2-chenhuacai@loongson.cn/ Tested with rcutorture.cpu_stall option as well to verify stall behavior with/without patch. Tested-by: Huacai Chen <chenhuacai@loongson.cn> Reported-by: Binbin Zhou <zhoubinbin@loongson.cn> Closes: https://lore.kernel.org/all/20230814020045.51950-2-chenhuacai@loongson.cn/ Suggested-by: Paul McKenney <paulmck@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: a80be428fbc1 ("rcu: Do not disable GP stall detection in rcu_cpu_stall_reset()") Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28watchdog: move softlockup_panic back to early_paramKrister Johansen1-0/+7
commit 8b793bcda61f6c3ed4f5b2ded7530ef6749580cb upstream. Setting softlockup_panic from do_sysctl_args() causes it to take effect later in boot. The lockup detector is enabled before SMP is brought online, but do_sysctl_args runs afterwards. If a user wants to set softlockup_panic on boot and have it trigger should a softlockup occur during onlining of the non-boot processors, they could do this prior to commit f117955a2255 ("kernel/watchdog.c: convert {soft/hard}lockup boot parameters to sysctl aliases"). However, after this commit the value of softlockup_panic is set too late to be of help for this type of problem. Restore the prior behavior. Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Cc: stable@vger.kernel.org Fixes: f117955a2255 ("kernel/watchdog.c: convert {soft/hard}lockup boot parameters to sysctl aliases") Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28audit: don't WARN_ON_ONCE(!current->mm) in audit_exe_compare()Paul Moore1-1/+1
commit 969d90ec212bae4b45bf9d21d7daa30aa6cf055e upstream. eBPF can end up calling into the audit code from some odd places, and some of these places don't have @current set properly so we end up tripping the `WARN_ON_ONCE(!current->mm)` near the top of `audit_exe_compare()`. While the basic `!current->mm` check is good, the `WARN_ON_ONCE()` results in some scary console messages so let's drop that and just do the regular `!current->mm` check to avoid problems. Cc: <stable@vger.kernel.org> Fixes: 47846d51348d ("audit: don't take task_lock() in audit_exe_compare() code path") Reported-by: Artem Savkov <asavkov@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28audit: don't take task_lock() in audit_exe_compare() code pathPaul Moore1-1/+8
commit 47846d51348dd62e5231a83be040981b17c955fa upstream. The get_task_exe_file() function locks the given task with task_lock() which when used inside audit_exe_compare() can cause deadlocks on systems that generate audit records when the task_lock() is held. We resolve this problem with two changes: ignoring those cases where the task being audited is not the current task, and changing our approach to obtaining the executable file struct to not require task_lock(). With the intent of the audit exe filter being to filter on audit events generated by processes started by the specified executable, it makes sense that we would only want to use the exe filter on audit records associated with the currently executing process, e.g. @current. If we are asked to filter records using a non-@current task_struct we can safely ignore the exe filter without negatively impacting the admin's expectations for the exe filter. Knowing that we only have to worry about filtering the currently executing task in audit_exe_compare() we can do away with the task_lock() and call get_mm_exe_file() with @current->mm directly. Cc: <stable@vger.kernel.org> Fixes: 5efc244346f9 ("audit: fix exe_file access in audit_exe_compare") Reported-by: Andreas Steinmetz <anstein99@googlemail.com> Reviewed-by: John Johansen <john.johanse@canonical.com> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28bpf: Fix precision tracking for BPF_ALU | BPF_TO_BE | BPF_ENDShung-Hsi Yu1-1/+6
commit 291d044fd51f8484066300ee42afecf8c8db7b3a upstream. BPF_END and BPF_NEG has a different specification for the source bit in the opcode compared to other ALU/ALU64 instructions, and is either reserved or use to specify the byte swap endianness. In both cases the source bit does not encode source operand location, and src_reg is a reserved field. backtrack_insn() currently does not differentiate BPF_END and BPF_NEG from other ALU/ALU64 instructions, which leads to r0 being incorrectly marked as precise when processing BPF_ALU | BPF_TO_BE | BPF_END instructions. This commit teaches backtrack_insn() to correctly mark precision for such case. While precise tracking of BPF_NEG and other BPF_END instructions are correct and does not need fixing, this commit opt to process all BPF_NEG and BPF_END instructions within the same if-clause to better align with current convention used in the verifier (e.g. check_alu_op). Fixes: b5dc0163d8fd ("bpf: precise scalar_value tracking") Cc: stable@vger.kernel.org Reported-by: Mohamed Mahmoud <mmahmoud@redhat.com> Closes: https://lore.kernel.org/r/87jzrrwptf.fsf@toke.dk Tested-by: Toke Høiland-Jørgensen <toke@redhat.com> Tested-by: Tao Lyu <tao.lyu@epfl.ch> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20231102053913.12004-2-shung-hsi.yu@suse.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28bpf: Fix check_stack_write_fixed_off() to correctly spill immHao Sun1-1/+1
commit 811c363645b33e6e22658634329e95f383dfc705 upstream. In check_stack_write_fixed_off(), imm value is cast to u32 before being spilled to the stack. Therefore, the sign information is lost, and the range information is incorrect when load from the stack again. For the following prog: 0: r2 = r10 1: *(u64*)(r2 -40) = -44 2: r0 = *(u64*)(r2 - 40) 3: if r0 s<= 0xa goto +2 4: r0 = 1 5: exit 6: r0 = 0 7: exit The verifier gives: func#0 @0 0: R1=ctx(off=0,imm=0) R10=fp0 0: (bf) r2 = r10 ; R2_w=fp0 R10=fp0 1: (7a) *(u64 *)(r2 -40) = -44 ; R2_w=fp0 fp-40_w=4294967252 2: (79) r0 = *(u64 *)(r2 -40) ; R0_w=4294967252 R2_w=fp0 fp-40_w=4294967252 3: (c5) if r0 s< 0xa goto pc+2 mark_precise: frame0: last_idx 3 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r0 stack= before 2: (79) r0 = *(u64 *)(r2 -40) 3: R0_w=4294967252 4: (b7) r0 = 1 ; R0_w=1 5: (95) exit verification time 7971 usec stack depth 40 processed 6 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 So remove the incorrect cast, since imm field is declared as s32, and __mark_reg_known() takes u64, so imm would be correctly sign extended by compiler. Fixes: ecdf985d7615 ("bpf: track immediate values written to stack by BPF_ST instruction") Cc: stable@vger.kernel.org Signed-off-by: Hao Sun <sunhao.th@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231101-fix-check-stack-write-v3-1-f05c2b1473d5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28tracing/perf: Add interrupt_context_level() helperSteven Rostedt (VMware)2-12/+2
[ Upstream commit 91ebe8bcbff9d2ff21303e73bf7434f39a98b255 ] Now that there are three different instances of doing the addition trick to the preempt_count() and NMI_MASK, HARDIRQ_MASK and SOFTIRQ_OFFSET macros, it deserves a helper function defined in the preempt.h header. Add the interrupt_context_level() helper and replace the three instances that do that logic with it. Link: https://lore.kernel.org/all/20211015142541.4badd8a9@gandalf.local.home/ Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Stable-dep-of: 87c3a5893e86 ("sched/core: Optimize in_task() and in_interrupt() a bit") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28tracing: Reuse logic from perf's get_recursion_context()Steven Rostedt (VMware)1-6/+6
[ Upstream commit 9b84fadc444de5456ab5f5487e2108311c724c3f ] Instead of having branches that adds noise to the branch prediction, use the addition logic to set the bit for the level of interrupt context that the state is currently in. This copies the logic from perf's get_recursion_context() function. Link: https://lore.kernel.org/all/20211015161702.GF174703@worktop.programming.kicks-ass.net/ Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Stable-dep-of: 87c3a5893e86 ("sched/core: Optimize in_task() and in_interrupt() a bit") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28kgdb: Flush console before entering kgdb on panicDouglas Anderson1-0/+3
[ Upstream commit dd712d3d45807db9fcae28a522deee85c1f2fde6 ] When entering kdb/kgdb on a kernel panic, it was be observed that the console isn't flushed before the `kdb` prompt came up. Specifically, when using the buddy lockup detector on arm64 and running: echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT I could see: [ 26.161099] lkdtm: Performing direct entry HARDLOCKUP [ 32.499881] watchdog: Watchdog detected hard LOCKUP on cpu 6 [ 32.552865] Sending NMI from CPU 5 to CPUs 6: [ 32.557359] NMI backtrace for cpu 6 ... [backtrace for cpu 6] ... [ 32.558353] NMI backtrace for cpu 5 ... [backtrace for cpu 5] ... [ 32.867471] Sending NMI from CPU 5 to CPUs 0-4,7: [ 32.872321] NMI backtrace forP cpuANC: Hard LOCKUP Entering kdb (current=..., pid 0) on processor 5 due to Keyboard Entry [5]kdb> As you can see, backtraces for the other CPUs start printing and get interleaved with the kdb PANIC print. Let's replicate the commands to flush the console in the kdb panic entry point to avoid this. Signed-off-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20230822131945.1.I5b460ae8f954e4c4f628a373d6e74713c06dd26f@changeid Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28crypto: pcrypt - Fix hungtask for PADATA_RESETLu Jialin1-1/+1
[ Upstream commit 8f4f68e788c3a7a696546291258bfa5fdb215523 ] We found a hungtask bug in test_aead_vec_cfg as follows: INFO: task cryptomgr_test:391009 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Call trace: __switch_to+0x98/0xe0 __schedule+0x6c4/0xf40 schedule+0xd8/0x1b4 schedule_timeout+0x474/0x560 wait_for_common+0x368/0x4e0 wait_for_completion+0x20/0x30 wait_for_completion+0x20/0x30 test_aead_vec_cfg+0xab4/0xd50 test_aead+0x144/0x1f0 alg_test_aead+0xd8/0x1e0 alg_test+0x634/0x890 cryptomgr_test+0x40/0x70 kthread+0x1e0/0x220 ret_from_fork+0x10/0x18 Kernel panic - not syncing: hung_task: blocked tasks For padata_do_parallel, when the return err is 0 or -EBUSY, it will call wait_for_completion(&wait->completion) in test_aead_vec_cfg. In normal case, aead_request_complete() will be called in pcrypt_aead_serial and the return err is 0 for padata_do_parallel. But, when pinst->flags is PADATA_RESET, the return err is -EBUSY for padata_do_parallel, and it won't call aead_request_complete(). Therefore, test_aead_vec_cfg will hung at wait_for_completion(&wait->completion), which will cause hungtask. The problem comes as following: (padata_do_parallel) | rcu_read_lock_bh(); | err = -EINVAL; | (padata_replace) | pinst->flags |= PADATA_RESET; err = -EBUSY | if (pinst->flags & PADATA_RESET) | rcu_read_unlock_bh() | return err In order to resolve the problem, we replace the return err -EBUSY with -EAGAIN, which means parallel_data is changing, and the caller should call it again. v3: remove retry and just change the return err. v2: introduce padata_try_do_parallel() in pcrypt_aead_encrypt and pcrypt_aead_decrypt to solve the hungtask. Signed-off-by: Lu Jialin <lujialin4@huawei.com> Signed-off-by: Guo Zihua <guozihua@huawei.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28kernel: watch_queue: copy user-array safelyPhilipp Stanner1-1/+1
[ Upstream commit ca0776571d3163bd03b3e8c9e3da936abfaecbf6 ] Currently, there is no overflow-check with memdup_user(). Use the new function memdup_array_user() instead of memdup_user() for duplicating the user-space array safely. Suggested-by: David Airlie <airlied@redhat.com> Signed-off-by: Philipp Stanner <pstanner@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Zack Rusin <zackr@vmware.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230920123612.16914-5-pstanner@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28kernel: kexec: copy user-array safelyPhilipp Stanner1-1/+1
[ Upstream commit 569c8d82f95eb5993c84fb61a649a9c4ddd208b3 ] Currently, there is no overflow-check with memdup_user(). Use the new function memdup_array_user() instead of memdup_user() for duplicating the user-space array safely. Suggested-by: David Airlie <airlied@redhat.com> Signed-off-by: Philipp Stanner <pstanner@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Zack Rusin <zackr@vmware.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230920123612.16914-4-pstanner@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28bpf: Detect IP == ksym.end as part of BPF programKumar Kartikeya Dwivedi1-1/+5
[ Upstream commit 66d9111f3517f85ef2af0337ece02683ce0faf21 ] Now that bpf_throw kfunc is the first such call instruction that has noreturn semantics within the verifier, this also kicks in dead code elimination in unprecedented ways. For one, any instruction following a bpf_throw call will never be marked as seen. Moreover, if a callchain ends up throwing, any instructions after the call instruction to the eventually throwing subprog in callers will also never be marked as seen. The tempting way to fix this would be to emit extra 'int3' instructions which bump the jited_len of a program, and ensure that during runtime when a program throws, we can discover its boundaries even if the call instruction to bpf_throw (or to subprogs that always throw) is emitted as the final instruction in the program. An example of such a program would be this: do_something(): ... r0 = 0 exit foo(): r1 = 0 call bpf_throw r0 = 0 exit bar(cond): if r1 != 0 goto pc+2 call do_something exit call foo r0 = 0 // Never seen by verifier exit // main(ctx): r1 = ... call bar r0 = 0 exit Here, if we do end up throwing, the stacktrace would be the following: bpf_throw foo bar main In bar, the final instruction emitted will be the call to foo, as such, the return address will be the subsequent instruction (which the JIT emits as int3 on x86). This will end up lying outside the jited_len of the program, thus, when unwinding, we will fail to discover the return address as belonging to any program and end up in a panic due to the unreliable stack unwinding of BPF programs that we never expect. To remedy this case, make bpf_prog_ksym_find treat IP == ksym.end as part of the BPF program, so that is_bpf_text_address returns true when such a case occurs, and we are able to unwind reliably when the final instruction ends up being a call instruction. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-12-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28workqueue: Provide one lock class key per work_on_cpu() callsiteFrederic Weisbecker1-8/+12
[ Upstream commit 265f3ed077036f053981f5eea0b5b43e7c5b39ff ] All callers of work_on_cpu() share the same lock class key for all the functions queued. As a result the workqueue related locking scenario for a function A may be spuriously accounted as an inversion against the locking scenario of function B such as in the following model: long A(void *arg) { mutex_lock(&mutex); mutex_unlock(&mutex); } long B(void *arg) { } void launchA(void) { work_on_cpu(0, A, NULL); } void launchB(void) { mutex_lock(&mutex); work_on_cpu(1, B, NULL); mutex_unlock(&mutex); } launchA and launchB running concurrently have no chance to deadlock. However the above can be reported by lockdep as a possible locking inversion because the works containing A() and B() are treated as belonging to the same locking class. The following shows an existing example of such a spurious lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 6.6.0-rc1-00065-g934ebd6e5359 #35409 Not tainted ------------------------------------------------------ kworker/0:1/9 is trying to acquire lock: ffffffff9bc72f30 (cpu_hotplug_lock){++++}-{0:0}, at: _cpu_down+0x57/0x2b0 but task is already holding lock: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}: __flush_work+0x83/0x4e0 work_on_cpu+0x97/0xc0 rcu_nocb_cpu_offload+0x62/0xb0 rcu_nocb_toggle+0xd0/0x1d0 kthread+0xe6/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x1b/0x30 -> #1 (rcu_state.barrier_mutex){+.+.}-{3:3}: __mutex_lock+0x81/0xc80 rcu_nocb_cpu_deoffload+0x38/0xb0 rcu_nocb_toggle+0x144/0x1d0 kthread+0xe6/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x1b/0x30 -> #0 (cpu_hotplug_lock){++++}-{0:0}: __lock_acquire+0x1538/0x2500 lock_acquire+0xbf/0x2a0 percpu_down_write+0x31/0x200 _cpu_down+0x57/0x2b0 __cpu_down_maps_locked+0x10/0x20 work_for_cpu_fn+0x15/0x20 process_scheduled_works+0x2a7/0x500 worker_thread+0x173/0x330 kthread+0xe6/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x1b/0x30 other info that might help us debug this: Chain exists of: cpu_hotplug_lock --> rcu_state.barrier_mutex --> (work_completion)(&wfc.work) Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock((work_completion)(&wfc.work)); lock(rcu_state.barrier_mutex); lock((work_completion)(&wfc.work)); lock(cpu_hotplug_lock); *** DEADLOCK *** 2 locks held by kworker/0:1/9: #0: ffff900481068b38 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works+0x212/0x500 #1: ffff9e3bc0057e60 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works+0x216/0x500 stack backtrace: CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc1-00065-g934ebd6e5359 #35409 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 Workqueue: events work_for_cpu_fn Call Trace: rcu-torture: rcu_torture_read_exit: Start of episode <TASK> dump_stack_lvl+0x4a/0x80 check_noncircular+0x132/0x150 __lock_acquire+0x1538/0x2500 lock_acquire+0xbf/0x2a0 ? _cpu_down+0x57/0x2b0 percpu_down_write+0x31/0x200 ? _cpu_down+0x57/0x2b0 _cpu_down+0x57/0x2b0 __cpu_down_maps_locked+0x10/0x20 work_for_cpu_fn+0x15/0x20 process_scheduled_works+0x2a7/0x500 worker_thread+0x173/0x330 ? __pfx_worker_thread+0x10/0x10 kthread+0xe6/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2f/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK Fix this with providing one lock class key per work_on_cpu() caller. Reported-and-tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28perf/core: Bail out early if the request AUX area is out of boundShuai Xue1-0/+6
[ Upstream commit 54aee5f15b83437f23b2b2469bcf21bdd9823916 ] When perf-record with a large AUX area, e.g 4GB, it fails with: #perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1 failed to mmap with 12 (Cannot allocate memory) and it reveals a WARNING with __alloc_pages(): ------------[ cut here ]------------ WARNING: CPU: 44 PID: 17573 at mm/page_alloc.c:5568 __alloc_pages+0x1ec/0x248 Call trace: __alloc_pages+0x1ec/0x248 __kmalloc_large_node+0xc0/0x1f8 __kmalloc_node+0x134/0x1e8 rb_alloc_aux+0xe0/0x298 perf_mmap+0x440/0x660 mmap_region+0x308/0x8a8 do_mmap+0x3c0/0x528 vm_mmap_pgoff+0xf4/0x1b8 ksys_mmap_pgoff+0x18c/0x218 __arm64_sys_mmap+0x38/0x58 invoke_syscall+0x50/0x128 el0_svc_common.constprop.0+0x58/0x188 do_el0_svc+0x34/0x50 el0_svc+0x34/0x108 el0t_64_sync_handler+0xb8/0xc0 el0t_64_sync+0x1a4/0x1a8 'rb->aux_pages' allocated by kcalloc() is a pointer array which is used to maintains AUX trace pages. The allocated page for this array is physically contiguous (and virtually contiguous) with an order of 0..MAX_ORDER. If the size of pointer array crosses the limitation set by MAX_ORDER, it reveals a WARNING. So bail out early with -ENOMEM if the request AUX area is out of bound, e.g.: #perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1 failed to mmap with 12 (Cannot allocate memory) Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28locking/ww_mutex/test: Fix potential workqueue corruptionJohn Stultz1-8/+12
[ Upstream commit bccdd808902f8c677317cec47c306e42b93b849e ] In some cases running with the test-ww_mutex code, I was seeing odd behavior where sometimes it seemed flush_workqueue was returning before all the work threads were finished. Often this would cause strange crashes as the mutexes would be freed while they were being used. Looking at the code, there is a lifetime problem as the controlling thread that spawns the work allocates the "struct stress" structures that are passed to the workqueue threads. Then when the workqueue threads are finished, they free the stress struct that was passed to them. Unfortunately the workqueue work_struct node is in the stress struct. Which means the work_struct is freed before the work thread returns and while flush_workqueue is waiting. It seems like a better idea to have the controlling thread both allocate and free the stress structures, so that we can be sure we don't corrupt the workqueue by freeing the structure prematurely. So this patch reworks the test to do so, and with this change I no longer see the early flush_workqueue returns. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230922043616.19282-3-jstultz@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20tracing/kprobes: Fix the order of argument descriptionsYujie Liu1-1/+1
[ Upstream commit f032c53bea6d2057c14553832d846be2f151cfb2 ] The order of descriptions should be consistent with the argument list of the function, so "kretprobe" should be the second one. int __kprobe_event_gen_cmd_start(struct dynevent_cmd *cmd, bool kretprobe, const char *name, const char *loc, ...) Link: https://lore.kernel.org/all/20231031041305.3363712-1-yujie.liu@intel.com/ Fixes: 2a588dd1d5d6 ("tracing: Add kprobe event command generation functions") Suggested-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Yujie Liu <yujie.liu@intel.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20bpf: Check map->usercnt after timer->timer is assignedHou Tao1-9/+16
[ Upstream commit fd381ce60a2d79cc967506208085336d3d268ae0 ] When there are concurrent uref release and bpf timer init operations, the following sequence diagram is possible. It will break the guarantee provided by bpf_timer: bpf_timer will still be alive after userspace application releases or unpins the map. It also will lead to kmemleak for old kernel version which doesn't release bpf_timer when map is released. bpf program X: bpf_timer_init() lock timer->lock read timer->timer as NULL read map->usercnt != 0 process Y: close(map_fd) // put last uref bpf_map_put_uref() atomic_dec_and_test(map->usercnt) array_map_free_timers() bpf_timer_cancel_and_free() // just return read timer->timer is NULL t = bpf_map_kmalloc_node() timer->timer = t unlock timer->lock Fix the problem by checking map->usercnt after timer->timer is assigned, so when there are concurrent uref release and bpf timer init, either bpf_timer_cancel_and_free() from uref release reads a no-NULL timer or the newly-added atomic64_read() returns a zero usercnt. Because atomic_dec_and_test(map->usercnt) and READ_ONCE(timer->timer) in bpf_timer_cancel_and_free() are not protected by a lock, so add a memory barrier to guarantee the order between map->usercnt and timer->timer. Also use WRITE_ONCE(timer->timer, x) to match the lockless read of timer->timer in bpf_timer_cancel_and_free(). Reported-by: Hsin-Wei Hung <hsinweih@uci.edu> Closes: https://lore.kernel.org/bpf/CABcoxUaT2k9hWsS1tNgXyoU3E-=PuOgMn737qK984fbFmfYixQ@mail.gmail.com Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231030063616.1653024-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20livepatch: Fix missing newline character in klp_resolve_symbols()Zheng Yejian1-1/+1
[ Upstream commit 67e18e132f0fd738f8c8cac3aa1420312073f795 ] Without the newline character, the log may not be printed immediately after the error occurs. Fixes: ca376a937486 ("livepatch: Prevent module-specific KLP rela sections from referencing vmlinux symbols") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230914072644.4098857-1-zhengyejian1@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20padata: Fix refcnt handling in padata_free_shell()WangJinchao1-1/+5
[ Upstream commit 7ddc21e317b360c3444de3023bcc83b85fabae2f ] In a high-load arm64 environment, the pcrypt_aead01 test in LTP can lead to system UAF (Use-After-Free) issues. Due to the lengthy analysis of the pcrypt_aead01 function call, I'll describe the problem scenario using a simplified model: Suppose there's a user of padata named `user_function` that adheres to the padata requirement of calling `padata_free_shell` after `serial()` has been invoked, as demonstrated in the following code: ```c struct request { struct padata_priv padata; struct completion *done; }; void parallel(struct padata_priv *padata) { do_something(); } void serial(struct padata_priv *padata) { struct request *request = container_of(padata, struct request, padata); complete(request->done); } void user_function() { DECLARE_COMPLETION(done) padata->parallel = parallel; padata->serial = serial; padata_do_parallel(); wait_for_completion(&done); padata_free_shell(); } ``` In the corresponding padata.c file, there's the following code: ```c static void padata_serial_worker(struct work_struct *serial_work) { ... cnt = 0; while (!list_empty(&local_list)) { ... padata->serial(padata); cnt++; } local_bh_enable(); if (refcount_sub_and_test(cnt, &pd->refcnt)) padata_free_pd(pd); } ``` Because of the high system load and the accumulation of unexecuted softirq at this moment, `local_bh_enable()` in padata takes longer to execute than usual. Subsequently, when accessing `pd->refcnt`, `pd` has already been released by `padata_free_shell()`, resulting in a UAF issue with `pd->refcnt`. The fix is straightforward: add `refcount_dec_and_test` before calling `padata_free_pd` in `padata_free_shell`. Fixes: 07928d9bfc81 ("padata: Remove broken queue flushing") Signed-off-by: WangJinchao <wangjinchao@xfusion.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20genirq/matrix: Exclude managed interrupts in irq_matrix_allocated()Chen Yu1-3/+3
[ Upstream commit a0b0bad10587ae2948a7c36ca4ffc206007fbcf3 ] When a CPU is about to be offlined, x86 validates that all active interrupts which are targeted to this CPU can be migrated to the remaining online CPUs. If not, the offline operation is aborted. The validation uses irq_matrix_allocated() to retrieve the number of vectors which are allocated on the outgoing CPU. The returned number of allocated vectors includes also vectors which are associated to managed interrupts. That's overaccounting because managed interrupts are: - not migrated when the affinity mask of the interrupt targets only the outgoing CPU - migrated to another CPU, but in that case the vector is already pre-allocated on the potential target CPUs and must not be taken into account. As a consequence the check whether the remaining online CPUs have enough capacity for migrating the allocated vectors from the outgoing CPU might fail incorrectly. Let irq_matrix_allocated() return only the number of allocated non-managed interrupts to make this validation check correct. [ tglx: Amend changelog and fixup kernel-doc comment ] Fixes: 2f75d9e1c905 ("genirq: Implement bitmap matrix allocator") Reported-by: Wendy Wang <wendy.wang@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20231020072522.557846-1-yu.c.chen@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20futex: Don't include process MM in futex key on no-MMUBen Wolsieffer1-1/+11
[ Upstream commit c73801ae4f22b390228ebf471d55668e824198b6 ] On no-MMU, all futexes are treated as private because there is no need to map a virtual address to physical to match the futex across processes. This doesn't quite work though, because private futexes include the current process's mm_struct as part of their key. This makes it impossible for one process to wake up a shared futex being waited on in another process. Fix this bug by excluding the mm_struct from the key.