summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2023-05-30x86/pci/xen: populate MSI sysfs entriesMaximilian Heyne1-2/+2
commit 335b4223466dd75f9f3ea4918187afbadd22e5c8 upstream. Commit bf5e758f02fc ("genirq/msi: Simplify sysfs handling") reworked the creation of sysfs entries for MSI IRQs. The creation used to be in msi_domain_alloc_irqs_descs_locked after calling ops->domain_alloc_irqs. Then it moved into __msi_domain_alloc_irqs which is an implementation of domain_alloc_irqs. However, Xen comes with the only other implementation of domain_alloc_irqs and hence doesn't run the sysfs population code anymore. Commit 6c796996ee70 ("x86/pci/xen: Fixup fallout from the PCI/MSI overhaul") set the flag MSI_FLAG_DEV_SYSFS for the xen msi_domain_info but that doesn't actually have an effect because Xen uses it's own domain_alloc_irqs implementation. Fix this by making use of the fallback functions for sysfs population. Fixes: bf5e758f02fc ("genirq/msi: Simplify sysfs handling") Signed-off-by: Maximilian Heyne <mheyne@amazon.de> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230503131656.15928-1-mheyne@amazon.de Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-30bpf: fix a memory leak in the LRU and LRU_PERCPU hash mapsAnton Protopopov1-2/+4
commit b34ffb0c6d23583830f9327864b9c1f486003305 upstream. The LRU and LRU_PERCPU maps allocate a new element on update before locking the target hash table bucket. Right after that the maps try to lock the bucket. If this fails, then maps return -EBUSY to the caller without releasing the allocated element. This makes the element untracked: it doesn't belong to either of free lists, and it doesn't belong to the hash table, so can't be re-used; this eventually leads to the permanent -ENOMEM on LRU map updates, which is unexpected. Fix this by returning the element to the local free list if bucket locking fails. Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked") Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230522154558.2166815-1-aspsk@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-30bpf: Fix mask generation for 32-bit narrow loads of 64-bit fieldsWill Deacon1-1/+1
commit 0613d8ca9ab382caabe9ed2dceb429e9781e443f upstream. A narrow load from a 64-bit context field results in a 64-bit load followed potentially by a 64-bit right-shift and then a bitwise AND operation to extract the relevant data. In the case of a 32-bit access, an immediate mask of 0xffffffff is used to construct a 64-bit BPP_AND operation which then sign-extends the mask value and effectively acts as a glorified no-op. For example: 0: 61 10 00 00 00 00 00 00 r0 = *(u32 *)(r1 + 0) results in the following code generation for a 64-bit field: ldr x7, [x7] // 64-bit load mov x10, #0xffffffffffffffff and x7, x7, x10 Fix the mask generation so that narrow loads always perform a 32-bit AND operation: ldr x7, [x7] // 64-bit load mov w10, #0xffffffff and w7, w7, w10 Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Krzesimir Nowak <krzesimir@kinvolk.io> Cc: Andrey Ignatov <rdna@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Fixes: 31fd85816dbe ("bpf: permits narrower load from bpf program context fields") Signed-off-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20230518102528.1341-1-will@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-24rethook: use preempt_{disable, enable}_notrace in rethook_trampoline_handlerZe Gao1-2/+2
commit be243bacfb25f5219f2396d787408e8cf1301dd1 upstream. This patch replaces preempt_{disable, enable} with its corresponding notrace version in rethook_trampoline_handler so no worries about stack recursion or overflow introduced by preempt_count_{add, sub} under fprobe + rethook context. Link: https://lore.kernel.org/all/20230517034510.15639-2-zegao@tencent.com/ Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook") Signed-off-by: Ze Gao <zegao@tencent.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-24bpf: Add preempt_count_{sub,add} into btf id deny listYafang1-0/+4
[ Upstream commit c11bd046485d7bf1ca200db0e7d0bdc4bafdd395 ] The recursion check in __bpf_prog_enter* and __bpf_prog_exit* leave preempt_count_{sub,add} unprotected. When attaching trampoline to them we get panic as follows, [ 867.843050] BUG: TASK stack guard page was hit at 0000000009d325cf (stack is 0000000046a46a15..00000000537e7b28) [ 867.843064] stack guard page: 0000 [#1] PREEMPT SMP NOPTI [ 867.843067] CPU: 8 PID: 11009 Comm: trace Kdump: loaded Not tainted 6.2.0+ #4 [ 867.843100] Call Trace: [ 867.843101] <TASK> [ 867.843104] asm_exc_int3+0x3a/0x40 [ 867.843108] RIP: 0010:preempt_count_sub+0x1/0xa0 [ 867.843135] __bpf_prog_enter_recur+0x17/0x90 [ 867.843148] bpf_trampoline_6442468108_0+0x2e/0x1000 [ 867.843154] ? preempt_count_sub+0x1/0xa0 [ 867.843157] preempt_count_sub+0x5/0xa0 [ 867.843159] ? migrate_enable+0xac/0xf0 [ 867.843164] __bpf_prog_exit_recur+0x2d/0x40 [ 867.843168] bpf_trampoline_6442468108_0+0x55/0x1000 ... [ 867.843788] preempt_count_sub+0x5/0xa0 [ 867.843793] ? migrate_enable+0xac/0xf0 [ 867.843829] __bpf_prog_exit_recur+0x2d/0x40 [ 867.843837] BUG: IRQ stack guard page was hit at 0000000099bd8228 (stack is 00000000b23e2bc4..000000006d95af35) [ 867.843841] BUG: IRQ stack guard page was hit at 000000005ae07924 (stack is 00000000ffd69623..0000000014eb594c) [ 867.843843] BUG: IRQ stack guard page was hit at 00000000028320f0 (stack is 00000000034b6438..0000000078d1bcec) [ 867.843842] bpf_trampoline_6442468108_0+0x55/0x1000 ... That is because in __bpf_prog_exit_recur, the preempt_count_{sub,add} are called after prog->active is decreased. Fixing this by adding these two functions into btf ids deny list. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Yafang <laoar.shao@gmail.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jiri Olsa <olsajiri@gmail.com> Acked-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20230413025248.79764-1-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-24bpf: Annotate data races in bpf_local_storageKumar Kartikeya Dwivedi1-3/+13
[ Upstream commit 0a09a2f933c73dc76ab0b72da6855f44342a8903 ] There are a few cases where hlist_node is checked to be unhashed without holding the lock protecting its modification. In this case, one must use hlist_unhashed_lockless to avoid load tearing and KCSAN reports. Fix this by using lockless variant in places not protected by the lock. Since this is not prompted by any actual KCSAN reports but only from code review, I have not included a fixes tag. Cc: Martin KaFai Lau <martin.lau@kernel.org> Cc: KP Singh <kpsingh@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230221200646.2500777-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-24rcu: Protect rcu_print_task_exp_stall() ->exp_tasks accessZqiang1-2/+4
[ Upstream commit 3c1566bca3f8349f12b75d0a2d5e4a20ad6262ec ] For kernels built with CONFIG_PREEMPT_RCU=y, the following scenario can result in a NULL-pointer dereference: CPU1 CPU2 rcu_preempt_deferred_qs_irqrestore rcu_print_task_exp_stall if (special.b.blocked) READ_ONCE(rnp->exp_tasks) != NULL raw_spin_lock_rcu_node np = rcu_next_node_entry(t, rnp) if (&t->rcu_node_entry == rnp->exp_tasks) WRITE_ONCE(rnp->exp_tasks, np) .... raw_spin_unlock_irqrestore_rcu_node raw_spin_lock_irqsave_rcu_node t = list_entry(rnp->exp_tasks->prev, struct task_struct, rcu_node_entry) (if rnp->exp_tasks is NULL, this will dereference a NULL pointer) The problem is that CPU2 accesses the rcu_node structure's->exp_tasks field without holding the rcu_node structure's ->lock and CPU2 did not observe CPU1's change to rcu_node structure's ->exp_tasks in time. Therefore, if CPU1 sets rcu_node structure's->exp_tasks pointer to NULL, then CPU2 might dereference that NULL pointer. This commit therefore holds the rcu_node structure's ->lock while accessing that structure's->exp_tasks field. [ paulmck: Apply Frederic Weisbecker feedback. ] Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-24refscale: Move shutdown from wait_event() to wait_event_idle()Paul E. McKenney1-1/+1
[ Upstream commit 6bc6e6b27524304aadb9c04611ddb1c84dd7617a ] The ref_scale_shutdown() kthread/function uses wait_event() to wait for the refscale test to complete. However, although the read-side tests are normally extremely fast, there is no law against specifying a very large value for the refscale.loops module parameter or against having a slow read-side primitive. Either way, this might well trigger the hung-task timeout. This commit therefore replaces those wait_event() calls with calls to wait_event_idle(), which do not trigger the hung-task timeout. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-24tick/broadcast: Make broadcast device replacement work correctlyThomas Gleixner1-32/+88
[ Upstream commit f9d36cf445ffff0b913ba187a3eff78028f9b1fb ] When a tick broadcast clockevent device is initialized for one shot mode then tick_broadcast_setup_oneshot() OR's the periodic broadcast mode cpumask into the oneshot broadcast cpumask. This is required when switching from periodic broadcast mode to oneshot broadcast mode to ensure that CPUs which are waiting for periodic broadcast are woken up on the next tick. But it is subtly broken, when an active broadcast device is replaced and the system is already in oneshot (NOHZ/HIGHRES) mode. Victor observed this and debugged the issue. Then the OR of the periodic broadcast CPU mask is wrong as the periodic cpumask bits are sticky after tick_broadcast_enable() set it for a CPU unless explicitly cleared via tick_broadcast_disable(). That means that this sets all other CPUs which have tick broadcasting enabled at that point unconditionally in the oneshot broadcast mask. If the affected CPUs were already idle and had their bits set in the oneshot broadcast mask then this does no harm. But for non idle CPUs which were not set this corrupts their state. On their next invocation of tick_broadcast_enable() they observe the bit set, which indicates that the broadcast for the CPU is already set up. As a consequence they fail to update the broadcast event even if their earliest expiring timer is before the actually programmed broadcast event. If the programmed broadcast event is far in the future, then this can cause stalls or trigger the hung task detector. Avoid this by telling tick_broadcast_setup_oneshot() explicitly whether this is the initial switch over from periodic to oneshot broadcast which must take the periodic broadcast mask into account. In the case of initialization of a replacement device this prevents that the broadcast oneshot mask is modified. There is a second problem with broadcast device replacement in this function. The broadcast device is only armed when the previous state of the device was periodic. That is correct for the switch from periodic broadcast mode to oneshot broadcast mode as the underlying broadcast device could operate in oneshot state already due to lack of periodic state in hardware. In that case it is already armed to expire at the next tick. For the replacement case this is wrong as the device is in shutdown state. That means that any already pending broadcast event will not be armed. This went unnoticed because any CPU which goes idle will observe that the broadcast device has an expiry time of KTIME_MAX and therefore any CPUs next timer event will be earlier and cause a reprogramming of the broadcast device. But that does not guarantee that the events of the CPUs which were already in idle are delivered on time. Fix this by arming the newly installed device for an immediate event which will reevaluate the per CPU expiry times and reprogram the broadcast device accordingly. This is simpler than caching the last expiry time in yet another place or saving it before the device exchange and handing it down to the setup function. Replacement of broadcast devices is not a frequent operation and usually happens once somewhere late in the boot process. Fixes: 9c336c9935cf ("tick/broadcast: Allow late registered device to enter oneshot mode") Reported-by: Victor Hassan <victor@allwinnertech.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/87pm7d2z1i.ffs@tglx Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-17locking/rwsem: Add __always_inline annotation to __down_read_common() and ↵John Stultz1-4/+4
inlined callers commit 92cc5d00a431e96e5a49c0b97e5ad4fa7536bd4b upstream. Apparently despite it being marked inline, the compiler may not inline __down_read_common() which makes it difficult to identify the cause of lock contention, as the blocked function in traceevents will always be listed as __down_read_common(). So this patch adds __always_inline annotation to the common function (as well as the inlined helper callers) to force it to be inlined so the blocking function will be listed (via Wchan) in traceevents. Fixes: c995e638ccbb ("locking/rwsem: Fold __down_{read,write}*()") Reported-by: Tim Murray <timmurray@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Waiman Long <longman@redhat.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20230503023351.2832796-1-jstultz@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-11kcsan: Avoid READ_ONCE() in read_instrumented_memory()Marco Elver1-4/+13
commit 8dec88070d964bfeb4198f34cb5956d89dd1f557 upstream. Haibo Li reported: | Unable to handle kernel paging request at virtual address | ffffff802a0d8d7171 | Mem abort info:o: | ESR = 0x9600002121 | EC = 0x25: DABT (current EL), IL = 32 bitsts | SET = 0, FnV = 0 0 | EA = 0, S1PTW = 0 0 | FSC = 0x21: alignment fault | Data abort info:o: | ISV = 0, ISS = 0x0000002121 | CM = 0, WnR = 0 0 | swapper pgtable: 4k pages, 39-bit VAs, pgdp=000000002835200000 | [ffffff802a0d8d71] pgd=180000005fbf9003, p4d=180000005fbf9003, | pud=180000005fbf9003, pmd=180000005fbe8003, pte=006800002a0d8707 | Internal error: Oops: 96000021 [#1] PREEMPT SMP | Modules linked in: | CPU: 2 PID: 45 Comm: kworker/u8:2 Not tainted | 5.15.78-android13-8-g63561175bbda-dirty #1 | ... | pc : kcsan_setup_watchpoint+0x26c/0x6bc | lr : kcsan_setup_watchpoint+0x88/0x6bc | sp : ffffffc00ab4b7f0 | x29: ffffffc00ab4b800 x28: ffffff80294fe588 x27: 0000000000000001 | x26: 0000000000000019 x25: 0000000000000001 x24: ffffff80294fdb80 | x23: 0000000000000000 x22: ffffffc00a70fb68 x21: ffffff802a0d8d71 | x20: 0000000000000002 x19: 0000000000000000 x18: ffffffc00a9bd060 | x17: 0000000000000001 x16: 0000000000000000 x15: ffffffc00a59f000 | x14: 0000000000000001 x13: 0000000000000000 x12: ffffffc00a70faa0 | x11: 00000000aaaaaaab x10: 0000000000000054 x9 : ffffffc00839adf8 | x8 : ffffffc009b4cf00 x7 : 0000000000000000 x6 : 0000000000000007 | x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffffffc00a70fb70 | x2 : 0005ff802a0d8d71 x1 : 0000000000000000 x0 : 0000000000000000 | Call trace: | kcsan_setup_watchpoint+0x26c/0x6bc | __tsan_read2+0x1f0/0x234 | inflate_fast+0x498/0x750 | zlib_inflate+0x1304/0x2384 | __gunzip+0x3a0/0x45c | gunzip+0x20/0x30 | unpack_to_rootfs+0x2a8/0x3fc | do_populate_rootfs+0xe8/0x11c | async_run_entry_fn+0x58/0x1bc | process_one_work+0x3ec/0x738 | worker_thread+0x4c4/0x838 | kthread+0x20c/0x258 | ret_from_fork+0x10/0x20 | Code: b8bfc2a8 2a0803f7 14000007 d503249f (78bfc2a8) ) | ---[ end trace 613a943cb0a572b6 ]----- The reason for this is that on certain arm64 configuration since e35123d83ee3 ("arm64: lto: Strengthen READ_ONCE() to acquire when CONFIG_LTO=y"), READ_ONCE() may be promoted to a full atomic acquire instruction which cannot be used on unaligned addresses. Fix it by avoiding READ_ONCE() in read_instrumented_memory(), and simply forcing the compiler to do the required access by casting to the appropriate volatile type. In terms of generated code this currently only affects architectures that do not use the default READ_ONCE() implementation. The only downside is that we are not guaranteed atomicity of the access itself, although on most architectures a plain load up to machine word size should still be atomic (a fact the default READ_ONCE() still relies on itself). Reported-by: Haibo Li <haibo.li@mediatek.com> Tested-by: Haibo Li <haibo.li@mediatek.com> Cc: <stable@vger.kernel.org> # 5.17+ Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-11PM: hibernate: Do not get block device exclusively in test_resume modeChen Yu2-4/+12
[ Upstream commit 5904de0d735bbb3b4afe9375c5b4f9748f882945 ] The system refused to do a test_resume because it found that the swap device has already been taken by someone else. Specifically, the swsusp_check()->blkdev_get_by_dev(FMODE_EXCL) is supposed to do this check. Steps to reproduce: dd if=/dev/zero of=/swapfile bs=$(cat /proc/meminfo | awk '/MemTotal/ {print $2}') count=1024 conv=notrunc mkswap /swapfile swapon /swapfile swap-offset /swapfile echo 34816 > /sys/power/resume_offset echo test_resume > /sys/power/disk echo disk > /sys/power/state PM: Using 3 thread(s) for compression PM: Compressing and saving image data (293150 pages)... PM: Image saving progress: 0% PM: Image saving progress: 10% ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00: configured for UDMA/100 ata2: SATA link down (SStatus 0 SControl 300) ata5: SATA link down (SStatus 0 SControl 300) ata6: SATA link down (SStatus 0 SControl 300) ata3: SATA link down (SStatus 0 SControl 300) ata4: SATA link down (SStatus 0 SControl 300) PM: Image saving progress: 20% PM: Image saving progress: 30% PM: Image saving progress: 40% PM: Image saving progress: 50% pcieport 0000:00:02.5: pciehp: Slot(0-5): No device found PM: Image saving progress: 60% PM: Image saving progress: 70% PM: Image saving progress: 80% PM: Image saving progress: 90% PM: Image saving done PM: hibernation: Wrote 1172600 kbytes in 2.70 seconds (434.29 MB/s) PM: S| PM: hibernation: Basic memory bitmaps freed PM: Image not found (code -16) This is because when using the swapfile as the hibernation storage, the block device where the swapfile is located has already been mounted by the OS distribution(usually mounted as the rootfs). This is not an issue for normal hibernation, because software_resume()->swsusp_check() happens before the block device(rootfs) mount. But it is a problem for the test_resume mode. Because when test_resume happens, the block device has been mounted already. Thus remove the FMODE_EXCL for test_resume mode. This would not be a problem because in test_resume stage, the processes have already been frozen, and the race condition described in Commit 39fbef4b0f77 ("PM: hibernate: Get block device exclusively in swsusp_check()") is unlikely to happen. Fixes: 39fbef4b0f77 ("PM: hibernate: Get block device exclusively in swsusp_check()") Reported-by: Yifan Li <yifan2.li@intel.com> Suggested-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Tested-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Tested-by: Wendy Wang <wendy.wang@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11PM: hibernate: Turn snapshot_test into global variableChen Yu2-1/+7
[ Upstream commit 08169a162f97819d3e5b4a342bb9cf5137787154 ] There is need to check snapshot_test and open block device in different mode, so as to avoid the race condition. No functional changes intended. Suggested-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Stable-dep-of: 5904de0d735b ("PM: hibernate: Do not get block device exclusively in test_resume mode") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11timekeeping: Fix references to nonexistent ktime_get_fast_ns()Geert Uytterhoeven1-2/+2
[ Upstream commit 158009f1b4a33bc0f354b994eea361362bd83226 ] There was never a function named ktime_get_fast_ns(). Presumably these should refer to ktime_get_mono_fast_ns() instead. Fixes: c1ce406e80fb15fa ("timekeeping: Fix up function documentation for the NMI safe accessors") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/06df7b3cbd94f016403bbf6cd2b38e4368e7468f.1682516546.git.geert+renesas@glider.be Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11swiotlb: fix debugfs reporting of reserved memory poolsMichael Kelley1-2/+4
[ Upstream commit 5499d01c029069044a3b3e50501c77b474c96178 ] For io_tlb_nslabs, the debugfs code reports the correct value for a specific reserved memory pool. But for io_tlb_used, the value reported is always for the default pool, not the specific reserved pool. Fix this. Fixes: 5c850d31880e ("swiotlb: fix passing local variable to debugfs_create_ulong()") Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11swiotlb: relocate PageHighMem test away from rmem_swiotlb_setupDoug Berger1-5/+5
[ Upstream commit a90922fa25370902322e9de6640e58737d459a50 ] The reservedmem_of_init_fn's are invoked very early at boot before the memory zones have even been defined. This makes it inappropriate to test whether the page corresponding to a PFN is in ZONE_HIGHMEM from within one. Removing the check allows an ARM 32-bit kernel with SPARSEMEM enabled to boot properly since otherwise we would be de-referencing an uninitialized sparsemem map to perform pfn_to_page() check. The arm64 architecture happens to work (and also has no high memory) but other 32-bit architectures could also be having similar issues. While it would be nice to provide early feedback about a reserved DMA pool residing in highmem, it is not possible to do that until the first time we try to use it, which is where the check is moved to. Fixes: 0b84e4f8b793 ("swiotlb: Add restricted DMA pool initialization") Signed-off-by: Doug Berger <opendmb@gmail.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11workqueue: Fix hung time report of worker poolsPetr Mladek1-3/+7
[ Upstream commit 335a42ebb0ca8ee9997a1731aaaae6dcd704c113 ] The workqueue watchdog prints a warning when there is no progress in a worker pool. Where the progress means that the pool started processing a pending work item. Note that it is perfectly fine to process work items much longer. The progress should be guaranteed by waking up or creating idle workers. show_one_worker_pool() prints state of non-idle worker pool. It shows a delay since the last pool->watchdog_ts. The timestamp is updated when a first pending work is queued in __queue_work(). Also it is updated when a work is dequeued for processing in worker_thread() and rescuer_thread(). The delay is misleading when there is no pending work item. In this case it shows how long the last work item is being proceed. Show zero instead. There is no stall if there is no pending work. Fixes: 82607adcf9cdf40fb7b ("workqueue: implement lockup detector") Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11tracing/user_events: Ensure write index cannot be negativeBeau Belgrave1-0/+3
[ Upstream commit cd98c93286a30cc4588dfd02453bec63c2f4acf4 ] The write index indicates which event the data is for and accesses a per-file array. The index is passed by user processes during write() calls as the first 4 bytes. Ensure that it cannot be negative by returning -EINVAL to prevent out of bounds accesses. Update ftrace self-test to ensure this occurs properly. Link: https://lkml.kernel.org/r/20230425225107.8525-2-beaub@linux.microsoft.com Fixes: 7f5a08c79df3 ("user_events: Add minimal support for trace_event into ftrace") Reported-by: Doug Cook <dcook@linux.microsoft.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11sched/rt: Fix bad task migration for rt tasksSchspa Shi2-0/+5
[ Upstream commit feffe5bb274dd3442080ef0e4053746091878799 ] Commit 95158a89dd50 ("sched,rt: Use the full cpumask for balancing") allows find_lock_lowest_rq() to pick a task with migration disabled. The purpose of the commit is to push the current running task on the CPU that has the migrate_disable() task away. However, there is a race which allows a migrate_disable() task to be migrated. Consider: CPU0 CPU1 push_rt_task check is_migration_disabled(next_task) task not running and migration_disabled == 0 find_lock_lowest_rq(next_task, rq); _double_lock_balance(this_rq, busiest); raw_spin_rq_unlock(this_rq); double_rq_lock(this_rq, busiest); <<wait for busiest rq>> <wakeup> task become running migrate_disable(); <context out> deactivate_task(rq, next_task, 0); set_task_cpu(next_task, lowest_rq->cpu); WARN_ON_ONCE(is_migration_disabled(p)); Fixes: 95158a89dd50 ("sched,rt: Use the full cpumask for balancing") Signed-off-by: Schspa Shi <schspa@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Dwaine Gonyier <dgonyier@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11perf/core: Fix hardlockup failure caused by perf throttleYang Jihong1-2/+2
[ Upstream commit 15def34e2635ab7e0e96f1bc32e1b69609f14942 ] commit e050e3f0a71bf ("perf: Fix broken interrupt rate throttling") introduces a change in throttling threshold judgment. Before this, compare hwc->interrupts and max_samples_per_tick, then increase hwc->interrupts by 1, but this commit reverses order of these two behaviors, causing the semantics of max_samples_per_tick to change. In literal sense of "max_samples_per_tick", if hwc->interrupts == max_samples_per_tick, it should not be throttled, therefore, the judgment condition should be changed to "hwc->interrupts > max_samples_per_tick". In fact, this may cause the hardlockup to fail, The minimum value of max_samples_per_tick may be 1, in this case, the return value of __perf_event_account_interrupt function is 1. As a result, nmi_watchdog gets throttled, which would stop PMU (Use x86 architecture as an example, see x86_pmu_handle_irq). Fixes: e050e3f0a71b ("perf: Fix broken interrupt rate throttling") Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230227023508.102230-1-yangjihong1@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11sched/fair: Fix inaccurate tally of ttwu_move_affineLibo Chen1-1/+1
[ Upstream commit 39afe5d6fc59237ff7738bf3ede5a8856822d59d ] There are scenarios where non-affine wakeups are incorrectly counted as affine wakeups by schedstats. When wake_affine_idle() returns prev_cpu which doesn't equal to nr_cpumask_bits, it will slip through the check: target == nr_cpumask_bits in wake_affine() and be counted as if target == this_cpu in schedstats. Replace target == nr_cpumask_bits with target != this_cpu to make sure affine wakeups are accurately tallied. Fixes: 806486c377e33 (sched/fair: Do not migrate if the prev_cpu is idle) Suggested-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Libo Chen <libo.chen@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Link: https://lore.kernel.org/r/20220810223313.386614-1-libo.chen@oracle.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: Don't EFAULT for getsockopt with optval=NULLStanislav Fomichev1-3/+6
[ Upstream commit 00e74ae0863827d944e36e56a4ce1e77e50edb91 ] Some socket options do getsockopt with optval=NULL to estimate the size of the final buffer (which is returned via optlen). This breaks BPF getsockopt assumptions about permitted optval buffer size. Let's enforce these assumptions only when non-NULL optval is provided. Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") Reported-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/ZD7Js4fj5YyI2oLd@google.com/T/#mb68daf700f87a9244a15d01d00c3f0e5b08f49f7 Link: https://lore.kernel.org/bpf/20230418225343.553806-2-sdf@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: Fix race between btf_put and btf_idr walk.Alexei Starovoitov1-5/+3
[ Upstream commit acf1c3d68e9a31f10d92bc67ad4673cdae5e8d92 ] Florian and Eduard reported hard dead lock: [ 58.433327] _raw_spin_lock_irqsave+0x40/0x50 [ 58.433334] btf_put+0x43/0x90 [ 58.433338] bpf_find_btf_id+0x157/0x240 [ 58.433353] btf_parse_fields+0x921/0x11c0 This happens since btf->refcount can be 1 at the time of btf_put() and btf_put() will call btf_free_id() which will try to grab btf_idr_lock and will dead lock. Avoid the issue by doing btf_put() without locking. Fixes: 3d78417b60fb ("bpf: Add bpf_btf_find_by_name_kind() helper.") Fixes: 1e89106da253 ("bpf: Add bpf_core_add_cands() and wire it into bpf_core_apply_relo_insn().") Reported-by: Florian Westphal <fw@strlen.de> Reported-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20230421014901.70908-1-alexei.starovoitov@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf/btf: Fix is_int_ptr()Feng Zhou1-6/+2
[ Upstream commit 91f2dc6838c19342f7f2993627c622835cc24890 ] When tracing a kernel function with arg type is u32*, btf_ctx_access() would report error: arg2 type INT is not a struct. The commit bb6728d75611 ("bpf: Allow access to int pointer arguments in tracing programs") added support for int pointer, but did not skip modifiers before checking it's type. This patch fixes it. Fixes: bb6728d75611 ("bpf: Allow access to int pointer arguments in tracing programs") Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20230410085908.98493-2-zhoufeng.zf@bytedance.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: Fix __reg_bound_offset 64->32 var_off subreg propagationDaniel Borkmann1-3/+3
[ Upstream commit 7be14c1c9030f73cc18b4ff23b78a0a081f16188 ] Xu reports that after commit 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking"), the following BPF program is rejected by the verifier: 0: (61) r2 = *(u32 *)(r1 +0) ; R2_w=pkt(off=0,r=0,imm=0) 1: (61) r3 = *(u32 *)(r1 +4) ; R3_w=pkt_end(off=0,imm=0) 2: (bf) r1 = r2 3: (07) r1 += 1 4: (2d) if r1 > r3 goto pc+8 5: (71) r1 = *(u8 *)(r2 +0) ; R1_w=scalar(umax=255,var_off=(0x0; 0xff)) 6: (18) r0 = 0x7fffffffffffff10 8: (0f) r1 += r0 ; R1_w=scalar(umin=0x7fffffffffffff10,umax=0x800000000000000f) 9: (18) r0 = 0x8000000000000000 11: (07) r0 += 1 12: (ad) if r0 < r1 goto pc-2 13: (b7) r0 = 0 14: (95) exit And the verifier log says: func#0 @0 0: R1=ctx(off=0,imm=0) R10=fp0 0: (61) r2 = *(u32 *)(r1 +0) ; R1=ctx(off=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 1: (61) r3 = *(u32 *)(r1 +4) ; R1=ctx(off=0,imm=0) R3_w=pkt_end(off=0,imm=0) 2: (bf) r1 = r2 ; R1_w=pkt(off=0,r=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 3: (07) r1 += 1 ; R1_w=pkt(off=1,r=0,imm=0) 4: (2d) if r1 > r3 goto pc+8 ; R1_w=pkt(off=1,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) 5: (71) r1 = *(u8 *)(r2 +0) ; R1_w=scalar(umax=255,var_off=(0x0; 0xff)) R2_w=pkt(off=0,r=1,imm=0) 6: (18) r0 = 0x7fffffffffffff10 ; R0_w=9223372036854775568 8: (0f) r1 += r0 ; R0_w=9223372036854775568 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775823,s32_min=-240,s32_max=15) 9: (18) r0 = 0x8000000000000000 ; R0_w=-9223372036854775808 11: (07) r0 += 1 ; R0_w=-9223372036854775807 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775809) 13: (b7) r0 = 0 ; R0_w=0 14: (95) exit from 12 to 11: R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775806 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775806 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775810,var_off=(0x8000000000000000; 0xffffffff)) 13: safe [...] from 12 to 11: R0_w=-9223372036854775795 R1=scalar(umin=9223372036854775822,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775794 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775794 R1=scalar(umin=9223372036854775822,umax=9223372036854775822,var_off=(0x8000000000000000; 0xffffffff)) 13: safe from 12 to 11: R0_w=-9223372036854775794 R1=scalar(umin=9223372036854775823,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775793 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775793 R1=scalar(umin=9223372036854775823,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) 13: safe from 12 to 11: R0_w=-9223372036854775793 R1=scalar(umin=9223372036854775824,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775792 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775792 R1=scalar(umin=9223372036854775824,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) 13: safe [...] The 64bit umin=9223372036854775810 bound continuously bumps by +1 while umax=9223372036854775823 stays as-is until the verifier complexity limit is reached and the program gets finally rejected. During this simulation, the umin also eventually surpasses umax. Looking at the first 'from 12 to 11' output line from the loop, R1 has the following state: R1_w=scalar(umin=0x8000000000000002 (9223372036854775810), umax=0x800000000000000f (9223372036854775823), var_off=(0x8000000000000000; 0xffffffff)) The var_off has technically not an inconsistent state but it's very imprecise and far off surpassing 64bit umax bounds whereas the expected output with refined known bits in var_off should have been like: R1_w=scalar(umin=0x8000000000000002 (9223372036854775810), umax=0x800000000000000f (9223372036854775823), var_off=(0x8000000000000000; 0xf)) In the above log, var_off stays as var_off=(0x8000000000000000; 0xffffffff) and does not converge into a narrower mask where more bits become known, eventually transforming R1 into a constant upon umin=9223372036854775823, umax=9223372036854775823 case where the verifier would have terminated and let the program pass. The __reg_combine_64_into_32() marks the subregister unknown and propagates 64bit {s,u}min/{s,u}max bounds to their 32bit equivalents iff they are within the 32bit universe. The question came up whether __reg_combine_64_into_32() should special case the situation that when 64bit {s,u}min bounds have the same value as 64bit {s,u}max bounds to then assign the latter as well to the 32bit reg->{s,u}32_{min,max}_value. As can be seen from the above example however, that is just /one/ special case and not a /generic/ solution given above example would still not be addressed this way and remain at an imprecise var_off=(0x8000000000000000; 0xffffffff). The improvement is needed in __reg_bound_offset() to refine var32_off with the updated var64_off instead of the prior reg->var_off. The reg_bounds_sync() code first refines information about the register's min/max bounds via __update_reg_bounds() from the current var_off, then in __reg_deduce_bounds() from sign bit and with the potentially learned bits from bounds it'll update the var_off tnum in __reg_bound_offset(). For example, intersecting with the old var_off might have improved bounds slightly, e.g. if umax was 0x7f...f and var_off was (0; 0xf...fc), then new var_off will then result in (0; 0x7f...fc). The intersected var64_off holds then the universe which is a superset of var32_off. The point for the latter is not to broaden, but to further refine known bits based on the intersection of var_off with 32 bit bounds, so that we later construct the final var_off from upper and lower 32 bits. The final __update_reg_bounds() can then potentially still slightly refine bounds if more bits became known from the new var_off. After the improvement, we can see R1 converging successively: func#0 @0 0: R1=ctx(off=0,imm=0) R10=fp0 0: (61) r2 = *(u32 *)(r1 +0) ; R1=ctx(off=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 1: (61) r3 = *(u32 *)(r1 +4) ; R1=ctx(off=0,imm=0) R3_w=pkt_end(off=0,imm=0) 2: (bf) r1 = r2 ; R1_w=pkt(off=0,r=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 3: (07) r1 += 1 ; R1_w=pkt(off=1,r=0,imm=0) 4: (2d) if r1 > r3 goto pc+8 ; R1_w=pkt(off=1,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) 5: (71) r1 = *(u8 *)(r2 +0) ; R1_w=scalar(umax=255,var_off=(0x0; 0xff)) R2_w=pkt(off=0,r=1,imm=0) 6: (18) r0 = 0x7fffffffffffff10 ; R0_w=9223372036854775568 8: (0f) r1 += r0 ; R0_w=9223372036854775568 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775823,s32_min=-240,s32_max=15) 9: (18) r0 = 0x8000000000000000 ; R0_w=-9223372036854775808 11: (07) r0 += 1 ; R0_w=-9223372036854775807 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775809) 13: (b7) r0 = 0 ; R0_w=0 14: (95) exit from 12 to 11: R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775806 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775806 R1_w=-9223372036854775806 13: safe from 12 to 11: R0_w=-9223372036854775806 R1_w=scalar(umin=9223372036854775811,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775805 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775805 R1_w=-9223372036854775805 13: safe [...] from 12 to 11: R0_w=-9223372036854775798 R1=scalar(umin=9223372036854775819,umax=9223372036854775823,var_off=(0x8000000000000008; 0x7),s32_min=8,s32_max=15,u32_min=8,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775797 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775797 R1=-9223372036854775797 13: safe from 12 to 11: R0_w=-9223372036854775797 R1=scalar(umin=9223372036854775820,umax=9223372036854775823,var_off=(0x800000000000000c; 0x3),s32_min=12,s32_max=15,u32_min=12,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775796 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775796 R1=-9223372036854775796 13: safe from 12 to 11: R0_w=-9223372036854775796 R1=scalar(umin=9223372036854775821,umax=9223372036854775823,var_off=(0x800000000000000c; 0x3),s32_min=12,s32_max=15,u32_min=12,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775795 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775795 R1=-9223372036854775795 13: safe from 12 to 11: R0_w=-9223372036854775795 R1=scalar(umin=9223372036854775822,umax=9223372036854775823,var_off=(0x800000000000000e; 0x1),s32_min=14,s32_max=15,u32_min=14,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775794 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775794 R1=-9223372036854775794 13: safe from 12 to 11: R0_w=-9223372036854775794 R1=-9223372036854775793 R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775793 12: (ad) if r0 < r1 goto pc-2 last_idx 12 first_idx 12 parent didn't have regs=1 stack=0 marks: R0_rw=P-9223372036854775801 R1_r=scalar(umin=9223372036854775815,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 last_idx 11 first_idx 11 regs=1 stack=0 before 11: (07) r0 += 1 parent didn't have regs=1 stack=0 marks: R0_rw=P-9223372036854775805 R1_rw=scalar(umin=9223372036854775812,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 last_idx 12 first_idx 0 regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=1 stack=0 before 11: (07) r0 += 1 regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=1 stack=0 before 11: (07) r0 += 1 regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=1 stack=0 before 11: (07) r0 += 1 regs=1 stack=0 before 9: (18) r0 = 0x8000000000000000 last_idx 12 first_idx 12 parent didn't have regs=2 stack=0 marks: R0_rw=P-9223372036854775801 R1_r=Pscalar(umin=9223372036854775815,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 last_idx 11 first_idx 11 regs=2 stack=0 before 11: (07) r0 += 1 parent didn't have regs=2 stack=0 marks: R0_rw=P-9223372036854775805 R1_rw=Pscalar(umin=9223372036854775812,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 last_idx 12 first_idx 0 regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=2 stack=0 before 11: (07) r0 += 1 regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=2 stack=0 before 11: (07) r0 += 1 regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=2 stack=0 before 11: (07) r0 += 1 regs=2 stack=0 before 9: (18) r0 = 0x8000000000000000 regs=2 stack=0 before 8: (0f) r1 += r0 regs=3 stack=0 before 6: (18) r0 = 0x7fffffffffffff10 regs=2 stack=0 before 5: (71) r1 = *(u8 *)(r2 +0) 13: safe from 4 to 13: safe verification time 322 usec stack depth 0 processed 56 insns (limit 1000000) max_states_per_insn 1 total_states 3 peak_states 3 mark_read 1 This also fixes up a test case along with this improvement where we match on the verifier log. The updated log now has a refined var_off, too. Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Reported-by: Xu Kuohai <xukuohai@huaweicloud.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20230314203424.4015351-2-xukuohai@huaweicloud.com Link: https://lore.kernel.org/bpf/20230322213056.2470-1-daniel@iogearbox.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: Remove misleading spec_v1 check on var-offset stack readLuis Gerhorst1-10/+6
[ Upstream commit 082cdc69a4651dd2a77539d69416a359ed1214f5 ] For every BPF_ADD/SUB involving a pointer, adjust_ptr_min_max_vals() ensures that the resulting pointer has a constant offset if bypass_spec_v1 is false. This is ensured by calling sanitize_check_bounds() which in turn calls check_stack_access_for_ptr_arithmetic(). There, -EACCESS is returned if the register's offset is not constant, thereby rejecting the program. In summary, an unprivileged user must never be able to create stack pointers with a variable offset. That is also the case, because a respective check in check_stack_write() is missing. If they were able to create a variable-offset pointer, users could still use it in a stack-write operation to trigger unsafe speculative behavior [1]. Because unprivileged users must already be prevented from creating variable-offset stack pointers, viable options are to either remove this check (replacing it with a clarifying comment), or to turn it into a "verifier BUG"-message, also adding a similar check in check_stack_write() (for consistency, as a second-level defense). This patch implements the first option to reduce verifier bloat. This check was introduced by commit 01f810ace9ed ("bpf: Allow variable-offset stack access") which correctly notes that "variable-offset reads and writes are disallowed (they were already disallowed for the indirect access case) because the speculative execution checking code doesn't support them". However, it does not further discuss why the check in check_stack_read() is necessary. The code which made this check obsolete was also introduced in this commit. I have compiled ~650 programs from the Linux selftests, Linux samples, Cilium, and libbpf/examples projects and confirmed that none of these trigger the check in check_stack_read() [2]. Instead, all of these programs are, as expected, already rejected when constructing the variable-offset pointers. Note that the check in check_stack_access_for_ptr_arithmetic() also prints "off=%d" while the code removed by this patch does not (the error removed does not appear in the "verification_error" values). For reproducibility, the repository linked includes the raw data and scripts used to create the plot. [1] https://arxiv.org/pdf/1807.03757.pdf [2] https://gitlab.cs.fau.de/un65esoq/bpf-spectre/-/raw/53dc19fcf459c186613b1156a81504b39c8d49db/data/plots/23-02-26_23-56_bpftool/bpftool/0004-errors.pdf?inline=false Fixes: 01f810ace9ed ("bpf: Allow variable-offset stack access") Signed-off-by: Luis Gerhorst <gerhorst@cs.fau.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230315165358.23701-1-gerhorst@cs.fau.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11bpf: fix precision propagation verbose loggingAndrii Nakryiko1-2/+2
[ Upstream commit 34f0677e7afd3a292bc1aadda7ce8e35faedb204 ] Fix wrong order of frame index vs register/slot index in precision propagation verbose (level 2) output. It's wrong and very confusing as is. Fixes: 529409ea92d5 ("bpf: propagate precision across all frames, not jus