summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2020-04-23bpf: fix buggy r0 retval refinement for tracing helpersDaniel Borkmann1-11/+34
[ no upstream commit ] See the glory details in 100605035e15 ("bpf: Verifier, do_refine_retval_range may clamp umin to 0 incorrectly") for why 849fa50662fb ("bpf/verifier: refine retval R0 state for bpf_get_stack helper") is buggy. The whole series however is not suitable for stable since it adds significant amount [0] of verifier complexity in order to add 32bit subreg tracking. Something simpler is needed. Unfortunately, reverting 849fa50662fb ("bpf/verifier: refine retval R0 state for bpf_get_stack helper") or just cherry-picking 100605035e15 ("bpf: Verifier, do_refine_retval_range may clamp umin to 0 incorrectly") is not an option since it will break existing tracing programs badly (at least those that are using bpf_get_stack() and bpf_probe_read_str() helpers). Not fixing it in stable is also not an option since on 4.19 kernels an error will cause a soft-lockup due to hitting dead-code sanitized branch since we don't hard-wire such branches in old kernels yet. But even then for 5.x 849fa50662fb ("bpf/verifier: refine retval R0 state for bpf_get_stack helper") would cause wrong bounds on the verifier simluation when an error is hit. In one of the earlier iterations of mentioned patch series for upstream there was the concern that just using smax_value in do_refine_retval_range() would nuke bounds by subsequent <<32 >>32 shifts before the comparison against 0 [1] which eventually led to the 32bit subreg tracking in the first place. While I initially went for implementing the idea [1] to pattern match the two shift operations, it turned out to be more complex than actually needed, meaning, we could simply treat do_refine_retval_range() similarly to how we branch off verification for conditionals or under speculation, that is, pushing a new reg state to the stack for later verification. This means, instead of verifying the current path with the ret_reg in [S32MIN, msize_max_value] interval where later bounds would get nuked, we split this into two: i) for the success case where ret_reg can be in [0, msize_max_value], and ii) for the error case with ret_reg known to be in interval [S32MIN, -1]. Latter will preserve the bounds during these shift patterns and can match reg < 0 test. test_progs also succeed with this approach. [0] https://lore.kernel.org/bpf/158507130343.15666.8018068546764556975.stgit@john-Precision-5820-Tower/ [1] https://lore.kernel.org/bpf/158015334199.28573.4940395881683556537.stgit@john-XPS-13-9370/T/#m2e0ad1d5949131014748b6daa48a3495e7f0456d Fixes: 849fa50662fb ("bpf/verifier: refine retval R0 state for bpf_get_stack helper") Reported-by: Lorenzo Fontana <fontanalorenz@gmail.com> Reported-by: Leonardo Di Donato <leodidonato@gmail.com> Reported-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Tested-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-23locktorture: Print ratio of acquisitions, not failuresPaul E. McKenney1-4/+4
commit 80c503e0e68fbe271680ab48f0fe29bc034b01b7 upstream. The __torture_print_stats() function in locktorture.c carefully initializes local variable "min" to statp[0].n_lock_acquired, but then compares it to statp[i].n_lock_fail. Given that the .n_lock_fail field should normally be zero, and given the initialization, it seems reasonable to display the maximum and minimum number acquisitions instead of miscomputing the maximum and minimum number of failures. This commit therefore switches from failures to acquisitions. And this turns out to be not only a day-zero bug, but entirely my own fault. I hate it when that happens! Fixes: 0af3fe1efa53 ("locktorture: Add a lock-torture kernel module") Reported-by: Will Deacon <will@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Will Deacon <will@kernel.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-23dma-debug: fix displaying of dma allocation typeGrygorii Strashko1-3/+6
commit 9bb50ed7470944238ec8e30a94ef096caf9056ee upstream. The commit 2e05ea5cdc1a ("dma-mapping: implement dma_map_single_attrs using dma_map_page_attrs") removed "dma_debug_page" enum, but missed to update type2name string table. This causes incorrect displaying of dma allocation type. Fix it by removing "page" string from type2name string table and switch to use named initializers. Before (dma_alloc_coherent()): k3-ringacc 4b800000.ringacc: scather-gather idx 2208 P=d1140000 N=d114 D=d1140000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable k3-ringacc 4b800000.ringacc: scather-gather idx 2216 P=d1150000 N=d115 D=d1150000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable After: k3-ringacc 4b800000.ringacc: coherent idx 2208 P=d1140000 N=d114 D=d1140000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable k3-ringacc 4b800000.ringacc: coherent idx 2216 P=d1150000 N=d115 D=d1150000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable Fixes: 2e05ea5cdc1a ("dma-mapping: implement dma_map_single_attrs using dma_map_page_attrs") Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-23dma-coherent: fix integer overflow in the reserved-memory dma allocationKevin Grandemange1-6/+7
[ Upstream commit 286c21de32b904131f8cf6a36ce40b8b0c9c5da3 ] pageno is an int and the PAGE_SHIFT shift is done on an int, overflowing if the memory is bigger than 2G This can be reproduced using for example a reserved-memory of 4G reserved-memory { #address-cells = <2>; #size-cells = <2>; ranges; reserved_dma: buffer@0 { compatible = "shared-dma-pool"; no-map; reg = <0x5 0x00000000 0x1 0x0>; }; }; Signed-off-by: Kevin Grandemange <kevin.grandemange@allegrodvt.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-21tracing: Fix the race between registering 'snapshot' event trigger and ↵Xiao Yang1-7/+3
triggering 'snapshot' operation commit 0bbe7f719985efd9adb3454679ecef0984cb6800 upstream. Traced event can trigger 'snapshot' operation(i.e. calls snapshot_trigger() or snapshot_count_trigger()) when register_snapshot_trigger() has completed registration but doesn't allocate buffer for 'snapshot' event trigger. In the rare case, 'snapshot' operation always detects the lack of allocated buffer so make register_snapshot_trigger() allocate buffer first. trigger-snapshot.tc in kselftest reproduces the issue on slow vm: ----------------------------------------------------------- cat trace ... ftracetest-3028 [002] .... 236.784290: sched_process_fork: comm=ftracetest pid=3028 child_comm=ftracetest child_pid=3036 <...>-2875 [003] .... 240.460335: tracing_snapshot_instance_cond: *** SNAPSHOT NOT ALLOCATED *** <...>-2875 [003] .... 240.460338: tracing_snapshot_instance_cond: *** stopping trace here! *** ----------------------------------------------------------- Link: http://lkml.kernel.org/r/20200414015145.66236-1-yangx.jy@cn.fujitsu.com Cc: stable@vger.kernel.org Fixes: 93e31ffbf417a ("tracing: Add 'snapshot' event trigger command") Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17bpf: Fix tnum constraints for 32-bit comparisonsJann Horn1-36/+72
[ Upstream commit 604dca5e3af1db98bd123b7bfc02b017af99e3a0 ] The BPF verifier tried to track values based on 32-bit comparisons by (ab)using the tnum state via 581738a681b6 ("bpf: Provide better register bounds after jmp32 instructions"). The idea is that after a check like this: if ((u32)r0 > 3) exit We can't meaningfully constrain the arithmetic-range-based tracking, but we can update the tnum state to (value=0,mask=0xffff'ffff'0000'0003). However, the implementation from 581738a681b6 didn't compute the tnum constraint based on the fixed operand, but instead derives it from the arithmetic-range-based tracking. This means that after the following sequence of operations: if (r0 >= 0x1'0000'0001) exit if ((u32)r0 > 7) exit The verifier assumed that the lower half of r0 is in the range (0, 0) and apply the tnum constraint (value=0,mask=0xffff'ffff'0000'0000) thus causing the overall tnum to be (value=0,mask=0x1'0000'0000), which was incorrect. Provide a fixed implementation. Fixes: 581738a681b6 ("bpf: Provide better register bounds after jmp32 instructions") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200330160324.15259-3-daniel@iogearbox.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-17kmod: make request_module() return an error when autoloading is disabledEric Biggers1-2/+2
commit d7d27cfc5cf0766a26a8f56868c5ad5434735126 upstream. Patch series "module autoloading fixes and cleanups", v5. This series fixes a bug where request_module() was reporting success to kernel code when module autoloading had been completely disabled via 'echo > /proc/sys/kernel/modprobe'. It also addresses the issues raised on the original thread (https://lkml.kernel.org/lkml/20200310223731.126894-1-ebiggers@kernel.org/T/#u) bydocumenting the modprobe sysctl, adding a self-test for the empty path case, and downgrading a user-reachable WARN_ONCE(). This patch (of 4): It's long been possible to disable kernel module autoloading completely (while still allowing manual module insertion) by setting /proc/sys/kernel/modprobe to the empty string. This can be preferable to setting it to a nonexistent file since it avoids the overhead of an attempted execve(), avoids potential deadlocks, and avoids the call to security_kernel_module_request() and thus on SELinux-based systems eliminates the need to write SELinux rules to dontaudit module_request. However, when module autoloading is disabled in this way, request_module() returns 0. This is broken because callers expect 0 to mean that the module was successfully loaded. Apparently this was never noticed because this method of disabling module autoloading isn't used much, and also most callers don't use the return value of request_module() since it's always necessary to check whether the module registered its functionality or not anyway. But improperly returning 0 can indeed confuse a few callers, for example get_fs_type() in fs/filesystems.c where it causes a WARNING to be hit: if (!fs && (request_module("fs-%.*s", len, name) == 0)) { fs = __get_fs_type(name, len); WARN_ONCE(!fs, "request_module fs-%.*s succeeded, but still no fs?\n", len, name); } This is easily reproduced with: echo > /proc/sys/kernel/modprobe mount -t NONEXISTENT none / It causes: request_module fs-NONEXISTENT succeeded, but still no fs? WARNING: CPU: 1 PID: 1106 at fs/filesystems.c:275 get_fs_type+0xd6/0xf0 [...] This should actually use pr_warn_once() rather than WARN_ONCE(), since it's also user-reachable if userspace immediately unloads the module. Regardless, request_module() should correctly return an error when it fails. So let's make it return -ENOENT, which matches the error when the modprobe binary doesn't exist. I've also sent patches to document and test this case. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Jessica Yu <jeyu@kernel.org> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Ben Hutchings <benh@debian.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200310223731.126894-1-ebiggers@kernel.org Link: http://lkml.kernel.org/r/20200312202552.241885-1-ebiggers@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17ftrace/kprobe: Show the maxactive number on kprobe_eventsMasami Hiramatsu1-0/+2
commit 6a13a0d7b4d1171ef9b80ad69abc37e1daa941b3 upstream. Show maxactive parameter on kprobe_events. This allows user to save the current configuration and restore it without losing maxactive parameter. Link: http://lkml.kernel.org/r/4762764a-6df7-bc93-ed60-e336146dce1f@gmail.com Link: http://lkml.kernel.org/r/158503528846.22706.5549974121212526020.stgit@devnote2 Cc: stable@vger.kernel.org Fixes: 696ced4fb1d76 ("tracing/kprobes: expose maxactive for kretprobe in kprobe_events") Reported-by: Taeung Song <treeze.taeung@gmail.com> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17sched/core: Remove duplicate assignment in sched_tick_remote()Scott Wood1-1/+0
commit 82e0516ce3a147365a5dd2a9bedd5ba43a18663d upstream. A redundant "curr = rq->curr" was added; remove it. Fixes: ebc0f83c78a2 ("timers/nohz: Update NOHZ load in remote tick") Signed-off-by: Scott Wood <swood@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1580776558-12882-1-git-send-email-swood@redhat.com Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17signal: Extend exec_id to 64bitsEric W. Biederman1-1/+1
commit d1e7fd6462ca9fc76650fbe6ca800e35b24267da upstream. Replace the 32bit exec_id with a 64bit exec_id to make it impossible to wrap the exec_id counter. With care an attacker can cause exec_id wrap and send arbitrary signals to a newly exec'd parent. This bypasses the signal sending checks if the parent changes their credentials during exec. The severity of this problem can been seen that in my limited testing of a 32bit exec_id it can take as little as 19s to exec 65536 times. Which means that it can take as little as 14 days to wrap a 32bit exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7 days. Even my slower timing is in the uptime of a typical server. Which means self_exec_id is simply a speed bump today, and if exec gets noticably faster self_exec_id won't even be a speed bump. Extending self_exec_id to 64bits introduces a problem on 32bit architectures where reading self_exec_id is no longer atomic and can take two read instructions. Which means that is is possible to hit a window where the read value of exec_id does not match the written value. So with very lucky timing after this change this still remains expoiltable. I have updated the update of exec_id on exec to use WRITE_ONCE and the read of exec_id in do_notify_parent to use READ_ONCE to make it clear that there is no locking between these two locations. Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl Fixes: 2.3.23pre2 Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17genirq/debugfs: Add missing sanity checks to interrupt injectionThomas Gleixner1-2/+9
commit a740a423c36932695b01a3e920f697bc55b05fec upstream. Interrupts cannot be injected when the interrupt is not activated and when a replay is already in progress. Fixes: 536e2e34bd00 ("genirq/debugfs: Triggering of interrupts from userspace") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200306130623.500019114@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17cpu/hotplug: Ignore pm_wakeup_pending() for disable_nonboot_cpus()Thomas Gleixner1-2/+2
commit e98eac6ff1b45e4e73f2e6031b37c256ccb5d36b upstream. A recent change to freeze_secondary_cpus() which added an early abort if a wakeup is pending missed the fact that the function is also invoked for shutdown, reboot and kexec via disable_nonboot_cpus(). In case of disable_nonboot_cpus() the wakeup event needs to be ignored as the purpose is to terminate the currently running kernel. Add a 'suspend' argument which is only set when the freeze is in context of a suspend operation. If not set then an eventually pending wakeup event is ignored. Fixes: a66d955e910a ("cpu/hotplug: Abort disabling secondary CPUs if wakeup is pending") Reported-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Pavankumar Kondeti <pkondeti@codeaurora.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/874kuaxdiz.fsf@nanos.tec.linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17sched/fair: Fix enqueue_task_fair warningVincent Guittot1-4/+22
commit fe61468b2cbc2b7ce5f8d3bf32ae5001d4c434e9 upstream. When a cfs rq is throttled, the latter and its child are removed from the leaf list but their nr_running is not changed which includes staying higher than 1. When a task is enqueued in this throttled branch, the cfs rqs must be added back in order to ensure correct ordering in the list but this can only happens if nr_running == 1. When cfs bandwidth is used, we call unconditionnaly list_add_leaf_cfs_rq() when enqueuing an entity to make sure that the complete branch will be added. Similarly unthrottle_cfs_rq() can stop adding cfs in the list when a parent is throttled. Iterate the remaining entity to ensure that the complete branch will be added in the list. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: stable@vger.kernel.org Cc: stable@vger.kernel.org #v5.1+ Link: https://lkml.kernel.org/r/20200306135257.25044-1-vincent.guittot@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17seccomp: Add missing compat_ioctl for notifySven Schnelle1-0/+1
commit 3db81afd99494a33f1c3839103f0429c8f30cb9d upstream. Executing the seccomp_bpf testsuite under a 64-bit kernel with 32-bit userland (both s390 and x86) doesn't work because there's no compat_ioctl handler defined. Add the handler. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200310123332.42255-1-svens@linux.ibm.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17locking/lockdep: Avoid recursion in lockdep_count_{for,back}ward_deps()Boqun Feng1-0/+4
[ Upstream commit 25016bd7f4caf5fc983bbab7403d08e64cba3004 ] Qian Cai reported a bug when PROVE_RCU_LIST=y, and read on /proc/lockdep triggered a warning: [ ] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled) ... [ ] Call Trace: [ ] lock_is_held_type+0x5d/0x150 [ ] ? rcu_lockdep_current_cpu_online+0x64/0x80 [ ] rcu_read_lock_any_held+0xac/0x100 [ ] ? rcu_read_lock_held+0xc0/0xc0 [ ] ? __slab_free+0x421/0x540 [ ] ? kasan_kmalloc+0x9/0x10 [ ] ? __kmalloc_node+0x1d7/0x320 [ ] ? kvmalloc_node+0x6f/0x80 [ ] __bfs+0x28a/0x3c0 [ ] ? class_equal+0x30/0x30 [ ] lockdep_count_forward_deps+0x11a/0x1a0 The warning got triggered because lockdep_count_forward_deps() call __bfs() without current->lockdep_recursion being set, as a result a lockdep internal function (__bfs()) is checked by lockdep, which is unexpected, and the inconsistency between the irq-off state and the state traced by lockdep caused the warning. Apart from this warning, lockdep internal functions like __bfs() should always be protected by current->lockdep_recursion to avoid potential deadlocks and data inconsistency, therefore add the current->lockdep_recursion on-and-off section to protect __bfs() in both lockdep_count_forward_deps() and lockdep_count_backward_deps() Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200312151258.128036-1-boqun.feng@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-17genirq/irqdomain: Check pointer in irq_domain_alloc_irqs_hierarchy()Alexander Sverdlin1-5/+5
[ Upstream commit 87f2d1c662fa1761359fdf558246f97e484d177a ] irq_domain_alloc_irqs_hierarchy() has 3 call sites in the compilation unit but only one of them checks for the pointer which is being dereferenced inside the called function. Move the check into the function. This allows for catching the error instead of the following crash: Unable to handle kernel NULL pointer dereference at virtual address 00000000 PC is at 0x0 LR is at gpiochip_hierarchy_irq_domain_alloc+0x11f/0x140 ... [<c06c23ff>] (gpiochip_hierarchy_irq_domain_alloc) [<c0462a89>] (__irq_domain_alloc_irqs) [<c0462dad>] (irq_create_fwspec_mapping) [<c06c2251>] (gpiochip_to_irq) [<c06c1c9b>] (gpiod_to_irq) [<bf973073>] (gpio_irqs_init [gpio_irqs]) [<bf974048>] (gpio_irqs_exit+0xecc/0xe84 [gpio_irqs]) Code: bad PC value Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200306174720.82604-1-alexander.sverdlin@nokia.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-17sched: Avoid scale real weight down to zeroMichael Wang1-1/+7
[ Upstream commit 26cf52229efc87e2effa9d788f9b33c40fb3358a ] During our testing, we found a case that shares no longer working correctly, the cgroup topology is like: /sys/fs/cgroup/cpu/A (shares=102400) /sys/fs/cgroup/cpu/A/B (shares=2) /sys/fs/cgroup/cpu/A/B/C (shares=1024) /sys/fs/cgroup/cpu/D (shares=1024) /sys/fs/cgroup/cpu/D/E (shares=1024) /sys/fs/cgroup/cpu/D/E/F (shares=1024) The same benchmark is running in group C & F, no other tasks are running, the benchmark is capable to consumed all the CPUs. We suppose the group C will win more CPU resources since it could enjoy all the shares of group A, but it's F who wins much more. The reason is because we have group B with shares as 2, since A->cfs_rq.load.weight == B->se.load.weight == B->shares/nr_cpus, so A->cfs_rq.load.weight become very small. And in calc_group_shares() we calculate shares as: load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg); shares = (tg_shares * load) / tg_weight; Since the 'cfs_rq->load.weight' is too small, the load become 0 after scale down, although 'tg_shares' is 102400, shares of the se which stand for group A on root cfs_rq become 2. While the se of D on root cfs_rq is far more bigger than 2, so it wins the battle. Thus when scale_load_down() scale real weight down to 0, it's no longer telling the real story, the caller will have the wrong information and the calculation will be buggy. This patch add check in scale_load_down(), so the real weight will be >= MIN_SHARES after scale, after applied the group C wins as expected. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/38e8e212-59a1-64b2-b247-b6d0b52d8dc1@linux.alibaba.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-17time/sched_clock: Expire timer in hardirq contextAhmed S. Darwish1-4/+5
[ Upstream commit 2c8bd58812ee3dbf0d78b566822f7eacd34bdd7b ] To minimize latency, PREEMPT_RT kernels expires hrtimers in preemptible softirq context by default. This can be overriden by marking the timer's expiry with HRTIMER_MODE_HARD. sched_clock_timer is missing this annotation: if its callback is preempted and the duration of the preemption exceeds the wrap around time of the underlying clocksource, sched clock will get out of sync. Mark the sched_clock_timer for expiry in hard interrupt context. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200309181529.26558-1-a.darwish@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-17dma-mapping: Fix dma_pgprot() for unencrypted coherent pagesThomas Hellstrom1-0/+2
[ Upstream commit 17c4a2ae15a7aaefe84bdb271952678c5c9cd8e1 ] When dma_mmap_coherent() sets up a mapping to unencrypted coherent memory under SEV encryption and sometimes under SME encryption, it will actually set up an encrypted mapping rather than an unencrypted, causing devices that DMAs from that memory to read encrypted contents. Fix this. When force_dma_unencrypted() returns true, the linear kernel map of the coherent pages have had the encryption bit explicitly cleared and the page content is unencrypted. Make sure that any additional PTEs we set up to these pages also have the encryption bit cleared by having dma_pgprot() return a protection with the encryption bit cleared in this case. Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lkml.kernel.org/r/20200304114527.3636-3-thomas_os@shipmail.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-17bpf: Fix deadlock with rq_lock in bpf_send_signal()Yonghong Song1-1/+1
[ Upstream commit 1bc7896e9ef44fd77858b3ef0b8a6840be3a4494 ] When experimenting with bpf_send_signal() helper in our production environment (5.2 based), we experienced a deadlock in NMI mode: #5 [ffffc9002219f770] queued_spin_lock_slowpath at ffffffff8110be24 #6 [ffffc9002219f770] _raw_spin_lock_irqsave at ffffffff81a43012 #7 [ffffc9002219f780] try_to_wake_up at ffffffff810e7ecd #8 [ffffc9002219f7e0] signal_wake_up_state at ffffffff810c7b55 #9 [ffffc9002219f7f0] __send_signal at ffffffff810c8602 #10 [ffffc9002219f830] do_send_sig_info at ffffffff810ca31a #11 [ffffc9002219f868] bpf_send_signal at ffffffff8119d227 #12 [ffffc9002219f988] bpf_overflow_handler at ffffffff811d4140 #13 [ffffc9002219f9e0] __perf_event_overflow at ffffffff811d68cf #14 [ffffc9002219fa10] perf_swevent_overflow at ffffffff811d6a09 #15 [ffffc9002219fa38] ___perf_sw_event at ffffffff811e0f47 #16 [ffffc9002219fc30] __schedule at ffffffff81a3e04d #17 [ffffc9002219fc90] schedule at ffffffff81a3e219 #18 [ffffc9002219fca0] futex_wait_queue_me at ffffffff8113d1b9 #19 [ffffc9002219fcd8] futex_wait at ffffffff8113e529 #20 [ffffc9002219fdf0] do_futex at ffffffff8113ffbc #21 [ffffc9002219fec0] __x64_sys_futex at ffffffff81140d1c #22 [ffffc9002219ff38] do_syscall_64 at ffffffff81002602 #23 [ffffc9002219ff50] entry_SYSCALL_64_after_hwframe at ffffffff81c00068 The above call stack is actually very similar to an issue reported by Commit eac9153f2b58 ("bpf/stackmap: Fix deadlock with rq_lock in bpf_get_stack()") by Song Liu. The only difference is bpf_send_signal() helper instead of bpf_get_stack() helper. The above deadlock is triggered with a perf_sw_event. Similar to Commit eac9153f2b58, the below almost identical reproducer used tracepoint point sched/sched_switch so the issue can be easily caught. /* stress_test.c */ #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <pthread.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #define THREAD_COUNT 1000 char *filename; void *worker(void *p) { void *ptr; int fd; char *pptr; fd = open(filename, O_RDONLY); if (fd < 0) return NULL; while (1) { struct timespec ts = {0, 1000 + rand() % 2000}; ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0); usleep(1); if (ptr == MAP_FAILED) { printf("failed to mmap\n"); break; } munmap(ptr, 4096 * 64); usleep(1); pptr = malloc(1); usleep(1); pptr[0] = 1; usleep(1); free(pptr); usleep(1); nanosleep(&ts, NULL); } close(fd); return NULL; } int main(int argc, char *argv[]) { void *ptr; int i; pthread_t threads[THREAD_COUNT]; if (argc < 2) return 0; filename = argv[1]; for (i = 0; i < THREAD_COUNT; i++) { if (pthread_create(threads + i, NULL, worker, NULL)) { fprintf(stderr, "Error creating thread\n"); return 0; } } for (i = 0; i < THREAD_COUNT; i++) pthread_join(threads[i], NULL); return 0; } and the following command: 1. run `stress_test /bin/ls` in one windown 2. hack bcc trace.py with the following change: # --- a/tools/trace.py # +++ b/tools/trace.py @@ -513,6 +513,7 @@ BPF_PERF_OUTPUT(%s); __data.tgid = __tgid; __data.pid = __pid; bpf_get_current_comm(&__data.comm, sizeof(__data.comm)); + bpf_send_signal(10); %s %s %s.perf_submit(%s, &__data, sizeof(__data)); 3. in a different window run ./trace.py -p $(pidof stress_test) t:sched:sched_switch The deadlock can be reproduced in our production system. Similar to Song's fix, the fix is to delay sending signal if irqs is disabled to avoid deadlocks involving with rq_lock. With this change, my above stress-test in our production system won't cause deadlock any more. I also implemented a scale-down version of reproducer in the selftest (a subsequent commit). With latest bpf-next, it complains for the following potential deadlock. [ 32.832450] -> #1 (&p->pi_lock){-.-.}: [ 32.833100] _raw_spin_lock_irqsave+0x44/0x80 [ 32.833696] task_rq_lock+0x2c/0xa0 [ 32.834182] task_sched_runtime+0x59/0xd0 [ 32.834721] thread_group_cputime+0x250/0x270 [ 32.835304] thread_group_cputime_adjusted+0x2e/0x70 [ 32.835959] do_task_stat+0x8a7/0xb80 [ 32.836461] proc_single_show+0x51/0xb0 ... [ 32.839512] -> #0 (&(&sighand->siglock)->rlock){....}: [ 32.840275] __lock_acquire+0x1358/0x1a20 [ 32.840826] lock_acquire+0xc7/0x1d0 [ 32.841309] _raw_spin_lock_irqsave+0x44/0x80 [ 32.841916] __lock_task_sighand+0x79/0x160 [ 32.842465] do_send_sig_info+0x35/0x90 [ 32.842977] bpf_send_signal+0xa/0x10 [ 32.843464] bpf_prog_bc13ed9e4d3163e3_send_signal_tp_sched+0x465/0x1000 [ 32.844301] trace_call_bpf+0x115/0x270 [ 32.844809] perf_trace_run_bpf_submit+0x4a/0xc0 [ 32.845411] perf_trace_sched_switch+0x10f/0x180 [ 32.846014] __schedule+0x45d/0x880 [ 32.846483] schedule+0x5f/0xd0 ... [ 32.853148] Chain exists of: [ 32.853148] &(&sighand->siglock)->rlock --> &p->pi_lock --> &rq->lock [ 32.853148] [ 32.854451] Possible unsafe locking scenario: [ 32.854451] [ 32.855173] CPU0 CPU1 [ 32.855745] ---- ---- [ 32.856278] lock(&rq->lock); [ 32.856671] lock(&p->pi_lock); [ 32.857332] lock(&rq->lock); [ 32.857999] lock(&(&sighand->siglock)->rlock); Deadlock happens on CPU0 when it tries to acquire &sighand->siglock but it has been held by CPU1 and CPU1 tries to grab &rq->lock and cannot get it. This is not exactly the callstack in our production environment, but sympotom is similar and both locks are using spin_lock_irqsave() to acquire the lock, and both involves rq_lock. The fix to delay sending signal when irq is disabled also fixed this issue. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200304191104.2796501-1-yhs@fb.com Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-08padata: always acquire cpu_hotplug_lock before pinst->lockDaniel Jordan1-2/+2
commit 38228e8848cd7dd86ccb90406af32de0cad24be3 upstream. lockdep complains when padata's paths to update cpumasks via CPU hotplug and sysfs are both taken: # echo 0 > /sys/devices/system/cpu/cpu1/online # echo ff > /sys/kernel/pcrypt/pencrypt/parallel_cpumask ====================================================== WARNING: possible circular locking dependency detected 5.4.0-rc8-padata-cpuhp-v3+ #1 Not tainted ------------------------------------------------------ bash/205 is trying to acquire lock: ffffffff8286bcd0 (cpu_hotplug_lock.rw_sem){++++}, at: padata_set_cpumask+0x2b/0x120 but task is already holding lock: ffff8880001abfa0 (&pinst->lock){+.+.}, at: padata_set_cpumask+0x26/0x120 which lock already depends on the new lock. padata doesn't take cpu_hotplug_lock and pinst->lock in a consistent order. Which should be first? CPU hotplug calls into padata with cpu_hotplug_lock already held, so it should have priority. Fixes: 6751fb3c0e0c ("padata: Use get_online_cpus/put_online_cpus") Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: linux-crypto@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-08padata: fix uninitialized return value in padata_replace()Daniel Jordan1-1/+1
[ Upstream commit 41ccdbfd5427bbbf3ed58b16750113b38fad1780 ] According to Geert's report[0], kernel/padata.c: warning: 'err' may be used uninitialized in this function [-Wuninitialized]: => 539:2 Warning is seen only with older compilers on certain archs. The runtime effect is potentially returning garbage down the stack when padata's cpumasks are modified before any pcrypt requests have run. Simplest fix is to initialize err to the success value. [0] http://lkml.kernel.org/r/20200210135506.11536-1-geert@linux-m68k.org Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Fixes: bbefa1dd6a6d ("crypto: pcrypt - Avoid deadlock by using per-instance padata queues") Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: linux-crypto@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-02bpf: Explicitly memset some bpf info structures declared on the stackGreg Kroah-Hartman2-3/+6
commit 5c6f25887963f15492b604dd25cb149c501bbabf upstream. Trying to initialize a structure with "= {};" will not always clean out all padding locations in a structure. So be explicit and call memset to initialize everything for a number of bpf information structures that are then copied from userspace, sometimes from smaller memory locations than the size of the structure. Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200320162258.GA794295@kroah.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-02bpf: Explicitly memset the bpf_attr structureGreg Kroah-Hartman1-1/+2
commit 8096f229421f7b22433775e928d506f0342e5907 upstream. For the bpf syscall, we are relying on the compiler to properly zero out the bpf_attr union that we copy userspace data into. Unfortunately that doesn't always work properly, padding and other oddities might not be correctly zeroed, and in some tests odd things have been found when the stack is pre-initialized to other values. Fix this by explicitly memsetting the structure to 0 before using it. Reported-by: Maciej Żenczykowski <maze@google.com> Reported-by: John Stultz <john.stultz@linaro.org> Reported-by: Alexander Potapenko <glider@google.com> Reported-by: Alistair Delva <adelva@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://android-review.googlesource.com/c/kernel/common/+/1235490 Link: https://lore.kernel.org/bpf/20200320094813.GA421650@kroah.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-01bpf: Undo incorrect __reg_bound_offset32 handlingDaniel Borkmann1-19/+0
commit f2d67fec0b43edce8c416101cdc52e71145b5fef upstream. Anatoly has been fuzzing with kBdysch harness and reported a hang in one of the outcomes: 0: (b7) r0 = 808464432 1: (7f) r0 >>= r0 2: (14) w0 -= 808464432 3: (07) r0 += 808464432 4: (b7) r1 = 808464432 5: (de) if w1 s<= w0 goto pc+0 R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020;0x10000001f)) R1_w=invP808464432 R10=fp0 6: (07) r0 += -2144337872 7: (14) w0 -= -1607454672 8: (25) if r0 > 0x30303030 goto pc+0 R0_w=invP(id=0,umin_value=271581184,umax_value=271581311,var_off=(0x10300000;0x7f)) R1_w=invP808464432 R10=fp0 9: (76) if w0 s>= 0x303030 goto pc+2 12: (95) exit from 8 to 9: safe from 5 to 6: R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020;0x10000001f)) R1_w=invP808464432 R10=fp0 6: (07) r0 += -2144337872 7: (14) w0 -= -1607454672 8: (25) if r0 > 0x30303030 goto pc+0 R0_w=invP(id=0,umin_value=271581184,umax_value=271581311,var_off=(0x10300000;0x7f)) R1_w=invP808464432 R10=fp0 9: safe from 8 to 9: safe verification time 589 usec stack depth 0 processed 17 insns (limit 1000000) [...] The underlying program was xlated as follows: # bpftool p d x i 9 0: (b7) r0 = 808464432 1: (7f) r0 >>= r0 2: (14) w0 -= 808464432 3: (07) r0 += 808464432 4: (b7) r1 = 808464432 5: (de) if w1 s<= w0 goto pc+0 6: (07) r0 += -2144337872 7: (14) w0 -= -1607454672 8: (25) if r0 > 0x30303030 goto pc+0 9: (76) if w0 s>= 0x303030 goto pc+2 10: (05) goto pc-1 11: (05) goto pc-1 12: (95) exit The verifier rewrote original instructions it recognized as dead code with 'goto pc-1', but reality differs from verifier simulation in that we're actually able to trigger a hang due to hitting the 'goto pc-1' instructions. Taking different examples to make the issue more obvious: in this example we're probing bounds on a completely unknown scalar variable in r1: [...] 5: R0_w=inv1 R1_w=inv(id=0) R10=fp0 5: (18) r2 = 0x4000000000 7: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R10=fp0 7: (18) r3 = 0x2000000000 9: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R10=fp0 9: (18) r4 = 0x400 11: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R10=fp0 11: (18) r5 = 0x200 13: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0 13: (2d) if r1 > r2 goto pc+4 R0_w=inv1 R1_w=inv(id=0,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0 14: R0_w=inv1 R1_w=inv(id=0,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0 14: (ad) if r1 < r3 goto pc+3 R0_w=inv1 R1_w=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0 15: R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0 15: (2e) if w1 > w4 goto pc+2 R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0 16: R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0 16: (ae) if w1 < w5 goto pc+1 R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0 [...] We're first probing lower/upper bounds via jmp64, later we do a similar check via jmp32 and examine the resulting var_off there. After fall-through in insn 14, we get the following bounded r1 with 0x7fffffffff unknown marked bits in the variable section. Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000: max: 0b100000000000000000000000000000000000000 / 0x4000000000 var: 0b111111111111111111111111111111111111111 / 0x7fffffffff min: 0b010000000000000000000000000000000000000 / 0x2000000000 Now, in insn 15 and 16, we perform a similar probe with lower/upper bounds in jmp32. Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000 and w1 <= 0x400 and w1 >= 0x200: max: 0b100000000000000000000000000000000000000 / 0x4000000000 var: 0b111111100000000000000000000000000000000 / 0x7f00000000 min: 0b010000000000000000000000000000000000000 / 0x2000000000 The lower/upper bounds haven't changed since they have high bits set in u64 space and the jmp32 tests can only refine bounds in the low bits. However, for the var part the expectation would have been 0x7f000007ff or something less precise up to 0x7fffffffff. A outcome of 0x7f00000000 is not correct since it would contradict the earlier probed bounds where we know that the result should have been in [0x200,0x400] in u32 space. Therefore, tests with such info will lead to wrong verifier assumptions later on like falsely predicting conditional jumps to be always taken, etc. The issue here is that __reg_bound_offset32()'s implementation from commit 581738a681b6 ("bpf: Provide better register bounds after jmp32 instructions") makes an incorrect range assumption: static void __reg_bound_offset32(struct bpf_reg_state *reg) { u64 mask = 0xffffFFFF; struct tnum range = tnum_range(reg->umin_value & mask, reg->umax_value & mask); struct tnum lo32 = tnum_cast(reg->var_off, 4); struct tnum hi32 = tnum_lshift(tnum_rshift(reg->var_off, 32), 32); reg->var_off = tnum_or(hi32, tnum_intersect(lo32, range)); } In the above walk-through example, __reg_bound_offset32() as-is chose a range after masking with 0xffffffff of [0x0,0x0] since umin:0x2000000000 and umax:0x4000000000 and therefore the lo32 part was clamped to 0x0 as well. However, in the umin:0x2000000000 and umax:0x4000000000 range above we'd end up with an actual possible interval of [0x0,0xffffffff] for u32 space instead. In case of the original reproducer, the situation looked as follows at insn 5 for r0: [...] 5: R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x0; 0x1ffffffff)) R1_w=invP808464432 R10=fp0 0x30303030 0x13030302f 5: (de) if w1 s<= w0 goto pc+0 R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020; 0x10000001f)) R1_w=invP808464432 R10=fp0 0x30303030 0x13030302f [...] After the fall-through, we similarly forced the var_off result into the wrong range [0x30303030,0x3030302f] suggesting later on that fixed bits must only be of 0x30303020 with 0x10000001f unknowns whereas such assumption can only be made when both bounds in hi32 range match. Originally, I was thinking to fix this by moving reg into a temp reg and use proper coerce_reg_to_size() helper on the temp reg where we can then based on that define the range tnum for later intersection: static void __reg_bound_offset32(struct bpf_reg_state *reg) { struct bpf_reg_state tmp = *reg; struct tnum lo32, hi32, range; coerce_reg_to_size(&tmp, 4); range = tnum_range(tmp.umin_value, tmp.umax_value); lo32 = tnum_cast(reg->var_off, 4); hi32 = tnum_lshift(tnum_rshift(reg->var_off, 32), 32); reg->var_off = tnum_or(hi32, tnum_intersect(lo32, range)); } In the case of the concrete example, this gives us a more conservative unknown section. Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000 and w1 <= 0x400 and w1 >= 0x200: max: 0b100000000000000000000000000000000000000 / 0x4000000000 var: 0b111111111111111111111111111111111111111 / 0x7fffffffff min: 0b010000000000000000000000000000000000000 / 0x2000000000 However, above new __reg_bound_offset32() has no effect on refining the knowledge of the register contents. Meaning, if the bounds in hi32 range mismatch we'll get the identity function given the range reg spans [0x0,0xffffffff] and we cast var_off into lo32 only to later on binary or it again with the hi32. Likewise, if the bounds in hi32 range match, then we mask both bounds with 0xffffffff, use the resulting umin/umax for the range to later intersect the lo32 with it. However, _prior_ called __reg_bound_offset() did already such intersection on the full reg and we therefore would only repeat the same operation on the lo32 part twice. Given this has no effect and the original commit had false assumptions, this patch reverts the code entirely which is also more straight forward for stable trees: apparently 581738a681b6 got auto-selected by Sasha's ML system and misclassified as a fix, so it got sucked into v5.4 where it should never have landed. A revert is low-risk also from a user PoV since it requires a recent kernel and llc to opt-into -mcpu=v3 BPF CPU to generate jmp32 instructions. A proper bounds refinement would need a significantly more complex approach which is currently being worked, but no stable material [0]. Hence revert is best option for stable. After the revert, the original reported program gets rejected as follows: 1: (7f) r0 >>= r0 2: (14) w0 -= 808464432 3: (07) r0 += 808464432 4: (b7) r1 = 808464432 5: (de) if w1 s<= w0 goto pc+0 R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x0; 0x1ffffffff)) R1_w=invP808464432 R10=fp0 6: (07) r0 += -2144337872 7: (14) w0 -= -1607454672 8: (25) if r0 > 0x30303030 goto pc+0 R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x3fffffff)) R1_w=invP808464432 R10=fp0 9: (76) if w0 s>= 0x303030 goto pc+2 R0=invP(id=0,umax_value=3158063,var_off=(0x0; 0x3fffff)) R1=invP808464432 R10=fp0 10: (30) r0 = *(u8 *)skb[808464432] BPF_LD_[ABS|IND] uses reserved fields processed 11 insns (limit 1000000) [...] [0] https://lore.kernel.org/bpf/158507130343.15666.8018068546764556975.stgit@john-Precision-5820-Tower/T/ Fixes: 581738a681b6 ("bpf: Provide better register bounds after jmp32 instructions") Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200330160324.15259-2-daniel@iogearbox.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-01bpf/btf: Fix BTF verification of enum members in struct/unionYoshiki Komachi1-1/+1
commit da6c7faeb103c493e505e87643272f70be586635 upstream. btf_enum_check_member() was currently sure to recognize the size of "enum" type members in struct/union as the size of "int" even if its size was packed. This patch fixes BTF enum verification to use the correct size of member in BPF programs. Fixes: 179cde8cef7e ("bpf: btf: Check members of struct/union") Signed-off-by: Yoshiki Komachi <komachi.yoshiki@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/1583825550-18606-2-git-send-email-komachi.yoshiki@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-01bpf: Initialize storage pointers to NULL to prevent freeing garbage pointerAndrii Nakryiko1-2/+2
commit 62039c30c19dcab96621e074aeeb90da7100def7 upstream. Local storage array isn't initialized, so if cgroup storage allocation fails for BPF_CGROUP_STORAGE_SHARED, error handling code will attempt to free uninitialized pointer for BPF_CGROUP_STORAGE_PERCPU storage type. Avoid this by always initializing storage pointers to NULLs. Fixes: 8bad74f9840f ("bpf: extend cgroup bpf core to allow multiple cgroup storage types") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-of