summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2017-11-30sched/rt: Simplify the IPI based RT balancing logicSteven Rostedt (Red Hat)3-127/+138
commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream. When a CPU lowers its priority (schedules out a high priority task for a lower priority one), a check is made to see if any other CPU has overloaded RT tasks (more than one). It checks the rto_mask to determine this and if so it will request to pull one of those tasks to itself if the non running RT task is of higher priority than the new priority of the next task to run on the current CPU. When we deal with large number of CPUs, the original pull logic suffered from large lock contention on a single CPU run queue, which caused a huge latency across all CPUs. This was caused by only having one CPU having overloaded RT tasks and a bunch of other CPUs lowering their priority. To solve this issue, commit: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling") changed the way to request a pull. Instead of grabbing the lock of the overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work. Although the IPI logic worked very well in removing the large latency build up, it still could suffer from a large number of IPIs being sent to a single CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet, when I tested this on a 120 CPU box, with a stress test that had lots of RT tasks scheduling on all CPUs, it actually triggered the hard lockup detector! One CPU had so many IPIs sent to it, and due to the restart mechanism that is triggered when the source run queue has a priority status change, the CPU spent minutes! processing the IPIs. Thinking about this further, I realized there's no reason for each run queue to send its own IPI. As all CPUs with overloaded tasks must be scanned regardless if there's one or many CPUs lowering their priority, because there's no current way to find the CPU with the highest priority task that can schedule to one of these CPUs, there really only needs to be one IPI being sent around at a time. This greatly simplifies the code! The new approach is to have each root domain have its own irq work, as the rto_mask is per root domain. The root domain has the following fields attached to it: rto_push_work - the irq work to process each CPU set in rto_mask rto_lock - the lock to protect some of the other rto fields rto_loop_start - an atomic that keeps contention down on rto_lock the first CPU scheduling in a lower priority task is the one to kick off the process. rto_loop_next - an atomic that gets incremented for each CPU that schedules in a lower priority task. rto_loop - a variable protected by rto_lock that is used to compare against rto_loop_next rto_cpu - The cpu to send the next IPI to, also protected by the rto_lock. When a CPU schedules in a lower priority task and wants to make sure overloaded CPUs know about it. It increments the rto_loop_next. Then it atomically sets rto_loop_start with a cmpxchg. If the old value is not "0", then it is done, as another CPU is kicking off the IPI loop. If the old value is "0", then it will take the rto_lock to synchronize with a possible IPI being sent around to the overloaded CPUs. If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no IPI being sent around, or one is about to finish. Then rto_cpu is set to the first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set in rto_mask, then there's nothing to be done. When the CPU receives the IPI, it will first try to push any RT tasks that is queued on the CPU but can't run because a higher priority RT task is currently running on that CPU. Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it finds one, it simply sends an IPI to that CPU and the process continues. If there's no more CPUs in the rto_mask, then rto_loop is compared with rto_loop_next. If they match, everything is done and the process is over. If they do not match, then a CPU scheduled in a lower priority task as the IPI was being passed around, and the process needs to start again. The first CPU in rto_mask is sent the IPI. This change removes this duplication of work in the IPI logic, and greatly lowers the latency caused by the IPIs. This removed the lockup happening on the 120 CPU machine. It also simplifies the code tremendously. What else could anyone ask for? Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and supplying me with the rto_start_trylock() and rto_start_unlock() helper functions. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott Wood <swood@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30sched: Make resched_cpu() unconditionalPaul E. McKenney1-2/+1
commit 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 upstream. The current implementation of synchronize_sched_expedited() incorrectly assumes that resched_cpu() is unconditional, which it is not. This means that synchronize_sched_expedited() can hang when resched_cpu()'s trylock fails as follows (analysis by Neeraj Upadhyay): o CPU1 is waiting for expedited wait to complete: sync_rcu_exp_select_cpus rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5 IPI sent to CPU5 synchronize_sched_expedited_wait ret = swait_event_timeout(rsp->expedited_wq, sync_rcu_preempt_exp_done(rnp_root), jiffies_stall); expmask = 0x20, CPU 5 in idle path (in cpuidle_enter()) o CPU5 handles IPI and fails to acquire rq lock. Handles IPI sync_sched_exp_handler resched_cpu returns while failing to try lock acquire rq->lock need_resched is not set o CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to idle (schedule() is not called). o CPU 1 reports RCU stall. Given that resched_cpu() is now used only by RCU, this commit fixes the assumption by making resched_cpu() unconditional. Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org> Suggested-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-15workqueue: Fix NULL pointer dereferenceLi Bin1-1/+2
commit cef572ad9bd7f85035ba8272e5352040e8be0152 upstream. When queue_work() is used in irq (not in task context), there is a potential case that trigger NULL pointer dereference. ---------------------------------------------------------------- worker_thread() |-spin_lock_irq() |-process_one_work() |-worker->current_pwq = pwq |-spin_unlock_irq() |-worker->current_func(work) |-spin_lock_irq() |-worker->current_pwq = NULL |-spin_unlock_irq() //interrupt here |-irq_handler |-__queue_work() //assuming that the wq is draining |-is_chained_work(wq) |-current_wq_worker() //Here, 'current' is the interrupted worker! |-current->current_pwq is NULL here! |-schedule() ---------------------------------------------------------------- Avoid it by checking for task context in current_wq_worker(), and if not in task context, we shouldn't use the 'current' to check the condition. Reported-by: Xiaofei Tan <tanxiaofei@huawei.com> Signed-off-by: Li Bin <huawei.libin@huawei.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 8d03ecfe4718 ("workqueue: reimplement is_chained_work() using current_wq_worker()") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-15sched/core: Add missing update_rq_clock() call in sched_move_task()Peter Zijlstra1-0/+1
[ Upstream commit 1b1d62254df0fe42a711eb71948f915918987790 ] Bug was noticed via this warning: WARNING: CPU: 6 PID: 1 at kernel/sched/sched.h:804 detach_task_cfs_rq+0x8e8/0xb80 rq->clock_update_flags < RQCF_ACT_SKIP Modules linked in: CPU: 6 PID: 1 Comm: systemd Not tainted 4.10.0-rc5-00140-g0874170baf55-dirty #1 Hardware name: Supermicro SYS-4048B-TRFT/X10QBi, BIOS 1.0 04/11/2014 Call Trace: dump_stack+0x4d/0x65 __warn+0xcb/0xf0 warn_slowpath_fmt+0x5f/0x80 detach_task_cfs_rq+0x8e8/0xb80 ? allocate_cgrp_cset_links+0x59/0x80 task_change_group_fair+0x27/0x150 sched_change_group+0x48/0xf0 sched_move_task+0x53/0x150 cpu_cgroup_attach+0x36/0x70 cgroup_taskset_migrate+0x175/0x300 cgroup_migrate+0xab/0xd0 cgroup_attach_task+0xf0/0x190 __cgroup_procs_write+0x1ed/0x2f0 cgroup_procs_write+0x14/0x20 cgroup_file_write+0x3f/0x100 kernfs_fop_write+0x104/0x180 __vfs_write+0x37/0x140 vfs_write+0xb8/0x1b0 SyS_write+0x55/0xc0 do_syscall_64+0x61/0x170 entry_SYSCALL64_slow_path+0x25/0x25 Reported-by: Ingo Molnar <mingo@kernel.org> Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02workqueue: replace pool->manager_arb mutex with a flagTejun Heo1-22/+15
commit 692b48258dda7c302e777d7d5f4217244478f1f6 upstream. Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by lockdep: [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted [ 1270.473240] ----------------------------------------------------- [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: [ 1270.474239] (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280 [ 1270.474994] [ 1270.474994] and this task is already holding: [ 1270.475440] (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0 [ 1270.476046] which would create a new lock dependency: [ 1270.476436] (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.} [ 1270.476949] [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock: [ 1270.477553] (&pool->lock/1){-.-.} ... [ 1270.488900] to a HARDIRQ-irq-unsafe lock: [ 1270.489327] (&(&lock->wait_lock)->rlock){+.+.} ... [ 1270.494735] Possible interrupt unsafe locking scenario: [ 1270.494735] [ 1270.495250] CPU0 CPU1 [ 1270.495600] ---- ---- [ 1270.495947] lock(&(&lock->wait_lock)->rlock); [ 1270.496295] local_irq_disable(); [ 1270.496753] lock(&pool->lock/1); [ 1270.497205] lock(&(&lock->wait_lock)->rlock); [ 1270.497744] <Interrupt> [ 1270.497948] lock(&pool->lock/1); , which will cause a irq inversion deadlock if the above lock scenario happens. The root cause of this safe -> unsafe lock order is the mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock held. Unlocking mutex while holding an irq spinlock was never safe and this problem has been around forever but it never got noticed because the only time the mutex is usually trylocked while holding irqlock making actual failures very unlikely and lockdep annotation missed the condition until the recent b9c16a0e1f73 ("locking/mutex: Fix lockdep_assert_held() fail"). Using mutex for pool->manager_arb has always been a bit of stretch. It primarily is an mechanism to arbitrate managership between workers which can easily be done with a pool flag. The only reason it became a mutex is that pool destruction path wants to exclude parallel managing operations. This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE and make the destruction path wait for the current manager on a wait queue. v2: Drop unnecessary flag clearing before pool destruction as suggested by Boqun. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-21hrtimer: Catch invalid clockids againMarc Zyngier1-5/+15
[ Upstream commit 336a9cde10d641e70bac67d90ae91b3190c3edca ] commit 82e88ff1ea94 ("hrtimer: Revert CLOCK_MONOTONIC_RAW support") removed unfortunately a sanity check in the hrtimer code which was part of that MONOTONIC_RAW patch series. It would have caught the bogus usage of CLOCK_MONOTONIC_RAW in the wireless code. So bring it back. It is way too easy to take any random clockid and feed it to the hrtimer subsystem. At best, it gets mapped to a monotonic base, but it would be better to just catch illegal values as early as possible. Detect invalid clockids, map them to CLOCK_MONOTONIC and emit a warning. [ tglx: Replaced the BUG by a WARN and gracefully map to CLOCK_MONOTONIC ] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Tomasz Nowicki <tn@semihalf.com> Cc: Christoffer Dall <christoffer.dall@linaro.org> Link: http://lkml.kernel.org/r/1452879670-16133-3-git-send-email-marc.zyngier@arm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-21sched/fair: Update rq clock before changing a task's CPU affinityWanpeng Li1-0/+1
[ Upstream commit a499c3ead88ccf147fc50689e85a530ad923ce36 ] This is triggered during boot when CONFIG_SCHED_DEBUG is enabled: ------------[ cut here ]------------ WARNING: CPU: 6 PID: 81 at kernel/sched/sched.h:812 set_next_entity+0x11d/0x380 rq->clock_update_flags < RQCF_ACT_SKIP CPU: 6 PID: 81 Comm: torture_shuffle Not tainted 4.10.0+ #1 Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016 Call Trace: dump_stack+0x85/0xc2 __warn+0xcb/0xf0 warn_slowpath_fmt+0x5f/0x80 set_next_entity+0x11d/0x380 set_curr_task_fair+0x2b/0x60 do_set_cpus_allowed+0x139/0x180 __set_cpus_allowed_ptr+0x113/0x260 set_cpus_allowed_ptr+0x10/0x20 torture_shuffle+0xfd/0x180 kthread+0x10f/0x150 ? torture_shutdown_init+0x60/0x60 ? kthread_create_on_node+0x60/0x60 ret_from_fork+0x31/0x40 ---[ end trace dd94d92344cea9c6 ]--- The task is running && !queued, so there is no rq clock update before calling set_curr_task(). This patch fixes it by updating rq clock after holding rq->lock/pi_lock just as what other dequeue + put_prev + enqueue + set_curr story does. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1487749975-5994-1-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-21locking/lockdep: Add nest_lock integrity testPeter Zijlstra1-2/+9
[ Upstream commit 7fb4a2cea6b18dab56d609530d077f168169ed6b ] Boqun reported that hlock->references can overflow. Add a debug test for that to generate a clear error when this happens. Without this, lockdep is likely to report a mysterious failure on unlock. Reported-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nicolai Hähnle <Nicolai.Haehnle@amd.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-18rcu: Allow for page faults in NMI handlersPaul E. McKenney1-2/+12
commit 28585a832602747cbfa88ad8934013177a3aae38 upstream. A number of architecture invoke rcu_irq_enter() on exception entry in order to allow RCU read-side critical sections in the exception handler when the exception is from an idle or nohz_full CPU. This works, at least unless the exception happens in an NMI handler. In that case, rcu_nmi_enter() would already have exited the extended quiescent state, which would mean that rcu_irq_enter() would (incorrectly) cause RCU to think that it is again in an extended quiescent state. This will in turn result in lockdep splats in response to later RCU read-side critical sections. This commit therefore causes rcu_irq_enter() and rcu_irq_exit() to take no action if there is an rcu_nmi_enter() in effect, thus avoiding the unscheduled return to RCU quiescent state. This in turn should make the kernel safe for on-demand RCU voyeurism. Link: http://lkml.kernel.org/r/20170922211022.GA18084@linux.vnet.ibm.com Fixes: 0be964be0 ("module: Sanitize RCU usage and locking") Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-12sched/cpuset/pm: Fix cpuset vs. suspend-resume bugsPeter Zijlstra3-6/+22
commit 50e76632339d4655859523a39249dd95ee5e93e7 upstream. Cpusets vs. suspend-resume is _completely_ broken. And it got noticed because it now resulted in non-cpuset usage breaking too. On suspend cpuset_cpu_inactive() doesn't call into cpuset_update_active_cpus() because it doesn't want to move tasks about, there is no need, all tasks are frozen and won't run again until after we've resumed everything. But this means that when we finally do call into cpuset_update_active_cpus() after resuming the last frozen cpu in cpuset_cpu_active(), the top_cpuset will not have any difference with the cpu_active_mask and this it will not in fact do _anything_. So the cpuset configuration will not be restored. This was largely hidden because we would unconditionally create identity domains and mobile users would not in fact use cpusets much. And servers what do use cpusets tend to not suspend-resume much. An addition problem is that we'd not in fact wait for the cpuset work to finish before resuming the tasks, allowing spurious migrations outside of the specified domains. Fix the rebuild by introducing cpuset_force_rebuild() and fix the ordering with cpuset_wait_for_hotplug(). Reported-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: deb7aa308ea2 ("cpuset: reorganize CPU / memory hotplug handling") Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-12ftrace: Fix kmemleak in unregister_ftrace_graphShu Wang1-14/+0
commit 2b0b8499ae75df91455bbeb7491d45affc384fb0 upstream. The trampoline allocated by function tracer was overwriten by function_graph tracer, and caused a memory leak. The save_global_trampoline should have saved the previous trampoline in register_ftrace_graph() and restored it in unregister_ftrace_graph(). But as it is implemented, save_global_trampoline was only used in unregister_ftrace_graph as default value 0, and it overwrote the previous trampoline's value. Causing the previous allocated trampoline to be lost. kmmeleak backtrace: kmemleak_vmalloc+0x77/0xc0 __vmalloc_node_range+0x1b5/0x2c0 module_alloc+0x7c/0xd0 arch_ftrace_update_trampoline+0xb5/0x290 ftrace_startup+0x78/0x210 register_ftrace_function+0x8b/0xd0 function_trace_init+0x4f/0x80 tracing_set_tracer+0xe6/0x170 tracing_set_trace_write+0x90/0xd0 __vfs_write+0x37/0x170 vfs_write+0xb2/0x1b0 SyS_write+0x55/0xc0 do_syscall_64+0x67/0x180 return_from_SYSCALL_64+0x0/0x6a [ Looking further into this, I found that this was left over from when the function and function graph tracers shared the same ftrace_ops. But in commit 5f151b2401 ("ftrace: Fix function_profiler and function tracer together"), the two were separated, and the save_global_trampoline no longer was necessary (and it may have been broken back then too). -- Steven Rostedt ] Link: http://lkml.kernel.org/r/20170912021454.5976-1-shuwang@redhat.com Fixes: 5f151b2401 ("ftrace: Fix function_profiler and function tracer together") Signed-off-by: Shu Wang <shuwang@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-12bpf: one perf event close won't free bpf program attached by another perf eventYonghong Song1-1/+2
[ Upstream commit ec9dd352d591f0c90402ec67a317c1ed4fb2e638 ] This patch fixes a bug exhibited by the following scenario: 1. fd1 = perf_event_open with attr.config = ID1 2. attach bpf program prog1 to fd1 3. fd2 = perf_event_open with attr.config = ID1 <this will be successful> 4. user program closes fd2 and prog1 is detached from the tracepoint. 5. user program with fd1 does not work properly as tracepoint no output any more. The issue happens at step 4. Multiple perf_event_open can be called successfully, but only one bpf prog pointer in the tp_event. In the current logic, any fd release for the same tp_event will free the tp_event->prog. The fix is to free tp_event->prog only when the closing fd corresponds to the one which registered the program. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-12bpf/verifier: reject BPF_ALU64|BPF_ENDEdward Cree1-1/+2
[ Upstream commit e67b8a685c7c984e834e3181ef4619cd7025a136 ] Neither ___bpf_prog_run nor the JITs accept it. Also adds a new test case. Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)") Signed-off-by: Edward Cree <ecree@solarflare.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05timer/sysclt: Restrict timer migration sysctl values to 0 and 1Myungho Jung2-1/+3
commit b94bf594cf8ed67cdd0439e70fa939783471597a upstream. timer_migration sysctl acts as a boolean switch, so the allowed values should be restricted to 0 and 1. Add the necessary extra fields to the sysctl table entry to enforce that. [ tglx: Rewrote changelog ] Signed-off-by: Myungho Jung <mhjungk@gmail.com> Link: http://lkml.kernel.org/r/1492640690-3550-1-git-send-email-mhjungk@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Kazuhiro Hayashi <kazuhiro3.hayashi@toshiba.co.jp> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05seccomp: fix the usage of get/put_seccomp_filter() in seccomp_get_filter()Oleg Nesterov1-7/+16
commit 66a733ea6b611aecf0119514d2dddab5f9d6c01e upstream. As Chris explains, get_seccomp_filter() and put_seccomp_filter() can end up using different filters. Once we drop ->siglock it is possible for task->seccomp.filter to have been replaced by SECCOMP_FILTER_FLAG_TSYNC. Fixes: f8e529ed941b ("seccomp, ptrace: add support for dumping seccomp filters") Reported-by: Chris Salls <chrissalls5@gmail.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> [tycho: add __get_seccomp_filter vs. open coding refcount_inc()] Signed-off-by: Tycho Andersen <tycho@docker.com> [kees: tweak commit log] Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05tracing: Erase irqsoff trace with empty writeBo Yan1-2/+8
commit 8dd33bcb7050dd6f8c1432732f930932c9d3a33e upstream. One convenient way to erase trace is "echo > trace". However, this is currently broken if the current tracer is irqsoff tracer. This is because irqsoff tracer use max_buffer as the default trace buffer. Set the max_buffer as the one to be cleared when it's the trace buffer currently in use. Link: http://lkml.kernel.org/r/1505754215-29411-1-git-send-email-byan@nvidia.com Cc: <mingo@redhat.com> Fixes: 4acd4d00f ("tracing: give easy way to clear trace buffer") Signed-off-by: Bo Yan <byan@nvidia.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05tracing: Fix trace_pipe behavior for instance tracesTahsin Erdogan1-1/+1
commit 75df6e688ccd517e339a7c422ef7ad73045b18a2 upstream. When reading data from trace_pipe, tracing_wait_pipe() performs a check to see if tracing has been turned off after some data was read. Currently, this check always looks at global trace state, but it should be checking the trace instance where trace_pipe is located at. Because of this bug, cat instances/i1/trace_pipe in the following script will immediately exit instead of waiting for data: cd /sys/kernel/debug/tracing echo 0 > tracing_on mkdir -p instances/i1 echo 1 > instances/i1/tracing_on echo 1 > instances/i1/events/sched/sched_process_exec/enable cat instances/i1/trace_pipe Link: http://lkml.kernel.org/r/20170917102348.1615-1-tahsin@google.com Fixes: 10246fa35d4f ("tracing: give easy way to clear trace buffer") Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-05genirq: Make sparse_irq_lock protect what it should protectThomas Gleixner1-17/+7
commit 12ac1d0f6c3e95732d144ffa65c8b20fbd9aa462 upstream. for_each_active_irq() iterates the sparse irq allocation bitmap. The caller must hold sparse_irq_lock. Several code pathes expect that an active bit in the sparse bitmap also has a valid interrupt descriptor. Unfortunately that's not true. The (de)allocation is a two step process, which holds the sparse_irq_lock only across the queue/remove from the radix tree and the set/clear in the allocation bitmap. If a iteration locks sparse_irq_lock between the two steps, then it might see an active bit but the corresponding irq descriptor is NULL. If that is dereferenced unconditionally, then the kernel oopses. Of course, all iterator sites could be audited and fixed, but.... There is no reason why the sparse_irq_lock needs to be dropped between the two steps, in fact the code becomes simpler when the mutex is held across both and the semantics become more straight forward, so future problems of missing NULL pointer checks in the iteration are avoided and all existing sites are fixed in one go. Expand the lock held sections so both operations are covered and the bitmap and the radixtree are in sync. Fixes: a05a900a51c7 ("genirq: Make sparse_lock a mutex") Reported-and-tested-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-27tracing: Apply trace_clock changes to instance max bufferBaohong Liu1-1/+1
commit 170b3b1050e28d1ba0700e262f0899ffa4fccc52 upstream. Currently trace_clock timestamps are applied to both regular and max buffers only for global trace. For instance trace, trace_clock timestamps are applied only to regular buffer. But, regular and max buffers can be swapped, for example, following a snapshot. So, for instance trace, bad timestamps can be seen following a snapshot. Let's apply trace_clock timestamps to instance max buffer as well. Link: http://lkml.kernel.org/r/ebdb168d0be042dcdf51f81e696b17fabe3609c1.1504642143.git.tom.zanussi@linux.intel.com Fixes: 277ba0446 ("tracing: Add interface to allow multiple trace buffers") Signed-off-by: Baohong Liu <baohong.liu@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-27tracing: Add barrier to trace_printk() buffer nesting modificationSteven Rostedt (VMware)1-1/+7
commit 3d9622c12c8873911f4cc0ccdabd0362c2fca06b upstream. trace_printk() uses 4 buffers, one for each context (normal, softirq, irq and NMI), such that it does not need to worry about one context preempting the other. There's a nesting counter that gets incremented to figure out which buffer to use. If the context gets preempted by another context which calls trace_printk() it will increment the counter and use the next buffer, and restore the counter when it is finished. The problem is that gcc may optimize the modification of the buffer nesting counter and it may not be incremented in memory before the buffer is used. If this happens, and the context gets interrupted by another context, it could pick the same buffer and corrupt the one that is being used. Compiler barriers need to be added after the nesting variable is incremented and before it is decremented to prevent usage of the context buffers by more than one context at the same time. Cc: Andy Lutomirski <luto@kernel.org> Fixes: e2ace00117 ("tracing: Choose static tp_printk buffer by explicit nesting count") Hat-tip-to: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-27ftrace: Fix memleak when unregistering dynamic ops when tracing disabledSteven Rostedt (VMware)1-4/+6
commit edb096e00724f02db5f6ec7900f3bbd465c6c76f upstream. If function tracing is disabled by the user via the function-trace option or the proc sysctl file, and a ftrace_ops that was allocated on the heap is unregistered, then the shutdown code exits out without doing the proper clean up. This was found via kmemleak and running the ftrace selftests, as one of the tests unregisters with function tracing disabled. # cat kmemleak unreferenced object 0xffffffffa0020000 (size 4096): comm "swapper/0", pid 1, jiffies 4294668889 (age 569.209s) hex dump (first 32 bytes): 55 ff 74 24 10 55 48 89 e5 ff 74 24 18 55 48 89 U.t$.UH...t$.UH. e5 48 81 ec a8 00 00 00 48 89 44 24 50 48 89 4c .H......H.D$PH.L backtrace: [<ffffffff81d64665>] kmemleak_vmalloc+0x85/0xf0 [<ffffffff81355631>] __vmalloc_node_range+0x281/0x3e0 [<ffffffff8109697f>] module_alloc+0x4f/0x90 [<ffffffff81091170>] arch_ftrace_update_trampoline+0x160/0x420 [<ffffffff81249947>] ftrace_startup+0xe7/0x300 [<ffffffff81249bd2>] register_ftrace_function+0x72/0x90 [<ffffffff81263786>] trace_selftest_ops+0x204/0x397 [<ffffffff82bb8971>] trace_selftest_startup_function+0x394/0x624 [<ffffffff81263a75>] run_tracer_selftest+0x15c/0x1d7 [<ffffffff82bb83f1>] init_trace_selftests+0x75/0x192 [<ffffffff81002230>] do_one_initcall+0x90/0x1e2 [<ffffffff82b7d620>] kernel_init_freeable+0x350/0x3fe [<ffffffff81d61ec3>] kernel_init+0x13/0x122 [<ffffffff81d72c6a>] ret_from_fork+0x2a/0x40 [<ffffffffffffffff>] 0xffffffffffffffff Fixes: 12cce594fa ("ftrace/x86: Allow !CONFIG_PREEMPT dynamic ops to use allocated trampolines") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-27ftrace: Fix selftest goto location on errorSteven Rostedt (VMware)1-1/+1
commit 46320a6acc4fb58f04bcf78c4c942cc43b20f986 upstream. In the second iteration of trace_selftest_ops(), the error goto label is wrong in the case where trace_selftest_test_global_cnt is off. In the case of error, it leaks the dynamic ops that was allocated. Fixes: 95950c2e ("ftrace: Add self-tests for multiple function trace users") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-13locktorture: Fix potential memory leak with rw lock testYang Shi1-0/+6
commit f4dbba591945dc301c302672adefba9e2ec08dc5 upstream. When running locktorture module with the below commands with kmemleak enabled: $ modprobe locktorture torture_type=rw_lock_irq $ rmmod locktorture The below kmemleak got caught: root@10:~# echo scan > /sys/kernel/debug/kmemleak [ 323.197029] kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak) root@10:~# cat /sys/kernel/debug/kmemleak unreferenced object 0xffffffc07592d500 (size 128): comm "modprobe", pid 368, jiffies 4294924118 (age 205.824s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 c3 7b 02 00 00 00 00 00 .........{...... 00 00 00 00 00 00 00 00 d7 9b 02 00 00 00 00 00 ................ backtrace: [<ffffff80081e5a88>] create_object+0x110/0x288 [<ffffff80086c6078>] kmemleak_alloc+0x58/0xa0 [<ffffff80081d5acc>] __kmalloc+0x234/0x318 [<ffffff80006fa130>] 0xffffff80006fa130 [<ffffff8008083ae4>] do_one_initcall+0x44/0x138 [<ffffff800817e28c>] do_init_module+0x68/0x1cc [<ffffff800811c848>] load_module+0x1a68/0x22e0 [<ffffff800811d340>] SyS_finit_module+0xe0/0xf0 [<ffffff80080836f0>] el0_svc_naked+0x24/0x28 [<ffffffffffffffff>] 0xffffffffffffffff unreferenced object 0xffffffc07592d480 (size 128): comm "modprobe", pid 368, jiffies 4294924118 (age 205.824s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 3b 6f 01 00 00 00 00 00 ........;o...... 00 00 00 00 00 00 00 00 23 6a 01 00 00 00 00 00 ........#j...... backtrace: [<ffffff80081e5a88>] create_object+0x110/0x288 [<ffffff80086c6078>] kmemleak_alloc+0x58/0xa0 [<ffffff80081d5acc>] __kmalloc+0x234/0x318 [<ffffff80006fa22c>] 0xffffff80006fa22c [<ffffff8008083ae4>] do_one_initcall+0x44/0x138 [<ffffff800817e28c>] do_init_module+0x68/0x1cc [<ffffff800811c848>] load_module+0x1a68/0x22e0 [<ffffff800811d340>] SyS_finit_module+0xe0/0xf0 [<ffffff80080836f0>] el0_svc_naked+0x24/0x28 [<ffffffffffffffff>] 0xffffffffffffffff It is because cxt.lwsa and cxt.lrsa don't get freed in module_exit, so free them in lock_torture_cleanup() and free writer_tasks if reader_tasks is failed at memory allocation. Signed-off-by: Yang Shi <yang.shi@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Cc: 石洋 <yang.s@alibaba-inc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-07cpuset: Fix incorrect memory_pressure control file mappingWaiman Long1-0/+1
commit 1c08c22c874ac88799cab1f78c40f46110274915 upstream. The memory_pressure control file was incorrectly set up without a private value (0, by default). As a result, this control file was treated like memory_migrate on read. By adding back the FILE_MEMORY_PRESSURE private value, the correct memory pressure value will be returned. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 7dbdb199d3bf ("cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-07mm, uprobes: fix multiple free of ->uprobes_state.xol_areaEric Biggers2-2/+8
commit 355627f518978b5167256d27492fe0b343aaf2f2 upstream. Commit 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for write killable") made it possible to kill a forking task while it is waiting to acquire its ->mmap_sem for write, in dup_mmap(). However, it was overlooked that this introduced an new error path before the new mm_struct's ->uprobes_state.xol_area has been set to NULL after being copied from the old mm_struct by the memcpy in dup_mm(). For a task that has previously hit a uprobe tracepoint, this resulted in the 'struct xol_area' being freed multiple times if the task was killed at just the right time while forking. Fix it by setting ->uprobes_state.xol_area to NULL in mm_init() rather than in uprobe_dup_mmap(). With CONFIG_UPROBE_EVENTS=y, the bug can be reproduced by the same C program given by commit 2b7e8665b4ff ("fork: fix incorrect fput of ->exe_file causing use-after-free"), provided that a uprobe tracepoint has been set on the fork_thread() function. For example: $ gcc reproducer.c -o reproducer -lpthread $ nm reproducer | grep fork_thread 0000000000400719 t fork_thread $ echo "p $PWD/reproducer:0x719" > /sys/kernel/debug/tracing/uprobe_events $ echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable $ ./reproducer Here is the use-after-free reported by KASAN: BUG: KASAN: use-after-free in uprobe_clear_state+0x1c4/0x200 Read of size 8 at addr ffff8800320a8b88 by task reproducer/198 CPU: 1 PID: 198 Comm: reproducer Not tainted 4.13.0-rc7-00015-g36fde05f3fb5 #255 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 Call Trace: dump_stack+0xdb/0x185 print_address_description+0x7e/0x290 kasan_report+0x23b/0x350 __asan_report_load8_noabort+0x19/0x20 uprobe_clear_state+0x1c4/0x200 mmput+0xd6/0x360 do_exit+0x740/0x1670 do_group_exit+0x13f/0x380 get_signal+0x597/0x17d0 do_signal+0x99/0x1df0 exit_to_usermode_loop+0x166/0x1e0 syscall_return_slowpath+0x258/0x2c0 entry_SYSCALL_64_fastpath+0xbc/0xbe ... Allocated by task 199: save_stack_trace+0x1b/0x20 kasan_kmalloc+0xfc/0x180 kmem_cache_alloc_trace+0xf3/0x330 __create_xol_area+0x10f/0x780 uprobe_notify_resume+0x1674/0x2210 exit_to_usermode_loop+0x150/0x1e0 prepare_exit_to_usermode+0x14b/0x180 retint_user+0x8/0x20 Freed by task 199: save_stack_trace+0x1b/0x20 kasan_slab_free+0xa8/0x1a0 kfree+0xba/0x210 uprobe_clear_state+0x151/0x200 mmput+0xd6/0x360 copy_process.part.8+0x605f/0x65d0 _do_fork+0x1a5/0xbd0 SyS_clone+0x19/0x20 do_syscall_64+0x22f/0x660 return_from_SYSCALL_64+0x0/0x7a Note: without KASAN, you may instead see a "Bad page state" message, or simply a general protection fault. Link: http://lkml.kernel.org/r/20170830033303.17927-1-ebiggers3@gmail.com Fixes: 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for write killable") Signed-off-by: Eric Biggers <ebiggers@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-02locking/spinlock/debug: Remove spinlock lockup detection codeWaiman Long1-81/+5
commit bc88c10d7e6900916f5e1ba3829d66a9de92b633 upstream. The current spinlock lockup detection code can sometimes produce false positives because of the unfairness of the locking algorithm itself. So the lockup detection code is now removed. Instead, we are relying on the NMI watchdog to detect potential lockup. We won't have lockup detection if the watchdog isn't running. The commented-out read-write lock lockup detection code are also removed. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1486583208-11038-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-02gcov: support GCC 7.1Martin Liska2-1/+9
commit 05384213436ab690c46d9dfec706b80ef8d671ab upstream. Starting from GCC 7.1, __gcov_exit is a new symbol expected to be implemented in a profiling runtime. [akpm@linux-foundation.org: coding-style fixes] [mliska@suse.cz: v2] Link: http://lkml.kernel.org/r/e63a3c59-0149-c97e-4084-20ca8f146b26@suse.cz Link: http://lkml.kernel.org/r/8c4084fa-3885-29fe-5fc4-0d4ca199c785@suse.cz Signed-off-by: Martin Liska <mliska@suse.cz> Acked-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-30timers: Fix excessive granularity of new timers after a nohz idleNicholas Piggin1-9/+41
commit 2fe59f507a65dbd734b990a11ebc7488f6f87a24 upstream. When a timer base is idle, it is forwarded when a new timer is added to ensure that granularity does not become excessive. When not idle, the timer tick is expected to increment the base. However there are several problems: - If an existing timer is modified, the base is forwarded only after the index is calculated. - The base is not forwarded by add_timer_on. - There is a window after a timer is restarted from a nohz idle, after it is marked not-idle and before the timer tick on this CPU, where a timer may be added but the ancient base does not get forwarded. These result in excessive granularity (a 1 jiffy timeout can blow out to 100s of jiffies), which cause the rcu lockup detector to trigger, among other things. Fix this by keeping track of whether the timer base has been idle since it was last run or forwarded, and if so then forward it before adding a new timer. There is still a case where mod_timer optimises the case of a pending timer mod with the same expiry time, where the timer can see excessive granularity relative to the new, shorter interval. A comment is added, but it's not changed because it is an important fastpath for networking. This has been tested and found to fix the RCU softlockup messages. Testing was also done with tracing to measure requested versus achieved wakeup latencies for all non-deferrable timers in an idle system (with no lockup watchdogs running). Wakeup latency relative to absolute latency is calculated (note this suffers from round-up skew at low absolute times) and analysed: max avg std upstream 506.0 1.20 4.68 patched 2.0 1.08 0.15 The bug was noticed due to the lockup detector Kconfig changes dropping it out of people's .configs and resulting in larger base clk skew When the lockup detectors are enabled, no CPU can go idle for longer than 4 seconds, which limits the granularity errors. Sub-optimal timer behaviour is observable on a smaller scale in that case: max avg std upstream 9.0 1.05 0.19 patched 2.0 1.04 0.11 Fixes: Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: David Miller <davem@davemloft.net> Cc: dzickus@redhat.com Cc: sfr@canb.auug.org.au Cc: mpe@ellerman.id.au Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: linuxarm@huawei.com Cc: abdhalee@linux.vnet.ibm.com Cc: John Stultz <john.stultz@linaro.org> Cc: akpm@linux-foundation.org Cc: paulmck@linux.vnet.ibm.com Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/20170822084348.21436-1-npiggin@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-30perf/core: Fix group {cpu,task} validationMark Rutland1-20/+19
commit 64aee2a965cf2954a038b5522f11d2cd2f0f8f3e upstream. Regardless of which events form a group, it does not make sense for the events to target different tasks and/or CPUs, as this leaves the group inconsistent and impossible to schedule. The core perf code assumes that these are consistent across (successfully intialised) groups. Core perf code only verifies this when moving SW events into a HW context. Thus, we can violate this requirement for pure SW groups and pure HW groups, unless the relevant PMU driver happens to perform this verification itself. These mismatched groups subsequently wreak havoc elsewhere. For example, we handle watchpoints as SW events, and reserve watchpoint HW on a per-CPU basis at pmu::event_init() time to ensure that any event that is initialised is guaranteed to have a slot at pmu::add() time. However, the core code only checks the group leader's cpu filter (via event_filter_match()), and can thus install follower events onto CPUs violating thier (mismatched) CPU filters, potentially installing them into a CPU without sufficient reserved slots. This can be triggered with the below test case, resulting in warnings from arch backends. #define _GNU_SOURCE #include <linux/hw_breakpoint.h> #include <linux/perf_event.h> #include <sched.h> #include <stdio.h> #include <sys/prctl.h> #include <sys/syscall.h> #include <unistd.h> static int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu, int group_fd, unsigned long flags) { return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags); } char watched_char; struct perf_event_attr wp_attr = { .type = PERF_TYPE_BREAKPOINT, .bp_type = HW_BREAKPOINT_RW, .bp_addr = (unsigned long)&watched_char, .bp_len = 1, .size = sizeof(wp_attr), }; int main(int argc, char *argv[]) { int leader, ret; cpu_set_t cpus; /* * Force use of CPU0 to ensure our CPU0-bound events get scheduled. */ CPU_ZERO(&cpus); CPU_SET(0, &cpus); ret = sched_setaffinity(0, sizeof(cpus), &cpus); if (ret) { printf("Unable to set cpu affinity\n"); return 1; } /* open leader event, bound to this task, CPU0 only */ leader = perf_event_open(&wp_attr, 0, 0, -1, 0); if (leader < 0) { printf("Couldn't open leader: %d\n", leader); return 1; } /* * Open a follower event that is bound to the same task, but a * different CPU. This means that the group should never be possible to * schedule. */ ret = perf_event_open(&wp_attr, 0, 1, leader, 0); if (ret < 0) { printf("Couldn't open mismatched follower: %d\n", ret); return 1; } else { printf("Opened leader/follower with mismastched CPUs\n"); } /* * Open as many independent events as we can, all bound to the same * task, CPU0 only. */ do { ret = perf_event_open(&wp_attr, 0, 0, -1, 0); } while (ret >= 0); /* * Force enable/disble all events to trigger the erronoeous * installation of the follower event. */ printf("Opened all events. Toggling..\n"); for (;;) { prctl(PR_TASK_PERF_EVENTS_DISABLE, 0, 0, 0, 0); prctl(PR_TASK_PERF_EVENTS_ENABLE, 0, 0, 0, 0); } return 0; } Fix this by validating this requirement regardless of whether we're moving events. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutr