summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2013-12-04ftrace: Fix function graph with loading of modulesSteven Rostedt (Red Hat)1-34/+34
commit 8a56d7761d2d041ae5e8215d20b4167d8aa93f51 upstream. Commit 8c4f3c3fa9681 "ftrace: Check module functions being traced on reload" fixed module loading and unloading with respect to function tracing, but it missed the function graph tracer. If you perform the following # cd /sys/kernel/debug/tracing # echo function_graph > current_tracer # modprobe nfsd # echo nop > current_tracer You'll get the following oops message: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 2910 at /linux.git/kernel/trace/ftrace.c:1640 __ftrace_hash_rec_update.part.35+0x168/0x1b9() Modules linked in: nfsd exportfs nfs_acl lockd ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables uinput snd_hda_codec_idt CPU: 2 PID: 2910 Comm: bash Not tainted 3.13.0-rc1-test #7 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007 0000000000000668 ffff8800787efcf8 ffffffff814fe193 ffff88007d500000 0000000000000000 ffff8800787efd38 ffffffff8103b80a 0000000000000668 ffffffff810b2b9a ffffffff81a48370 0000000000000001 ffff880037aea000 Call Trace: [<ffffffff814fe193>] dump_stack+0x4f/0x7c [<ffffffff8103b80a>] warn_slowpath_common+0x81/0x9b [<ffffffff810b2b9a>] ? __ftrace_hash_rec_update.part.35+0x168/0x1b9 [<ffffffff8103b83e>] warn_slowpath_null+0x1a/0x1c [<ffffffff810b2b9a>] __ftrace_hash_rec_update.part.35+0x168/0x1b9 [<ffffffff81502f89>] ? __mutex_lock_slowpath+0x364/0x364 [<ffffffff810b2cc2>] ftrace_shutdown+0xd7/0x12b [<ffffffff810b47f0>] unregister_ftrace_graph+0x49/0x78 [<ffffffff810c4b30>] graph_trace_reset+0xe/0x10 [<ffffffff810bf393>] tracing_set_tracer+0xa7/0x26a [<ffffffff810bf5e1>] tracing_set_trace_write+0x8b/0xbd [<ffffffff810c501c>] ? ftrace_return_to_handler+0xb2/0xde [<ffffffff811240a8>] ? __sb_end_write+0x5e/0x5e [<ffffffff81122aed>] vfs_write+0xab/0xf6 [<ffffffff8150a185>] ftrace_graph_caller+0x85/0x85 [<ffffffff81122dbd>] SyS_write+0x59/0x82 [<ffffffff8150a185>] ftrace_graph_caller+0x85/0x85 [<ffffffff8150a2d2>] system_call_fastpath+0x16/0x1b ---[ end trace 940358030751eafb ]--- The above mentioned commit didn't go far enough. Well, it covered the function tracer by adding checks in __register_ftrace_function(). The problem is that the function graph tracer circumvents that (for a slight efficiency gain when function graph trace is running with a function tracer. The gain was not worth this). The problem came with ftrace_startup() which should always be called after __register_ftrace_function(), if you want this bug to be completely fixed. Anyway, this solution moves __register_ftrace_function() inside of ftrace_startup() and removes the need to call them both. Reported-by: Dave Wysochanski <dwysocha@redhat.com> Fixes: ed926f9b35cd ("ftrace: Use counters to enable functions to trace") Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04cpuset: Fix memory allocator deadlockPeter Zijlstra1-2/+6
commit 0fc0287c9ed1ffd3706f8b4d9b314aa102ef1245 upstream. Juri hit the below lockdep report: [ 4.303391] ====================================================== [ 4.303392] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ] [ 4.303394] 3.12.0-dl-peterz+ #144 Not tainted [ 4.303395] ------------------------------------------------------ [ 4.303397] kworker/u4:3/689 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: [ 4.303399] (&p->mems_allowed_seq){+.+...}, at: [<ffffffff8114e63c>] new_slab+0x6c/0x290 [ 4.303417] [ 4.303417] and this task is already holding: [ 4.303418] (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff812d2dfb>] blk_execute_rq_nowait+0x5b/0x100 [ 4.303431] which would create a new lock dependency: [ 4.303432] (&(&q->__queue_lock)->rlock){..-...} -> (&p->mems_allowed_seq){+.+...} [ 4.303436] [ 4.303898] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock: [ 4.303918] -> (&p->mems_allowed_seq){+.+...} ops: 2762 { [ 4.303922] HARDIRQ-ON-W at: [ 4.303923] [<ffffffff8108ab9a>] __lock_acquire+0x65a/0x1ff0 [ 4.303926] [<ffffffff8108cbe3>] lock_acquire+0x93/0x140 [ 4.303929] [<ffffffff81063dd6>] kthreadd+0x86/0x180 [ 4.303931] [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0 [ 4.303933] SOFTIRQ-ON-W at: [ 4.303933] [<ffffffff8108abcc>] __lock_acquire+0x68c/0x1ff0 [ 4.303935] [<ffffffff8108cbe3>] lock_acquire+0x93/0x140 [ 4.303940] [<ffffffff81063dd6>] kthreadd+0x86/0x180 [ 4.303955] [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0 [ 4.303959] INITIAL USE at: [ 4.303960] [<ffffffff8108a884>] __lock_acquire+0x344/0x1ff0 [ 4.303963] [<ffffffff8108cbe3>] lock_acquire+0x93/0x140 [ 4.303966] [<ffffffff81063dd6>] kthreadd+0x86/0x180 [ 4.303969] [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0 [ 4.303972] } Which reports that we take mems_allowed_seq with interrupts enabled. A little digging found that this can only be from cpuset_change_task_nodemask(). This is an actual deadlock because an interrupt doing an allocation will hit get_mems_allowed()->...->__read_seqcount_begin(), which will spin forever waiting for the write side to complete. Cc: John Stultz <john.stultz@linaro.org> Cc: Mel Gorman <mgorman@suse.de> Reported-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@gmail.com> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04audit: fix info leak in AUDIT_GET requestsMathias Krause1-0/+1
commit 64fbff9ae0a0a843365d922e0057fc785f23f0e3 upstream. We leak 4 bytes of kernel stack in response to an AUDIT_GET request as we miss to initialize the mask member of status_set. Fix that. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04audit: use nlmsg_len() to get message payload lengthMathias Krause1-1/+1
commit 4d8fe7376a12bf4524783dd95cbc00f1fece6232 upstream. Using the nlmsg_len member of the netlink header to test if the message is valid is wrong as it includes the size of the netlink header itself. Thereby allowing to send short netlink messages that pass those checks. Use nlmsg_len() instead to test for the right message length. The result of nlmsg_len() is guaranteed to be non-negative as the netlink message already passed the checks of nlmsg_ok(). Also switch to min_t() to please checkpatch.pl. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04audit: printk USER_AVC messages when audit isn't enabledTyler Hicks1-1/+1
commit 0868a5e150bc4c47e7a003367cd755811eb41e0b upstream. When the audit=1 kernel parameter is absent and auditd is not running, AUDIT_USER_AVC messages are being silently discarded. AUDIT_USER_AVC messages should be sent to userspace using printk(), as mentioned in the commit message of 4a4cd633 ("AUDIT: Optimise the audit-disabled case for discarding user messages"). When audit_enabled is 0, audit_receive_msg() discards all user messages except for AUDIT_USER_AVC messages. However, audit_log_common_recv_msg() refuses to allocate an audit_buffer if audit_enabled is 0. The fix is to special case AUDIT_USER_AVC messages in both functions. It looks like commit 50397bd1 ("[AUDIT] clean up audit_receive_msg()") introduced this bug. Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Cc: linux-audit@redhat.com Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04PM / hibernate: Avoid overflow in hibernate_preallocate_memory()Aaron Lu1-1/+5
commit fd432b9f8c7c88428a4635b9f5a9c6e174df6e36 upstream. When system has a lot of highmem (e.g. 16GiB using a 32 bits kernel), the code to calculate how much memory we need to preallocate in normal zone may cause overflow. As Leon has analysed: It looks that during computing 'alloc' variable there is overflow: alloc = (3943404 - 1970542) - 1978280 = -5418 (signed) And this function goes to err_out. Fix this by avoiding that overflow. References: https://bugzilla.kernel.org/show_bug.cgi?id=60817 Reported-and-tested-by: Leon Drugi <eyak@wp.pl> Signed-off-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04alarmtimer: return EINVAL instead of ENOTSUPP if rtcdev doesn't existKOSAKI Motohiro1-2/+2
commit 98d6f4dd84a134d942827584a3c5f67ffd8ec35f upstream. Fedora Ruby maintainer reported latest Ruby doesn't work on Fedora Rawhide on ARM. (http://bugs.ruby-lang.org/issues/9008) Because of, commit 1c6b39ad3f (alarmtimers: Return -ENOTSUPP if no RTC device is present) intruduced to return ENOTSUPP when clock_get{time,res} can't find a RTC device. However this is incorrect. First, ENOTSUPP isn't exported to userland (ENOTSUP or EOPNOTSUP are the closest userland equivlents). Second, Posix and Linux man pages agree that clock_gettime and clock_getres should return EINVAL if clk_id argument is invalid. While the arugment that the clockid is valid, but just not supported on this hardware could be made, this is just a technicality that doesn't help userspace applicaitons, and only complicates error handling. Thus, this patch changes the code to use EINVAL. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Reported-by: Vit Ondruch <v.ondruch@tiscali.cz> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [jstultz: Tweaks to commit message to include full rational] Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-29exec/ptrace: fix get_dumpable() incorrect testsKees Cook1-1/+2
commit d049f74f2dbe71354d43d393ac3a188947811348 upstream. The get_dumpable() return value is not boolean. Most users of the function actually want to be testing for non-SUID_DUMP_USER(1) rather than SUID_DUMP_DISABLE(0). The SUID_DUMP_ROOT(2) is also considered a protected state. Almost all places did this correctly, excepting the two places fixed in this patch. Wrong logic: if (dumpable == SUID_DUMP_DISABLE) { /* be protective */ } or if (dumpable == 0) { /* be protective */ } or if (!dumpable) { /* be protective */ } Correct logic: if (dumpable != SUID_DUMP_USER) { /* be protective */ } or if (dumpable != 1) { /* be protective */ } Without this patch, if the system had set the sysctl fs/suid_dumpable=2, a user was able to ptrace attach to processes that had dropped privileges to that user. (This may have been partially mitigated if Yama was enabled.) The macros have been moved into the file that declares get/set_dumpable(), which means things like the ia64 code can see them too. CVE-2013-2929 Reported-by: Vasily Kulikov <segoon@openwall.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-29perf/ftrace: Fix paranoid level for enabling function tracerSteven Rostedt1-1/+1
commit 12ae030d54ef250706da5642fc7697cc60ad0df7 upstream. The current default perf paranoid level is "1" which has "perf_paranoid_kernel()" return false, and giving any operations that use it, access to normal users. Unfortunately, this includes function tracing and normal users should not be allowed to enable function tracing by default. The proper level is defined at "-1" (full perf access), which "perf_paranoid_tracepoint_raw()" will only give access to. Use that check instead for enabling function tracing. Reported-by: Dave Jones <davej@redhat.com> Reported-by: Vince Weaver <vincent.weaver@maine.edu> Tested-by: Vince Weaver <vincent.weaver@maine.edu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> CVE: CVE-2013-2930 Fixes: ced39002f5ea ("ftrace, perf: Add support to use function tracepoint in perf") Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-20tracing: Fix potential out-of-bounds in trace_get_user()Steven Rostedt1-1/+4
commit 057db8488b53d5e4faa0cedb2f39d4ae75dfbdbb upstream. Andrey reported the following report: ERROR: AddressSanitizer: heap-buffer-overflow on address ffff8800359c99f3 ffff8800359c99f3 is located 0 bytes to the right of 243-byte region [ffff8800359c9900, ffff8800359c99f3) Accessed by thread T13003: #0 ffffffff810dd2da (asan_report_error+0x32a/0x440) #1 ffffffff810dc6b0 (asan_check_region+0x30/0x40) #2 ffffffff810dd4d3 (__tsan_write1+0x13/0x20) #3 ffffffff811cd19e (ftrace_regex_release+0x1be/0x260) #4 ffffffff812a1065 (__fput+0x155/0x360) #5 ffffffff812a12de (____fput+0x1e/0x30) #6 ffffffff8111708d (task_work_run+0x10d/0x140) #7 ffffffff810ea043 (do_exit+0x433/0x11f0) #8 ffffffff810eaee4 (do_group_exit+0x84/0x130) #9 ffffffff810eafb1 (SyS_exit_group+0x21/0x30) #10 ffffffff81928782 (system_call_fastpath+0x16/0x1b) Allocated by thread T5167: #0 ffffffff810dc778 (asan_slab_alloc+0x48/0xc0) #1 ffffffff8128337c (__kmalloc+0xbc/0x500) #2 ffffffff811d9d54 (trace_parser_get_init+0x34/0x90) #3 ffffffff811cd7b3 (ftrace_regex_open+0x83/0x2e0) #4 ffffffff811cda7d (ftrace_filter_open+0x2d/0x40) #5 ffffffff8129b4ff (do_dentry_open+0x32f/0x430) #6 ffffffff8129b668 (finish_open+0x68/0xa0) #7 ffffffff812b66ac (do_last+0xb8c/0x1710) #8 ffffffff812b7350 (path_openat+0x120/0xb50) #9 ffffffff812b8884 (do_filp_open+0x54/0xb0) #10 ffffffff8129d36c (do_sys_open+0x1ac/0x2c0) #11 ffffffff8129d4b7 (SyS_open+0x37/0x50) #12 ffffffff81928782 (system_call_fastpath+0x16/0x1b) Shadow bytes around the buggy address: ffff8800359c9700: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd ffff8800359c9780: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa ffff8800359c9800: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa ffff8800359c9880: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa ffff8800359c9900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>ffff8800359c9980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00[03]fb ffff8800359c9a00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa ffff8800359c9a80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa ffff8800359c9b00: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00 ffff8800359c9b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff8800359c9c00: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap redzone: fa Heap kmalloc redzone: fb Freed heap region: fd Shadow gap: fe The out-of-bounds access happens on 'parser->buffer[parser->idx] = 0;' Although the crash happened in ftrace_regex_open() the real bug occurred in trace_get_user() where there's an incrementation to parser->idx without a check against the size. The way it is triggered is if userspace sends in 128 characters (EVENT_BUF_SIZE + 1), the loop that reads the last character stores it and then breaks out because there is no more characters. Then the last character is read to determine what to do next, and the index is incremented without checking size. Then the caller of trace_get_user() usually nulls out the last character with a zero, but since the index is equal to the size, it writes a nul character after the allocated space, which can corrupt memory. Luckily, only root user has write access to this file. Link: http://lkml.kernel.org/r/20131009222323.04fd1a0d@gandalf.local.home Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-13clockevents: Sanitize ticks to nsec conversionThomas Gleixner1-15/+50
commit 97b9410643475d6557d2517c2aff9fd2221141a9 upstream. Marc Kleine-Budde pointed out, that commit 77cc982 "clocksource: use clockevents_config_and_register() where possible" caused a regression for some of the converted subarchs. The reason is, that the clockevents core code converts the minimal hardware tick delta to a nanosecond value for core internal usage. This conversion is affected by integer math rounding loss, so the backwards conversion to hardware ticks will likely result in a value which is less than the configured hardware limitation. The affected subarchs used their own workaround (SIGH!) which got lost in the conversion. The solution for the issue at hand is simple: adding evt->mult - 1 to the shifted value before the integer divison in the core conversion function takes care of it. But this only works for the case where for the scaled math mult/shift pair "mult <= 1 << shift" is true. For the case where "mult > 1 << shift" we can apply the rounding add only for the minimum delta value to make sure that the backward conversion is not less than the given hardware limit. For the upper bound we need to omit the rounding add, because the backwards conversion is always larger than the original latch value. That would violate the upper bound of the hardware device. Though looking closer at the details of that function reveals another bogosity: The upper bounds check is broken as well. Checking for a resulting "clc" value greater than KTIME_MAX after the conversion is pointless. The conversion does: u64 clc = (latch << evt->shift) / evt->mult; So there is no sanity check for (latch << evt->shift) exceeding the 64bit boundary. The latch argument is "unsigned long", so on a 64bit arch the handed in argument could easily lead to an unnoticed shift overflow. With the above rounding fix applied the calculation before the divison is: u64 clc = (latch << evt->shift) + evt->mult - 1; So we need to make sure, that neither the shift nor the rounding add is overflowing the u64 boundary. [ukl: move assignment to rnd after eventually changing mult, fix build issue and correct comment with the right math] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Russell King - ARM Linux <linux@arm.linux.org.uk> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: nicolas.ferre@atmel.com Cc: Marc Pignat <marc.pignat@hevs.ch> Cc: john.stultz@linaro.org Cc: kernel@pengutronix.de Cc: Ronald Wahl <ronald.wahl@raritan.com> Cc: LAK <linux-arm-kernel@lists.infradead.org> Cc: Ludovic Desroches <ludovic.desroches@atmel.com> Link: http://lkml.kernel.org/r/1380052223-24139-1-git-send-email-u.kleine-koenig@pengutronix.de Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-01perf: Fix perf_cgroup_switch for sw-eventsPeter Zijlstra1-3/+6
commit 95cf59ea72331d0093010543b8951bb43f262cac upstream. Jiri reported that he could trigger the WARN_ON_ONCE() in perf_cgroup_switch() using sw-events. This is because sw-events share a cpuctx with multiple PMUs. Use the ->unique_pmu pointer to limit the pmu iteration to unique cpuctx instances. Reported-and-Tested-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-so7wi2zf3jjzrwcutm2mkz0j@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-01perf: Clarify perf_cpu_context::active_pmu usage by renaming it to ::unique_pmuPeter Zijlstra1-6/+6
commit 3f1f33206c16c7b3839d71372bc2ac3f305aa802 upstream. Stephane thought the perf_cpu_context::active_pmu name confusing and suggested using 'unique_pmu' instead. This pointer is a pointer to a 'random' pmu sharing the cpuctx instance, therefore limiting a for_each_pmu loop to those where cpuctx->unique_pmu matches the pmu we get a loop over unique cpuctx instances. Suggested-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-kxyjqpfj2fn9gt7kwu5ag9ks@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-01cgroup: fail if monitored file and event_control are in different cgroupLi Zefan1-0/+11
commit f169007b2773f285e098cb84c74aac0154d65ff7 upstream. If we pass fd of memory.usage_in_bytes of cgroup A to cgroup.event_control of cgroup B, then we won't get memory usage notification from A but B! What's worse, if A and B are in different mount hierarchy, we'll end up accessing NULL pointer! Disallow this kind of invalid usage. Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Weng Meiling <wengmeiling.weng@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-01sched/fair: Fix small race where child->se.parent,cfs_rq might point to ↵Daisuke Nishimura1-5/+9
invalid ones commit 6c9a27f5da9609fca46cb2b183724531b48f71ad upstream. There is a small race between copy_process() and cgroup_attach_task() where child->se.parent,cfs_rq points to invalid (old) ones. parent doing fork() | someone moving the parent to another cgroup -------------------------------+--------------------------------------------- copy_process() + dup_task_struct() -> parent->se is copied to child->se. se.parent,cfs_rq of them point to old ones. cgroup_attach_task() + cgroup_task_migrate() -> parent->cgroup is updated. + cpu_cgroup_attach() + sched_move_task() + task_move_group_fair() +- set_task_rq() -> se.parent,cfs_rq of parent are updated. + cgroup_fork() -> parent->cgroup is copied to child->cgroup. (*1) + sched_fork() + task_fork_fair() -> se.parent,cfs_rq of child are accessed while they point to old ones. (*2) In the worst case, this bug can lead to "use-after-free" and cause a panic, because it's new cgroup's refcount that is incremented at (*1), so the old cgroup(and related data) can be freed before (*2). In fact, a panic caused by this bug was originally caught in RHEL6.4. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0 [...] Call Trace: [<ffffffff81051f25>] place_entity+0x75/0xa0 [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160 [<ffffffff81063c0b>] sched_fork+0x6b/0x140 [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450 [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130 [<ffffffff8106d2f4>] do_fork+0x94/0x460 [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100 [<ffffffff81009598>] sys_clone+0x28/0x30 [<ffffffff8100b393>] stub_clone+0x13/0x20 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/039601ceae06$733d3130$59b79390$@mxp.nes.nec.co.jp Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-29workqueue: consider work function when searching for busy work itemsTejun Heo1-11/+31
commit a2c1c57be8d9fd5b716113c8991d3d702eeacf77 upstream. To avoid executing the same work item concurrenlty, workqueue hashes currently busy workers according to their current work items and looks up the the table when it wants to execute a new work item. If there already is a worker which is executing the new work item, the new item is queued to the found worker so that it gets executed only after the current execution finishes. Unfortunately, a work item may be freed while being executed and thus recycled for different purposes. If it gets recycled for a different work item and queued while the previous execution is still in progress, workqueue may make the new work item wait for the old one although the two aren't really related in any way. In extreme cases, this false dependency may lead to deadlock although it's extremely unlikely given that there aren't too many self-freeing work item users and they usually don't wait for other work items. To alleviate the problem, record the current work function in each busy worker and match it together with the work item address in find_worker_executing_work(). While this isn't complete, it ensures that unrelated work items don't interact with each other and in the very unlikely case where a twisted wq user triggers it, it's always onto itself making the culprit easy to spot. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Andrey Isakov <andy51@gmx.ru> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=51701 [lizf: Backported to 3.4: - Adjust context - Incorporate earlier logging cleanup in process_one_work() from 044c782ce3a9 ('workqueue: fix checkpatch issues')] Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-29workqueue: fix possible stall on try_to_grab_pending() of a delayed work itemLai Jiangshan1-3/+22
commit 3aa62497594430ea522050b75c033f71f2c60ee6 upstream. Currently, when try_to_grab_pending() grabs a delayed work item, it leaves its linked work items alone on the delayed_works. The linked work items are always NO_COLOR and will cause future cwq_activate_first_delayed() increase cwq->nr_active incorrectly, and may cause the whole cwq to stall. For example, state: cwq->max_active = 1, cwq->nr_active = 1 one work in cwq->pool, many in cwq->delayed_works. step1: try_to_grab_pending() removes a work item from delayed_works but leaves its NO_COLOR linked work items on it. step2: Later on, cwq_activate_first_delayed() activates the linked work item increasing ->nr_active. step3: cwq->nr_active = 1, but all activated work items of the cwq are NO_COLOR. When they finish, cwq->nr_active will not be decreased due to NO_COLOR, and no further work items will be activated from cwq->delayed_works. the cwq stalls. Fix it by ensuring the target work item is activated before stealing PENDING in try_to_grab_pending(). This ensures that all the linked work items are activated without incorrectly bumping cwq->nr_active. tj: Updated comment and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> [lizf: backported to 3.4: adjust context] Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-20futex: Take hugepages into account when generating futex_keyZhang Yi1-1/+2
commit 13d60f4b6ab5b702dc8d2ee20999f98a93728aec upstream. The futex_keys of process shared futexes are generated from the page offset, the mapping host and the mapping index of the futex user space address. This should result in an unique identifier for each futex. Though this is not true when futexes are located in different subpages of an hugepage. The reason is, that the mapping index for all those futexes evaluates to the index of the base page of the hugetlbfs mapping. So a futex at offset 0 of the hugepage mapping and another one at offset PAGE_SIZE of the same hugepage mapping have identical futex_keys. This happens because the futex code blindly uses page->index. Steps to reproduce the bug: 1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0 and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs mapping. The mutexes must be initialized as PTHREAD_PROCESS_SHARED because PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as their keys solely depend on the user space address. 2. Lock mutex1 and mutex2 3. Create thread1 and in the thread function lock mutex1, which results in thread1 blocking on the locked mutex1. 4. Create thread2 and in the thread function lock mutex2, which results in thread2 blocking on the locked mutex2. 5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2 still blocks on mutex2 because the futex_key points to mutex1. To solve this issue we need to take the normal page index of the page which contains the futex into account, if the futex is in an hugetlbfs mapping. In other words, we calculate the normal page mapping index of the subpage in the hugetlbfs mapping. Mappings which are not based on hugetlbfs are not affected and still use page->index. Thanks to Mel Gorman who provided a patch for adding proper evaluation functions to the hugetlbfs code to avoid exposing hugetlbfs specific details to the futex code. [ tglx: Massaged changelog ] Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Tested-by: Ma Chenggong <ma.chenggong@zte.com.cn> Reviewed-by: 'Mel Gorman' <mgorman@suse.de> Acked-by: 'Darren Hart' <dvhart@linux.intel.com> Cc: 'Peter Zijlstra' <peterz@infradead.org> Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-14tracing: Fix fields of struct trace_iterator that are zeroed by mistakeAndrew Vagin1-0/+1
commit ed5467da0e369e65b247b99eb6403cb79172bcda upstream. tracing_read_pipe zeros all fields bellow "seq". The declaration contains a comment about that, but it doesn't help. The first field is "snapshot", it's true when current open file is snapshot. Looks obvious, that it should not be zeroed. The second field is "started". It was converted from cpumask_t to cpumask_var_t (v2.6.28-4983-g4462344), in other words it was converted from cpumask to pointer on cpumask. Currently the reference on "started" memory is lost after the first read from tracing_read_pipe and a proper object will never be freed. The "started" is never dereferenced for trace_pipe, because trace_pipe can't have the TRACE_FILE_ANNOTATE options. Link: http://lkml.kernel.org/r/1375463803-3085183-1-git-send-email-avagin@openvz.org Signed-off-by: Andrew Vagin <avagin@openvz.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-11perf: Use css_tryget() to avoid propping up css refcountSalman Qazi1-3/+7
commit 9c5da09d266ca9b32eb16cf940f8161d949c2fe5 upstream. An rmdir pushes css's ref count to zero. However, if the associated directory is open at the time, the dentry ref count is non-zero. If the fd for this directory is then passed into perf_event_open, it does a css_get(). This bounces the ref count back up from zero. This is a problem by itself. But what makes it turn into a crash is the fact that we end up doing an extra dput, since we perform a dput when css_put sees the ref count go down to zero. css_tryget() does not fall into that trap. So, we use that instead. Reproduction test-case for the bug: #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <linux/unistd.h> #include <linux/perf_event.h> #include <string.h> #include <errno.h> #include <stdio.h> #define PERF_FLAG_PID_CGROUP (1U << 2) int perf_event_open(struct perf_event_attr *hw_event_uptr, pid_t pid, int cpu, int group_fd, unsigned long flags) { return syscall(__NR_perf_event_open,hw_event_uptr, pid, cpu, group_fd, flags); } /* * Directly poke at the perf_event bug, since it's proving hard to repro * depending on where in the kernel tree. what moved? */ int main(int argc, char **argv) { int fd; struct perf_event_attr attr; memset(&attr, 0, sizeof(attr)); attr.exclude_kernel = 1; attr.size = sizeof(attr); mkdir("/dev/cgroup/perf_event/blah", 0777); fd = open("/dev/cgroup/perf_event/blah", O_RDONLY); perror("open"); rmdir("/dev/cgroup/perf_event/blah"); sleep(2); perf_event_open(&attr, fd, 0, -1, PERF_FLAG_PID_CGROUP); perror("perf_event_open"); close(fd); return 0; } Signed-off-by: Salman Qazi <sqazi@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20120614223108.1025.2503.stgit@dungbeetle.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-11perf: Fix event group context moveJiri Olsa1-2/+18
commit 0231bb5336758426b44ccd798ccd3c5419c95d58 upstream. When we have group with mixed events (hw/sw) we want to end up with group leader being in hw context. So if group leader is initialy sw event, we move all the events under hw context. The move is done for each event by removing it from its context and adding it back into proper one. As a part of the removal the event is automatically disabled, which is not what we want at this stage of creating groups. The fix is to initialize event state after removal from sw context. This fix resulted from the following discussion: http://thread.gmane.org/gmane.linux.kernel.perf.user/1144 Reported-by: Andreas Hollmann <hollmann@in.tum.de> Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Cc: Vince Weaver <vince@deater.net> Link: http://lkml.kernel.org/r/1359714225-4231-1-git-send-email-jolsa@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-11sched: Fix the broken sched_rr_get_interval()Zhu Yanhai1-1/+1
commit a59f4e079d19464eebb9b06513a1d4f55fdae5ba upstream. The caller of sched_sliced() should pass se.cfs_rq and se as the arguments, however in sched_rr_get_interval() we gave it rq.cfs_rq and se, which made the following computation obviously wrong. The change was introduced by commit: 77034937dc45 sched: fix crash in sys_sched_rr_get_interval() ... 5 years ago, while it had been the correct 'cfs_rq_of' before the commit. The change seems to be irrelevant to the commit msg, which was to return a 0 timeslice for tasks that are on an idle runqueue. So I believe that was just a plain typo. Signed-off-by: Zhu Yanhai <gaoyang.zyh@taobao.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1357621012-15039-1-git-send-email-gaoyang.zyh@taobao.com [ Since this is an ABI and an old bug, we'll test this via a slow upstream route, to hopefully discover any app breakage. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-04tracing: Fix irqs-off tag display in syscall tracingzhangwei(Jovi)1-0/+3
commit 11034ae9c20f4057a6127fc965906417978e69b2 upstream Initialization of variable irq_flags and pc was missed when backport 11034ae9c to linux-3.0.y and linux-3.4.y, my fault. Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-28hrtimers: Move SMP function call to thread contextThomas Gleixner1-15/+13
commit 5ec2481b7b47a4005bb446d176e5d0257400c77d upstream. smp_call_function_* must not be called from softirq context. But clock_was_set() which calls on_each_cpu() is called from softirq context to implement a delayed clock_was_set() for the timer interrupt handler. Though that almost never gets invoked. A recent change in the resume code uses the softirq based delayed clock_was_set to support Xens resume mechanism. linux-next contains a new warning which warns if smp_call_function_* is called from softirq context which gets triggered by that Xen change. Fix this by moving the delayed clock_was_set() call to a work context. Reported-and-tested-by: Artem Savkov <artem.savkov@gmail.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com>, Cc: Konrad Wilk <konrad.wilk@oracle.com> Cc: John Stultz <john.stultz@linaro.org> Cc: xen-devel@lists.xen.org Cc: stable@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-28tracing: Fix irqs-off tag display in syscall tracingzhangwei(Jovi)1-4/+14
commit 11034ae9c20f4057a6127fc965906417978e69b2 upstream. All syscall tracing irqs-off tags are wrong, the syscall enter entry doesn't disable irqs. [root@jovi tracing]#echo "syscalls:sys_enter_open" > set_event [root@jovi tracing]# cat trace # tracer: nop # # entries-in-buffer/entries-written: 13/13 #P:2 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | irqbalance-513 [000] d... 56115.496766: sys_open(filename: 804e1a6, flags: 0, mode: 1b6) irqbalance-513 [000] d... 56115.497008: sys_open(filename: 804e1bb, flags: 0, mode: 1b6) sendmail-771 [000] d... 56115.827982: sys_open(filename: b770e6d1, flags: 0, mode: 1b6) The reason is syscall tracing doesn't record irq_flags into buffer. The proper display is: [root@jovi tracing]#echo "syscalls:sys_enter_open" > set_event [root@jovi tracing]# cat trace # tracer: nop # # entries-in-buffer/entries-written: 14/14 #P:2 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | irqbalance-514 [001] .... 46.213921: sys_open(filename: 804e1a6, flags: 0, mode: 1b6) irqbalance-514 [001] .... 46.214160: sys_open(filename: 804e1bb, flags: 0, mode: 1b6) <...>-920 [001] .... 47.307260: sys_open(filename: 4e82a0c5, flags: 80000, mode: 0) Link: http://lkml.kernel.org/r/1365564393-10972-3-git-send-email-jovi.zhangwei@huawei.com Cc: stable@vger.kernel.org # 2.6.35 Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-28perf: Fix perf_lock_task_context() vs RCUPeter Zijlstra1-1/+14
commit 058ebd0eba3aff16b144eabf4510ed9510e1416e upstream. Jiri managed to trigger this warning: [] ====================================================== [] [ INFO: possible circular locking dependency detected ] [] 3.10.0+ #228 Tainted: G W [] ------------------------------------------------------- [] p/6613 is trying to acquire lock: [] (rcu_node_0){..-...}, at: [<ffffffff810ca797>] rcu_read_unlock_special+0xa7/0x250 [] [] but task is already holding lock: [] (&ctx->lock){-.-...}, at: [<ffffffff810f2879>] perf_lock_task_context+0xd9/0x2c0 [] [] which lock already depends on the new lock. [] [] the existing dependency chain (in reverse order) is: [] [] -> #4 (&ctx->lock){-.-...}: [] -> #3 (&rq->lock){-.-.-.}: [] -> #2 (&p->pi_lock){-.-.-.}: [] -> #1 (&rnp->nocb_gp_wq[1]){......}: [] -> #0 (rcu_node_0){..-...}: Paul was quick to explain that due to preemptible RCU we cannot call rcu_read_unlock() while holding scheduler (or nested) locks when part of the read side critical section was preemptible. Therefore solve it by making the entire RCU read side non-preemptible. Also pull out the retry from under the non-preempt to play nice with RT. Reported-by: Jiri Olsa <jolsa@redhat.com> Helped-out-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-28perf: Remove WARN_ON_ONCE() check in __perf_event_enable() for valid scenarioJiri Olsa1-1/+10
commit 06f417968beac6e6b614e17b37d347aa6a6b1d30 upstream. The '!ctx->is_active' check has a valid scenario, so there's no need for the warning. The reason is that there's a time window between the 'ctx->is_active' check in the perf_event_enable() function and the __perf_event_enable() function having: - IRQs on - ctx->lock unlocked where the task could be killed and 'ctx' deactivated by perf_event_exit_task(), ending up with the warning below. So remove the WARN_ON_ONCE() check and add comments to explain it all. This addresses the following warning reported by Vince Weaver: [ 324.983534] ------------[ cut here ]------------ [ 324.984420] WARNING: at kernel/events/core.c:1953 __perf_event_enable+0x187/0x190() [ 324.984420] Modules linked in: [ 324.984420] CPU: 19 PID: 2715 Comm: nmi_bug_snb Not tainted 3.10.0+ #246 [ 324.984420] Hardware name: Supermicro X8DTN/X8DTN, BIOS 4.6.3 01/08/2010 [ 324.984420] 0000000000000009 ffff88043fce3ec8 ffffffff8160ea0b ffff88043fce3f00 [ 324.984420] ffffffff81080ff0 ffff8802314fdc00 ffff880231a8f800 ffff88043fcf7860 [ 324.984420] 0000000000000286 ffff880231a8f800 ffff88043fce3f10 ffffffff8108103a [ 324.984420] Call Trace: [ 324.984420] <IRQ> [<ffffffff8160ea0b>] dump_stack+0x19/0x1b [ 324.984420] [<ffffffff81080ff0>] warn_slowpath_common+0x70/0xa0 [ 324.984420] [<ffffffff8108103a>] warn_slowpath_null+0x1a/0x20 [ 324.984420] [<ffffffff81134437>] __perf_event_enable+0x187/0x190 [ 324.984420] [<ffffffff81130030>] remote_function+0x40/0x50 [ 324.984420] [<ffffffff810e51de>] generic_smp_call_function_single_interrupt+0xbe/0x130 [ 324.984420] [<ffffffff81066a47>] smp_call_function_single_interrupt+0x27/0x40 [ 324.984420] [<ffffffff8161fd2f>] call_function_single_interrupt+0x6f/0x80 [ 324.984420] <EOI> [<ffffffff816161a1>] ? _raw_spin_unlock_irqrestore+0x41/0x70 [ 324.984420] [<ffffffff8113799d>] perf_event_exit_task+0x14d/0x210 [ 324.984420] [<ffffffff810acd04>] ? switch_task_namespaces+0x24/0x60 [ 324.984420] [<ffffffff81086946>] do_exit+0x2b6/0xa40 [ 324.984420] [<ffffffff8161615c>] ? _raw_spin_unlock_irq+0x2c/0x30 [ 324.984420] [<ffffffff81087279>] do_group_exit+0x49/0xc0 [ 324.984420] [<ffffffff81096854>] get_signal_to_deliver+0x254/0x620 [ 324.984420] [<ffffffff81043057>] do_signal+0x57/0x5a0 [ 324.984420] [<ffffffff8161a164>] ? __do_page_fault+0x2a4/0x4e0 [ 324.984420] [<ffffffff8161665c>] ? retint_restore_args+0xe/0xe [ 324.984420] [<ffffffff816166cd>] ? retint_signal+0x11/0x84 [ 324.984420] [<ffffffff81043605>] do_notify_resume+0x65/0x80 [ 324.984420] [<ffffffff81616702>] retint_signal+0x46/0x84 [ 324.984420] ---[ end trace 442ec2f04db3771a ]--- Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Jiri Olsa <jolsa@redhat.com> Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1373384651-6109-2-git-send-email-jolsa@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-28perf: Clone child context from parent context pmuJiri Olsa1-1/+1
commit 734df5ab549ca44f40de0f07af1c8803856dfb18 upstream. Currently when the child context for inherited events is created, it's based on the pmu object of the first event of the parent context. This is wrong for the following scenario: - HW context having HW and SW event - HW event got removed (closed) - SW event stays in HW context as the only event and its pmu is used to clone the child context The issue starts when the cpu context object is touched based on the pmu context object (__get_cpu_context). In this case the HW context will work with SW cpu context ending up with following WARN below. Fixing this by using parent context pmu object to clone from child context. Addresses the following warning reported by Vince Weaver: [ 2716.472065] ------------[ cut here ]------------ [ 2716.476035] WARNING: at kernel/events/core.c:2122 task_ctx_sched_out+0x3c/0x) [ 2716.476035] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs locn [ 2716.476035] CPU: 0 PID: 3164 Comm: perf_fuzzer Not tainted 3.10.0-rc4 #2 [ 2716.476035] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BI2 [ 2716.476035] 0000000000000000 ffffffff8102e215 0000000000000000 ffff88011fc18 [ 2716.476035] ffff8801175557f0 0000000000000000 ffff880119fda88c ffffffff810ad [ 2716.476035] ffff880119fda880 ffffffff810af02a 0000000000000009 ffff880117550 [ 2716.476035] Call Trace: [ 2716.476035] [<ffffffff8102e215>] ? warn_slowpath_common+0x5b/0x70 [ 2716.476035] [<ffffffff810ab2bd>] ? task_ctx_sched_out+0x3c/0x5f [ 2716.476035] [<ffffffff810af02a>] ? perf_event_exit_task+0xbf/0x194 [ 2716.476035] [<ffffffff81032a37>] ? do_exit+0x3e7/0x90c [ 2716.476035] [<ffffffff810cd5ab>] ? __do_fault+0x359/0x394 [ 2716.476035] [<ffffffff81032fe6>] ? do_group_exit+0x66/0x98 [ 2716.476035] [<ffffffff8103dbcd>] ? get_signal_to_deliver+0x479/0x4ad [ 2716.476035] [<ffffffff810ac05c>] ? __perf_event_task_sched_out+0x230/0x2d1 [ 2716.476035] [<ffffffff8100205d>] ? do_signal+0x3c/0x432 [ 2716.476035] [<ffffffff810abbf9>] ? ctx_sched_in+0x43/0x141 [ 2716.476035] [<ffffffff810ac2ca>] ? perf_event_context_sched_in+0x7a/0x90 [ 2716.476035] [<ffffffff810ac311>] ? __perf_event_task_sched_in+0x31/0x118 [ 2716.476035] [<ffffffff81050dd9>] ? mmdrop+0xd/0x1c [ 2716.476035] [<ffffffff81051a39>] ? finish_task_switch+0x7d/0xa6 [ 2716.476035] [<ffffffff81002473>] ? do_notify_resume+0x20/0x5d [ 2716.476035] [<ffffffff813654f5>] ? retint_signal+0x3d/0x78 [ 2716.476035] ---[ end trace 827178d8a5966c3d ]--- Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1373384651-6109-1-git-send-email-jolsa@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linu