| Age | Commit message (Collapse) | Author | Files | Lines |
|
commit f36d1be2930ede0a1947686e1126ffda5d5ee1bb upstream.
When we do cat /sys/kernel/debug/tracing/printk_formats, we hit kernel
panic at t_show.
general protection fault: 0000 [#1] PREEMPT SMP
CPU: 0 PID: 2957 Comm: sh Tainted: G W O 3.14.55-x86_64-01062-gd4acdc7 #2
RIP: 0010:[<ffffffff811375b2>]
[<ffffffff811375b2>] t_show+0x22/0xe0
RSP: 0000:ffff88002b4ebe80 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
RDX: 0000000000000004 RSI: ffffffff81fd26a6 RDI: ffff880032f9f7b1
RBP: ffff88002b4ebe98 R08: 0000000000001000 R09: 000000000000ffec
R10: 0000000000000000 R11: 000000000000000f R12: ffff880004d9b6c0
R13: 7365725f6d706400 R14: ffff880004d9b6c0 R15: ffffffff82020570
FS: 0000000000000000(0000) GS:ffff88003aa00000(0063) knlGS:00000000f776bc40
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 00000000f6c02ff0 CR3: 000000002c2b3000 CR4: 00000000001007f0
Call Trace:
[<ffffffff811dc076>] seq_read+0x2f6/0x3e0
[<ffffffff811b749b>] vfs_read+0x9b/0x160
[<ffffffff811b7f69>] SyS_read+0x49/0xb0
[<ffffffff81a3a4b9>] ia32_do_call+0x13/0x13
---[ end trace 5bd9eb630614861e ]---
Kernel panic - not syncing: Fatal exception
When the first time find_next calls find_next_mod_format, it should
iterate the trace_bprintk_fmt_list to find the first print format of
the module. However in current code, start_index is smaller than *pos
at first, and code will not iterate the list. Latter container_of will
get the wrong address with former v, which will cause mod_fmt be a
meaningless object and so is the returned mod_fmt->fmt.
This patch will fix it by correcting the start_index. After fixed,
when the first time calls find_next_mod_format, start_index will be
equal to *pos, and code will iterate the trace_bprintk_fmt_list to
get the right module printk format, so is the returned mod_fmt->fmt.
Link: http://lkml.kernel.org/r/5684B900.9000309@intel.com
Fixes: 102c9323c35a8 "tracing: Add __tracepoint_string() to export string pointers"
Signed-off-by: Qiu Peiyang <peiyangx.qiu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6224beb12e190ff11f3c7d4bf50cb2922878f600 upstream.
Fengguang Wu's tests triggered a bug in the branch tracer's start up
test when CONFIG_DEBUG_PREEMPT set. This was because that config
adds some debug logic in the per cpu field, which calls back into
the branch tracer.
The branch tracer has its own recursive checks, but uses a per cpu
variable to implement it. If retrieving the per cpu variable calls
back into the branch tracer, you can see how things will break.
Instead of using a per cpu variable, use the trace_recursion field
of the current task struct. Simply set a bit when entering the
branch tracing and clear it when leaving. If the bit is set on
entry, just don't do the tracing.
There's also the case with lockdep, as the local_irq_save() called
before the recursion can also trigger code that can call back into
the function. Changing that to a raw_local_irq_save() will protect
that as well.
This prevents the recursion and the inevitable crash that follows.
Link: http://lkml.kernel.org/r/20150630141803.GA28071@wfg-t540p.sh.intel.com
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6b88f44e161b9ee2a803e5b2b1fbcf4e20e8b980 upstream.
While debugging a WARN_ON() for filtering, I found that it is possible
for the filter string to be referenced after its end. With the filter:
# echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter
The filter_parse() function can call infix_get_op() which calls
infix_advance() that updates the infix filter pointers for the cnt
and tail without checking if the filter is already at the end, which
will put the cnt to zero and the tail beyond the end. The loop then calls
infix_next() that has
ps->infix.cnt--;
return ps->infix.string[ps->infix.tail++];
The cnt will now be below zero, and the tail that is returned is
already passed the end of the filter string. So far the allocation
of the filter string usually has some buffer that is zeroed out, but
if the filter string is of the exact size of the allocated buffer
there's no guarantee that the charater after the nul terminating
character will be zero.
Luckily, only root can write to the filter.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b4875bbe7e68f139bd3383828ae8e994a0df6d28 upstream.
When testing the fix for the trace filter, I could not come up with
a scenario where the operand count goes below zero, so I added a
WARN_ON_ONCE(cnt < 0) to the logic. But there is legitimate case
that it can happen (although the filter would be wrong).
# echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter
That is, a single operation without any operands will hit the path
where the WARN_ON_ONCE() can trigger. Although this is harmless,
and the filter is reported as a error. But instead of spitting out
a warning to the kernel dmesg, just fail nicely and report it via
the proper channels.
Link: http://lkml.kernel.org/r/558C6082.90608@oracle.com
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2cf30dc180cea808077f003c5116388183e54f9e upstream.
When the following filter is used it causes a warning to trigger:
# cd /sys/kernel/debug/tracing
# echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter
-bash: echo: write error: Invalid argument
# cat events/ext4/ext4_truncate_exit/filter
((dev==1)blocks==2)
^
parse_error: No error
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1223 at kernel/trace/trace_events_filter.c:1640 replace_preds+0x3c5/0x990()
Modules linked in: bnep lockd grace bluetooth ...
CPU: 3 PID: 1223 Comm: bash Tainted: G W 4.1.0-rc3-test+ #450
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
0000000000000668 ffff8800c106bc98 ffffffff816ed4f9 ffff88011ead0cf0
0000000000000000 ffff8800c106bcd8 ffffffff8107fb07 ffffffff8136b46c
ffff8800c7d81d48 ffff8800d4c2bc00 ffff8800d4d4f920 00000000ffffffea
Call Trace:
[<ffffffff816ed4f9>] dump_stack+0x4c/0x6e
[<ffffffff8107fb07>] warn_slowpath_common+0x97/0xe0
[<ffffffff8136b46c>] ? _kstrtoull+0x2c/0x80
[<ffffffff8107fb6a>] warn_slowpath_null+0x1a/0x20
[<ffffffff81159065>] replace_preds+0x3c5/0x990
[<ffffffff811596b2>] create_filter+0x82/0xb0
[<ffffffff81159944>] apply_event_filter+0xd4/0x180
[<ffffffff81152bbf>] event_filter_write+0x8f/0x120
[<ffffffff811db2a8>] __vfs_write+0x28/0xe0
[<ffffffff811dda43>] ? __sb_start_write+0x53/0xf0
[<ffffffff812e51e0>] ? security_file_permission+0x30/0xc0
[<ffffffff811dc408>] vfs_write+0xb8/0x1b0
[<ffffffff811dc72f>] SyS_write+0x4f/0xb0
[<ffffffff816f5217>] system_call_fastpath+0x12/0x6a
---[ end trace e11028bd95818dcd ]---
Worse yet, reading the error message (the filter again) it says that
there was no error, when there clearly was. The issue is that the
code that checks the input does not check for balanced ops. That is,
having an op between a closed parenthesis and the next token.
This would only cause a warning, and fail out before doing any real
harm, but it should still not caues a warning, and the error reported
should work:
# cd /sys/kernel/debug/tracing
# echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter
-bash: echo: write error: Invalid argument
# cat events/ext4/ext4_truncate_exit/filter
((dev==1)blocks==2)
^
parse_error: Meaningless filter expression
And give no kernel warning.
Link: http://lkml.kernel.org/r/20150615175025.7e809215@gandalf.local.home
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
[ luis: backported to 3.16:
- unconditionally decrement cnt as the OP_NOT logic was introduced only
by e12c09cf3087 ("tracing: Add NOT to filtering logic") ]
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 108029323910c5dd1ef8fa2d10da1ce5fbce6e12 upstream.
The producer should be used producer_fifo as its sched_priority,
so correct it.
Link: http://lkml.kernel.org/r/1433923957-67842-1-git-send-email-long.wanglong@huawei.com
Signed-off-by: Wang Long <long.wanglong@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 80a9b64e2c156b6523e7a01f2ba6e5d86e722814 upstream.
It has come to my attention that this_cpu_read/write are horrible on
architectures other than x86. Worse yet, they actually disable
preemption or interrupts! This caused some unexpected tracing results
on ARM.
101.356868: preempt_count_add <-ring_buffer_lock_reserve
101.356870: preempt_count_sub <-ring_buffer_lock_reserve
The ring_buffer_lock_reserve has recursion protection that requires
accessing a per cpu variable. But since preempt_disable() is traced, it
too got traced while accessing the variable that is suppose to prevent
recursion like this.
The generic version of this_cpu_read() and write() are:
#define this_cpu_generic_read(pcp) \
({ typeof(pcp) ret__; \
preempt_disable(); \
ret__ = *this_cpu_ptr(&(pcp)); \
preempt_enable(); \
ret__; \
})
#define this_cpu_generic_to_op(pcp, val, op) \
do { \
unsigned long flags; \
raw_local_irq_save(flags); \
*__this_cpu_ptr(&(pcp)) op val; \
raw_local_irq_restore(flags); \
} while (0)
Which is unacceptable for locations that know they are within preempt
disabled or interrupt disabled locations.
Paul McKenney stated that __this_cpu_() versions produce much better code on
other architectures than this_cpu_() does, if we know that the call is done in
a preempt disabled location.
I also changed the recursive_unlock() to use two local variables instead
of accessing the per_cpu variable twice.
Link: http://lkml.kernel.org/r/20150317114411.GE3589@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/20150317104038.312e73d1@gandalf.local.home
Acked-by: Christoph Lameter <cl@linux.com>
Reported-by: Uwe Kleine-Koenig <u.kleine-koenig@pengutronix.de>
Tested-by: Uwe Kleine-Koenig <u.kleine-koenig@pengutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 946e51f2bf37f1656916eb75bd0742ba33983c28 upstream.
move d_rcu from overlapping d_child to overlapping d_alias
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ben Hutchings <ben@decadent.org.uk>
[hujianyang: Backported to 3.14 refer to the work of Ben Hutchings in 3.2:
- Apply name changes in all the different places we use d_alias and d_child
- Move the WARN_ON() in __d_free() to d_free() as we don't have dentry_free()]
Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7215853e985a4bef1a6c14e00e89dfec84f1e457 upstream.
Commit 6edb2a8a385f0cdef51dae37ff23e74d76d8a6ce introduced
an array map_pages that contains the addresses returned by
kmap_atomic. However, when unmapping those pages, map_pages[0]
is unmapped before map_pages[1], breaking the nesting requirement
as specified in the documentation for kmap_atomic/kunmap_atomic.
This was caught by the highmem debug code present in kunmap_atomic.
Fix the loop to do the unmapping properly.
Link: http://lkml.kernel.org/r/1418871056-6614-1-git-send-email-markivx@codeaurora.org
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Reported-by: Lime Yang <limey@codeaurora.org>
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 086ba77a6db00ed858ff07451bedee197df868c9 upstream.
ARM has some private syscalls (for example, set_tls(2)) which lie
outside the range of NR_syscalls. If any of these are called while
syscall tracing is being performed, out-of-bounds array access will
occur in the ftrace and perf sys_{enter,exit} handlers.
# trace-cmd record -e raw_syscalls:* true && trace-cmd report
...
true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0)
true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264
true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1)
true-653 [000] 384.675988: sys_exit: NR 983045 = 0
...
# trace-cmd record -e syscalls:* true
[ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace
[ 17.289590] pgd = 9e71c000
[ 17.289696] [aaaaaace] *pgd=00000000
[ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[ 17.290169] Modules linked in:
[ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21
[ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000
[ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8
[ 17.290866] LR is at syscall_trace_enter+0x124/0x184
Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers.
Commit cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls"
added the check for less than zero, but it should have also checked
for greater than NR_syscalls.
Link: http://lkml.kernel.org/p/1414620418-29472-1-git-send-email-rabin@rab.in
Fixes: cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls"
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 24607f114fd14f2f37e3e0cb3d47bce96e81e848 upstream.
Commit 651e22f2701b "ring-buffer: Always reset iterator to reader page"
fixed one bug but in the process caused another one. The reset is to
update the header page, but that fix also changed the way the cached
reads were updated. The cache reads are used to test if an iterator
needs to be updated or not.
A ring buffer iterator, when created, disables writes to the ring buffer
but does not stop other readers or consuming reads from happening.
Although all readers are synchronized via a lock, they are only
synchronized when in the ring buffer functions. Those functions may
be called by any number of readers. The iterator continues down when
its not interrupted by a consuming reader. If a consuming read
occurs, the iterator starts from the beginning of the buffer.
The way the iterator sees that a consuming read has happened since
its last read is by checking the reader "cache". The cache holds the
last counts of the read and the reader page itself.
Commit 651e22f2701b changed what was saved by the cache_read when
the rb_iter_reset() occurred, making the iterator never match the cache.
Then if the iterator calls rb_iter_reset(), it will go into an
infinite loop by checking if the cache doesn't match, doing the reset
and retrying, just to see that the cache still doesn't match! Which
should never happen as the reset is suppose to set the cache to the
current value and there's locks that keep a consuming reader from
having access to the data.
Fixes: 651e22f2701b "ring-buffer: Always reset iterator to reader page"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4ce97dbf50245227add17c83d87dc838e7ca79d0 upstream.
Epoll on trace_pipe can sometimes hang in a weird case. If the ring buffer is
empty when we set waiters_pending but an event shows up exactly at that moment
we can miss being woken up by the ring buffers irq work. Since
ring_buffer_empty() is inherently racey we will sometimes think that the buffer
is not empty. So we don't get woken up and we don't think there are any events
even though there were some ready when we added the watch, which makes us hang.
This patch fixes this by making sure that we are actually on the wait list
before we set waiters_pending, and add a memory barrier to make sure
ring_buffer_empty() is going to be correct.
Link: http://lkml.kernel.org/p/1408989581-23727-1-git-send-email-jbacik@fb.com
Cc: Martin Lau <kafai@fb.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 021de3d904b88b1771a3a2cfc5b75023c391e646 upstream.
After writting a test to try to trigger the bug that caused the
ring buffer iterator to become corrupted, I hit another bug:
WARNING: CPU: 1 PID: 5281 at kernel/trace/ring_buffer.c:3766 rb_iter_peek+0x113/0x238()
Modules linked in: ipt_MASQUERADE sunrpc [...]
CPU: 1 PID: 5281 Comm: grep Tainted: G W 3.16.0-rc3-test+ #143
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
0000000000000000 ffffffff81809a80 ffffffff81503fb0 0000000000000000
ffffffff81040ca1 ffff8800796d6010 ffffffff810c138d ffff8800796d6010
ffff880077438c80 ffff8800796d6010 ffff88007abbe600 0000000000000003
Call Trace:
[<ffffffff81503fb0>] ? dump_stack+0x4a/0x75
[<ffffffff81040ca1>] ? warn_slowpath_common+0x7e/0x97
[<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
[<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
[<ffffffff810c14df>] ? ring_buffer_iter_peek+0x2d/0x5c
[<ffffffff810c6f73>] ? tracing_iter_reset+0x6e/0x96
[<ffffffff810c74a3>] ? s_start+0xd7/0x17b
[<ffffffff8112b13e>] ? kmem_cache_alloc_trace+0xda/0xea
[<ffffffff8114cf94>] ? seq_read+0x148/0x361
[<ffffffff81132d98>] ? vfs_read+0x93/0xf1
[<ffffffff81132f1b>] ? SyS_read+0x60/0x8e
[<ffffffff8150bf9f>] ? tracesys+0xdd/0xe2
Debugging this bug, which triggers when the rb_iter_peek() loops too
many times (more than 2 times), I discovered there's a case that can
cause that function to legitimately loop 3 times!
rb_iter_peek() is different than rb_buffer_peek() as the rb_buffer_peek()
only deals with the reader page (it's for consuming reads). The
rb_iter_peek() is for traversing the buffer without consuming it, and as
such, it can loop for one more reason. That is, if we hit the end of
the reader page or any page, it will go to the next page and try again.
That is, we have this:
1. iter->head > iter->head_page->page->commit
(rb_inc_iter() which moves the iter to the next page)
try again
2. event = rb_iter_head_event()
event->type_len == RINGBUF_TYPE_TIME_EXTEND
rb_advance_iter()
try again
3. read the event.
But we never get to 3, because the count is greater than 2 and we
cause the WARNING and return NULL.
Up the counter to 3.
Fixes: 69d1b839f7ee "ring-buffer: Bind time extend and data events together"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 651e22f2701b4113989237c3048d17337dd2185c upstream.
When performing a consuming read, the ring buffer swaps out a
page from the ring buffer with a empty page and this page that
was swapped out becomes the new reader page. The reader page
is owned by the reader and since it was swapped out of the ring
buffer, writers do not have access to it (there's an exception
to that rule, but it's out of scope for this commit).
When reading the "trace" file, it is a non consuming read, which
means that the data in the ring buffer will not be modified.
When the trace file is opened, a ring buffer iterator is allocated
and writes to the ring buffer are disabled, such that the iterator
will not have issues iterating over the data.
Although the ring buffer disabled writes, it does not disable other
reads, or even consuming reads. If a consuming read happens, then
the iterator is reset and starts reading from the beginning again.
My tests would sometimes trigger this bug on my i386 box:
WARNING: CPU: 0 PID: 5175 at kernel/trace/trace.c:1527 __trace_find_cmdline+0x66/0xaa()
Modules linked in:
CPU: 0 PID: 5175 Comm: grep Not tainted 3.16.0-rc3-test+ #8
Hardware name: /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
00000000 00000000 f09c9e1c c18796b3 c1b5d74c f09c9e4c c103a0e3 c1b5154b
f09c9e78 00001437 c1b5d74c 000005f7 c10bd85a c10bd85a c1cac57c f09c9eb0
ed0e0000 f09c9e64 c103a185 00000009 f09c9e5c c1b5154b f09c9e78 f09c9e80^M
Call Trace:
[<c18796b3>] dump_stack+0x4b/0x75
[<c103a0e3>] warn_slowpath_common+0x7e/0x95
[<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
[<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
[<c103a185>] warn_slowpath_fmt+0x33/0x35
[<c10bd85a>] __trace_find_cmdline+0x66/0xaa^M
[<c10bed04>] trace_find_cmdline+0x40/0x64
[<c10c3c16>] trace_print_context+0x27/0xec
[<c10c4360>] ? trace_seq_printf+0x37/0x5b
[<c10c0b15>] print_trace_line+0x319/0x39b
[<c10ba3fb>] ? ring_buffer_read+0x47/0x50
[<c10c13b1>] s_show+0x192/0x1ab
[<c10bfd9a>] ? s_next+0x5a/0x7c
[<c112e76e>] seq_read+0x267/0x34c
[<c1115a25>] vfs_read+0x8c/0xef
[<c112e507>] ? seq_lseek+0x154/0x154
[<c1115ba2>] SyS_read+0x54/0x7f
[<c188488e>] syscall_call+0x7/0xb
---[ end trace 3f507febd6b4cc83 ]---
>>>> ##### CPU 1 buffer started ####
Which was the __trace_find_cmdline() function complaining about the pid
in the event record being negative.
After adding more test cases, this would trigger more often. Strangely
enough, it would never trigger on a single test, but instead would trigger
only when running all the tests. I believe that was the case because it
required one of the tests to be shutting down via delayed instances while
a new test started up.
After spending several days debugging this, I found that it was caused by
the iterator becoming corrupted. Debugging further, I found out why
the iterator became corrupted. It happened with the rb_iter_reset().
As consuming reads may not read the full reader page, and only part
of it, there's a "read" field to know where the last read took place.
The iterator, must also start at the read position. In the rb_iter_reset()
code, if the reader page was disconnected from the ring buffer, the iterator
would start at the head page within the ring buffer (where writes still
happen). But the mistake there was that it still used the "read" field
to start the iterator on the head page, where it should always start
at zero because readers never read from within the ring buffer where
writes occur.
I originally wrote a patch to have it set the iter->head to 0 instead
of iter->head_page->read, but then I questioned why it wasn't always
setting the iter to point to the reader page, as the reader page is
still valid. The list_empty(reader_page->list) just means that it was
successful in swapping out. But the reader_page may still have data.
There was a bug report a long time ago that was not reproducible that
had something about trace_pipe (consuming read) not matching trace
(iterator read). This may explain why that happened.
Anyway, the correct answer to this bug is to always use the reader page
an not reset the iterator to inside the writable ring buffer.
Fixes: d769041f8653 "ring_buffer: implement new locking"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 58d4e21e50ff3cc57910a8abc20d7e14375d2f61 upstream.
The "uptime" trace clock added in:
commit 8aacf017b065a805d27467843490c976835eb4a5
tracing: Add "uptime" trace clock that uses jiffies
has wraparound problems when the system has been up more
than 1 hour 11 minutes and 34 seconds. It converts jiffies
to nanoseconds using:
(u64)jiffies_to_usecs(jiffy) * 1000ULL
but since jiffies_to_usecs() only returns a 32-bit value, it
truncates at 2^32 microseconds. An additional problem on 32-bit
systems is that the argument is "unsigned long", so fixing the
return value only helps until 2^32 jiffies (49.7 days on a HZ=1000
system).
Avoid these problems by using jiffies_64 as our basis, and
not converting to nanoseconds (we do convert to clock_t because
user facing API must not be dependent on internal kernel
HZ values).
Link: http://lkml.kernel.org/p/99d63c5bfe9b320a3b428d773825a37095bf6a51.1405708254.git.tony.luck@intel.com
Fixes: 8aacf017b065 "tracing: Add "uptime" trace clock that uses jiffies"
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 97b8ee845393701edc06e27ccec2876ff9596019 upstream.
ring_buffer_poll_wait() should always put the poll_table to its wait_queue
even there is immediate data available. Otherwise, the following epoll and
read sequence will eventually hang forever:
1. Put some data to make the trace_pipe ring_buffer read ready first
2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
3. epoll_wait()
4. read(trace_pipe_fd) till EAGAIN
5. Add some more data to the trace_pipe ring_buffer
6. epoll_wait() -> this epoll_wait() will block forever
~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
ring_buffer_poll_wait() returns immediately without adding poll_table,
which has poll_table->_qproc pointing to ep_poll_callback(), to its
wait_queue.
~ During the epoll_wait() call in step 3 and step 6,
ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
because the poll_table->_qproc is NULL and it is how epoll works.
~ When there is new data available in step 6, ring_buffer does not know
it has to call ep_poll_callback() because it is not in its wait queue.
Hence, block forever.
Other poll implementation seems to call poll_wait() unconditionally as the very
first thing to do. For example, tcp_poll() in tcp.c.
Link: http://lkml.kernel.org/p/20140610060637.GA14045@devbig242.prn2.facebook.com
Fixes: 2a2cc8f7c4d0 "ftrace: allow the event pipe to be polled"
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Martin Lau <kafai@fb.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f0160a5a2912267c02cfe692eac955c360de5fdf upstream.
The TRACE_ITER_PRINTK check in __trace_puts/__trace_bputs is missing,
so add it, to be consistent with __trace_printk/__trace_bprintk.
Those functions are all called by the same function: trace_printk().
Link: http://lkml.kernel.org/p/51E7A7D6.8090900@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8abfb8727f4a724d31f9ccfd8013fbd16d539445 upstream.
Currently trace option stacktrace is not applicable for
trace_printk with constant string argument, the reason is
in __trace_puts/__trace_bputs ftrace_trace_stack is missing.
In contrast, when using trace_printk with non constant string
argument(will call into __trace_printk/__trace_bprintk), then
trace option stacktrace is workable, this inconstant result
will confuses users a lot.
Link: http://lkml.kernel.org/p/51E7A7C9.9040401@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5f8bf2d263a20b986225ae1ed7d6759dc4b93af9 upstream.
Running my ftrace tests on PowerPC, it failed the test that checks
if function_graph tracer is affected by the stack tracer. It was.
Looking into this, I found that the update_function_graph_func()
must be called even if the trampoline function is not changed.
This is because archs like PowerPC do not support ftrace_ops being
passed by assembly and instead uses a helper function (what the
trampoline function points to). Since this function is not changed
even when multiple ftrace_ops are added to the code, the test that
falls out before calling update_function_graph_func() will miss that
the update must still be done.
Call update_function_graph_function() for all calls to
update_ftrace_function()
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2448e3493cb3874baa90725c87869455ebf11cd2 upstream.
instance_rmdir() path destroys the event files but forgets to free
file->filter. Change remove_event_file_dir() to free_event_filter().
Link: http://lkml.kernel.org/p/20140711190638.GA19517@redhat.com
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: "zhangwei(Jovi)" <jovi.zhangwei@huawei.com>
Fixes: f6a84bdc75b5 "tracing: Introduce remove_event_file_dir()"
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8b8b36834d0fff67fc8668093f4312dd04dcf21d upstream.
The per_cpu buffers are created one per possible CPU. But these do
not mean that those CPUs are online, nor do they even exist.
With the addition of the ring buffer polling, it assumes that the
caller polls on an existing buffer. But this is not the case if
the user reads trace_pipe from a CPU that does not exist, and this
causes the kernel to crash.
Simple fix is to check the cpu against buffer bitmask against to see
if the buffer was allocated or not and return -ENODEV if it is
not.
More updates were done to pass the -ENODEV back up to userspace.
Link: http://lkml.kernel.org/r/5393DB61.6060707@oracle.com
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 099ed151675cd1d2dbeae1dac697975f6a68716d upstream.
Disabling reading and writing to the trace file should not be able to
disable all function tracing callbacks. There's other users today
(like kprobes and perf). Reading a trace file should not stop those
from happening.
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 379cfdac37923653c9d4242d10052378b7563005 upstream.
In order to prevent the saved cmdline cache from being filled when
tracing is not active, the comms are only recorded after a trace event
is recorded.
The problem is, a comm can fail to be recorded if the trace_cmdline_lock
is held. That lock is taken via a trylock to allow it to happen from
any context (including NMI). If the lock fails to be taken, the comm
is skipped. No big deal, as we will try again later.
But! Because of the code that was added to only record after an event,
we may not try again later as the recording is made as a oneshot per
event per CPU.
Only disable the recording of the comm if the comm is actually recorded.
Fixes: 7ffbd48d5cab "tracing: Cache comms only after an event occurred"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a949ae560a511fe4e3adf48fa44fefded93e5c2b upstream.
A race exists between module loading and enabling of function tracer.
CPU 1 CPU 2
----- -----
load_module()
module->state = MODULE_STATE_COMING
register_ftrace_function()
mutex_lock(&ftrace_lock);
ftrace_startup()
update_ftrace_function();
ftrace_arch_code_modify_prepare()
set_all_module_text_rw();
<enables-ftrace>
ftrace_arch_code_modify_post_process()
set_all_module_text_ro();
[ here all module text is set to RO,
including the module that is
loading!! ]
blocking_notifier_call_chain(MODULE_STATE_COMING);
ftrace_init_module()
[ tries to modify code, but it's RO, and fails!
ftrace_bug() is called]
When this race happens, ftrace_bug() will produces a nasty warning and
all of the function tracing features will be disabled until reboot.
The simple solution is to treate module load the same way the core
kernel is treated at boot. To hardcode the ftrace function modification
of converting calls to mcount into nops. This is done in init/main.c
there's no reason it could not be done in load_module(). This gives
a better control of the changes and doesn't tie the state of the
module to its notifiers as much. Ftrace is special, it needs to be
treated as such.
The reason this would work, is that the ftrace_module_init() would be
called while the module is in MODULE_STATE_UNFORMED, which is ignored
by the set_all_module_text_ro() call.
Link: http://lkml.kernel.org/r/1395637826-3312-1-git-send-email-indou.takao@jp.fujitsu.com
Reported-by: Takao Indoh <indou.takao@jp.fujitsu.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 561a4fe851ccab9dd0d14989ab566f9392d9f8b5 upstream.
As trace event triggers are now part of the mainline kernel, I added
my trace event trigger tests to my test suite I run on all my kernels.
Now these tests get run under different config options, and one of
those options is CONFIG_PROVE_RCU, which checks under lockdep that
the rcu locking primitives are being used correctly. This triggered
the following splat:
===============================
[ INFO: suspicious RCU usage. ]
3.15.0-rc2-test+ #11 Not tainted
-------------------------------
kernel/trace/trace_events_trigger.c:80 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
4 locks held by swapper/1/0:
#0: ((&(&j_cdbs->work)->timer)){..-...}, at: [<ffffffff8104d2cc>] call_timer_fn+0x5/0x1be
#1: (&(&pool->lock)->rlock){-.-...}, at: [<ffffffff81059856>] __queue_work+0x140/0x283
#2: (&p->pi_lock){-.-.-.}, at: [<ffffffff8106e961>] try_to_wake_up+0x2e/0x1e8
#3: (&rq->lock){-.-.-.}, at: [<ffffffff8106ead3>] try_to_wake_up+0x1a0/0x1e8
stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.15.0-rc2-test+ #11
Hardware name: /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
0000000000000001 ffff88007e083b98 ffffffff819f53a5 0000000000000006
ffff88007b0942c0 ffff88007e083bc8 ffffffff81081307 ffff88007ad96d20
0000000000000000 ffff88007af2d840 ffff88007b2e701c ffff88007e083c18
Call Trace:
<IRQ> [<ffffffff819f53a5>] dump_stack+0x4f/0x7c
[<ffffffff81081307>] lockdep_rcu_suspicious+0x107/0x110
[<ffffffff810ee51c>] event_triggers_call+0x99/0x108
[<ffffffff810e8174>] ftrace_event_buffer_commit+0x42/0xa4
[<ffffffff8106aadc>] ftrace_raw_event_sched_wakeup_template+0x71/0x7c
[<ffffffff8106bcbf>] ttwu_do_wakeup+0x7f/0xff
[<ffffffff8106bd9b>] ttwu_do_activate.constprop.126+0x5c/0x61
[<ffffffff8106eadf>] try_to_wake_up+0x1ac/0x1e8
[<ffffffff8106eb77>] wake_up_process+0x36/0x3b
[<ffffffff810575cc>] wake_up_worker+0x24/0x26
[<ffffffff810578bc>] insert_work+0x5c/0x65
[<ffffffff81059982>] __queue_work+0x26c/0x283
[<ffffffff81059999>] ? __queue_work+0x283/0x283
[<ffffffff810599b7>] delayed_work_timer_fn+0x1e/0x20
[<ffffffff8104d3a6>] call_timer_fn+0xdf/0x1be^M
[<ffffffff8104d2cc>] ? call_timer_fn+0x5/0x1be
[<ffffffff81059999>] ? __queue_work+0x283/0x283
[<ffffffff8104d823>] run_timer_softirq+0x1a4/0x22f^M
[<ffffffff8104696d>] __do_softirq+0x17b/0x31b^M
[<ffffffff81046d03>] irq_exit+0x42/0x97
[<ffffffff81a08db6>] smp_apic_timer_interrupt+0x37/0x44
[<ffffffff81a07a2f>] apic_timer_interrupt+0x6f/0x80
<EOI> [<ffffffff8100a5d8>] ? default_idle+0x21/0x32
[<ffffffff8100a5d6>] ? default_idle+0x1f/0x32
[<ffffffff8100ac10>] arch_cpu_idle+0xf/0x11
[<ffffffff8107b3a4>] cpu_startup_entry+0x1a3/0x213
[<ffffffff8102a23c>] start_secondary+0x212/0x219
The cause is that the triggers are protected by rcu_read_lock_sched() but
the data is dereferenced with rcu_dereference() which expects it to
be protected with rcu_read_lock(). The proper reference should be
rcu_dereference_sched().
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6ea6215fe394e320468589d9bba464a48f6d823a upstream.
Forgot to free uprobe_cpu_buffer percpu page in uprobe_buffer_disable().
Link: http://lkml.kernel.org/p/534F8B3F.1090407@huawei.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit af5040da01ef980670b3741b3e10733ee3e33566 upstream.
trace_block_rq_complete does not take into account that request can
be partially completed, so we can get the following incorrect output
of blkparser:
C R 232 + 240 [0]
C R 240 + 232 [0]
C R 248 + 224 [0]
C R 256 + 216 [0]
but should be:
C R 232 + 8 [0]
C R 240 + 8 [0]
C R 248 + 8 [0]
C R 256 + 8 [0]
Also, the whole output summary statistics of completed requests and
final throughput will be incorrect.
This patch takes into account real completion size of the request and
fixes wrong completion accounting.
Signed-off-by: Roman Pen <r.peniaev@gmail.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: linux-kernel@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
While working on my tutorial for 2014 Linux Collaboration Summit
I found that the traceon trigger did not work when conditions were
used. The other triggers worked fine though. Looking into it, it
is because of the way the triggers use the ring buffer to store
the fields it will use for the condition. But if tracing is off, nothing
is stored in the buffer, and the tracepoint exits before calling the
trigger to test the condition. This is fine for all the triggers that
only work when tracing is on, but for traceon trigger that is to
work when tracing is off, nothing happens.
The fix is simple, just use a temp ring buffer to record the event
if tracing is off and the event has a trace event conditional trigger
enabled. The rest of the tracepoint code will work just fine, but
the tracepoint wont be recorded in the other buffers.
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
In event format strings, the array size is reported in two locations.
One in array subscript and then via the "size:" attribute. The values
reported there have a mismatch.
For e.g., in sched:sched_switch the prev_comm and next_comm character
arrays have subscript values as [32] where as the actual field size is
16.
name: sched_switch
ID: 301
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1;signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:char prev_comm[32]; offset:8; size:16; signed:1;
field:pid_t prev_pid; offset:24; size:4; signed:1;
field:int prev_prio; offset:28; size:4; signed:1;
field:long prev_state; offset:32; size:8; signed:1;
field:char next_comm[32]; offset:40; size:16; signed:1;
field:pid_t next_pid; offset:56; size:4; signed:1;
field:int next_prio; offset:60; size:4; signed:1;
After bisection, the following commit was blamed:
92edca0 tracing: Use direct field, type and system names
This commit removes the duplication of strings for field->name and
field->type assuming that all the strings passed in
__trace_define_field() are immutable. This is not true for arrays, where
the type string is created in event_storage variable and field->type for
all array fields points to event_storage.
Use __stringify() to create a string constant for the type string.
Also, get rid of event_storage and event_storage_mutex that are not
needed anymore.
also, an added benefit is that this reduces the overhead of events a bit more:
text data bss dec hex filename
8424787 2036472 1302528 11763787 b3804b vmlinux
8420814 2036408 1302528 11759750 b37086 vmlinux.patched
Link: http://lkml.kernel.org/r/1392349908-29685-1-git-send-email-vnagarnaik@google.com
Cc: Laurent Chavey <chavey@google.com>
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
If a module fails to add its tracepoints due to module tainting, do not
create the module event infrastructure in the debugfs directory. As the events
will not work and worse yet, they will silently fail, making the user wonder
why the events they enable do not display anything.
Having a warning on module load and the events not visible to the users
will make the cause of the problem much clearer.
Link: http://lkml.kernel.org/r/20140227154923.265882695@goodmis.org
Fixes: 6d723736e472 "tracing/events: add support for modules to TRACE_EVENT"
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org # 2.6.31+
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
Each sub-buffer (buffer page) has a full 64 bit timestamp. The events on
that page use a 27 bit delta against that timestamp in order to save on
bits written to the ring buffer. If the time between events is larger than
what the 27 bits can hold, a "time extend" event is added to hold the
entire 64 bit timestamp again and the events after that hold a delta from
that timestamp.
As a "time extend" is always paired with an event, it is logical to just
allocate the event with the time extend, to make things a bit more efficient.
Unfortunately, when the pairing code was written, it removed the "delta = 0"
from the first commit on a page, causing the events on the page to be
slightly skewed.
Fixes: 69d1b839f7ee "ring-buffer: Bind time extend and data events together"
Cc: stable@vger.kernel.org # 2.6.37+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
Pull core block IO changes from Jens Axboe:
"The major piece in here is the immutable bio_ve series from Kent, the
rest is fairly minor. It was supposed to go in last round, but
various issues pushed it to this release instead. The pull request
contains:
- Various smaller blk-mq fixes from different folks. Nothing major
here, just minor fixes and cleanups.
- Fix for a memory leak in the error path in the block ioctl code
from Christian Engelmayer.
- Header export fix from CaiZhiyong.
- Finally the immutable biovec changes from Kent Overstreet. This
enables some nice future work on making arbitrarily sized bios
possible, and splitting more efficient. Related fixes to immutable
bio_vecs:
- dm-cache immutable fixup from Mike Snitzer.
- btrfs immutable fixup from Muthu Kumar.
- bio-integrity fix from Nic Bellinger, which is also going to stable"
* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
xtensa: fixup simdisk driver to work with immutable bio_vecs
block/blk-mq-cpu.c: use hotcpu_notifier()
blk-mq: for_each_* macro correctness
block: Fix memory leak in rw_copy_check_uvector() handling
bio-integrity: Fix bio_integrity_verify segment start bug
block: remove unrelated header files and export symbol
blk-mq: uses page->list incorrectly
blk-mq: use __smp_call_function_single directly
btrfs: fix missing increment of bi_remaining
Revert "block: Warn and free bio if bi_end_io is not set"
block: Warn and free bio if bi_end_io is not set
blk-mq: fix initializing request's start time
block: blk-mq: don't export blk_mq_free_queue()
block: blk-mq: make blk |