Age | Commit message (Collapse) | Author | Files | Lines |
|
The following commits added new fields/flags to the branch stack field
list:
commit 1f48989cdc7d ("perf script: Output branch sample type")
commit 6ade6c646035 ("perf script: Show branch speculation info")
commit 1e66dcff7b9b ("perf script: Add not taken event for branch stack")
Update brstack syntax documentation to be consistent with the latest
branch stack field list. Improve the descriptions to help users
interpret the fields accurately.
Signed-off-by: Yujie Liu <yujie.liu@intel.com>
Reviewed-by: Leo Yan <leo.yan@arm.com>
Reviewed-by: Sandipan Das <sandipan.das@amd.com>
Link: https://lore.kernel.org/r/20250312072329.419020-1-yujie.liu@intel.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
This option is to show data type info in the regular (code) annotation.
It tries to find data type for each (memory) instruction in the
function. It'd be useful to see function-level memory access pattern
and also to debug the data type profiling result.
The output would be added at the end of the line and have "# data-type:"
prefix.
For now, it only works with --stdio mode for simplicity. I can work on
enabling it for TUI later.
$ perf annotate --stdio --code-with-type
Percent | Source code & Disassembly of vmlinux for cpu/mem-loads/ppk (253 samples, percent: local period)
---------------------------------------------------------------------------------------------------------------
: 0 0xffffffff81baa000 <check_preemption_disabled>:
0.00 : ffffffff81baa000: pushq %r12 # data-type: (stack operation)
0.00 : ffffffff81baa002: pushq %rbp # data-type: (stack operation)
0.00 : ffffffff81baa003: pushq %rbx # data-type: (stack operation)
0.00 : ffffffff81baa004: subq $0x8, %rsp
18.00 : ffffffff81baa008: movl %gs:0x7e48893d(%rip), %ebx # 0x3294c <pcpu_hot+0xc> # data-type: struct pcpu_hot +0xc (cpu_number)
12.58 : ffffffff81baa00f: movl %gs:0x7e488932(%rip), %eax # 0x32948 <pcpu_hot+0x8> # data-type: struct pcpu_hot +0x8 (preempt_count)
0.00 : ffffffff81baa016: testl $0x7fffffff, %eax
0.00 : ffffffff81baa01b: je 0xffffffff81baa02c <check_preemption_disabled+0x2c>
0.00 : ffffffff81baa01d: addq $0x8, %rsp
0.00 : ffffffff81baa021: movl %ebx, %eax
14.19 : ffffffff81baa023: popq %rbx # data-type: (stack operation)
18.86 : ffffffff81baa024: popq %rbp # data-type: (stack operation)
12.10 : ffffffff81baa025: popq %r12 # data-type: (stack operation)
17.78 : ffffffff81baa027: jmp 0xffffffff81bc1170 <__x86_return_thunk>
6.49 : ffffffff81baa02c: callq *0xc9139e(%rip) # 0xffffffff8283b3d0 <pv_ops+0xf0> # data-type: (stack operation)
0.00 : ffffffff81baa032: testb $0x2, %ah
0.00 : ffffffff81baa035: je 0xffffffff81baa01d <check_preemption_disabled+0x1d>
0.00 : ffffffff81baa037: movq %rdi, %rbp
0.00 : ffffffff81baa03a: movq %gs:0x32940, %rax # data-type: struct pcpu_hot +0 (current_task)
0.00 : ffffffff81baa043: testb $0x4, 0x2f(%rax) # data-type: struct task_struct +0x2f (flags)
0.00 : ffffffff81baa047: je 0xffffffff81baa052 <check_preemption_disabled+0x52>
0.00 : ffffffff81baa049: cmpl $0x1, 0x3d0(%rax) # data-type: struct task_struct +0x3d0 (nr_cpus_allowed)
0.00 : ffffffff81baa050: je 0xffffffff81baa01d <check_preemption_disabled+0x1d>
0.00 : ffffffff81baa052: movq %gs:0x32940, %r12 # data-type: struct pcpu_hot +0 (current_task)
0.00 : ffffffff81baa05b: cmpw $0x0, 0x7f0(%r12) # data-type: struct task_struct +0x7f0 (migration_disabled)
0.00 : ffffffff81baa065: movq %rsi, (%rsp)
0.00 : ffffffff81baa069: jne 0xffffffff81baa01d <check_preemption_disabled+0x1d>
0.00 : ffffffff81baa06b: movl 0xe8dd13(%rip), %eax # 0xffffffff82a37d84 <system_state> # data-type: enum system_states +0
0.00 : ffffffff81baa071: testl %eax, %eax
0.00 : ffffffff81baa073: je 0xffffffff81baa01d <check_preemption_disabled+0x1d>
0.00 : ffffffff81baa075: incl %gs:0x7e4888cc(%rip) # 0x32948 <pcpu_hot+0x8> # data-type: struct pcpu_hot +0x8 (preempt_count)
0.00 : ffffffff81baa07c: movq $-0x7e14a100, %rdi
0.00 : ffffffff81baa083: callq 0xffffffff81148c40 <__printk_ratelimit> # data-type: (stack operation)
0.00 : ffffffff81baa088: testl %eax, %eax
0.00 : ffffffff81baa08a: je 0xffffffff81baa0d5 <check_preemption_disabled+0xd5>
0.00 : ffffffff81baa08c: movl 0x958(%r12), %r9d # data-type: struct task_struct +0x958 (pid)
0.00 : ffffffff81baa094: movq (%rsp), %rdx # data-type: char* +0
0.00 : ffffffff81baa098: movq %rbp, %rsi
0.00 : ffffffff81baa09b: leaq 0xb88(%r12), %r8 # data-type: struct task_struct +0xb88 (comm)
0.00 : ffffffff81baa0a3: movl %gs:0x7e48889e(%rip), %ecx # 0x32948 <pcpu_hot+0x8> # data-type: struct pcpu_hot +0x8 (preempt_count)
0.00 : ffffffff81baa0aa: andl $0x7fffffff, %ecx
0.00 : ffffffff81baa0b0: movq $-0x7dd3cdf0, %rdi
0.00 : ffffffff81baa0b7: subl $0x1, %ecx
0.00 : ffffffff81baa0ba: callq 0xffffffff81149340 <_printk> # data-type: (stack operation)
0.00 : ffffffff81baa0bf: movq 0x20(%rsp), %rsi
0.00 : ffffffff81baa0c4: movq $-0x7ddb8c7e, %rdi
0.00 : ffffffff81baa0cb: callq 0xffffffff81149340 <_printk> # data-type: (stack operation)
0.00 : ffffffff81baa0d0: callq 0xffffffff81b7ab60 <dump_stack> # data-type: (stack operation)
0.00 : ffffffff81baa0d5: decl %gs:0x7e48886c(%rip) # 0x32948 <pcpu_hot+0x8> # data-type: struct pcpu_hot +0x8 (preempt_count)
0.00 : ffffffff81baa0dc: jmp 0xffffffff81baa01d <check_preemption_disabled+0x1d>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250310224925.799005-8-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
This patch parses `owner_lock_stat` into a RB tree, enabling ordered
reporting of owner lock statistics with stack traces. It also updates
the documentation for the `-o` option in contention mode, decouples `-o`
from `-t`, and issues a warning to inform users about the new behavior
of `-ov`.
Example output:
$ sudo ~/linux/tools/perf/perf lock con -abvo -Y mutex-spin -E3 perf bench sched pipe
...
contended total wait max wait avg wait type caller
171 1.55 ms 20.26 us 9.06 us mutex pipe_read+0x57
0xffffffffac6318e7 pipe_read+0x57
0xffffffffac623862 vfs_read+0x332
0xffffffffac62434b ksys_read+0xbb
0xfffffffface604b2 do_syscall_64+0x82
0xffffffffad00012f entry_SYSCALL_64_after_hwframe+0x76
36 193.71 us 15.27 us 5.38 us mutex pipe_write+0x50
0xffffffffac631ee0 pipe_write+0x50
0xffffffffac6241db vfs_write+0x3bb
0xffffffffac6244ab ksys_write+0xbb
0xfffffffface604b2 do_syscall_64+0x82
0xffffffffad00012f entry_SYSCALL_64_after_hwframe+0x76
4 51.22 us 16.47 us 12.80 us mutex do_epoll_wait+0x24d
0xffffffffac691f0d do_epoll_wait+0x24d
0xffffffffac69249b do_epoll_pwait.part.0+0xb
0xffffffffac693ba5 __x64_sys_epoll_pwait+0x95
0xfffffffface604b2 do_syscall_64+0x82
0xffffffffad00012f entry_SYSCALL_64_after_hwframe+0x76
=== owner stack trace ===
3 31.24 us 15.27 us 10.41 us mutex pipe_read+0x348
0xffffffffac631bd8 pipe_read+0x348
0xffffffffac623862 vfs_read+0x332
0xffffffffac62434b ksys_read+0xbb
0xfffffffface604b2 do_syscall_64+0x82
0xffffffffad00012f entry_SYSCALL_64_after_hwframe+0x76
...
Signed-off-by: Chun-Tse Shao <ctshao@google.com>
Tested-by: Athira Rajeev <atrajeev@linux.ibm.com>
Link: https://lore.kernel.org/r/20250227003359.732948-5-ctshao@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
-v disables deduplication of similarly suffixed PMUs so add it to the
help and doc strings.
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250226104111.564443-4-james.clark@linaro.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
In sysfs, the perf events are all located in
/sys/bus/event_source/devices/ but some places ended up hard-coding the
location to be at the root of /sys/devices/ which could be very risky as
you do not exactly know what type of device you are accessing in sysfs
at that location.
So fix this all up by properly pointing everything at the bus device
list instead of the root of the sysfs devices/ tree.
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/2025021955-implant-excavator-179d@gregkh
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
Describe latency and parallelism profiling, related flags, and differences
with the currently only supported CPU-consumption-centric profiling.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lore.kernel.org/r/a13f270ed33cedb03ce9ebf9ddbd064854ca0f19.1739437531.git.dvyukov@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
Add record/report --latency flag that allows to capture and show
latency-centric profiles rather than the default CPU-consumption-centric
profiles. For latency profiles record captures context switch events,
and report shows Latency as the first column.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lore.kernel.org/r/e9640464bcbc47dde2cb557003f421052ebc9eec.1739437531.git.dvyukov@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
The --summary-mode option will select how to show the syscall summary at
the end. By default, it'll show the summary for each thread and it's
the same as if --summary-mode=thread is passed.
The other option is to show total summary, which is --summary-mode=total.
I'd like to have this instead of a separate option like --total-summary
because we may want to add a new summary mode (by cgroup) later.
$ sudo ./perf trace -as --summary-mode=total sleep 1
Summary of events:
total, 21580 events
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
epoll_wait 1305 0 14716.712 0.000 11.277 551.529 8.87%
futex 1256 89 13331.197 0.000 10.614 733.722 15.49%
poll 669 0 6806.618 0.000 10.174 459.316 11.77%
ppoll 220 0 3968.797 0.000 18.040 516.775 25.35%
clock_nanosleep 1 0 1000.027 1000.027 1000.027 1000.027 0.00%
epoll_pwait 21 0 592.783 0.000 28.228 522.293 88.29%
nanosleep 16 0 60.515 0.000 3.782 10.123 33.33%
ioctl 510 0 4.284 0.001 0.008 0.182 8.84%
recvmsg 1434 775 3.497 0.001 0.002 0.174 6.37%
write 1393 0 2.854 0.001 0.002 0.017 1.79%
read 1063 100 2.236 0.000 0.002 0.083 5.11%
...
Reviewed-by: Howard Chu <howardchu95@gmail.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20250205205443.1986408-5-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf-tools updates from Namhyung Kim:
"There are a lot of changes in the perf tools in this cycle.
build:
- Use generic syscall table to generate syscall numbers on supported
archs
- This also enables to get rid of libaudit which was used for syscall
numbers
- Remove python2 support as it's deprecated for years
- Fix issues on static build with libzstd
perf record:
- Intel-PT supports "aux-action" config term to pause or resume
tracing in the aux-buffer. Users can start the intel_pt event as
"started-paused" and configure other events to control the Intel-PT
tracing:
# perf record --kcore -e intel_pt/aux-action=start-paused/ \
-e syscalls:sys_enter_newuname/aux-action=resume/ \
-e syscalls:sys_exit_newuname/aux-action=pause/ -- uname
This requires kernel support (which was added in v6.13)
perf lock:
- 'perf lock contention' command has an ability to symbolize locks in
dynamically allocated objects using slab cache name when it runs
with BPF. Those dynamic locks would have "&" prefix in the name to
distinguish them from ordinary (static) locks
# perf lock con -abl -E 5 sleep 1
contended total wait max wait avg wait address symbol
2 1.95 us 1.77 us 975 ns ffff9d5e852d3498 &task_struct (mutex)
1 1.18 us 1.18 us 1.18 us ffff9d5e852d3538 &task_struct (mutex)
4 1.12 us 354 ns 279 ns ffff9d5e841ca800 &kmalloc-cg-512 (mutex)
2 859 ns 617 ns 429 ns ffffffffa41c3620 delayed_uprobe_lock (mutex)
3 691 ns 388 ns 230 ns ffffffffa41c0940 pack_mutex (mutex)
This also requires kernel/BPF support (which was added in v6.13)
perf ftrace:
- 'perf ftrace latency' command gets a couple of options to support
linear buckets instead of exponential. Also it's possible to
specify max and min latency for the linear buckets:
# perf ftrace latency -abn -T switch_mm_irqs_off --bucket-range=100 \
--min-latency=200 --max-latency=800 -- sleep 1
# DURATION | COUNT | GRAPH |
0 - 200 ns | 186 | ### |
200 - 300 ns | 256 | ##### |
300 - 400 ns | 364 | ####### |
400 - 500 ns | 223 | #### |
500 - 600 ns | 111 | ## |
600 - 700 ns | 41 | |
700 - 800 ns | 141 | ## |
800 - ... ns | 169 | ### |
# statistics (in nsec)
total time: 2162212
avg time: 967
max time: 16817
min time: 132
count: 2236
- As you can see in the above example, it nows shows the statistics
at the end so that users can see the avg/max/min latencies easily
- 'perf ftrace profile' command has --graph-opts option like 'perf
ftrace trace' so that it can control the tracing behaviors in the
same way. For example, it can limit the function call depth or
threshold
perf script:
- Improve physical memory resolution in 'mem-phys-addr' script by
parsing /proc/iomem file
# perf script mem-phys-addr -- find /
...
Event: mem_inst_retired.all_loads:P
Memory type count percentage
---------------------------------------- ---------- ----------
100000000-85f7fffff : System RAM 8929 69.7
547600000-54785d23f : Kernel data 1240 9.7
546a00000-5474bdfff : Kernel rodata 490 3.8
5480ce000-5485fffff : Kernel bss 121 0.9
0-fff : Reserved 3860 30.1
100000-89c01fff : System RAM 18 0.1
8a22c000-8df6efff : System RAM 5 0.0
Others:
- 'perf test' gets --runs-per-test option to run the test cases
repeatedly. This would be helpful to see if it's flaky
- Add 'parse_events' method to Python perf extension module, so that
users can use the same event parsing logic in the python code. One
more step towards implementing perf tools in Python. :)
- Support opening tracepoint events without libtraceevent. This will
be helpful if it won't use the tracing data like in 'perf stat'
- Update ARM Neoverse N2/V2 JSON events and metrics"
* tag 'perf-tools-for-v6.14-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (176 commits)
perf test: Update event_groups test to use instructions
perf bench: Fix undefined behavior in cmpworker()
perf annotate: Prefer passing evsel to evsel->core.idx
perf lock: Rename fields in lock_type_table
perf lock: Add percpu-rwsem for type filter
perf lock: Fix parse_lock_type which only retrieve one lock flag
perf lock: Fix return code for functions in __cmd_contention
perf hist: Fix width calculation in hpp__fmt()
perf hist: Fix bogus profiles when filters are enabled
perf hist: Deduplicate cmp/sort/collapse code
perf test: Improve verbose documentation
perf test: Add a runs-per-test flag
perf test: Fix parallel/sequential option documentation
perf test: Send list output to stdout rather than stderr
perf test: Rename functions and variables for better clarity
perf tools: Expose quiet/verbose variables in Makefile.perf
perf config: Add a function to set one variable in .perfconfig
perf test perftool_testsuite: Return correct value for skipping
perf test perftool_testsuite: Add missing description
perf test record+probe_libc_inet_pton: Make test resilient
...
|
|
percpu-rwsem was missing in man page. And for backward compatibility,
replace `pcpu-sem` with `percpu-rwsem` before parsing lock name.
Tested `./perf lock con -ab -Y pcpu-sem` and `./perf lock con -ab -Y
percpu-rwsem`
Fixes: 4f701063bfa2 ("perf lock contention: Show lock type with address")
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Chun-Tse Shao <ctshao@google.com>
Cc: nick.forrington@arm.com
Link: https://lore.kernel.org/r/20250116235838.2769691-2-ctshao@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
Add a little more detail on the output expectations for each verbose
level.
Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Cc: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250110045736.598281-6-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
To detect flakes it is useful to run tests more than once. Add a
runs-per-test flag that will run each test multiple times. Example
output:
```
$ perf test -r 3 lbr -v
122: perf record LBR tests : Ok
122: perf record LBR tests : Ok
122: perf record LBR tests : Ok
```
Update the documentation for the runs-per-test option.
Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Cc: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250110045736.598281-5-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
The parallel option was removed in commit 94d1a913bdc4 ("perf test:
Make parallel testing the default"). Update the sequential
documentation to reflect it isn't the default except for "exclusive"
tests.
Fixes: 94d1a913bdc4 ("perf test: Make parallel testing the default")
Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Cc: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250110045736.598281-4-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
Document the flag along with PMU events to hint what it's used for and
give an example with other useful options to get minimal output.
Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com>
Signed-off-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250108142904.401139-3-james.clark@linaro.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
All architectures now support HAVE_SYSCALL_TABLE_SUPPORT, so the flag is
no longer needed. With the removal of the flag, the related
GENERIC_SYSCALL_TABLE can also be removed.
libaudit was only used as a fallback for when HAVE_SYSCALL_TABLE_SUPPORT
was not defined, so libaudit is also no longer needed for any
architecture.
Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Günther Noack <gnoack@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leo Yan <leo.yan@linux.dev>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mickaël Salaün <mic@digikod.net>
Cc: Mike Leach <mike.leach@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250108-perf_syscalltbl-v6-16-7543b5293098@rivosinc.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Like trace subcommand, it should be able to pass some options to control
the tracing behavior for the function graph tracer.
But some options are limited in order to maintain the internal behavior.
For example, it can limit the function call depth like below:
# perf ftrace profile --graph-opts depth=5 -- myprog
Committer testing:
root@number:~# perf ftrace profile --graph-opts thresh=1000 -- sleep 1
# Total (us) Avg (us) Max (us) Count Function
1001419.301 500709.650 1000032.000 2 x64_sys_call
1000032.000 1000032.000 1000032.000 1 __x64_sys_clock_nanosleep
1000032.000 1000032.000 1000032.000 1 common_nsleep
1000031.000 1000031.000 1000031.000 1 do_nanosleep
1000031.000 1000031.000 1000031.000 1 hrtimer_nanosleep
1000024.000 1000024.000 1000024.000 1 schedule
1387.208 1387.208 1387.208 1 __x64_sys_execve
1386.691 1386.691 1386.691 1 do_execveat_common.isra.0
1334.170 1334.170 1334.170 1 bprm_execve
1258.413 1258.413 1258.413 1 load_elf_binary
1123.068 1123.068 1123.068 1 begin_new_exec
1113.550 1113.550 1113.550 1 mmput
1109.237 1109.237 1109.237 1 exit_mmap
root@number:~# perf ftrace profile --graph-opts thresh=1200 -- sleep 1
# Total (us) Avg (us) Max (us) Count Function
1001448.204 500724.102 1000018.000 2 x64_sys_call
1000017.000 1000017.000 1000017.000 1 __x64_sys_clock_nanosleep
1000017.000 1000017.000 1000017.000 1 common_nsleep
1000017.000 1000017.000 1000017.000 1 hrtimer_nanosleep
1000016.000 1000016.000 1000016.000 1 do_nanosleep
1000012.000 1000012.000 1000012.000 1 schedule
1430.112 1430.112 1430.112 1 __x64_sys_execve
1429.581 1429.581 1429.581 1 do_execveat_common.isra.0
1376.289 1376.289 1376.289 1 bprm_execve
1301.743 1301.743 1301.743 1 load_elf_binary
root@number:~#
Reviewed-by: James Clark <james.clark@linaro.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250107224352.1128669-2-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
The --force-btf option is intended for debugging purposes and is
currently undocumented. Add documentation for it.
Committer notes:
We need a follow up patch expanding on what can be done via BTF and what
isn't possible and thus needs further work to convert kernel C source
code into tables that can then be associated with syscall integer args
and struct members, as discussed in:
https://lore.kernel.org/all/20241215190712.787847-3-howardchu95@gmail.com/T/#mcfbba653200775c59c730705229a49b34a153db7
Signed-off-by: Howard Chu <howardchu95@gmail.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20241215190712.787847-3-howardchu95@gmail.com
Link: https://lore.kernel.org/all/20241215190712.787847-3-howardchu95@gmail.com/T/#mcfbba653200775c59c730705229a49b34a153db7
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Document the use of aux-action config term and provide a simple example.
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Leo Yan <leo.yan@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20241216070244.14450-7-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Improve format of config terms and section references.
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Leo Yan <leo.yan@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20241216070244.14450-6-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Add parsing for aux-action to accept "pause", "resume" or "start-paused"
values.
"start-paused" is valid only for AUX area events.
"pause" and "resume" are valid only for events grouped with an AUX area
event as the group leader. However, like with aux-output, the events
will be automatically grouped if they are not currently in a group, and
the AUX area event precedes the other events.
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Leo Yan <leo.yan@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20241216070244.14450-4-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
This patch adds a max-latency option as discussed, in case the number of
buckets is more than 22, we don't observe the setting (for now, let's
say).
By default or if 0 is passed, the value is automatically determined
based on the number of buckets, range and minimum, so that we fill all
available buffers (equivalent to the behaviour before this patch).
We now get something like this:
# perf ftrace latency --bucket-range=20 \
--min-latency 10 \
--max-latency=100 \
-T switch_mm_irqs_off -a sleep 2
# DURATION | COUNT | GRAPH |
0 - 10 us | 1731 | ################ |
10 - 30 us | 1 | |
30 - 50 us | 0 | |
50 - 70 us | 0 | |
70 - 90 us | 0 | |
90 - 100 us | 0 | |
100 - ... us | 0 | |
Note the maximum is observed also if it doesn't cover completely a full
range (the second to last range is 10us long to let the last start at
100 sharp), this looks to me more sensible and eases the computations,
since we don't need to account for the range while filling the buckets.
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20241112181214.1171244-5-acme@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Things below and over will be in the first and last, outlier, buckets.
Without it:
# perf ftrace latency --use-nsec --use-bpf \
--bucket-range=200 \
-T switch_mm_irqs_off -a sleep 2
# DURATION | COUNT | GRAPH |
0 - 200 ns | 0 | |
200 - 400 ns | 44 | |
400 - 600 ns | 291 | # |
600 - 800 ns | 506 | ## |
800 - 1000 ns | 148 | |
1.00 - 1.20 us | 581 | ## |
1.20 - 1.40 us | 2199 | ########## |
1.40 - 1.60 us | 1048 | #### |
1.60 - 1.80 us | 1448 | ###### |
1.80 - 2.00 us | 1091 | ##### |
2.00 - 2.20 us | 517 | ## |
2.20 - 2.40 us | 318 | # |
2.40 - 2.60 us | 370 | # |
2.60 - 2.80 us | 271 | # |
2.80 - 3.00 us | 150 | |
3.00 - 3.20 us | 85 | |
3.20 - 3.40 us | 48 | |
3.40 - 3.60 us | 40 | |
3.60 - 3.80 us | 22 | |
3.80 - 4.00 us | 13 | |
4.00 - 4.20 us | 14 | |
4.20 - ... us | 626 | ## |
#
# perf ftrace latency --use-nsec --use-bpf \
--bucket-range=20 --min-latency=1200 \
-T switch_mm_irqs_off -a sleep 2
# DURATION | COUNT | GRAPH |
0 - 1200 ns | 1243 | ##### |
1.20 - 1.22 us | 141 | |
1.22 - 1.24 us | 202 | |
1.24 - 1.26 us | 209 | |
1.26 - 1.28 us | 219 | |
1.28 - 1.30 us | 208 | |
1.30 - 1.32 us | 245 | # |
1.32 - 1.34 us | 246 | # |
1.34 - 1.36 us | 224 | # |
1.36 - 1.38 us | 219 | |
1.38 - 1.40 us | 206 | |
1.40 - 1.42 us | 190 | |
1.42 - 1.44 us | 190 | |
1.44 - 1.46 us | 146 | |
1.46 - 1.48 us | 140 | |
1.48 - 1.50 us | 125 | |
1.50 - 1.52 us | 115 | |
1.52 - 1.54 us | 102 | |
1.54 - 1.56 us | 87 | |
1.56 - 1.58 us | 90 | |
1.58 - 1.60 us | 85 | |
1.60 - ... us | 5487 | ######################## |
#
Now we want focus on the latencies starting at 1.2us, with a finer
grained range of 20ns:
This is all on a live system, so statistically interesting, but not
narrowing down on the same numbers, so a 'perf ftrace latency record'
seems interesting to then use all on the same snapshot of latencies.
A --max-latency counterpart should come next, at first limiting the
max-latency to 20 * bucket-size, as we have a fixed buckets array with
20 + 2 entries (+ for the outliers) and thus would need to make it
larger for higher latencies.
We also may need a way to ask for not considering the out of range
values (first and last buckets) when drawing the buckets bars.
Co-developed-by: Gabriele Monaco <gmonaco@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20241112181214.1171244-4-acme@kernel.org
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
In addition to showing it exponentially, using log2() to figure out the
histogram index, allow for showing it linearly:
The preexisting more, the default:
# perf ftrace latency --use-nsec --use-bpf \
-T switch_mm_irqs_off -a sleep 2
# DURATION | COUNT | GRAPH |
0 - 1 ns | 0 | |
1 - 2 ns | 0 | |
2 - 4 ns | 0 | |
4 - 8 ns | 0 | |
8 - 16 ns | 0 | |
16 - 32 ns | 0 | |
32 - 64 ns | 0 | |
64 - 128 ns | 238 | # |
128 - 256 ns | 1704 | ########## |
256 - 512 ns | 672 | ### |
512 - 1024 ns | 4458 | ########################## |
1 - 2 us | 677 | #### |
2 - 4 us | 5 | |
4 - 8 us | 0 | |
8 - 16 us | 0 | |
16 - 32 us | 0 | |
32 - 64 us | 0 | |
64 - 128 us | 0 | |
128 - 256 us | 0 | |
256 - 512 us | 0 | |
512 - 1024 us | 0 | |
1 - ... ms | 0 | |
#
The new histogram mode:
# perf ftrace latency --bucket-range=150 --use-nsec --use-bpf \
-T switch_mm_irqs_off -a sleep 2
# DURATION | COUNT | GRAPH |
0 - 1 ns | 0 | |
1 - 151 ns | 265 | # |
151 - 301 ns | 1797 | ########### |
301 - 451 ns | 258 | # |
451 - 601 ns | 289 | # |
601 - 751 ns | 2049 | ############# |
751 - 901 ns | 967 | ###### |
901 - 1051 ns | 513 | ### |
1.05 - 1.20 us | 114 | |
1.20 - 1.35 us | 559 | ### |
1.35 - 1.50 us | 189 | # |
1.50 - 1.65 us | 137 | |
1.65 - 1.80 us | 32 | |
1.80 - 1.95 us | 2 | |
1.95 - 2.10 us | 0 | |
2.10 - 2.25 us | 1 | |
2.25 - 2.40 us | 1 | |
2.40 - 2.55 us | 0 | |
2.55 - 2.70 us | 0 | |
2.70 - 2.85 us | 0 | |
2.85 - 3.00 us | 1 | |
3.00 - ... us | 4 | |
#
Co-developed-by: Gabriele Monaco <gmonaco@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20241112181214.1171244-3-acme@kernel.org
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Just a trivial typo, should be 'can', did a spell check on the rest of
the file just in case, nothing more stood out.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
The perf tools annotation code used for a long time parsing the output
of binutils's objdump (or its reimplementations, like llvm's) to then
parse and augment it with samples, allow navigation, etc.
More recently disassemblers from the capstone and llvm (libraries, not
parsing the output of tools using those libraries to mimic binutils's
objdump output) were introduced.
So when all those methods are available, there is a static preference
for a series of attempts of disassembling a binary, with the 'llvm,
capstone, objdump' sequence being hard coded.
This patch allows users to change that sequence, specifying via a 'perf
config' 'annotate.disassemblers' entry which and in what order
disassemblers should be attempted.
As alluded to in the comments in the source code of this series, this
flexibility is useful for users and developers alike, elliminating the
requirement to rebuild the tool with some specific set of libraries to
see how the output of disassembling would be for one of these methods.
root@x1:~# rm -f ~/.perfconfig
root@x1:~# perf annotate -v --stdio2 update_load_avg
<SNIP>
symbol__disassemble:
filename=/usr/lib/debug/lib/modules/6.11.4-201.fc40.x86_64/vmlinux,
sym=update_load_avg, start=0xffffffffb6148fe0, en>
annotating [0x6ff7170]
/usr/lib/debug/lib/modules/6.11.4-201.fc40.x86_64/vmlinux :
[0x7407ca0] update_load_avg
Disassembled with llvm
annotate.disassemblers=llvm,capstone,objdump
Samples: 66 of event 'cpu_atom/cycles/P', 10000 Hz,
Event count (approx.): 5185444, [percent: local period]
update_load_avg()
/usr/lib/debug/lib/modules/6.11.4-201.fc40.x86_64/vmlinux
Percent 0xffffffff81148fe0 <update_load_avg>:
1.61 pushq %r15
pushq %r14
1.00 pushq %r13
movl %edx,%r13d
1.90 pushq %r12
pushq %rbp
movq %rsi,%rbp
pushq %rbx
movq %rdi,%rbx
subq $0x18,%rsp
15.14 movl 0x1a4(%rdi),%eax
root@x1:~# perf config annotate.disassemblers=capstone
root@x1:~# cat ~/.perfconfig
# this file is auto-generated.
[annotate]
disassemblers = capstone
root@x1:~#
root@x1:~# perf annotate -v --stdio2 update_load_avg
<SNIP>
Disassembled with capstone
annotate.disassemblers=capstone
Samples: 66 of event 'cpu_atom/cycles/P', 10000 Hz,
Event count (approx.): 5185444, [percent: local period]
update_load_avg()
/usr/lib/debug/lib/modules/6.11.4-201.fc40.x86_64/vmlinux
Percent 0xffffffff81148fe0 <update_load_avg>:
1.61 pushq %r15
pushq %r14
1.00 pushq %r13
movl %edx,%r13d
1.90 pushq %r12
pushq %rbp
movq %rsi,%rbp
pushq %rbx
movq %rdi,%rbx
subq $0x18,%rsp
15.14 movl 0x1a4(%rdi),%eax
root@x1:~# perf config annotate.disassemblers=objdump,capstone
root@x1:~# perf config annotate.disassemblers
annotate.disassemblers=objdump,capstone
root@x1:~# cat ~/.perfconfig
# this file is auto-generated.
[annotate]
disassemblers = objdump,capstone
root@x1:~# perf annotate -v --stdio2 update_load_avg
Executing: objdump --start-address=0xffffffff81148fe0 \
--stop-address=0xffffffff811497aa \
-d --no-show-raw-insn -S -C "$1"
Disassembled with objdump
annotate.disassemblers=objdump,capstone
Samples: 66 of event 'cpu_atom/cycles/P', 10000 Hz,
Event count (approx.): 5185444, [percent: local period]
update_load_avg()
/usr/lib/debug/lib/modules/6.11.4-201.fc40.x86_64/vmlinux
Percent
Disassembly of section .text:
ffffffff81148fe0 <update_load_avg>:
#define DO_ATTACH 0x4
ffffffff81148fe0 <update_load_avg>:
#define DO_ATTACH 0x4
#define DO_DETACH 0x8
/* Update task and its cfs_rq load average */
static inline void update_load_avg(struct cfs_rq *cfs_rq,
struct sched_entity *se,
int flags)
{
1.61 push %r15
push %r14
1.00 push %r13
mov %edx,%r13d
1.90 push %r12
push %rbp
mov %rsi,%rbp
push %rbx
mov %rdi,%rbx
sub $0x18,%rsp
}
/* rq->task_clock normalized against any time
this cfs_rq has spent throttled */
static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
{
if (unlikely(cfs_rq->throttle_count))
15.14 mov 0x1a4(%rdi),%eax
root@x1:~#
After adding a way to select the disassembler from the command line a
'perf test' comparing the output of the various diassemblers should be
introduced, to test these codebases.
Acked-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Steinar H. Gunderson <sesse@google.com>
Link: https://lore.kernel.org/r/20241111151734.1018476-4-acme@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
|
Add a few paragraphs on tool and hwmon events.
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Yoshihiro Furudera <fj5100bi@fujitsu.com>
Cc: Howard Chu <howardchu95@gmail.com>
Cc: Ze Gao <zegao2021@gmail.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: Junhao He <hejunhao3@huawei.com>
Cc: Weilin Wang <weilin.wang@intel.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Athira Jajeev <atrajeev@linux.vnet.ibm.com>
Link: https://lore.kernel.org/r/20241109003759.473460-8-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|
|
The --itrace help now needs updating to reflect that
the --itrace=b argument sythesises branches as well
as branch misses.
Signed-off-by: Graham Woodward <graham.woodward@arm.com>
Reviewed-by: James Clark <james.clark@linaro.org>
Tested-by: Leo Yan <leo.yan@arm.com>
Cc: nd@arm.com
Cc: mike.leach@linaro.org
Cc: linux-arm-kernel@lists.infradead.org
Link: https://lore.kernel.org/r/20241025143009.25419-5-graham.woodward@arm.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
|