diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2025-03-27 16:03:52 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2025-03-27 16:03:52 -0700 |
| commit | 88221ac0d560700b50493aedc768f728aa585141 (patch) | |
| tree | 10a8d0e13f2f5f33c04aa8b9f566e2e66bb94c5b | |
| parent | 31eb415bf6f06c90fdd9b635caf3a6c5110a38b6 (diff) | |
| parent | 4ffef9579ffc51647c5eb55869fb310f3c1e2db2 (diff) | |
| download | linux-88221ac0d560700b50493aedc768f728aa585141.tar.gz linux-88221ac0d560700b50493aedc768f728aa585141.tar.bz2 linux-88221ac0d560700b50493aedc768f728aa585141.zip | |
Merge tag 'trace-latency-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull latency tracing updates from Steven Rostedt:
- Add some trace events to osnoise and timerlat sample generation
This adds more information to the osnoise and timerlat tracers as
well as allows BPF programs to be attached to these locations to
extract even more data.
- Fix to DECLARE_TRACE_CONDITION() macro
It wasn't used but now will be and it happened to be broken causing
the build to fail.
- Add scheduler specification monitors to runtime verifier (RV)
This is a continuation of Daniel Bristot's work.
RV allows monitors to run and react concurrently. Running the
cumulative model is equivalent to running single components using the
same reactors, with the advantage that it's easier to point out which
specification failed in case of error.
This update introduces nested monitors to RV, in short, the sysfs
monitor folder will contain a monitor named sched, which is nothing
but an empty container for other monitors. Controlling the sched
monitor (enable, disable, set reactors) controls all nested monitors.
The following scheduling monitors are added:
- sco: scheduling context operations
Monitor to ensure sched_set_state happens only in thread context
- tss: task switch while scheduling
Monitor to ensure sched_switch happens only in scheduling context
- snroc: set non runnable on its own context
Monitor to ensure set_state happens only in the respective task's context
- scpd: schedule called with preemption disabled
Monitor to ensure schedule is called with preemption disabled
- snep: schedule does not enable preempt
Monitor to ensure schedule does not enable preempt
- sncid: schedule not called with interrupt disabled
Monitor to ensure schedule is not called with interrupt disabled
* tag 'trace-latency-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tools/rv: Allow rv list to filter for container
Documentation/rv: Add docs for the sched monitors
verification/dot2k: Add support for nested monitors
tools/rv: Add support for nested monitors
rv: Add scpd, snep and sncid per-cpu monitors
rv: Add snroc per-task monitor
rv: Add sco and tss per-cpu monitors
rv: Add option for nested monitors and include sched
sched: Add sched tracepoints for RV task model
rv: Add license identifiers to monitor files
tracing: Fix DECLARE_TRACE_CONDITION
trace/osnoise: Add trace events for samples
64 files changed, 2135 insertions, 166 deletions
diff --git a/Documentation/tools/rv/rv-mon-sched.rst b/Documentation/tools/rv/rv-mon-sched.rst new file mode 100644 index 000000000000..da0fe4c79ae5 --- /dev/null +++ b/Documentation/tools/rv/rv-mon-sched.rst @@ -0,0 +1,69 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +rv-mon-sched +============ +----------------------------- +Scheduler monitors collection +----------------------------- + +:Manual section: 1 + +SYNOPSIS +======== + +**rv mon sched** [*OPTIONS*] + +**rv mon <NESTED_MONITOR>** [*OPTIONS*] + +**rv mon sched:<NESTED_MONITOR>** [*OPTIONS*] + +DESCRIPTION +=========== + +The scheduler monitor collection is a container for several monitors to model +the behaviour of the scheduler. Each monitor describes a specification that +the scheduler should follow. + +As a monitor container, it will enable all nested monitors and set them +according to OPTIONS. +Nevertheless nested monitors can also be activated independently both by name +and by specifying sched: , e.g. to enable only monitor tss you can do any of: + + # rv mon sched:tss + + # rv mon tss + +See kernel documentation for further information about this monitor: +<https://docs.kernel.org/trace/rv/monitor_sched.html> + +OPTIONS +======= + +.. include:: common_ikm.rst + +NESTED MONITOR +============== + +The available nested monitors are: + * scpd: schedule called with preemption disabled + * snep: schedule does not enable preempt + * sncid: schedule not called with interrupt disabled + * snroc: set non runnable on its own context + * sco: scheduling context operations + * tss: task switch while scheduling + +SEE ALSO +======== + +**rv**\(1), **rv-mon**\(1) + +Linux kernel *RV* documentation: +<https://www.kernel.org/doc/html/latest/trace/rv/index.html> + +AUTHOR +====== + +Written by Gabriele Monaco <gmonaco@redhat.com> + +.. include:: common_appendix.rst diff --git a/Documentation/trace/rv/monitor_sched.rst b/Documentation/trace/rv/monitor_sched.rst new file mode 100644 index 000000000000..24b2c62a3bc2 --- /dev/null +++ b/Documentation/trace/rv/monitor_sched.rst @@ -0,0 +1,171 @@ +Scheduler monitors +================== + +- Name: sched +- Type: container for multiple monitors +- Author: Gabriele Monaco <gmonaco@redhat.com>, Daniel Bristot de Oliveira <bristot@kernel.org> + +Description +----------- + +Monitors describing complex systems, such as the scheduler, can easily grow to +the point where they are just hard to understand because of the many possible +state transitions. +Often it is possible to break such descriptions into smaller monitors, +sharing some or all events. Enabling those smaller monitors concurrently is, +in fact, testing the system as if we had one single larger monitor. +Splitting models into multiple specification is not only easier to +understand, but gives some more clues when we see errors. + +The sched monitor is a set of specifications to describe the scheduler behaviour. +It includes several per-cpu and per-task monitors that work independently to verify +different specifications the scheduler should follow. + +To make this system as straightforward as possible, sched specifications are *nested* +monitors, whereas sched itself is a *container*. +From the interface perspective, sched includes other monitors as sub-directories, +enabling/disabling or setting reactors to sched, propagates the change to all monitors, +however single monitors can be used independently as well. + +It is important that future modules are built after their container (sched, in +this case), otherwise the linker would not respect the order and the nesting +wouldn't work as expected. +To do so, simply add them after sched in the Makefile. + +Specifications +-------------- + +The specifications included in sched are currently a work in progress, adapting the ones +defined in by Daniel Bristot in [1]. + +Currently we included the following: + +Monitor tss +~~~~~~~~~~~ + +The task switch while scheduling (tss) monitor ensures a task switch happens +only in scheduling context, that is inside a call to `__schedule`:: + + | + | + v + +-----------------+ + | thread | <+ + +-----------------+ | + | | + | schedule_entry | schedule_exit + v | + sched_switch | + +--------------- | + | sched | + +--------------> -+ + +Monitor sco +~~~~~~~~~~~ + +The scheduling context operations (sco) monitor ensures changes in a task state +happen only in thread context:: + + + | + | + v + sched_set_state +------------------+ + +------------------ | | + | | thread_context | + +-----------------> | | <+ + +------------------+ | + | | + | schedule_entry | schedule_exit + v | + | + scheduling_context -+ + +Monitor snroc +~~~~~~~~~~~~~ + +The set non runnable on its own context (snroc) monitor ensures changes in a +task state happens only in the respective task's context. This is a per-task +monitor:: + + | + | + v + +------------------+ + | other_context | <+ + +------------------+ | + | | + | sched_switch_in | sched_switch_out + v | + sched_set_state | + +------------------ | + | own_context | + +-----------------> -+ + +Monitor scpd +~~~~~~~~~~~~ + +The schedule called with preemption disabled (scpd) monitor ensures schedule is +called with preemption disabled:: + + | + | + v + +------------------+ + | cant_sched | <+ + +------------------+ | + | | + | preempt_disable | preempt_enable + v | + schedule_entry | + schedule_exit | + +----------------- can_sched | + | | + +----------------> -+ + +Monitor snep +~~~~~~~~~~~~ + +The schedule does not enable preempt (snep) monitor ensures a schedule call +does not enable preemption:: + + | + | + v + preempt_disable +------------------------+ + preempt_enable | | + +------------------ | non_scheduling_context | + | | | + +-----------------> | | <+ + +------------------------+ | + | | + | schedule_entry | schedule_exit + v | + | + scheduling_contex -+ + +Monitor sncid +~~~~~~~~~~~~~ + +The schedule not called with interrupt disabled (sncid) monitor ensures +schedule is not called with interrupt disabled:: + + | + | + v + schedule_entry +--------------+ + schedule_exit | | + +----------------- | can_sched | + | | | + +----------------> | | <+ + +--------------+ | + | | + | irq_disable | irq_enable + v | + | + cant_sched -+ + +References +---------- + +[1] - https://bristot.me/linux-task-model diff --git a/include/linux/rv.h b/include/linux/rv.h index 8883b41d88ec..3452b5e4b29e 100644 --- a/include/linux/rv.h +++ b/include/linux/rv.h @@ -7,7 +7,7 @@ #ifndef _LINUX_RV_H #define _LINUX_RV_H -#define MAX_DA_NAME_LEN 24 +#define MAX_DA_NAME_LEN 32 #ifdef CONFIG_RV /* @@ -56,7 +56,7 @@ struct rv_monitor { bool rv_monitoring_on(void); int rv_unregister_monitor(struct rv_monitor *monitor); -int rv_register_monitor(struct rv_monitor *monitor); +int rv_register_monitor(struct rv_monitor *monitor, struct rv_monitor *parent); int rv_get_task_monitor_slot(void); void rv_put_task_monitor_slot(int slot); diff --git a/include/linux/sched.h b/include/linux/sched.h index 6e5c38718ff5..56ddeb37b5cd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -46,6 +46,7 @@ #include <linux/rv.h> #include <linux/livepatch_sched.h> #include <linux/uidgid_types.h> +#include <linux/tracepoint-defs.h> #include <asm/kmap_size.h> /* task_struct member predeclarations (sorted alphabetically): */ @@ -187,6 +188,12 @@ struct user_event_mm; # define debug_rtlock_wait_restore_state() do { } while (0) #endif +#define trace_set_current_state(state_value) \ + do { \ + if (tracepoint_enabled(sched_set_state_tp)) \ + __trace_set_current_state(state_value); \ + } while (0) + /* * set_current_state() includes a barrier so that the write of current->__state * is correctly serialised wrt the caller's subsequent test of whether to @@ -227,12 +234,14 @@ struct user_event_mm; #define __set_current_state(state_value) \ do { \ debug_normal_state_change((state_value)); \ + trace_set_current_state(state_value); \ WRITE_ONCE(current->__state, (state_value)); \ } while (0) #define set_current_state(state_value) \ do { \ debug_normal_state_change((state_value)); \ + trace_set_current_state(state_value); \ smp_store_mb(current->__state, (state_value)); \ } while (0) @@ -248,6 +257,7 @@ struct user_event_mm; \ raw_spin_lock_irqsave(¤t->pi_lock, flags); \ debug_special_state_change((state_value)); \ + trace_set_current_state(state_value); \ WRITE_ONCE(current->__state, (state_value)); \ raw_spin_unlock_irqrestore(¤t->pi_lock, flags); \ } while (0) @@ -283,6 +293,7 @@ struct user_event_mm; raw_spin_lock(¤t->pi_lock); \ current->saved_state = current->__state; \ debug_rtlock_wait_set_state(); \ + trace_set_current_state(TASK_RTLOCK_WAIT); \ WRITE_ONCE(current->__state, TASK_RTLOCK_WAIT); \ raw_spin_unlock(¤t->pi_lock); \ } while (0); @@ -292,6 +303,7 @@ struct user_event_mm; lockdep_assert_irqs_disabled(); \ raw_spin_lock(¤t->pi_lock); \ debug_rtlock_wait_restore_state(); \ + trace_set_current_state(current->saved_state); \ WRITE_ONCE(current->__state, current->saved_state); \ current->saved_state = TASK_RUNNING; \ raw_spin_unlock(¤t->pi_lock); \ @@ -328,6 +340,10 @@ extern void io_schedule_finish(int token); extern long io_schedule_timeout(long timeout); extern void io_schedule(void); +/* wrapper function to trace from this header file */ +DECLARE_TRACEPOINT(sched_set_state_tp); +extern void __trace_set_current_state(int state_value); + /** * struct prev_cputime - snapshot of system and user cputime * @utime: time spent in user mode diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h index e1c1079f8c8d..ed52d0506c69 100644 --- a/include/trace/define_trace.h +++ b/include/trace/define_trace.h @@ -76,6 +76,10 @@ #define DECLARE_TRACE(name, proto, args) \ DEFINE_TRACE(name, PARAMS(proto), PARAMS(args)) +#undef DECLARE_TRACE_CONDITION +#define DECLARE_TRACE_CONDITION(name, proto, args, cond) \ + DEFINE_TRACE(name, PARAMS(proto), PARAMS(args)) + /* If requested, create helpers for calling these tracepoints from Rust. */ #ifdef CREATE_RUST_TRACE_POINTS #undef DEFINE_RUST_DO_TRACE @@ -108,6 +112,8 @@ /* Make all open coded DECLARE_TRACE nops */ #undef DECLARE_TRACE #define DECLARE_TRACE(name, proto, args) +#undef DECLARE_TRACE_CONDITION +#define DECLARE_TRACE_CONDITION(name, proto, args, cond) #ifdef TRACEPOINTS_ENABLED #include <trace/trace_events.h> @@ -129,6 +135,7 @@ #undef DEFINE_EVENT_CONDITION #undef TRACE_HEADER_MULTI_READ #undef DECLARE_TRACE +#undef DECLARE_TRACE_CONDITION /* Only undef what we defined in this file */ #ifdef UNDEF_TRACE_INCLUDE_FILE diff --git a/include/trace/events/osnoise.h b/include/trace/events/osnoise.h index a2379a4f0684..3f4273623801 100644 --- a/include/trace/events/osnoise.h +++ b/include/trace/events/osnoise.h @@ -3,9 +3,105 @@ #define TRACE_SYSTEM osnoise #if !defined(_OSNOISE_TRACE_H) || defined(TRACE_HEADER_MULTI_READ) + +#ifndef _OSNOISE_TRACE_H #define _OSNOISE_TRACE_H +/* + * osnoise sample structure definition. Used to store the statistics of a + * sample run. + */ +struct osnoise_sample { + u64 runtime; /* runtime */ + u64 noise; /* noise */ + u64 max_sample; /* max single noise sample */ + int hw_count; /* # HW (incl. hypervisor) interference */ + int nmi_count; /* # NMIs during this sample */ + int irq_count; /* # IRQs during this sample */ + int softirq_count; /* # softirqs during this sample */ + int thread_count; /* # threads during this sample */ +}; + +#ifdef CONFIG_TIMERLAT_TRACER +/* + * timerlat sample structure definition. Used to store the statistics of + * a sample run. + */ +struct timerlat_sample { + u64 timer_latency; /* timer_latency */ + unsigned int seqnum; /* unique sequence */ + int context; /* timer context */ +}; +#endif // CONFIG_TIMERLAT_TRACER +#endif // _OSNOISE_TRACE_H #include <linux/tracepoint.h> +TRACE_EVENT(osnoise_sample, + + TP_PROTO(struct osnoise_sample *s), + + TP_ARGS(s), + + TP_STRUCT__entry( + __field( u64, runtime ) + __field( u64, noise ) + __field( u64, max_sample ) + __field( int, hw_count ) + __field( int, irq_count ) + __field( int, nmi_count ) + __field( int, softirq_count ) + __field( int, thread_count ) + ), + + TP_fast_assign( + __entry->runtime = s->runtime; + __entry->noise = s->noise; + __entry->max_sample = s->max_sample; + __entry->hw_count = s->hw_count; + __entry->irq_count = s->irq_count; + __entry->nmi_count = s->nmi_count; + __entry->softirq_count = s->softirq_count; + __entry->thread_count = s->thread_count; + ), + + TP_printk("runtime=%llu noise=%llu max_sample=%llu hw_count=%d" + " irq_count=%d nmi_count=%d softirq_count=%d" + " thread_count=%d", + __entry->runtime, + __entry->noise, + __entry->max_sample, + __entry->hw_count, + __entry->irq_count, + __entry->nmi_count, + __entry->softirq_count, + __entry->thread_count) +); + +#ifdef CONFIG_TIMERLAT_TRACER +TRACE_EVENT(timerlat_sample, + + TP_PROTO(struct timerlat_sample *s), + + TP_ARGS(s), + + TP_STRUCT__entry( + __field( u64, timer_latency ) + __field( unsigned int, seqnum ) + __field( int, context ) + ), + + TP_fast_assign( + __entry->timer_latency = s->timer_latency; + __entry->seqnum = s->seqnum; + __entry->context = s->context; + ), + + TP_printk("timer_latency=%llu seqnum=%u context=%d", + __entry->timer_latency, + __entry->seqnum, + __entry->context) +); +#endif // CONFIG_TIMERLAT_TRACER + TRACE_EVENT(thread_noise, TP_PROTO(struct task_struct *t, u64 start, u64 duration), |
