diff options
-rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 45 | ||||
-rw-r--r-- | Documentation/trace/debugging.rst | 159 | ||||
-rw-r--r-- | Documentation/trace/ftrace.rst | 12 | ||||
-rw-r--r-- | include/linux/ring_buffer.h | 20 | ||||
-rw-r--r-- | kernel/trace/ring_buffer.c | 949 | ||||
-rw-r--r-- | kernel/trace/trace.c | 372 | ||||
-rw-r--r-- | kernel/trace/trace.h | 14 | ||||
-rw-r--r-- | kernel/trace/trace_functions_graph.c | 23 | ||||
-rw-r--r-- | kernel/trace/trace_output.c | 17 | ||||
-rw-r--r-- | tools/testing/selftests/ring-buffer/map_test.c | 24 |
10 files changed, 1483 insertions, 152 deletions
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 19b71ff1168e..bb48ae24ae69 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -6808,6 +6808,51 @@ the same thing would happen if it was left off). The irq_handler_entry event, and all events under the "initcall" system. + Flags can be added to the instance to modify its behavior when it is + created. The flags are separated by '^'. + + The available flags are: + + traceoff - Have the tracing instance tracing disabled after it is created. + traceprintk - Have trace_printk() write into this trace instance + (note, "printk" and "trace_printk" can also be used) + + trace_instance=foo^traceoff^traceprintk,sched,irq + + The flags must come before the defined events. + + If memory has been reserved (see memmap for x86), the instance + can use that memory: + + memmap=12M$0x284500000 trace_instance=boot_map@0x284500000:12M + + The above will create a "boot_map" instance that uses the physical + memory at 0x284500000 that is 12Megs. The per CPU buffers of that + instance will be split up accordingly. + + Alternatively, the memory can be reserved by the reserve_mem option: + + reserve_mem=12M:4096:trace trace_instance=boot_map@trace + + This will reserve 12 megabytes at boot up with a 4096 byte alignment + and place the ring buffer in this memory. Note that due to KASLR, the + memory may not be the same location each time, which will not preserve + the buffer content. + + Also note that the layout of the ring buffer data may change between + kernel versions where the validator will fail and reset the ring buffer + if the layout is not the same as the previous kernel. + + If the ring buffer is used for persistent bootups and has events enabled, + it is recommend to disable tracing so that events from a previous boot do not + mix with events of the current boot (unless you are debugging a random crash + at boot up). + + reserve_mem=12M:4096:trace trace_instance=boot_map^traceoff^traceprintk@trace,sched,irq + + See also Documentation/trace/debugging.rst + + trace_options=[option-list] [FTRACE] Enable or disable tracer options at boot. The option-list is a comma delimited list of options diff --git a/Documentation/trace/debugging.rst b/Documentation/trace/debugging.rst new file mode 100644 index 000000000000..54fb16239d70 --- /dev/null +++ b/Documentation/trace/debugging.rst @@ -0,0 +1,159 @@ +============================== +Using the tracer for debugging +============================== + +Copyright 2024 Google LLC. + +:Author: Steven Rostedt <rostedt@goodmis.org> +:License: The GNU Free Documentation License, Version 1.2 + (dual licensed under the GPL v2) + +- Written for: 6.12 + +Introduction +------------ +The tracing infrastructure can be very useful for debugging the Linux +kernel. This document is a place to add various methods of using the tracer +for debugging. + +First, make sure that the tracefs file system is mounted:: + + $ sudo mount -t tracefs tracefs /sys/kernel/tracing + + +Using trace_printk() +-------------------- + +trace_printk() is a very lightweight utility that can be used in any context +inside the kernel, with the exception of "noinstr" sections. It can be used +in normal, softirq, interrupt and even NMI context. The trace data is +written to the tracing ring buffer in a lockless way. To make it even +lighter weight, when possible, it will only record the pointer to the format +string, and save the raw arguments into the buffer. The format and the +arguments will be post processed when the ring buffer is read. This way the +trace_printk() format conversions are not done during the hot path, where +the trace is being recorded. + +trace_printk() is meant only for debugging, and should never be added into +a subsystem of the kernel. If you need debugging traces, add trace events +instead. If a trace_printk() is found in the kernel, the following will +appear in the dmesg:: + + ********************************************************** + ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ** + ** ** + ** trace_printk() being used. Allocating extra memory. ** + ** ** + ** This means that this is a DEBUG kernel and it is ** + ** unsafe for production use. ** + ** ** + ** If you see this message and you are not debugging ** + ** the kernel, report this immediately to your vendor! ** + ** ** + ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ** + ********************************************************** + +Debugging kernel crashes +------------------------ +There is various methods of acquiring the state of the system when a kernel +crash occurs. This could be from the oops message in printk, or one could +use kexec/kdump. But these just show what happened at the time of the crash. +It can be very useful in knowing what happened up to the point of the crash. +The tracing ring buffer, by default, is a circular buffer than will +overwrite older events with newer ones. When a crash happens, the content of +the ring buffer will be all the events that lead up to the crash. + +There are several kernel command line parameters that can be used to help in +this. The first is "ftrace_dump_on_oops". This will dump the tracing ring +buffer when a oops occurs to the console. This can be useful if the console +is being logged somewhere. If a serial console is used, it may be prudent to +make sure the ring buffer is relatively small, otherwise the dumping of the +ring buffer may take several minutes to hours to finish. Here's an example +of the kernel command line:: + + ftrace_dump_on_oops trace_buf_size=50K + +Note, the tracing buffer is made up of per CPU buffers where each of these +buffers is broken up into sub-buffers that are by default PAGE_SIZE. The +above trace_buf_size option above sets each of the per CPU buffers to 50K, +so, on a machine with 8 CPUs, that's actually 400K total. + +Persistent buffers across boots +------------------------------- +If the system memory allows it, the tracing ring buffer can be specified at +a specific location in memory. If the location is the same across boots and +the memory is not modified, the tracing buffer can be retrieved from the +following boot. There's two ways to reserve memory for the use of the ring +buffer. + +The more reliable way (on x86) is to reserve memory with the "memmap" kernel +command line option and then use that memory for the trace_instance. This +requires a bit of knowledge of the physical memory layout of the system. The +advantage of using this method, is that the memory for the ring buffer will +always be the same:: + + memmap==12M$0x284500000 trace_instance=boot_map@0x284500000:12M + +The memmap above reserves 12 megabytes of memory at the physical memory +location 0x284500000. Then the trace_instance option will create a trace +instance "boot_map" at that same location with the same amount of memory +reserved. As the ring buffer is broke up into per CPU buffers, the 12 +megabytes will be broken up evenly between those CPUs. If you have 8 CPUs, +each per CPU ring buffer will be 1.5 megabytes in size. Note, that also +includes meta data, so the amount of memory actually used by the ring buffer +will be slightly smaller. + +Another more generic but less robust way to allocate a ring buffer mapping +at boot is with the "reserve_mem" option:: + + reserve_mem=12M:4096:trace trace_instance=boot_map@trace + +The reserve_mem option above will find 12 megabytes that are available at +boot up, and align it by 4096 bytes. It will label this memory as "trace" +that can be used by later command line options. + +The trace_instance option creates a "boot_map" instance and will use the +memory reserved by reserve_mem that was labeled as "trace". This method is +more generic but may not be as reliable. Due to KASLR, the memory reserved +by reserve_mem may not be located at the same location. If this happens, +then the ring buffer will not be from the previous boot and will be reset. + +Sometimes, by using a larger alignment, it can keep KASLR from moving things +around in such a way that it will move the location of the reserve_mem. By +using a larger alignment, you may find better that the buffer is more +consistent to where it is placed:: + + reserve_mem=12M:0x2000000:trace trace_instance=boot_map@trace + +On boot up, the memory reserved for the ring buffer is validated. It will go +through a series of tests to make sure that the ring buffer contains valid +data. If it is, it will then set it up to be available to read from the +instance. If it fails any of the tests, it will clear the entire ring buffer +and initialize it as new. + +The layout of this mapped memory may not be consistent from kernel to +kernel, so only the same kernel is guaranteed to work if the mapping is +preserved. Switching to a different kernel version may find a different +layout and mark the buffer as invalid. + +Using trace_printk() in the boot instance +----------------------------------------- +By default, the content of trace_printk() goes into the top level tracing +instance. But this instance is never preserved across boots. To have the +trace_printk() content, and some other internal tracing go to the preserved +buffer (like dump stacks), either set the instance to be the trace_printk() +destination from the kernel command line, or set it after boot up via the +trace_printk_dest option. + +After boot up:: + + echo 1 > /sys/kernel/tracing/instances/boot_map/options/trace_printk_dest + +From the kernel command line:: + + reserve_mem=12M:4096:trace trace_instance=boot_map^traceprintk^traceoff@trace + +If setting it from the kernel command line, it is recommended to also +disable tracing with the "traceoff" flag, and enable tracing after boot up. +Otherwise the trace from the most recent boot will be mixed with the trace +from the previous boot, and may make it confusing to read. diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst index 5aba74872ba7..4073ca48af4a 100644 --- a/Documentation/trace/ftrace.rst +++ b/Documentation/trace/ftrace.rst @@ -1186,6 +1186,18 @@ Here are the available options: trace_printk Can disable trace_printk() from writing into the buffer. + trace_printk_dest + Set to have trace_printk() and similar internal tracing functions + write into this instance. Note, only one trace instance can have + this set. By setting this flag, it clears the trace_printk_dest flag + of the instance that had it set previously. By default, the top + level trace has this set, and will get it set again if another + instance has it set then clears it. + + This flag cannot be cleared by the top level instance, as it is the + default instance. The only way the top level instance has this flag + cleared, is by it being set in another instance. + annotate It is sometimes confusing when the CPU buffers are full and one CPU buffer had a lot of events recently, thus diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index fd35d4ec12e1..17fbb7855295 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -89,6 +89,14 @@ void ring_buffer_discard_commit(struct trace_buffer *buffer, struct trace_buffer * __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *key); +struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned flags, + int order, unsigned long start, + unsigned long range_size, + struct lock_class_key *key); + +bool ring_buffer_last_boot_delta(struct trace_buffer *buffer, long *text, + long *data); + /* * Because the ring buffer is generic, if other users of the ring buffer get * traced by ftrace, it can produce lockdep warnings. We need to keep each @@ -100,6 +108,18 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *k __ring_buffer_alloc((size), (flags), &__key); \ }) +/* + * Because the ring buffer is generic, if other users of the ring buffer get + * traced by ftrace, it can produce lockdep warnings. We need to keep each + * ring buffer's lock class separate. + */ +#define ring_buffer_alloc_range(size, flags, order, start, range_size) \ +({ \ + static struct lock_class_key __key; \ + __ring_buffer_alloc_range((size), (flags), (order), (start), \ + (range_size), &__key); \ +}) + typedef bool (*ring_buffer_cond_fn)(void *data); int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full, ring_buffer_cond_fn cond, void *data); diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index cebd879a30cb..77dc0b25140e 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -32,6 +32,8 @@ #include <asm/local64.h> #include <asm/local.h> +#include "trace.h" + /* * The "absolute" timestamp in the buffer is only 59 bits. * If a clock has the 5 MSBs set, it needs to be saved and @@ -42,6 +44,21 @@ static void update_pages_handler(struct work_struct *work); +#define RING_BUFFER_META_MAGIC 0xBADFEED + +struct ring_buffer_meta { + int magic; + int struct_size; + unsigned long text_addr; + unsigned long data_addr; + unsigned long first_buffer; + unsigned long head_buffer; + unsigned long commit_buffer; + __u32 subbuf_size; + __u32 nr_subbufs; + int buffers[]; +}; + /* * The ring buffer header is special. We must manually up keep it. */ @@ -342,7 +359,8 @@ struct buffer_page { local_t entries; /* entries on this page */ unsigned long real_end; /* real end of data */ unsigned order; /* order of the page */ - u32 id; /* ID for external mapping */ + u32 id:30; /* ID for external mapping */ + u32 range:1; /* Mapped via a range */ struct buffer_data_page *page; /* Actual data page */ }; @@ -373,7 +391,9 @@ static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage) static void free_buffer_page(struct buffer_page *bpage) { - free_pages((unsigned long)bpage->page, bpage->order); + /* Range pages are not to be freed */ + if (!bpage->range) + free_pages((unsigned long)bpage->page, bpage->order); kfree(bpage); } @@ -491,9 +511,11 @@ struct ring_buffer_per_cpu { unsigned long pages_removed; unsigned int mapped; + unsigned int user_mapped; /* user space mapping */ struct mutex mapping_lock; unsigned long *subbuf_ids; /* ID to subbuf VA */ struct trace_buffer_meta *meta_page; + struct ring_buffer_meta *ring_meta; /* ring buffer pages to update, > 0 to add, < 0 to remove */ long nr_pages_to_update; @@ -523,6 +545,12 @@ struct trace_buffer { struct rb_irq_work irq_work; bool time_stamp_abs; + unsigned long range_addr_start; + unsigned long range_addr_end; + + long last_text_delta; + long last_data_delta; + unsigned int subbuf_size; unsigned int subbuf_order; unsigned int max_data_size; @@ -1239,6 +1267,11 @@ static void rb_head_page_activate(struct ring_buffer_per_cpu *cpu_buffer) * Set the previous list pointer to have the HEAD flag. */ rb_set_list_to_head(head->list.prev); + + if (cpu_buffer->ring_meta) { + struct ring_buffer_meta *meta = cpu_buffer->ring_meta; + meta->head_buffer = (unsigned long)head->page; + } } static void rb_list_head_clear(struct list_head *list) @@ -1478,9 +1511,484 @@ static void rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer) } } +/* + * Take an address, add the meta data size as well as the array of + * array subbuffer indexes, then align it to a subbuffer size. + * + * This is used to help find the next per cpu subbuffer within a mapped range. + */ +static unsigned long +rb_range_align_subbuf(unsigned long addr, int subbuf_size, int nr_subbufs) +{ + addr += sizeof(struct ring_buffer_meta) + + sizeof(int) * nr_subbufs; + return ALIGN(addr, subbuf_size); +} + +/* + * Return the ring_buffer_meta for a given @cpu. + */ +static void *rb_range_meta(struct trace_buffer *buffer, int nr_pages, int cpu) +{ + int subbuf_size = buffer->subbuf_size + BUF_PAGE_HDR_SIZE; + unsigned long ptr = buffer->range_addr_start; + struct ring_buffer_meta *meta; + int nr_subbufs; + + if (!ptr) + return NULL; + + /* When nr_pages passed in is zero, the first meta has already been initialized */ + if (!nr_pages) { + meta = (struct ring_buffer_meta *)ptr; + nr_subbufs = meta->nr_subbufs; + } else { + meta = NULL; + /* Include the reader page */ + nr_subbufs = nr_pages + 1; + } + + /* + * The first chunk may not be subbuffer aligned, where as + * the rest of the chunks are. + */ + if (cpu) { + ptr = rb_range_align_subbuf(ptr, subbuf_size, nr_subbufs); + ptr += subbuf_size * nr_subbufs; + + /* We can use multiplication to find chunks greater than 1 */ + if (cpu > 1) { + unsigned long size; + unsigned long p; + + /* Save the beginning of this CPU chunk */ + p = ptr; + ptr = rb_range_align_subbuf(ptr, subbuf_size, nr_subbufs); + ptr += subbuf_size * nr_subbufs; + + /* Now all chunks after this are the same size */ + size = ptr - p; + ptr += size * (cpu - 2); + } + } + return (void *)ptr; +} + +/* Return the start of subbufs given the meta pointer */ +static void *rb_subbufs_from_meta(struct ring_buffer_meta *meta) +{ + int subbuf_size = meta->subbuf_size; + unsigned long ptr; + + ptr = (unsigned long)meta; + ptr = rb_range_align_subbuf(ptr, subbuf_size, meta->nr_subbufs); + + return (void *)ptr; +} + +/* + * Return a specific sub-buffer for a given @cpu defined by @idx. + */ +static void *rb_range_buffer(struct ring_buffer_per_cpu *cpu_buffer, int idx) +{ + struct ring_buffer_meta *meta; + unsigned long ptr; + int subbuf_size; + + meta = rb_range_meta(cpu_buffer->buffer, 0, cpu_buffer->cpu); + if (!meta) + return NULL; + + if (WARN_ON_ONCE(idx >= meta->nr_subbufs)) + return NULL; + + subbuf_size = meta->subbuf_size; + + /* Map this buffer to the order that's in meta->buffers[] */ + idx = meta->buffers[idx]; + + ptr = (unsigned long)rb_subbufs_from_meta(meta); + + ptr += subbuf_size * idx; + if (ptr + subbuf_size > cpu_buffer->buffer->range_addr_end) + return NULL; + + return (void *)ptr; +} + +/* + * See if the existing memory contains valid ring buffer data. + * As the previous kernel must be the same as this kernel, all + * the calculations (size of buffers and number of buffers) + * must be the same. + */ +static bool rb_meta_valid(struct ring_buffer_meta *meta, int cpu, + struct trace_buffer *buffer, int nr_pages) +{ + int subbuf_size = PAGE_SIZE; + struct buffer_data_page *subbuf; + unsigned long buffers_start; + unsigned long buffers_end; + int i; + + /* Check the meta magic and meta struct size */ + if (meta->magic != RING_BUFFER_META_MAGIC || + meta->struct_size != sizeof(*meta)) { + pr_info("Ring buffer boot meta[%d] mismatch of magic or struct size\n", cpu); + return false; + } + + /* The subbuffer's size and number of subbuffers must match */ + if (meta->subbuf_size != subbuf_size || + meta->nr_subbufs != nr_pages + 1) { + pr_info("Ring buffer boot meta [%d] mismatch of subbuf_size/nr_pages\n", cpu); + return false; + } + + buffers_start = meta->first_buffer; + buffers_end = meta->first_buffer + (subbuf_size * meta->nr_subbufs); + + /* Is the head and commit buffers within the range of buffers? */ + if (meta->head_buffer < buffers_start || + meta->head_buffer >= buffers_end) { + pr_info("Ring buffer boot meta [%d] head buffer out of range\n", cpu); + return false; + } + + if (meta->commit_buffer < buffers_start || + meta->commit_buffer >= buffers_end) { + pr_info("Ring buffer boot meta [%d] commit buffer out of range\n", cpu); + return false; + } + + subbuf = rb_subbufs_from_meta(meta); + + /* Is the meta buffers and the subbufs themselves have correct data? */ + for (i = 0; i < meta->nr_subbufs; i++) { + if (meta->buffers[i] < 0 || + meta->buffers[i] >= meta->nr_subbufs) { + pr_info("Ring buffer boot meta [%d] array out of range\n", cpu); + return false; + } + + if ((unsigned)local_read(&subbuf->commit) > subbuf_size) { + pr_info("Ring buffer boot meta [%d] buffer invalid commit\n", cpu); + return false; + } + + subbuf = (void *)subbuf + subbuf_size; + } + + return true; +} + +static int rb_meta_subbuf_idx(struct ring_buffer_meta *meta, void *subbuf); + +static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu, + unsigned long long *timestamp, u64 *delta_ptr) +{ + struct ring_buffer_event *event; + u64 ts, delta; + int events = 0; + int e; + + *delta_ptr = 0; + *timestamp = 0; + + ts = dpage->time_stamp; + + for (e = 0; e < tail; e += rb_event_length(event)) { + + event = (struct ring_buffer_event *)(dpage->data + e); + + switch (event->type_len) { + + case RINGBUF_TYPE_TIME_EXTEND: + delta = rb_event_time_stamp(event); + ts += delta; + break; + + case RINGBUF_TYPE_TIME_STAMP: + delta = rb_event_time_stamp(event); + delta = rb_fix_abs_ts(delta, ts); + if (delta < ts) { + *delta_ptr = delta; + *timestamp = ts; + return -1; + } + ts = delta; + break; + + case RINGBUF_TYPE_PADDING: + if (event->time_delta == 1) + break; + fallthrough; + case RINGBUF_TYPE_DATA: + events++; + ts += event->time_delta; + break; + + default: + return -1; + } + } + *timestamp = ts; + return events; +} + +static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu) +{ + unsigned long long ts; + u64 delta; + int tail; + + tail = local_read(&dpage->commit); + return rb_read_data_buffer(dpage, tail, cpu, &ts, &delta); +} + +/* If the meta data has been validated, now validate the events */ +static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer) +{ + struct ring_buffer_meta *meta = cpu_buffer->ring_meta; + struct buffer_page *head_page; + unsigned long entry_bytes = 0; + unsigned long entries = 0; + int ret; + int i; + + if (!meta || !meta->head_buffer) + return; + + /* Do the reader page first */ + ret = rb_validate_buffer(cpu_buffer->reader_page->page, cpu_buffer->cpu); + if (ret < 0) { + pr_info("Ring buffer reader page is invalid\n"); + goto invalid; + } + entries += ret; + entry_bytes += local_read(&cpu_buffer->reader_page->page->commit); + local_set(&cpu_buffer->reader_page->entries, ret); + + head_page = cpu_buffer->head_page; + + /* If both the head and commit are on the reader_page then we are done. */ + if (head_page == cpu_buffer->reader_page && + head_page == cpu_buffer->commit_page) + goto done; + + /* Iterate until finding the commit page */ + for (i = 0; i < meta->nr_subbufs + 1; i++, rb_inc_page(&head_page)) { + + /* Reader page has already been done */ + if (head_page == cpu_buffer->reader_page) + continue; + + ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu); + if (ret < 0) { + pr_info("Ring buffer meta [%d] invalid buffer page\n", + cpu_buffer->cpu); + goto invalid; + } + entries += ret; + entry_bytes += local_read(&head_page->page->commit); + local_set(&cpu_buffer->head_page->entries, ret); + + if (head_page == cpu_buffer->commit_page) + break; + } + + if (head_page != cpu_buffer->commit_page) { + pr_info("Ring buffer meta [%d] commit page not found\n", + cpu_buffer->cpu); + goto invalid; + } + done: + local_set(&cpu_buffer->entries, entries); + local_set(&cpu_buffer->entries_bytes, entry_bytes); + + pr_info("Ring buffer meta [%d] is from previous boot!\n", cpu_buffer->cpu); + return; + + invalid: + /* The content of the buffers are invalid, reset the meta data */ + meta->head_buffer = 0; + meta->commit_buffer = 0; + + /* Reset the reader page */ + local_set(&cpu_buffer->reader_page->entries, 0); + local_set(&cpu_buffer->reader_page->page->commit, 0); + + /* Reset all the subbuffers */ + for (i = 0; i < meta->nr_subbufs - 1; i++, rb_inc_page(&head_page)) { + local_set(&head_page->entries, 0); + local_set(&head_page->page->commit, 0); + } +} + +/* Used to calculate data delta */ +static char rb_data_ptr[] = ""; + +#define THIS_TEXT_PTR ((unsigned long)rb_meta_init_text_addr) +#define THIS_DATA_PTR ((unsigned long)rb_data_ptr) + +static void rb_meta_init_text_addr(struct ring_buffer_meta *meta) +{ + meta->text_addr = THIS_TEXT_PTR; + meta->data_addr = THIS_DATA_PTR; +} + +static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages) +{ + struct ring_buffer_meta *meta; + unsigned long delta; + void *subbuf; + int cpu; + int i; + + for (cpu = 0; cpu < nr_cpu_ids; cpu++) { + void *next_meta; + + meta = rb_range_meta(buffer, nr_pages, cpu); + + if (rb_meta_valid(meta, cpu, buffer, nr_pages)) { + /* Make the mappings match the current address */ + subbuf = rb_subbufs_from_meta(meta); + delta = (unsigned long)subbuf - meta->first_buffer; + meta->first_buffer += delta; + meta->head_buffer += delta; + meta->commit_buffer += delta; + buffer->last_text_delta = THIS_TEXT_PTR - meta->text_addr; + buffer->last_data_delta = THIS_DATA_PTR - meta->data_addr; + continue; + } + + if (cpu < nr_cpu_ids - 1) + next_meta = rb_range_meta(buffer, nr_pages, cpu + 1); + else + next_meta = (void *)buffer->range_addr_end; + + memset(meta, 0, next_meta - (void *)meta); + + meta->magic = RING_BUFFER_META_MAGIC; + meta->struct_size = sizeof(*meta); + + meta->nr_subbufs = nr_pages + 1; + meta->subbuf_size = PAGE_SIZE; + + subbuf = rb_subbufs_from_meta(meta); + + meta->first_buffer = (unsigned long)subbuf; + rb_meta_init_text_addr(meta); + + /* + * The buffers[] array holds the order of the sub-buffers + * that are after the meta data. The sub-buffers may + * be swapped out when read and inserted into a different + * location of the ring buffer. Although their addresses + * remain the same, the buffers[] array contains the + * index into the sub-buffers holding their actual order. + */ + for (i = 0; i < meta->nr_subbufs; i++) { + meta->buffers[i] = i; + rb_init_page(subbuf); + subbuf += meta->subbuf_size; + } + } +} + +static void *rbm_start(struct seq_file *m, loff_t *pos) +{ + struct ring_buffer_per_cpu *cpu_buffer = m->private; + struct ring_buffer_meta *meta = cpu_buffer->ring_meta; + unsigned long val; + + if (!meta) + return NULL; + + if (*pos > meta->nr_subbufs) + return NULL; + + val = *pos; + val++; + + return (void *)val; +} + +static void *rbm_next(struct seq_file *m, void *v, loff_t *pos) +{ + (*pos)++; + + return rbm_start(m, pos); +} + +static int rbm_show(struct seq_file *m, void *v) +{ + struct ring_buffer_per_cpu *cpu_buffer = m->private; + struct ring_buffer_meta *meta = cpu_buffer->ring_meta; + unsigned long val = (unsigned long)v; + + if (val == 1) { + seq_printf(m, "head_buffer: %d\n", + rb_meta_subbuf_idx(meta, (void *)meta->head_buffer)); + seq_printf(m, "commit_buffer: %d\n", + rb_meta_subbuf_idx(meta, (void *)meta->commit_buffer)); + seq_printf(m, "subbuf_size: %d\n", meta->subbuf_size); + seq_printf(m, "nr_subbufs: %d\n", meta->nr_subbufs); + return 0; + } + + val -= 2; + seq_printf(m, "buffer[%ld]: %d\n", val, meta->buffers[val]); + + return 0; +} + +static void rbm_stop(struct seq_file *m, void *p) +{ +} + +static const struct seq_operations rb_meta_seq_ops = { + .start = rbm_start, + .next = rbm_next, + .show = rbm_show, + .stop = rbm_stop, +}; + +int ring_buffer_meta_seq_init(struct file *file, struct trace_buffer *buffer, int cpu) +{ + struct seq_file *m; + int ret; + + ret = seq_open(file, &rb_meta_seq_ops); + if (ret) + return ret; + + m = file->private_data; + m->private = buffer->buffers[cpu]; + + return 0; +} + +/* Map the buffer_pages to the previous head and commit pages */ +static void rb_meta_buffer_update(struct ring_buffer_per_cpu *cpu_buffer, + struct buffer_page *bpage) +{ + struct ring_buffer_meta *meta = cpu_buffer->ring_meta; + + if (meta->head_buffer == (unsigned long)bpage->page) + cpu_buffer->head_page = bpage; + + if (meta->commit_buffer == (unsigned long)bpage->page) { + cpu_buffer->commit_page = bpage; + cpu_buffer->tail_page = bpage; + } +} + static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, long nr_pages, struct list_head *pages) { + struct trace_buffer *buffer = cpu_buffer->buffer; + struct ring_buffer_meta *meta = NULL; struct buffer_page *bpage, *tmp; bool user_thread = current->mm != NULL; gfp_t mflags; @@ -1515,6 +2023,10 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, */ if (user_thread) set_current_oom_origin(); + + if (buffer->range_addr_start) + meta = rb_range_meta(buffer, nr_pages, cpu_buffer->cpu); + for (i = 0; i < nr_pages; i++) { struct page *page; @@ -1525,16 +2037,32 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, rb_check_bpage(cpu_buffer, bpage); - list_add(&bpage->list, pages); - - page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), - mflags | __GFP_COMP | __GFP_ZERO, - cpu_buffer->buffer->subbuf_order); - if (!page) - goto free_pages; - bpage->page = page_address(page); + /* + * Append the pages as for mapped buffers we want to keep + * the order + */ + list_add_tail(&bpage->list, pages); + + if (meta) { + /* A range was given. Use that for the buffer page */ + bpage->page = rb_range_buffer(cpu_buffer, i + 1); + if (!bpage->page) + goto free_pages; + /* If this is valid from a previous boot */ + if (meta->head_buffer) + rb_meta_buffer_update(cpu_buffer, bpage); + bpage->range = 1; + bpage->id = i + 1; + } else { + page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), + mflags | __GFP_COMP | __GFP_ZERO, + cpu_buffer->buffer->subbuf_order); + if (!page) + goto free_pages; + bpage->page = page_address(page); + rb_init_page(bpage->page); + } bpage->order = cpu_buffer->buffer->subbuf_order; - rb_init_page(bpage->page); if (user_thread && fatal_signal_pending(current)) goto free_pages; @@ -1584,6 +2112,7 @@ static struct ring_buffer_per_cpu * rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu) { struct ring_buffer_per_cpu *cpu_buffer; + struct ring_buffer_meta *meta; struct buffer_page *bpage; struct page *page; int ret; @@ -1614,12 +2143,28 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu) cpu_buffer->reader_page = bpage; - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_COMP | __GFP_ZERO, - cpu_buffer->buffer->subbuf_order); - if (!page) - goto fail_free_reader; - bpage->page = page_address(page); - rb_init_page(bpage->page); + if (buffer->range_addr_start) { + /* + * Range mapped buffers have the same restrictions as memory + * mapped ones do. + */ + cpu_buffer->mapped = 1; + cpu_buffer->ring_meta = rb_range_meta(buffer, nr_pages, cpu); + bpage->page = rb_range_buffer(cpu_buffer, 0); + if (!bpage->page) + goto fail_free_reader; + if (cpu_buffer->ring_meta->head_buffer) + rb_meta_buffer_update(cpu_buffer, bpage); + bpage->range = 1; + } else { + page = alloc_pages_node(cpu_to_node(cpu), + GFP_KERNEL | __GFP_COMP | __GFP_ZERO, + cpu_buffer->buffer->subbuf_order); + if (!page) + goto fail_free_reader; + bpage->page = page_address(page); + rb_init_page(bpage->page); + } INIT_LIST_HEAD(&cpu_buffer->reader_page->list); INIT_LIST_HEAD(&cpu_buffer->new_pages); @@ -1628,11 +2173,35 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu) if (ret < 0) goto fail_free_reader; - cpu_buffer->head_page - = list_entry(cpu_buffer->pages, struct buffer_page, list); - cpu_buffer->tail_page = cpu_buffer->commit_page = cpu_buffer->head_page; + rb_meta_validate_events(cpu_buffer); + + /* If the boot meta was valid then this has already been updated */ + meta = cpu_buffer->ring_meta; + if (!meta || !meta->head_buffer || + !cpu_buffer->head_page || !cpu_buffer->commit_page || !cpu_buffer->tail_page) { + if (meta && meta->head_buffer && + (cpu_buffer->head_page || cpu_buffer->commit_page || cpu_buffer->tail_page)) { + pr_warn("Ring buffer meta buffers not all mapped\n"); + if (!cpu_buffer->head_page) + pr_warn(" Missing head_page\n"); + if (!cpu_buffer->commit_page) + pr_warn(" Missing commit_page\n"); + if (!cpu_buffer->tail_page) + pr_warn(" Missing tail_page\n"); + } - rb_head_page_activate(cpu_buffer); + cpu_buffer->head_page + = list_entry(cpu_buffer->pages, struct buffer_page, list); + cpu_buffer->tail_page = cpu_buffer->commit_page = cpu_buffer->head_page; + + rb_head_page_activate(cpu_buffer); + + if (cpu_buffer->ring_meta) + meta->commit_buffer = meta->head_buffer; |