summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2024-03-10Merge tag 'trace-ring-buffer-v6.8-rc7' of ↵Linus Torvalds3-93/+113
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Do not allow large strings (> 4096) as single write to trace_marker The size of a string written into trace_marker was determined by the size of the sub-buffer in the ring buffer. That size is dependent on the PAGE_SIZE of the architecture as it can be mapped into user space. But on PowerPC, where PAGE_SIZE is 64K, that made the limit of the string of writing into trace_marker 64K. One of the selftests looks at the size of the ring buffer sub-buffers and writes that plus more into the trace_marker. The write will take what it can and report back what it consumed so that the user space application (like echo) will write the rest of the string. The string is stored in the ring buffer and can be read via the "trace" or "trace_pipe" files. The reading of the ring buffer uses vsnprintf(), which uses a precision "%.*s" to make sure it only reads what is stored in the buffer, as a bug could cause the string to be non terminated. With the combination of the precision change and the PAGE_SIZE of 64K allowing huge strings to be added into the ring buffer, plus the test that would actually stress that limit, a bug was reported that the precision used was too big for "%.*s" as the string was close to 64K in size and the max precision of vsnprintf is 32K. Linus suggested not to have that precision as it could hide a bug if the string was again stored without a nul byte. Another issue that was brought up is that the trace_seq buffer is also based on PAGE_SIZE even though it is not tied to the architecture limit like the ring buffer sub-buffer is. Having it be 64K * 2 is simply just too big and wasting memory on systems with 64K page sizes. It is now hardcoded to 8K which is what all other architectures with 4K PAGE_SIZE has. Finally, the write to trace_marker is now limited to 4K as there is no reason to write larger strings into trace_marker. - ring_buffer_wait() should not loop. The ring_buffer_wait() does not have the full context (yet) on if it should loop or not. Just exit the loop as soon as its woken up and let the callers decide to loop or not (they already do, so it's a bit redundant). - Fix shortest_full field to be the smallest amount in the ring buffer that a waiter is waiting for. The "shortest_full" field is updated when a new waiter comes in and wants to wait for a smaller amount of data in the ring buffer than other waiters. But after all waiters are woken up, it's not reset, so if another waiter comes in wanting to wait for more data, it will be woken up when the ring buffer has a smaller amount from what the previous waiters were waiting for. - The wake up all waiters on close is incorrectly called frome .release() and not from .flush() so it will never wake up any waiters as the .release() will not get called until all .read() calls are finished. And the wakeup is for the waiters in those .read() calls. * tag 'trace-ring-buffer-v6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Use .flush() call to wake up readers ring-buffer: Fix resetting of shortest_full ring-buffer: Fix waking up ring buffer readers tracing: Limit trace_marker writes to just 4K tracing: Limit trace_seq size to just 8K and not depend on architecture PAGE_SIZE tracing: Remove precision vsnprintf() check from print event
2024-03-10tracing: Use .flush() call to wake up readersSteven Rostedt (Google)1-6/+15
The .release() function does not get called until all readers of a file descriptor are finished. If a thread is blocked on reading a file descriptor in ring_buffer_wait(), and another thread closes the file descriptor, it will not wake up the other thread as ring_buffer_wake_waiters() is called by .release(), and that will not get called until the .read() is finished. The issue originally showed up in trace-cmd, but the readers are actually other processes with their own file descriptors. So calling close() would wake up the other tasks because they are blocked on another descriptor then the one that was closed(). But there's other wake ups that solve that issue. When a thread is blocked on a read, it can still hang even when another thread closed its descriptor. This is what the .flush() callback is for. Have the .flush() wake up the readers. Link: https://lore.kernel.org/linux-trace-kernel/20240308202432.107909457@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linke li <lilinke99@qq.com> Cc: Rabin Vincent <rabin@rab.in> Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-10ring-buffer: Fix resetting of shortest_fullSteven Rostedt (Google)1-7/+23
The "shortest_full" variable is used to keep track of the waiter that is waiting for the smallest amount on the ring buffer before being woken up. When a tasks waits on the ring buffer, it passes in a "full" value that is a percentage. 0 means wake up on any data. 1-100 means wake up from 1% to 100% full buffer. As all waiters are on the same wait queue, the wake up happens for the waiter with the smallest percentage. The problem is that the smallest_full on the cpu_buffer that stores the smallest amount doesn't get reset when all the waiters are woken up. It does get reset when the ring buffer is reset (echo > /sys/kernel/tracing/trace). This means that tasks may be woken up more often then when they want to be. Instead, have the shortest_full field get reset just before waking up all the tasks. If the tasks wait again, they will update the shortest_full before sleeping. Also add locking around setting of shortest_full in the poll logic, and change "work" to "rbwork" to match the variable name for rb_irq_work structures that are used in other places. Link: https://lore.kernel.org/linux-trace-kernel/20240308202431.948914369@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linke li <lilinke99@qq.com> Cc: Rabin Vincent <rabin@rab.in> Fixes: 2c2b0a78b3739 ("ring-buffer: Add percentage of ring buffer full to wake up reader") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-10ring-buffer: Fix waking up ring buffer readersSteven Rostedt (Google)1-71/+68
A task can wait on a ring buffer for when it fills up to a specific watermark. The writer will check the minimum watermark that waiters are waiting for and if the ring buffer is past that, it will wake up all the waiters. The waiters are in a wait loop, and will first check if a signal is pending and then check if the ring buffer is at the desired level where it should break out of the loop. If a file that uses a ring buffer closes, and there's threads waiting on the ring buffer, it needs to wake up those threads. To do this, a "wait_index" was used. Before entering the wait loop, the waiter will read the wait_index. On wakeup, it will check if the wait_index is different than when it entered the loop, and will exit the loop if it is. The waker will only need to update the wait_index before waking up the waiters. This had a couple of bugs. One trivial one and one broken by design. The trivial bug was that the waiter checked the wait_index after the schedule() call. It had to be checked between the prepare_to_wait() and the schedule() which it was not. The main bug is that the first check to set the default wait_index will always be outside the prepare_to_wait() and the schedule(). That's because the ring_buffer_wait() doesn't have enough context to know if it should break out of the loop. The loop itself is not needed, because all the callers to the ring_buffer_wait() also has their own loop, as the callers have a better sense of what the context is to decide whether to break out of the loop or not. Just have the ring_buffer_wait() block once, and if it gets woken up, exit the function and let the callers decide what to do next. Link: https://lore.kernel.org/all/CAHk-=whs5MdtNjzFkTyaUy=vHi=qwWgPi0JgTe6OYUYMNSRZfg@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240308202431.792933613@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linke li <lilinke99@qq.com> Cc: Rabin Vincent <rabin@rab.in> Fixes: e30f53aad2202 ("tracing: Do not busy wait in buffer splice") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-07Merge tag 'net-6.8-rc8' of ↵Linus Torvalds2-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bpf, ipsec and netfilter. No solution yet for the stmmac issue mentioned in the last PR, but it proved to be a lockdep false positive, not a blocker. Current release - regressions: - dpll: move all dpll<>netdev helpers to dpll code, fix build regression with old compilers Current release - new code bugs: - page_pool: fix netlink dump stop/resume Previous releases - regressions: - bpf: fix verifier to check bpf_func_state->callback_depth when pruning states as otherwise unsafe programs could get accepted - ipv6: avoid possible UAF in ip6_route_mpath_notify() - ice: reconfig host after changing MSI-X on VF - mlx5: - e-switch, change flow rule destination checking - add a memory barrier to prevent a possible null-ptr-deref - switch to using _bh variant of of spinlock where needed Previous releases - always broken: - netfilter: nf_conntrack_h323: add protection for bmp length out of range - bpf: fix to zero-initialise xdp_rxq_info struct before running XDP program in CPU map which led to random xdp_md fields - xfrm: fix UDP encapsulation in TX packet offload - netrom: fix data-races around sysctls - ice: - fix potential NULL pointer dereference in ice_bridge_setlink() - fix uninitialized dplls mutex usage - igc: avoid returning frame twice in XDP_REDIRECT - i40e: disable NAPI right after disabling irqs when handling xsk_pool - geneve: make sure to pull inner header in geneve_rx() - sparx5: fix use after free inside sparx5_del_mact_entry - dsa: microchip: fix register write order in ksz8_ind_write8() Misc: - selftests: mptcp: fixes for diag.sh" * tag 'net-6.8-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (63 commits) net: pds_core: Fix possible double free in error handling path netrom: Fix data-races around sysctl_net_busy_read netrom: Fix a data-race around sysctl_netrom_link_fails_count netrom: Fix a data-race around sysctl_netrom_routing_control netrom: Fix a data-race around sysctl_netrom_transport_no_activity_timeout netrom: Fix a data-race around sysctl_netrom_transport_requested_window_size netrom: Fix a data-race around sysctl_netrom_transport_busy_delay netrom: Fix a data-race around sysctl_netrom_transport_acknowledge_delay netrom: Fix a data-race around sysctl_netrom_transport_maximum_tries netrom: Fix a data-race around sysctl_netrom_transport_timeout netrom: Fix data-races around sysctl_netrom_network_ttl_initialiser netrom: Fix a data-race around sysctl_netrom_obsolescence_count_initialiser netrom: Fix a data-race around sysctl_netrom_default_path_quality netfilter: nf_conntrack_h323: Add protection for bmp length out of range netfilter: nf_tables: mark set as dead when unbinding anonymous set with timeout netfilter: nft_ct: fix l3num expectations with inet pseudo family netfilter: nf_tables: reject constant set with timeout netfilter: nf_tables: disallow anonymous set with timeout flag net/rds: fix WARNING in rds_conn_connect_if_down net: dsa: microchip: fix register write order in ksz8_ind_write8() ...
2024-03-06tracing: Limit trace_marker writes to just 4KSteven Rostedt (Google)1-5/+5
Limit the max print event of trace_marker to just 4K string size. This must also be less than the amount that can be held by a trace_seq along with the text that is before the output (like the task name, PID, CPU, state, etc). As trace_seq is made to handle large events (some greater than 4K). Make the max size of a trace_marker write event be 4K which is guaranteed to fit in the trace_seq buffer. Link: https://lore.kernel.org/linux-trace-kernel/20240304223433.4ba47dff@gandalf.local.home Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-06tracing: Remove precision vsnprintf() check from print eventSteven Rostedt (Google)1-4/+2
This reverts 60be76eeabb3d ("tracing: Add size check when printing trace_marker output"). The only reason the precision check was added was because of a bug that miscalculated the write size of the string into the ring buffer and it truncated it removing the terminating nul byte. On reading the trace it crashed the kernel. But this was due to the bug in the code that happened during development and should never happen in practice. If anything, the precision can hide bugs where the string in the ring buffer isn't nul terminated and it will not be checked. Link: https://lore.kernel.org/all/C7E7AF1A-D30F-4D18-B8E5-AF1EF58004F5@linux.ibm.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240227125706.04279ac2@gandalf.local.home Link: https://lore.kernel.org/all/20240302111244.3a1674be@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20240304174341.2a561d9f@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Fixes: 60be76eeabb3d ("tracing: Add size check when printing trace_marker output") Reported-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-05cpumap: Zero-initialise xdp_rxq_info struct before running XDP programToke Høiland-Jørgensen1-1/+1
When running an XDP program that is attached to a cpumap entry, we don't initialise the xdp_rxq_info data structure being used in the xdp_buff that backs the XDP program invocation. Tobias noticed that this leads to random values being returned as the xdp_md->rx_queue_index value for XDP programs running in a cpumap. This means we're basically returning the contents of the uninitialised memory, which is bad. Fix this by zero-initialising the rxq data structure before running the XDP program. Fixes: 9216477449f3 ("bpf: cpumap: Add the possibility to attach an eBPF program to cpumap") Reported-by: Tobias Böhm <tobias@aibor.de> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20240305213132.11955-1-toke@redhat.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-03-05bpf: check bpf_func_state->callback_depth when pruning statesEduard Zingerman1-0/+3
When comparing current and cached states verifier should consider bpf_func_state->callback_depth. Current state cannot be pruned against cached state, when current states has more iterations left compared to cached state. Current state has more iterations left when it's callback_depth is smaller. Below is an example illustrating this bug, minimized from mailing list discussion [0] (assume that BPF_F_TEST_STATE_FREQ is set). The example is not a safe program: if loop_cb point (1) is followed by loop_cb point (2), then division by zero is possible at point (4). struct ctx { __u64 a; __u64 b; __u64 c; }; static void loop_cb(int i, struct ctx *ctx) { /* assume that generated code is "fallthrough-first": * if ... == 1 goto * if ... == 2 goto * <default> */ switch (bpf_get_prandom_u32()) { case 1: /* 1 */ ctx->a = 42; return 0; break; case 2: /* 2 */ ctx->b = 42; return 0; break; default: /* 3 */ ctx->c = 42; return 0; break; } } SEC("tc") __failure __flag(BPF_F_TEST_STATE_FREQ) int test(struct __sk_buff *skb) { struct ctx ctx = { 7, 7, 7 }; bpf_loop(2, loop_cb, &ctx, 0); /* 0 */ /* assume generated checks are in-order: .a first */ if (ctx.a == 42 && ctx.b == 42 && ctx.c == 7) asm volatile("r0 /= 0;":::"r0"); /* 4 */ return 0; } Prior to this commit verifier built the following checkpoint tree for this example: .------------------------------------- Checkpoint / State name | .-------------------------------- Code point number | | .---------------------------- Stack state {ctx.a,ctx.b,ctx.c} | | | .------------------- Callback depth in frame #0 v v v v - (0) {7P,7P,7},depth=0 - (3) {7P,7P,7},depth=1 - (0) {7P,7P,42},depth=1 - (3) {7P,7,42},depth=2 - (0) {7P,7,42},depth=2 loop terminates because of depth limit - (4) {7P,7,42},depth=0 predicted false, ctx.a marked precise - (6) exit (a) - (2) {7P,7,42},depth=2 - (0) {7P,42,42},depth=2 loop terminates because of depth limit - (4) {7P,42,42},depth=0 predicted false, ctx.a marked precise - (6) exit (b) - (1) {7P,7P,42},depth=2 - (0) {42P,7P,42},depth=2 loop terminates because of depth limit - (4) {42P,7P,42},depth=0 predicted false, ctx.{a,b} marked precise - (6) exit - (2) {7P,7,7},depth=1 considered safe, pruned using checkpoint (a) (c) - (1) {7P,7P,7},depth=1 considered safe, pruned using checkpoint (b) Here checkpoint (b) has callback_depth of 2, meaning that it would never reach state {42,42,7}. While checkpoint (c) has callback_depth of 1, and thus could yet explore the state {42,42,7} if not pruned prematurely. This commit makes forbids such premature pruning, allowing verifier to explore states sub-tree starting at (c): (c) - (1) {7,7,7P},depth=1 - (0) {42P,7,7P},depth=1 ... - (2) {42,7,7},depth=2 - (0) {42,42,7},depth=2 loop terminates because of depth limit - (4) {42,42,7},depth=0 predicted true, ctx.{a,b,c} marked precise - (5) division by zero [0] https://lore.kernel.org/bpf/9b251840-7cb8-4d17-bd23-1fc8071d8eef@linux.dev/ Fixes: bb124da69c47 ("bpf: keep track of max number of bpf_loop callback iterations") Suggested-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240222154121.6991-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-03-05Merge tag 'cgroup-for-6.8-rc7-fixes' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Two cpuset fixes. Both are for bugs in error handling paths and low risk" * tag 'cgroup-for-6.8-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Fix retval in update_cpumask() cgroup/cpuset: Fix a memory leak in update_exclusive_cpumask()
2024-03-01Merge tag 'probes-fixes-v6.8-rc5' of ↵Linus Torvalds1-8/+6
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull fprobe fix from Masami Hiramatsu: - allocate entry_data_size buffer for each rethook instance. This fixes a buffer overrun bug (which leads a kernel crash) when fprobe user uses its entry_data in the entry_handler. * tag 'probes-fixes-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: fprobe: Fix to allocate entry_data_size buffer with rethook instances
2024-03-01fprobe: Fix to allocate entry_data_size buffer with rethook instancesMasami Hiramatsu (Google)1-8/+6
Fix to allocate fprobe::entry_data_size buffer with rethook instances. If fprobe doesn't allocate entry_data_size buffer for each rethook instance, fprobe entry handler can cause a buffer overrun when storing entry data in entry handler. Link: https://lore.kernel.org/all/170920576727.107552.638161246679734051.stgit@devnote2/ Reported-by: Jiri Olsa <olsajiri@gmail.com> Closes: https://lore.kernel.org/all/Zd9eBn2FTQzYyg7L@krava/ Fixes: 4bbd93455659 ("kprobes: kretprobe scalability improvement") Cc: stable@vger.kernel.org Tested-by: Jiri Olsa <olsajiri@gmail.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-02-29cgroup/cpuset: Fix retval in update_cpumask()Kamalesh Babulal1-1/+1
The update_cpumask(), checks for newly requested cpumask by calling validate_change(), which returns an error on passing an invalid set of cpu(s). Independent of the error returned, update_cpumask() always returns zero, suppressing the error and returning success to the user on writing an invalid cpu range for a cpuset. Fix it by returning retval instead, which is returned by validate_change(). Fixes: 99fe36ba6fc1 ("cgroup/cpuset: Improve temporary cpumasks handling") Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Reviewed-by: Waiman Long <longman@redhat.com> Cc: stable@vger.kernel.org # v6.6+ Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-28cgroup/cpuset: Fix a memory leak in update_exclusive_cpumask()Waiman Long1-3/+3
Fix a possible memory leak in update_exclusive_cpumask() by moving the alloc_cpumasks() down after the validate_change() check which can fail and still before the temporary cpumasks are needed. Fixes: e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") Reported-and-tested-by: Mirsad Todorovac <mirsad.todorovac@alu.hr> Closes: https://lore.kernel.org/lkml/14915689-27a3-4cd8-80d2-9c30d0c768b6@alu.unizg.hr Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v6.7+
2024-02-22Merge tag 'net-6.8.0-rc6' of ↵Linus Torvalds3-1/+8
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bpf and netfilter. Current release - regressions: - af_unix: fix another unix GC hangup Previous releases - regressions: - core: fix a possible AF_UNIX deadlock - bpf: fix NULL pointer dereference in sk_psock_verdict_data_ready() - netfilter: nft_flow_offload: release dst in case direct xmit path is used - bridge: switchdev: ensure MDB events are delivered exactly once - l2tp: pass correct message length to ip6_append_data - dccp/tcp: unhash sk from ehash for tb2 alloc failure after check_estalblished() - tls: fixes for record type handling with PEEK - devlink: fix possible use-after-free and memory leaks in devlink_init() Previous releases - always broken: - bpf: fix an oops when attempting to read the vsyscall page through bpf_probe_read_kernel - sched: act_mirred: use the backlog for mirred ingress - netfilter: nft_flow_offload: fix dst refcount underflow - ipv6: sr: fix possible use-after-free and null-ptr-deref - mptcp: fix several data races - phonet: take correct lock to peek at the RX queue Misc: - handful of fixes and reliability improvements for selftests" * tag 'net-6.8.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits) l2tp: pass correct message length to ip6_append_data net: phy: realtek: Fix rtl8211f_config_init() for RTL8211F(D)(I)-VD-CG PHY selftests: ioam: refactoring to align with the fix Fix write to cloned skb in ipv6_hop_ioam() phonet/pep: fix racy skb_queue_empty() use phonet: take correct lock to peek at the RX queue net: sparx5: Add spinlock for frame transmission from CPU net/sched: flower: Add lock protection when remove filter handle devlink: fix port dump cmd type net: stmmac: Fix EST offset for dwmac 5.10 tools: ynl: don't leak mcast_groups on init error tools: ynl: make sure we always pass yarg to mnl_cb_run net: mctp: put sock on tag allocation failure netfilter: nf_tables: use kzalloc for hook allocation netfilter: nf_tables: register hooks last when adding new chain/flowtable netfilter: nft_flow_offload: release dst in case direct xmit path is used netfilter: nft_flow_offload: reset dst in route object after setting up flow netfilter: nf_tables: set dormant flag on hook register failure selftests: tls: add test for peeking past a record of a different type selftests: tls: add test for merging of same-type control messages ...
2024-02-22Merge tag 'trace-v6.8-rc5' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fix from Steven Rostedt: - While working on the ring buffer I noticed that the counter used for knowing where the end of the data is on a sub-buffer was not a full "int" but just 20 bits. It was masked out to 0xfffff. With the new code that allows the user to change the size of the sub-buffer, it is theoretically possible to ask for a size bigger than 2^20. If that happens, unexpected results may occur as there's no code checking if the counter overflowed the 20 bits of the write mask. There are other checks to make sure events fit in the sub-buffer, but if the sub-buffer itself is too big, that is not checked. Add a check in the resize of the sub-buffer to make sure that it never goes beyond the size of the counter that holds how much data is on it. * tag 'trace-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Do not let subbuf be bigger than write mask
2024-02-22Merge tag 'for-netdev' of ↵Paolo Abeni3-1/+8
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-02-22 The following pull-request contains BPF updates for your *net* tree. We've added 11 non-merge commits during the last 24 day(s) which contain a total of 15 files changed, 217 insertions(+), 17 deletions(-). The main changes are: 1) Fix a syzkaller-triggered oops when attempting to read the vsyscall page through bpf_probe_read_kernel and friends, from Hou Tao. 2) Fix a kernel panic due to uninitialized iter position pointer in bpf_iter_task, from Yafang Shao. 3) Fix a race between bpf_timer_cancel_and_free and bpf_timer_cancel, from Martin KaFai Lau. 4) Fix a xsk warning in skb_add_rx_frag() (under CONFIG_DEBUG_NET) due to incorrect truesize accounting, from Sebastian Andrzej Siewior. 5) Fix a NULL pointer dereference in sk_psock_verdict_data_ready, from Shigeru Yoshida. 6) Fix a resolve_btfids warning when bpf_cpumask symbol cannot be resolved, from Hari Bathini. bpf-for-netdev * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, sockmap: Fix NULL pointer dereference in sk_psock_verdict_data_ready() selftests/bpf: Add negtive test cases for task iter bpf: Fix an issue due to uninitialized bpf_iter_task selftests/bpf: Test racing between bpf_timer_cancel_and_free and bpf_timer_cancel bpf: Fix racing between bpf_timer_cancel_and_free and bpf_timer_cancel selftest/bpf: Test the read of vsyscall page under x86-64 x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault() x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.h bpf, scripts: Correct GPL license name xsk: Add truesize to skb_add_rx_frag(). bpf: Fix warning for bpf_cpumask in verifier ==================== Link: https://lore.kernel.org/r/20240221231826.1404-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-02-21ring-buffer: Do not let subbuf be bigger than write maskSteven Rostedt (Google)1-0/+4
The data on the subbuffer is measured by a write variable that also contains status flags. The counter is just 20 bits in length. If the subbuffer is bigger than then counter, it will fail. Make sure that the subbuffer can not be set to greater than the counter that keeps track of the data on the subbuffer. Link: https://lore.kernel.org/linux-trace-kernel/20240220095112.77e9cb81@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 2808e31ec12e5 ("ring-buffer: Add interface for configuring trace sub buffer size") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-20sched/membarrier: reduce the ability to hammer on sys_membarrierLinus Torvalds1-0/+6
On some systems, sys_membarrier can be very expensive, causing overall slowdowns for everything. So put a lock on the path in order to serialize the accesses to prevent the ability for this to be called at too high of a frequency and saturate the machine. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-and-tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Borislav Petkov <bp@alien8.de> Fixes: 22e4ebb97582 ("membarrier: Provide expedited private command") Fixes: c5f58bd58f43 ("membarrier: Provide GLOBAL_EXPEDITED command") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-02-19bpf: Fix an issue due to uninitialized bpf_iter_taskYafang Shao1-0/+2
Failure to initialize it->pos, coupled with the presence of an invalid value in the flags variable, can lead to it->pos referencing an invalid task, potentially resulting in a kernel panic. To mitigate this risk, it's crucial to ensure proper initialization of it->pos to NULL. Fixes: ac8148d957f5 ("bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos)") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/bpf/20240217114152.1623-2-laoar.shao@gmail.com
2024-02-19bpf: Fix racing between bpf_timer_cancel_and_free and bpf_timer_cancelMartin KaFai Lau1-1/+4
The following race is possible between bpf_timer_cancel_and_free and bpf_timer_cancel. It will lead a UAF on the timer->timer. bpf_timer_cancel(); spin_lock(); t = timer->time; spin_unlock(); bpf_timer_cancel_and_free(); spin_lock(); t = timer->timer; timer->timer = NULL; spin_unlock(); hrtimer_cancel(&t->timer); kfree(t); /* UAF on t */ hrtimer_cancel(&t->timer); In bpf_timer_cancel_and_free, this patch frees the timer->timer after a rcu grace period. This requires a rcu_head addition to the "struct bpf_hrtimer". Another kfree(t) happens in bpf_timer_init, this does not need a kfree_rcu because it is still under the spin_lock and timer->timer has not been visible by others yet. In bpf_timer_cancel, rcu_read_lock() is added because this helper can be used in a non rcu critical section context (e.g. from a sleepable bpf prog). Other timer->timer usages in helpers.c have been audited, bpf_timer_cancel() is the only place where timer->timer is used outside of the spin_lock. Another solution considered is to mark a t->flag in bpf_timer_cancel and clear it after hrtimer_cancel() is done. In bpf_timer_cancel_and_free, it busy waits for the flag to be cleared before kfree(t). This patch goes with a straight forward solution and frees timer->timer after a rcu grace period. Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.") Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20240215211218.990808-1-martin.lau@linux.dev
2024-02-17Merge tag 'probes-fixes-v6.8-rc4' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fix from Masami Hiramatsu: - tracing/probes: Fix BTF structure member finder to find the members which are placed after any anonymous union member correctly. * tag 'probes-fixes-v6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/probes: Fix to search structure fields correctly
2024-02-17tracing/probes: Fix to search structure fields correctlyMasami Hiramatsu (Google)1-2/+2
Fix to search a field from the structure which has anonymous union correctly. Since the reference `type` pointer was updated in the loop, the search loop suddenly aborted where it hits an anonymous union. Thus it can not find the field after the anonymous union. This avoids updating the cursor `type` pointer in the loop. Link: https://lore.kernel.org/all/170791694361.389532.10047514554799419688.stgit@devnote2/ Fixes: 302db0f5b3d8 ("tracing/probes: Add a function to search a member of a struct/union") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-02-16Merge tag 'wq-for-6.8-rc4-fixes' of ↵Linus Torvalds1-6/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: "Just one patch to revert commit ca10d851b9ad ("workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()"). This commit could break ordering guarantees for ordered workqueues. The problem that the commit tried to resolve partially - making ordered workqueues follow unbound cpumask - is fully solved in wq/for-6.9 branch" * tag 'wq-for-6.8-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()"
2024-02-16Merge tag 'trace-v6.8-rc4' of ↵Linus Torvalds3-3/+7
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix the #ifndef that didn't have the 'CONFIG_' prefix on HAVE_DYNAMIC_FTRACE_WITH_REGS The fix to have dynamic trampolines work with x86 broke arm64 as the config used in the #ifdef was HAVE_DYNAMIC_FTRACE_WITH_REGS and not CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS which removed the fix that the previous fix was to fix. - Fix tracing_on state The code to test if "tracing_on" is set incorrectly used ring_buffer_record_is_on() which returns false if the ring buffer isn't able to be written to. But the ring buffer disable has several bits that disable it. One is internal disabling which is used for resizing and other modifications of the ring buffer. But the "tracing_on" user space visible flag should only report if tracing is actually on and not internally disabled, as this can cause confusion as writing "1" when it is disabled will not enable it. Instead use ring_buffer_record_is_set_on() which shows the user space visible settings. - Fix a false positive kmemleak on saved cmdlines Now that the saved_cmdlines structure is allocated via alloc_page() and not via kmalloc() it has become invisible to kmemleak. The allocation done to one of its pointers was flagged as a dangling allocation leak. Make kmemleak aware of this allocation and free. - Fix synthetic event dynamic strings An update that cleaned up the synthetic event code removed the return value of trace_string(), and had it return zero instead of the length, causing dynamic strings in the synthetic event to always have zero size. - Clean up documentation and header files for seq_buf * tag 'trace-v6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: seq_buf: Fix kernel documentation seq_buf: Don't use "proxy" headers tracing/synthetic: Fix trace_string() return value tracing: Inform kmemleak of saved_cmdlines allocation tracing: Use ring_buffer_record_is_set_on() in tracer_tracing_is_on() tracing: Fix HAVE_DYNAMIC_FTRACE_WITH_REGS ifdef
2024-02-15tracing/synthetic: Fix trace_string() return valueThorsten Blum1-1/+2
Fix trace_string() by assigning the string length to the return variable which got lost in commit ddeea494a16f ("tracing/synthetic: Use union instead of casts") and caused trace_string() to always return 0. Link: https://lore.kernel.org/linux-trace-kernel/20240214220555.711598-1-thorsten.blum@toblux.com Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: ddeea494a16f ("tracing/synthetic: Use union instead of casts") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-14tracing: Inform kmemleak of saved_cmdlines allocationSteven Rostedt (Google)1-0/+3
The allocation of the struct saved_cmdlines_buffer structure changed from: s = kmalloc(sizeof(*s), GFP_KERNEL); s->saved_cmdlines = kmalloc_array(TASK_COMM_LEN, val, GFP_KERNEL); to: orig_size = sizeof(*s) + val * TASK_COMM_LEN; order = get_order(orig_size); size = 1 << (order + PAGE_SHIFT); page = alloc_pages(GFP_KERNEL, order); if (!page) return NULL; s = page_address(page); memset(s, 0, sizeof(*s)); s->saved_cmdlines = kmalloc_array(TASK_COMM_LEN, val, GFP_KERNEL); Where that s->saved_cmdlines allocation looks to be a dangling allocation to kmemleak. That's because kmemleak only keeps track of kmalloc() allocations. For allocations that use page_alloc() directly, the kmemleak needs to be explicitly informed about it. Add kmemleak_alloc() and kmemleak_free() around the page allocation so that it doesn't give the following false positive: unreferenced object 0xffff8881010c8000 (size 32760): comm "swapper", pid 0, jiffies 4294667296 hex dump (first 32 bytes): ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ backtrace (crc ae6ec1b9): [<ffffffff86722405>] kmemleak_alloc+0x45/0x80 [<ffffffff8414028d>] __kmalloc_large_node+0x10d/0x190 [<ffffffff84146ab1>] __kmalloc+0x3b1/0x4c0 [<ffffffff83ed7103>] allocate_cmdlines_buffer+0x113/0x230 [<ffffffff88649c34>] tracer_alloc_buffers.isra.0+0x124/0x460 [<ffffffff8864a174>] early_trace_init+0x14/0xa0 [<ffffffff885dd5ae>] start_kernel+0x12e/0x3c0 [<ffffffff885f5758>] x86_64_start_reservations+0x18/0x30 [<ffffffff885f582b>] x86_64_start_kernel+0x7b/0x80 [<ffffffff83a001c3>] secondary_startup_64_no_verify+0x15e/0x16b Link: https://lore.kernel.org/linux-trace-kernel/87r0hfnr9r.fsf@kernel.org/ Link: https://lore.kernel.org/linux-trace-kernel/20240214112046.09a322d6@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Fixes: 44dc5c41b5b1 ("tracing: Fix wasted memory in saved_cmdlines logic") Reported-by: Kalle Valo <kvalo@kernel.org> Tested-by: Kalle Valo <kvalo@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-13bpf: Fix warning for bpf_cpumask in verifierHari Bathini1-0/+2
Compiling with CONFIG_BPF_SYSCALL & !CONFIG_BPF_JIT throws the below warning: "WARN: resolve_btfids: unresolved symbol bpf_cpumask" Fix it by adding the appropriate #ifdef. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/bpf/20240208100115.602172-1-hbathini@linux.ibm.com
2024-02-13tracing: Use ring_buffer_record_is_set_on() in tracer_tracing_is_on()Sven Schnelle1-1/+1
tracer_tracing_is_on() checks whether record_disabled is not zero. This checks both the record_disabled counter and the RB_BUFFER_OFF flag. Reading the source it looks like this function should only check for the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on(). This fixes spurious fails in the 'test for function traceon/off triggers' test from the ftrace testsuite when the system is under load. Link: https://lore.kernel.org/linux-trace-kernel/20240205065340.2848065-1-svens@linux.ibm.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Tested-By: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-13tracing: Fix HAVE_DYNAMIC_FTRACE_WITH_REGS ifdefPetr Pavlu1-1/+1
Commit a8b9cf62ade1 ("ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default") attempted to fix an issue with direct trampolines on x86, see its description for details. However, it wrongly referenced the HAVE_DYNAMIC_FTRACE_WITH_REGS config option and the problem is still present. Add the missing "CONFIG_" prefix for the logic to work as intended. Link: https://lore.kernel.org/linux-trace-kernel/20240213132434.22537-1-petr.pavlu@suse.com Fixes: a8b9cf62ade1 ("ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-11Merge tag 'timers_urgent_for_v6.8_rc4' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Borislav Petkov: - Make sure a warning is issued when a hrtimer gets queued after the timers have been migrated on the CPU down path and thus said timer will get ignored * tag 'timers_urgent_for_v6.8_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimer: Report offline hrtimer enqueue
2024-02-10Merge tag 'mm-hotfixes-stable-2024-02-10-11-16' of ↵Linus Torvalds2-25/+35
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "21 hotfixes. 12 are cc:stable and the remainder pertain to post-6.7 issues or aren't considered to be needed in earlier kernel versions" * tag 'mm-hotfixes-stable-2024-02-10-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits) nilfs2: fix potential bug in end_buffer_async_write mm/damon/sysfs-schemes: fix wrong DAMOS tried regions update timeout setup nilfs2: fix hang in nilfs_lookup_dirty_data_buffers() MAINTAINERS: Leo Yan has moved mm/zswap: don't return LRU_SKIP if we have dropped lru lock fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super mailmap: switch email address for John Moon mm: zswap: fix objcg use-after-free in entry destruction mm/madvise: don't forget to leave lazy MMU mode in madvise_cold_or_pageout_pte_range() arch/arm/mm: fix major fault accounting when retrying under per-VMA lock selftests: core: include linux/close_range.h for CLOSE_RANGE_* macros mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page mm: memcg: optimize parent iteration in memcg_rstat_updated() nilfs2: fix data corruption in dsync block recovery for small block sizes mm/userfaultfd: UFFDIO_MOVE implementation should use ptep_get() exit: wait_task_zombie: kill the no longer necessary spin_lock_irq(siglock) fs/proc: do_task_stat: use sig->stats_lock to gather the threads/children stats fs/proc: do_task_stat: move thread_group_cputime_adjusted() outside of lock_task_sighand() getrusage: use sig->stats_lock rather than lock_task_sighand() getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand() ...
2024-02-09Merge tag 'trace-v6.8-rc3' of ↵Linus Torvalds2-38/+47
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix broken direct trampolines being called when another callback is attached the same function. ARM 64 does not support FTRACE_WITH_REGS, and when it added direct trampoline calls from ftrace, it removed the "WITH_REGS" flag from the ftrace_ops for direct trampolines. This broke x86 as x86 requires direct trampolines to have WITH_REGS. This wasn't noticed because direct trampolines work as long as the function it is attached to is not shared with other callbacks (like the function tracer). When there are other callbacks, a helper trampoline is called, to call all the non direct callbacks and when it returns, the direct trampoline is called. For x86, the direct trampoline sets a flag in the regs field to tell the x86 specific code to call the direct trampoline. But this only works if the ftrace_ops had WITH_REGS set. ARM does things differently that does not require this. For now, set WITH_REGS if the arch supports WITH_REGS (which ARM does not), and this makes it work for both ARM64 and x86. - Fix wasted memory in the saved_cmdlines logic. The saved_cmdlines is a cache that maps PIDs to COMMs that tracing can use. Most trace events only save the PID in the event. The saved_cmdlines file lists PIDs to COMMs so that the tracing tools can show an actual name and not just a PID for each event. There's an array of PIDs that map to a small set of saved COMM strings. The array is set to PID_MAX_DEFAULT which is usually set to 32768. When a PID comes in, it will add itself to this array along with the index into the COMM array (note if the system allows more than PID_MAX_DEFAULT, this cache is similar to cache lines as an update of a PID that has the same PID_MAX_DEFAULT bits set will flush out another task with the same matching bits set). A while ago, the size of this cache was changed to be dynamic and the array was moved into a structure and created with kmalloc(). But this new structure had the size of 131104 bytes, or 0x20020 in hex. As kmalloc allocates in powers of two, it was actually allocating 0x40000 bytes (262144) leaving 131040 bytes of wasted memory. The last element of this structure was a pointer to the COMM string array which defaulted to just saving 128 COMMs. By changing the last field of this structure to a variable length string, and just having it round up to fill the allocated memory, the default size of the saved COMM cache is now 8190. This not only uses the wasted space, but actually saves space by removing the extra allocation for the COMM names. * tag 'trace-v6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix wasted memory in saved_cmdlines logic ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default
2024-02-09tracing: Fix wasted memory in saved_cmdlines logicSteven Rostedt (Google)1-38/+37
While looking at improving the saved_cmdlines cache I found a huge amount of wasted memory that should be used for the cmdlines. The tracing data saves pids during the trace. At sched switch, if a trace occurred, it will save the comm of the task that did the trace. This is saved in a "cache" th