summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
3 daysnet: Provide a PREEMPT_RT specific check for netdev_queue::_xmit_lockSebastian Andrzej Siewior1-5/+22
[ Upstream commit b824c3e16c1904bf80df489e293d1e3cbf98896d ] After acquiring netdev_queue::_xmit_lock the number of the CPU owning the lock is recorded in netdev_queue::xmit_lock_owner. This works as long as the BH context is not preemptible. On PREEMPT_RT the softirq context is preemptible and without the softirq-lock it is possible to have multiple user in __dev_queue_xmit() submitting a skb on the same CPU. This is fine in general but this means also that the current CPU is recorded as netdev_queue::xmit_lock_owner. This in turn leads to the recursion alert and the skb is dropped. Instead checking the for CPU number, that owns the lock, PREEMPT_RT can check if the lockowner matches the current task. Add netif_tx_owned() which returns true if the current context owns the lock by comparing the provided CPU number with the recorded number. This resembles the current check by negating the condition (the current check returns true if the lock is not owned). On PREEMPT_RT use rt_mutex_owner() to return the lock owner and compare the current task against it. Use the new helper in __dev_queue_xmit() and netif_local_xmit_active() which provides a similar check. Update comments regarding pairing READ_ONCE(). Reported-by: Bert Karwatzki <spasswolf@web.de> Closes: https://lore.kernel.org/all/20260216134333.412332-1-spasswolf@web.de Fixes: 3253cb49cbad4 ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reported-by: Bert Karwatzki <spasswolf@web.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20260302162631.uGUyIqDT@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysindirect_call_wrapper: do not reevaluate function pointerEric Dumazet1-7/+11
[ Upstream commit 710f5c76580306cdb9ec51fac8fcf6a8faff7821 ] We have an increasing number of READ_ONCE(xxx->function) combined with INDIRECT_CALL_[1234]() helpers. Unfortunately this forces INDIRECT_CALL_[1234]() to read xxx->function many times, which is not what we wanted. Fix these macros so that xxx->function value is not reloaded. $ scripts/bloat-o-meter -t vmlinux.0 vmlinux add/remove: 0/0 grow/shrink: 1/65 up/down: 122/-1084 (-962) Function old new delta ip_push_pending_frames 59 181 +122 ip6_finish_output 687 681 -6 __udp_enqueue_schedule_skb 1078 1072 -6 ioam6_output 2319 2312 -7 xfrm4_rcv_encap_finish2 64 56 -8 xfrm4_output 297 289 -8 vrf_ip_local_out 278 270 -8 vrf_ip6_local_out 278 270 -8 seg6_input_finish 64 56 -8 rpl_output 700 692 -8 ipmr_forward_finish 124 116 -8 ip_forward_finish 143 135 -8 ip6mr_forward2_finish 100 92 -8 ip6_forward_finish 73 65 -8 input_action_end_bpf 1091 1083 -8 dst_input 52 44 -8 __xfrm6_output 801 793 -8 __xfrm4_output 83 75 -8 bpf_input 500 491 -9 __tcp_check_space 530 521 -9 input_action_end_dt6 291 280 -11 vti6_tnl_xmit 1634 1622 -12 bpf_xmit 1203 1191 -12 rpl_input 497 483 -14 rawv6_send_hdrinc 1355 1341 -14 ndisc_send_skb 1030 1016 -14 ipv6_srh_rcv 1377 1363 -14 ip_send_unicast_reply 1253 1239 -14 ip_rcv_finish 226 212 -14 ip6_rcv_finish 300 286 -14 input_action_end_x_core 205 191 -14 input_action_end_x 355 341 -14 input_action_end_t 205 191 -14 input_action_end_dx6_finish 127 113 -14 input_action_end_dx4_finish 373 359 -14 input_action_end_dt4 426 412 -14 input_action_end_core 186 172 -14 input_action_end_b6_encap 292 278 -14 input_action_end_b6 198 184 -14 igmp6_send 1332 1318 -14 ip_sublist_rcv 864 848 -16 ip6_sublist_rcv 1091 1075 -16 ipv6_rpl_srh_rcv 1937 1920 -17 xfrm_policy_queue_process 1246 1228 -18 seg6_output_core 903 885 -18 mld_sendpack 856 836 -20 NF_HOOK 756 736 -20 vti_tunnel_xmit 1447 1426 -21 input_action_end_dx6 664 642 -22 input_action_end 1502 1480 -22 sock_sendmsg_nosec 134 111 -23 ip6mr_forward2 388 364 -24 sock_recvmsg_nosec 134 109 -25 seg6_input_core 836 810 -26 ip_send_skb 172 146 -26 ip_local_out 140 114 -26 ip6_local_out 140 114 -26 __sock_sendmsg 162 136 -26 __ip_queue_xmit 1196 1170 -26 __ip_finish_output 405 379 -26 ipmr_queue_fwd_xmit 373 346 -27 sock_recvmsg 173 145 -28 ip6_xmit 1635 1607 -28 xfrm_output_resume 1418 1389 -29 ip_build_and_send_pkt 625 591 -34 dst_output 504 432 -72 Total: Before=25217686, After=25216724, chg -0.00% Fixes: 283c16a2dfd3 ("indirect call wrappers: helpers to speed-up indirect calls of builtin") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260227172603.1700433-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
3 dayspinctrl: generic: move function to amlogic-am4 driverConor Dooley1-5/+0
[ Upstream commit 9c5a40f2922a5a6d6b42e7b3d4c8e253918c07a1 ] pinconf_generic_dt_node_to_map_pinmux() is not actually a generic function, and really belongs in the amlogic-am4 driver. There are three reasons why. First, and least, of the reasons is that this function behaves differently to the other dt_node_to_map functions in a way that is not obvious from a first glance. This difference stems for the devicetree properties that the function is intended for use with, and how they are typically used. The other generic dt_node_to_map functions support platforms where the pins, groups and functions are described statically in the driver and require a function that will produce a mapping from dt nodes to these pre-established descriptions. No other code in the driver is require to be executed at runtime. pinconf_generic_dt_node_to_map_pinmux() on the other hand is intended for use with the pinmux property, where groups and functions are determined entirely from the devicetree. As a result, there are no statically defined groups and functions in the driver for this function to perform a mapping to. Other drivers that use the pinmux property (e.g. the k1) their dt_node_to_map function creates the groups and functions as the devicetree is parsed. Instead of that, pinconf_generic_dt_node_to_map_pinmux() requires that the devicetree is parsed twice, once by it and once at probe, so that the driver dynamically creates the groups and functions before the dt_node_to_map callback is executed. I don't believe this double parsing requirement is how developers would expect this to work and is not necessary given there are drivers that do not have this behaviour. Secondly and thirdly, the function bakes in some assumptions that only really match the amlogic platform about how the devicetree is constructed. These, to me, are problematic for something that claims to be generic. The other dt_node_to_map implementations accept a being called for either a node containing pin configuration properties or a node containing child nodes that each contain the configuration properties. IOW, they support the following two devicetree configurations: | cfg { | label: group { | pinmux = <asjhdasjhlajskd>; | config-item1; | }; | }; | label: cfg { | group1 { | pinmux = <dsjhlfka>; | config-item2; | }; | group2 { | pinmux = <lsdjhaf>; | config-item1; | }; | }; pinconf_generic_dt_node_to_map_pinmux() only supports the latter. The other assumption about devicetree configuration that the function makes is that the labeled node's parent is a "function node". The amlogic driver uses these "function nodes" to create the functions at probe time, and pinconf_generic_dt_node_to_map_pinmux() finds the parent of the node it is operating on's name as part of the mapping. IOW, it requires that the devicetree look like: | pinctrl@bla { | | func-foo { | label: group-default { | pinmuxes = <lskdf>; | }; | }; | }; and couldn't be used if the nodes containing the pinmux and configuration properties are children of the pinctrl node itself: | pinctrl@bla { | | label: group-default { | pinmuxes = <lskdf>; | }; | }; These final two reasons are mainly why I believe this is not suitable as a generic function, and should be moved into the driver that is the sole user and originator of the "generic" function. Signed-off-by: Conor Dooley <conor.dooley@microchip.com> Acked-by: Andy Shevchenko <andriy.shevchenko@intel.com> Signed-off-by: Linus Walleij <linusw@kernel.org> Stable-dep-of: a2539b92e4b7 ("pinctrl: meson: amlogic-a4: Fix device node reference leak in aml_dt_node_to_map_pinmux()") Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysnet: stmmac: remove support for lpi_intr_oRussell King (Oracle)1-1/+0
commit 14eb64db8ff07b58a35b98375f446d9e20765674 upstream. The dwmac databook for v3.74a states that lpi_intr_o is a sideband signal which should be used to ungate the application clock, and this signal is synchronous to the receive clock. The receive clock can run at 2.5, 25 or 125MHz depending on the media speed, and can stop under the control of the link partner. This means that the time it takes to clear is dependent on the negotiated media speed, and thus can be 8, 40, or 400ns after reading the LPI control and status register. It has been observed with some aggressive link partners, this clock can stop while lpi_intr_o is still asserted, meaning that the signal remains asserted for an indefinite period that the local system has no direct control over. The LPI interrupts will still be signalled through the main interrupt path in any case, and this path is not dependent on the receive clock. This, since we do not gate the application clock, and the chances of adding clock gating in the future are slim due to the clocks being ill-defined, lpi_intr_o serves no useful purpose. Remove the code which requests the interrupt, and all associated code. Reported-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com> Tested-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com> # Renesas RZ/V2H board Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1vnJbt-00000007YYN-28nm@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 daystracing: Fix WARN_ON in tracing_buffers_mmap_closeQing Wang1-0/+1
commit e39bb9e02b68942f8e9359d2a3efe7d37ae6be0e upstream. When a process forks, the child process copies the parent's VMAs but the user_mapped reference count is not incremented. As a result, when both the parent and child processes exit, tracing_buffers_mmap_close() is called twice. On the second call, user_mapped is already 0, causing the function to return -ENODEV and triggering a WARN_ON. Normally, this isn't an issue as the memory is mapped with VM_DONTCOPY set. But this is only a hint, and the application can call madvise(MADVISE_DOFORK) which resets the VM_DONTCOPY flag. When the application does that, it can trigger this issue on fork. Fix it by incrementing the user_mapped reference count without re-mapping the pages in the VMA's open callback. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://patch.msgid.link/20260227025842.1085206-1-wangqing7171@gmail.com Fixes: cf9f0f7c4c5bb ("tracing: Allow user-space mapping of the ring-buffer") Reported-by: syzbot+3b5dd2030fe08afdf65d@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=3b5dd2030fe08afdf65d Tested-by: syzbot+3b5dd2030fe08afdf65d@syzkaller.appspotmail.com Signed-off-by: Qing Wang <wangqing7171@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
3 daysPM: sleep: core: Avoid bit field races related to work_in_progressXuewen Yan1-1/+1
[ Upstream commit 0491f3f9f664e7e0131eb4d2a8b19c49562e5c64 ] In all of the system suspend transition phases, the async processing of a device may be carried out in parallel with power.work_in_progress updates for the device's parent or suppliers and if it touches bit fields from the same group (for example, power.must_resume or power.wakeup_path), bit field corruption is possible. To avoid that, turn work_in_progress in struct dev_pm_info into a proper bool field and relocate it to save space. Fixes: aa7a9275ab81 ("PM: sleep: Suspend async parents after suspending children") Fixes: 443046d1ad66 ("PM: sleep: Make suspend of devices more asynchronous") Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Closes: https://lore.kernel.org/linux-pm/20260203063459.12808-1-xuewen.yan@unisoc.com/ Cc: All applicable <stable@vger.kernel.org> [ rjw: Added subject and changelog ] Link: https://patch.msgid.link/CAB8ipk_VX2VPm706Jwa1=8NSA7_btWL2ieXmBgHr2JcULEP76g@mail.gmail.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysx86/uprobes: Fix XOL allocation failure for 32-bit tasksOleg Nesterov1-0/+1
[ Upstream commit d55c571e4333fac71826e8db3b9753fadfbead6a ] This script #!/usr/bin/bash echo 0 > /proc/sys/kernel/randomize_va_space echo 'void main(void) {}' > TEST.c # -fcf-protection to ensure that the 1st endbr32 insn can't be emulated gcc -m32 -fcf-protection=branch TEST.c -o test bpftrace -e 'uprobe:./test:main {}' -c ./test "hangs", the probed ./test task enters an endless loop. The problem is that with randomize_va_space == 0 get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used by the stack vma. arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE. vm_unmapped_area() happily returns the high address > TASK_SIZE and then get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)" check. handle_swbp() doesn't report this failure (probably it should) and silently restarts the probed insn. Endless loop. I think that the right fix should change the x86 get_unmapped_area() paths to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case because ->orig_ax = -1. But we need a simple fix for -stable, so this patch just sets TS_COMPAT if the probed task is 32-bit to make in_ia32_syscall() true. Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()") Reported-by: Paulo Andrade <pandrade@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/ Cc: stable@vger.kernel.org Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysunwind_user/x86: Teach FP unwind about start of functionPeter Zijlstra1-0/+1
[ Upstream commit ae25884ad749e7f6e0c3565513bdc8aa2554a425 ] When userspace is interrupted at the start of a function, before we get a chance to complete the frame, unwind will miss one caller. X86 has a uprobe specific fixup for this, add bits to the generic unwinder to support this. Suggested-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251024145156.GM4068168@noisy.programming.kicks-ass.net Stable-dep-of: d55c571e4333 ("x86/uprobes: Fix XOL allocation failure for 32-bit tasks") Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysunwind: Implement compat fp unwindPeter Zijlstra1-0/+1
[ Upstream commit c79dd946e370af3537edb854f210cba3a94b4516 ] It is important to be able to unwind compat tasks too. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20250924080119.613695709@infradead.org Stable-dep-of: d55c571e4333 ("x86/uprobes: Fix XOL allocation failure for 32-bit tasks") Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysbpf: Introduce tnum_step to step through tnum's membersHarishankar Vishwanathan1-0/+3
[ Upstream commit 76e954155b45294c502e3d3a9e15757c858ca55e ] This commit introduces tnum_step(), a function that, when given t, and a number z returns the smallest member of t larger than z. The number z must be greater or equal to the smallest member of t and less than the largest member of t. The first step is to compute j, a number that keeps all of t's known bits, and matches all unknown bits to z's bits. Since j is a member of the t, it is already a candidate for result. However, we want our result to be (minimally) greater than z. There are only two possible cases: (1) Case j <= z. In this case, we want to increase the value of j and make it > z. (2) Case j > z. In this case, we want to decrease the value of j while keeping it > z. (Case 1) j <= z t = xx11x0x0 z = 10111101 (189) j = 10111000 (184) ^ k (Case 1.1) Let's first consider the case where j < z. We will address j == z later. Since z > j, there had to be a bit position that was 1 in z and a 0 in j, beyond which all positions of higher significance are equal in j and z. Further, this position could not have been unknown in a, because the unknown positions of a match z. This position had to be a 1 in z and known 0 in t. Let k be position of the most significant 1-to-0 flip. In our example, k = 3 (starting the count at 1 at the least significant bit). Setting (to 1) the unknown bits of t in positions of significance smaller than k will not produce a result > z. Hence, we must set/unset the unknown bits at positions of significance higher than k. Specifically, we look for the next larger combination of 1s and 0s to place in those positions, relative to the combination that exists in z. We can achieve this by concatenating bits at unknown positions of t into an integer, adding 1, and writing the bits of that result back into the corresponding bit positions previously extracted from z. >From our example, considering only positions of significance greater than k: t = xx..x z = 10..1 + 1 ----- 11..0 This is the exact combination 1s and 0s we need at the unknown bits of t in positions of significance greater than k. Further, our result must only increase the value minimally above z. Hence, unknown bits in positions of significance smaller than k should remain 0. We finally have, result = 11110000 (240) (Case 1.2) Now consider the case when j = z, for example t = 1x1x0xxx z = 10110100 (180) j = 10110100 (180) Matching the unknown bits of the t to the bits of z yielded exactly z. To produce a number greater than z, we must set/unset the unknown bits in t, and *all* the unknown bits of t candidates for being set/unset. We can do this similar to Case 1.1, by adding 1 to the bits extracted from the masked bit positions of z. Essentially, this case is equivalent to Case 1.1, with k = 0. t = 1x1x0xxx z = .0.1.100 + 1 --------- .0.1.101 This is the exact combination of bits needed in the unknown positions of t. After recalling the known positions of t, we get result = 10110101 (181) (Case 2) j > z t = x00010x1 z = 10000010 (130) j = 10001011 (139) ^ k Since j > z, there had to be a bit position which was 0 in z, and a 1 in j, beyond which all positions of higher significance are equal in j and z. This position had to be a 0 in z and known 1 in t. Let k be the position of the most significant 0-to-1 flip. In our example, k = 4. Because of the 0-to-1 flip at position k, a member of t can become greater than z if the bits in positions greater than k are themselves >= to z. To make that member *minimally* greater than z, the bits in positions greater than k must be exactly = z. Hence, we simply match all of t's unknown bits in positions more significant than k to z's bits. In positions less significant than k, we set all t's unknown bits to 0 to retain minimality. In our example, in positions of greater significance than k (=4), t=x000. These positions are matched with z (1000) to produce 1000. In positions of lower significance than k, t=10x1. All unknown bits are set to 0 to produce 1001. The final result is: result = 10001001 (137) This concludes the computation for a result > z that is a member of t. The procedure for tnum_step() in this commit implements the idea described above. As a proof of correctness, we verified the algorithm against a logical specification of tnum_step. The specification asserts the following about the inputs t, z and output res that: 1. res is a member of t, and 2. res is strictly greater than z, and 3. there does not exist another value res2 such that 3a. res2 is also a member of t, and 3b. res2 is greater than z 3c. res2 is smaller than res We checked the implementation against this logical specification using an SMT solver. The verification formula in SMTLIB format is available at [1]. The verification returned an "unsat": indicating that no input assignment exists for which the implementation and the specification produce different outputs. In addition, we also automatically generated the logical encoding of the C implementation using Agni [2] and verified it against the same specification. This verification also returned an "unsat", confirming that the implementation is equivalent to the specification. The formula for this check is also available at [3]. Link: https://pastebin.com/raw/2eRWbiit [1] Link: https://github.com/bpfverif/agni [2] Link: https://pastebin.com/raw/EztVbBJ2 [3] Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Link: https://lore.kernel.org/r/93fdf71910411c0f19e282ba6d03b4c65f9c5d73.1772225741.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: efc11a667878 ("bpf: Improve bounds when tnum has a single possible value") Signed-off-by: Sasha Levin <sashal@kernel.org>
3 daysbpf: Add bitwise tracking for BPF_ENDTianci Cao1-0/+5
[ Upstream commit 9d21199842247ab05c675fb9b6c6ca393a5c0024 ] This patch implements bitwise tracking (tnum analysis) for BPF_END (byte swap) operation. Currently, the BPF verifier does not track value for BPF_END operation, treating the result as completely unknown. This limits the verifier's ability to prove safety of programs that perform endianness conversions, which are common in networking code. For example, the following code pattern for port number validation: int test(struct pt_regs *ctx) { __u64 x = bpf_get_prandom_u32(); x &= 0x3f00; // Range: [0, 0x3f00], var_off: (0x0; 0x3f00) x = bswap16(x); // Should swap to range [0, 0x3f], var_off: (0x0; 0x3f) if (x > 0x3f) goto trap; return 0; trap: return *(u64 *)NULL; // Should be unreachable } Currently generates verifier output: 1: (54) w0 &= 16128 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=16128,var_off=(0x0; 0x3f00)) 2: (d7) r0 = bswap16 r0 ; R0=scalar() 3: (25) if r0 > 0x3f goto pc+2 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=63,var_off=(0x0; 0x3f)) Without this patch, even though the verifier knows `x` has certain bits set, after bswap16, it loses all tracking information and treats port as having a completely unknown value [0, 65535]. According to the BPF instruction set[1], there are 3 kinds of BPF_END: 1. `bswap(16|32|64)`: opcode=0xd7 (BPF_END | BPF_ALU64 | BPF_TO_LE) - do unconditional swap 2. `le(16|32|64)`: opcode=0xd4 (BPF_END | BPF_ALU | BPF_TO_LE) - on big-endian: do swap - on little-endian: truncation (16/32-bit) or no-op (64-bit) 3. `be(16|32|64)`: opcode=0xdc (BPF_END | BPF_ALU | BPF_TO_BE) - on little-endian: do swap - on big-endian: truncation (16/32-bit) or no-op (64-bit) Since BPF_END operations are inherently bit-wise permutations, tnum (bitwise tracking) offers the most efficient and precise mechanism for value analysis. By implementing `tnum_bswap16`, `tnum_bswap32`, and `tnum_bswap64`, we can derive exact `var_off` values concisely, directly reflecting the bit-level changes. Here is the overview of changes: 1. In `tnum_bswap(16|32|64)` (kernel/bpf/tnum.c): Call `swab(16|32|64)` function on the value and mask of `var_off`, and do truncation for 16/32-bit cases. 2. In `adjust_scalar_min_max_vals` (kernel/bpf/verifier.c): Call helper function `scalar_byte_swap`. - Only do byte swap when * alu64 (unconditional swap) OR * switching between big-endian and little-endian machines. - If need do byte swap: * Firstly call `tnum_bswap(16|32|64)` to update `var_off`. * Then reset the bound since byte swap scrambles the range. - For 16/32-bit cases, truncate dst register to match the swapped size. This enables better verification of networking code that frequently uses byte swaps for protocol processing, reducing false positive rejections. [1] https://www.kernel.org/doc/Documentation/bpf/standardization/instruction-set.rst Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com> Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com> Co-developed-by: Yazhou Tang <tangyazhou518@outlook.com> Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com> Signed-off-by: Tianci Cao <ziye@zju.edu.cn> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260204111503.77871-2-ziye@zju.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: efc11a667878 ("bpf: Improve bounds when tnum has a single possible value") Signed-off-by: Sasha Levin <sashal@kernel.org>
3 dayssched/fair: Fix lag clampPeter Zijlstra1-0/+1
[ Upstream commit 6e3c0a4e1ad1e0455b7880fad02b3ee179f56c09 ] Vincent reported that he was seeing undue lag clamping in a mixed slice workload. Implement the max_slice tracking as per the todo comment. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Reported-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Link: https://patch.msgid.link/20250422101628.GA33555@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daystracing: Wake up poll waiters for hist files when removing an eventPetr Pavlu1-0/+5
[ Upstream commit 9678e53179aa7e907360f5b5b275769008a69b80 ] The event_hist_poll() function attempts to verify whether an event file is being removed, but this check may not occur or could be unnecessarily delayed. This happens because hist_poll_wakeup() is currently invoked only from event_hist_trigger() when a hist command is triggered. If the event file is being removed, no associated hist command will be triggered and a waiter will be woken up only after an unrelated hist command is triggered. Fix the issue by adding a call to hist_poll_wakeup() in remove_event_file_dir() after setting the EVENT_FILE_FL_FREED flag. This ensures that a task polling on a hist file is woken up and receives EPOLLERR. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://patch.msgid.link/20260219162737.314231-3-petr.pavlu@suse.com Fixes: 1bd13edbbed6 ("tracing/hist: Add poll(POLLIN) support on hist file") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysfgraph: Do not call handlers direct when not using ftrace_opsSteven Rostedt1-3/+10
[ Upstream commit f4ff9f646a4d373f9e895c2f0073305da288bc0a ] The function graph tracer was modified to us the ftrace_ops of the function tracer. This simplified the code as well as allowed more features of the function graph tracer. Not all architectures were converted over as it required the implementation of HAVE_DYNAMIC_FTRACE_WITH_ARGS to implement. For those architectures, it still did it the old way where the function graph tracer handle was called by the function tracer trampoline. The handler then had to check the hash to see if the registered handlers wanted to be called by that function or not. In order to speed up the function graph tracer that used ftrace_ops, if only one callback was registered with function graph, it would call its function directly via a static call. Now, if the architecture does not support the use of using ftrace_ops and still has the ftrace function trampoline calling the function graph handler, then by doing a direct call it removes the check against the handler's hash (list of functions it wants callbacks to), and it may call that handler for functions that the handler did not request calls for. On 32bit x86, which does not support the ftrace_ops use with function graph tracer, it shows the issue: ~# trace-cmd start -p function -l schedule ~# trace-cmd show # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 2) * 11898.94 us | schedule(); 3) # 1783.041 us | schedule(); 1) | schedule() { ------------------------------------------ 1) bash-8369 => kworker-7669 ------------------------------------------ 1) | schedule() { ------------------------------------------ 1) kworker-7669 => bash-8369 ------------------------------------------ 1) + 97.004 us | } 1) | schedule() { [..] Now by starting the function tracer is another instance: ~# trace-cmd start -B foo -p function This causes the function graph tracer to trace all functions (because the function trace calls the function graph tracer for each on, and the function graph trace is doing a direct call): ~# trace-cmd show # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 1) 1.669 us | } /* preempt_count_sub */ 1) + 10.443 us | } /* _raw_spin_unlock_irqrestore */ 1) | tick_program_event() { 1) | clockevents_program_event() { 1) 1.044 us | ktime_get(); 1) 6.481 us | lapic_next_event(); 1) + 10.114 us | } 1) + 11.790 us | } 1) ! 181.223 us | } /* hrtimer_interrupt */ 1) ! 184.624 us | } /* __sysvec_apic_timer_interrupt */ 1) | irq_exit_rcu() { 1) 0.678 us | preempt_count_sub(); When it should still only be tracing the schedule() function. To fix this, add a macro FGRAPH_NO_DIRECT to be set to 0 when the architecture does not support function graph use of ftrace_ops, and set to 1 otherwise. Then use this macro to know to allow function graph tracer to call the handlers directly or not. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://patch.msgid.link/20260218104244.5f14dade@gandalf.local.home Fixes: cc60ee813b503 ("function_graph: Use static_call and branch to optimize entry function") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysfbdev: Use device_create_with_groups() to fix sysfs groups registration raceHans de Goede1-1/+0
[ Upstream commit 68eeb0871e986ae5462439dae881e3a27bcef85f ] The fbdev sysfs attributes are registered after sending the uevent for the device creation, leaving a race window where e.g. udev rules may not be able to access the sysfs attributes because the registration is not done yet. Fix this by switching to device_create_with_groups(). This also results in a nice cleanup. After switching to device_create_with_groups() all that is left of fb_init_device() is setting the drvdata and that can be passed to device_create[_with_groups]() too. After which fb_init_device() can be completely removed. Dropping fb_init_device() + fb_cleanup_device() in turn allows removing fb_info.class_flag as they were the only user of this field. Fixes: 5fc830d6aca1 ("fbdev: Register sysfs groups through device_add_group") Cc: stable@vger.kernel.org Cc: Shixiong Ou <oushixiong@kylinos.cn> Signed-off-by: Hans de Goede <johannes.goede@oss.qualcomm.com> Signed-off-by: Helge Deller <deller@gmx.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysmm/vmscan: fix demotion targets checks in reclaim/demotionBing Jiao2-6/+6
[ Upstream commit 1aceed565ff172fc0331dd1d5e7e65139b711139 ] Patch series "mm/vmscan: fix demotion targets checks in reclaim/demotion", v9. This patch series addresses two issues in demote_folio_list(), can_demote(), and next_demotion_node() in reclaim/demotion. 1. demote_folio_list() and can_demote() do not correctly check demotion target against cpuset.mems_effective, which will cause (a) pages to be demoted to not-allowed nodes and (b) pages fail demotion even if the system still has allowed demotion nodes. Patch 1 fixes this bug by updating cpuset_node_allowed() and mem_cgroup_node_allowed() to return effective_mems, allowing directly logic-and operation against demotion targets. 2. next_demotion_node() returns a preferred demotion target, but it does not check the node against allowed nodes. Patch 2 ensures that next_demotion_node() filters against the allowed node mask and selects the closest demotion target to the source node. This patch (of 2): Fix two bugs in demote_folio_list() and can_demote() due to incorrect demotion target checks against cpuset.mems_effective in reclaim/demotion. Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") introduces the cpuset.mems_effective check and applies it to can_demote(). However: 1. It does not apply this check in demote_folio_list(), which leads to situations where pages are demoted to nodes that are explicitly excluded from the task's cpuset.mems. 2. It checks only the nodes in the immediate next demotion hierarchy and does not check all allowed demotion targets in can_demote(). This can cause pages to never be demoted if the nodes in the next demotion hierarchy are not set in mems_effective. These bugs break resource isolation provided by cpuset.mems. This is visible from userspace because pages can either fail to be demoted entirely or are demoted to nodes that are not allowed in multi-tier memory systems. To address these bugs, update cpuset_node_allowed() and mem_cgroup_node_allowed() to return effective_mems, allowing directly logic-and operation against demotion targets. Also update can_demote() and demote_folio_list() accordingly. Bug 1 reproduction: Assume a system with 4 nodes, where nodes 0-1 are top-tier and nodes 2-3 are far-tier memory. All nodes have equal capacity. Test script: echo 1 > /sys/kernel/mm/numa/demotion_enabled mkdir /sys/fs/cgroup/test echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control echo "0-2" > /sys/fs/cgroup/test/cpuset.mems echo $$ > /sys/fs/cgroup/test/cgroup.procs swapoff -a # Expectation: Should respect node 0-2 limit. # Observation: Node 3 shows significant allocation (MemFree drops) stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1 Bug 2 reproduction: Assume a system with 6 nodes, where nodes 0-2 are top-tier, node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes. All nodes have equal capacity. Test script: echo 1 > /sys/kernel/mm/numa/demotion_enabled mkdir /sys/fs/cgroup/test echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems echo $$ > /sys/fs/cgroup/test/cgroup.procs swapoff -a # Expectation: Pages are demoted to Nodes 4-5 # Observation: No pages are demoted before oom. stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2 Link: https://lkml.kernel.org/r/20260114205305.2869796-1-bingjiao@google.com Link: https://lkml.kernel.org/r/20260114205305.2869796-2-bingjiao@google.com Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") Signed-off-by: Bing Jiao <bingjiao@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Gregory Price <gourry@gourry.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 dayskcsan, compiler_types: avoid duplicate type issues in BPF Type FormatAlan Maguire1-7/+16
[ Upstream commit 9dc052234da736f7749f19ab6936342ec7dbe3ac ] Enabling KCSAN is causing a large number of duplicate types in BTF for core kernel structs like task_struct [1]. This is due to the definition in include/linux/compiler_types.h `#ifdef __SANITIZE_THREAD__ ... `#define __data_racy volatile .. `#else ... `#define __data_racy ... `#endif Because some objects in the kernel are compiled without KCSAN flags (KCSAN_SANITIZE) we sometimes get the empty __data_racy annotation for objects; as a result we get multiple conflicting representations of the associated structs in DWARF, and these lead to multiple instances of core kernel types in BTF since they cannot be deduplicated due to the additional modifier in some instances. Moving the __data_racy definition under CONFIG_KCSAN avoids this problem, since the volatile modifier will be present for both KCSAN and KCSAN_SANITIZE objects in a CONFIG_KCSAN=y kernel. Link: https://lkml.kernel.org/r/20260116091730.324322-1-alan.maguire@oracle.com Fixes: 31f605a308e6 ("kcsan, compiler_types: Introduce __data_racy type qualifier") Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Reported-by: Nilay Shroff <nilay@linux.ibm.com> Tested-by: Nilay Shroff <nilay@linux.ibm.com> Suggested-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: Bart van Assche <bvanassche@acm.org> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Hao Luo <haoluo@google.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: KP Singh <kpsingh@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Naman Jain <namjain@linux.microsoft.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 dayscompiler-clang.h: require LLVM 19.1.0 or higher for __typeof_unqual__Nathan Chancellor1-1/+1
[ Upstream commit e8d899d301346a5591c9d1af06c3c9b3501cf84b ] When building the kernel using a version of LLVM between llvmorg-19-init (the first commit of the LLVM 19 development cycle) and the change in LLVM that actually added __typeof_unqual__ for all C modes [1], which might happen during a bisect of LLVM, there is a build failure: In file included from arch/x86/kernel/asm-offsets.c:9: In file included from include/linux/crypto.h:15: In file included from include/linux/completion.h:12: In file included from include/linux/swait.h:7: In file included from include/linux/spinlock.h:56: In file included from include/linux/preempt.h:79: arch/x86/include/asm/preempt.h:61:2: error: call to undeclared function '__typeof_unqual__'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 61 | raw_cpu_and_4(__preempt_count, ~PREEMPT_NEED_RESCHED); | ^ arch/x86/include/asm/percpu.h:478:36: note: expanded from macro 'raw_cpu_and_4' 478 | #define raw_cpu_and_4(pcp, val) percpu_binary_op(4, , "and", (pcp), val) | ^ arch/x86/include/asm/percpu.h:210:3: note: expanded from macro 'percpu_binary_op' 210 | TYPEOF_UNQUAL(_var) pto_tmp__; \ | ^ include/linux/compiler.h:248:29: note: expanded from macro 'TYPEOF_UNQUAL' 248 | # define TYPEOF_UNQUAL(exp) __typeof_unqual__(exp) | ^ The current logic of CC_HAS_TYPEOF_UNQUAL just checks for a major version of 19 but half of the 19 development cycle did not have support for __typeof_unqual__. Harden the logic of CC_HAS_TYPEOF_UNQUAL to avoid this error by only using __typeof_unqual__ with a released version of LLVM 19, which is greater than or equal to 19.1.0 with LLVM's versioning scheme that matches GCC's [2]. Link: https://github.com/llvm/llvm-project/commit/cc308f60d41744b5920ec2e2e5b25e1273c8704b [1] Link: https://github.com/llvm/llvm-project/commit/4532617ae420056bf32f6403dde07fb99d276a49 [2] Link: https://lkml.kernel.org/r/20260116-require-llvm-19-1-for-typeof_unqual-v1-1-3b9a4a4b212b@kernel.org Fixes: ac053946f5c4 ("compiler.h: introduce TYPEOF_UNQUAL() macro") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Cc: Bill Wendling <morbo@google.com> Cc: Justin Stitt <justinstitt@google.com> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysima: verify the previous kernel's IMA buffer lies in addressable RAMHarshit Mogalapalli1-0/+1
[ Upstream commit 10d1c75ed4382a8e79874379caa2ead8952734f9 ] Patch series "Address page fault in ima_restore_measurement_list()", v3. When the second-stage kernel is booted via kexec with a limiting command line such as "mem=<size>" we observe a pafe fault that happens. BUG: unable to handle page fault for address: ffff97793ff47000 RIP: ima_restore_measurement_list+0xdc/0x45a #PF: error_code(0x0000) not-present page This happens on x86_64 only, as this is already fixed in aarch64 in commit: cbf9c4b9617b ("of: check previous kernel's ima-kexec-buffer against memory bounds") This patch (of 3): When the second-stage kernel is booted with a limiting command line (e.g. "mem=<size>"), the IMA measurement buffer handed over from the previous kernel may fall outside the addressable RAM of the new kernel. Accessing such a buffer can fault during early restore. Introduce a small generic helper, ima_validate_range(), which verifies that a physical [start, end] range for the previous-kernel IMA buffer lies within addressable memory: - On x86, use pfn_range_is_mapped(). - On OF based architectures, use page_is_ram(). Link: https://lkml.kernel.org/r/20251231061609.907170-1-harshit.m.mogalapalli@oracle.com Link: https://lkml.kernel.org/r/20251231061609.907170-2-harshit.m.mogalapalli@oracle.com Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Reviewed-by: Mimi Zohar <zohar@linux.ibm.com> Cc: Alexander Graf <graf@amazon.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: guoweikang <guoweikang.kernel@gmail.com> Cc: Henry Willard <henry.willard@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Joel Granados <joel.granados@kernel.org> Cc: Jonathan McDowell <noodles@fb.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Paul Webb <paul.x.webb@oracle.com> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Yifei Liu <yifei.l.liu@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysmfd: tps65219: Implement LOCK register handling for TPS65214Kory Maincent (TI.com)1-0/+2
[ Upstream commit d3fcf276b501a82d4504fd5b1ed40249546530d1 ] The TPS65214 PMIC variant has a LOCK_REG register that prevents writes to nearly all registers when locked. Unlock the registers at probe time and leave them unlocked permanently. This approach is justified because: - Register locking is very uncommon in typical system operation - No code path is expected to lock the registers during runtime - Adding a custom regmap write function would add overhead to every register write, including voltage changes triggered by CPU OPP transitions from the cpufreq governor which could happen quite frequently Cc: stable@vger.kernel.org Fixes: 7947219ab1a2d ("mfd: tps65219: Add support for TI TPS65214 PMIC") Reviewed-by: Andrew Davis <afd@ti.com> Signed-off-by: Kory Maincent (TI.com) <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20251218-fix_tps65219-v5-1-8bb511417f3a@bootlin.com Signed-off-by: Lee Jones <lee@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysnet: add a common function to compute features for upper devicesHangbin Liu2-0/+19
[ Upstream commit 28098defc79fe7d29e6bfe4eb6312991f6bdc3d3 ] Some high level software drivers need to compute features from lower devices. But each has their own implementations and may lost some feature compute. Let's use one common function to compute features for kinds of these devices. The new helper uses the current bond implementation as the reference one, as the latter already handles all the relevant aspects: netdev features, TSO limits and dst retention. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20251017034155.61990-2-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: bb4c698633c0 ("team: avoid NETDEV_CHANGEMTU event when unregistering slave") Signed-off-by: Sasha Levin <sashal@kernel.org>
11 daysipv6: Move ipv6_fl_list from ipv6_pinfo to inet_sock.Kuniyuki Iwashima1-1/+0
[ Upstream commit 1c17f4373d4db1e1f0ebd3ddcd8e7a642927a826 ] In {tcp6,udp6,raw6}_sock, struct ipv6_pinfo is always placed at the beginning of a new cache line because 1. __alignof__(struct tcp_sock) is 64 due to ____cacheline_aligned of __cacheline_group_begin(tcp_sock_write_tx) 2. __alignof__(struct udp_sock) is 64 due to ____cacheline_aligned of struct numa_drop_counters 3. in raw6_sock, struct numa_drop_counters is placed before struct ipv6_pinfo . struct ipv6_pinfo is 136 bytes, but the last cache line is only used by ipv6_fl_list: $ pahole -C ipv6_pinfo vmlinux struct ipv6_pinfo { ... /* --- cacheline 2 boundary (128 bytes) --- */ struct ipv6_fl_socklist * ipv6_fl_list; /* 128 8 */ /* size: 136, cachelines: 3, members: 23 */ Let's move ipv6_fl_list from struct ipv6_pinfo to struct inet_sock to save a full cache line for {tcp6,udp6,raw6}_sock. Now, struct ipv6_pinfo is 128 bytes, and {tcp6,udp6,raw6}_sock have 64 bytes less, while {tcp,udp,raw}_sock retain the same size. Before: # grep -E "^(RAW|UDP[^L\-]|TCP)" /proc/slabinfo | awk '{print $1, "\t", $4}' RAWv6 1408 UDPv6 1472 TCPv6 2560 RAW 1152 UDP 1280 TCP 2368 After: # grep -E "^(RAW|UDP[^L\-]|TCP)" /proc/slabinfo | awk '{print $1, "\t", $4}' RAWv6 1344 UDPv6 1408 TCPv6 2496 RAW 1152 UDP 1280 TCP 2368 Also, ipv6_fl_list and inet_flags (SNDFLOW bit) are placed in the same cache line. $ pahole -C inet_sock vmlinux ... /* --- cacheline 11 boundary (704 bytes)