summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
39 hoursLinux 6.18.17v6.18.17Sasha Levin1-1/+1
Tested-by: Brett A C Sheffield <bacs@librecast.net> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dileep malepu <dileep.debian@gmail.com> Tested-by: Jeffrin Jose T <jeffrin@rajagiritech.edu.in> Tested-by: Florian Fainelli <florian.fainelli@broadcom.com> Tested-by: Ron Economos <re@w6rz.net> Tested-by: Peter Schneider <pschneider1968@googlemail.com> Tested-by: Mark Brown <broonie@kernel.org> Tested-by: Shuah Khan <skhan@linuxfoundation.org> Tested-by: Barry K. Nathan <barryn@pobox.com> Tested-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursselftests/bpf: Avoid simplification of crafted bounds testPaul Chaignon1-1/+1
[ Upstream commit 024cea2d647ed8ab942f19544b892d324dba42b4 ] The reg_bounds_crafted tests validate the verifier's range analysis logic. They focus on the actual ranges and thus ignore the tnum. As a consequence, they carry the assumption that the tested cases can be reproduced in userspace without using the tnum information. Unfortunately, the previous change the refinement logic breaks that assumption for one test case: (u64)2147483648 (u32)<op> [4294967294; 0x100000000] The tested bytecode is shown below. Without our previous improvement, on the false branch of the condition, R7 is only known to have u64 range [0xfffffffe; 0x100000000]. With our improvement, and using the tnum information, we can deduce that R7 equals 0x100000000. 19: (bc) w0 = w6 ; R6=0x80000000 20: (bc) w0 = w7 ; R7=scalar(smin=umin=0xfffffffe,smax=umax=0x100000000,smin32=-2,smax32=0,var_off=(0x0; 0x1ffffffff)) 21: (be) if w6 <= w7 goto pc+3 ; R6=0x80000000 R7=0x100000000 R7's tnum is (0; 0x1ffffffff). On the false branch, regs_refine_cond_op refines R7's u32 range to [0; 0x7fffffff]. Then, __reg32_deduce_bounds refines the s32 range to 0 using u32 and finally also sets u32=0. From this, __reg_bound_offset improves the tnum to (0; 0x100000000). Finally, our previous patch uses this new tnum to deduce that it only intersect with u64=[0xfffffffe; 0x100000000] in a single value: 0x100000000. Because the verifier uses the tnum to reach this constant value, the selftest is unable to reproduce it by only simulating ranges. The solution implemented in this patch is to change the test case such that there is more than one overlap value between u64 and the tnum. The max. u64 value is thus changed from 0x100000000 to 0x300000000. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/50641c6a7ef39520595dcafa605692427c1006ec.1772225741.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursbpf: collect only live registers in linked regsEduard Zingerman4-38/+73
[ Upstream commit 2658a1720a1944fbaeda937000ad2b3c3dfaf1bb ] Fix an inconsistency between func_states_equal() and collect_linked_regs(): - regsafe() uses check_ids() to verify that cached and current states have identical register id mapping. - func_states_equal() calls regsafe() only for registers computed as live by compute_live_registers(). - clean_live_states() is supposed to remove dead registers from cached states, but it can skip states belonging to an iterator-based loop. - collect_linked_regs() collects all registers sharing the same id, ignoring the marks computed by compute_live_registers(). Linked registers are stored in the state's jump history. - backtrack_insn() marks all linked registers for an instruction as precise whenever one of the linked registers is precise. The above might lead to a scenario: - There is an instruction I with register rY known to be dead at I. - Instruction I is reached via two paths: first A, then B. - On path A: - There is an id link between registers rX and rY. - Checkpoint C is created at I. - Linked register set {rX, rY} is saved to the jump history. - rX is marked as precise at I, causing both rX and rY to be marked precise at C. - On path B: - There is no id link between registers rX and rY, otherwise register states are sub-states of those in C. - Because rY is dead at I, check_ids() returns true. - Current state is considered equal to checkpoint C, propagate_precision() propagates spurious precision mark for register rY along the path B. - Depending on a program, this might hit verifier_bug() in the backtrack_insn(), e.g. if rY ∈ [r1..r5] and backtrack_insn() spots a function call. The reproducer program is in the next patch. This was hit by sched_ext scx_lavd scheduler code. Changes in tests: - verifier_scalar_ids.c selftests need modification to preserve some registers as live for __msg() checks. - exceptions_assert.c adjusted to match changes in the verifier log, R0 is dead after conditional instruction and thus does not get range. - precise.c adjusted to match changes in the verifier log, register r9 is dead after comparison and it's range is not important for test. Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Fixes: 0fb3cf6110a5 ("bpf: use register liveness information for func_states_equal") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260306-linked-regs-and-propagate-precision-v1-1-18e859be570d@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hourstracing: Add NULL pointer check to trigger_data_free()Guenter Roeck1-0/+3
[ Upstream commit 457965c13f0837a289c9164b842d0860133f6274 ] If trigger_data_alloc() fails and returns NULL, event_hist_trigger_parse() jumps to the out_free error path. While kfree() safely handles a NULL pointer, trigger_data_free() does not. This causes a NULL pointer dereference in trigger_data_free() when evaluating data->cmd_ops->set_filter. Fix the problem by adding a NULL pointer check to trigger_data_free(). The problem was found by an experimental code review agent based on gemini-3.1-pro while reviewing backports into v6.18.y. Cc: Miaoqian Lin <linmq006@gmail.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://patch.msgid.link/20260305193339.2810953-1-linux@roeck-us.net Fixes: 0550069cc25f ("tracing: Properly process error handling in event_hist_trigger_parse()") Assisted-by: Gemini:gemini-3.1-pro Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursselftest/arm64: Fix sve2p1_sigill() to hwcap testYifan Wu1-2/+2
[ Upstream commit d87c828daa7ead9763416f75cc416496969cf1dc ] The FEAT_SVE2p1 is indicated by ID_AA64ZFR0_EL1.SVEver. However, the BFADD requires the FEAT_SVE_B16B16, which is indicated by ID_AA64ZFR0_EL1.B16B16. This could cause the test to incorrectly fail on a CPU that supports FEAT_SVE2.1 but not FEAT_SVE_B16B16. LD1Q Gather load quadwords which is decoded from SVE encodings and implied by FEAT_SVE2p1. Fixes: c5195b027d29 ("kselftest/arm64: Add SVE 2.1 to hwcap test") Signed-off-by: Yifan Wu <wuyifan50@huawei.com> Reviewed-by: Mark Brown <broonie@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursata: libata-eh: Fix detection of deferred qc timeoutsGuenter Roeck1-1/+1
[ Upstream commit ee0e6e69a772d601e152e5368a1da25d656122a8 ] If the ata_qc_for_each_raw() loop finishes without finding a matching SCSI command for any QC, the variable qc will hold a pointer to the last element examined, which has the tag i == ATA_MAX_QUEUE - 1. This qc can match the port deferred QC (ap->deferred_qc). If that happens, the condition qc == ap->deferred_qc evaluates to true despite the loop not breaking with a match on the SCSI command for this QC. In that case, the error handler mistakenly intercepts a command that has not been issued yet and that has not timed out, and thus erroneously returning a timeout error. Fix the problem by checking for i < ATA_MAX_QUEUE in addition to qc == ap->deferred_qc. The problem was found by an experimental code review agent based on gemini-3.1-pro while reviewing backports into v6.18.y. Assisted-by: Gemini:gemini-3.1-pro Fixes: eddb98ad9364 ("ata: libata-eh: correctly handle deferred qc timeouts") Signed-off-by: Guenter Roeck <linux@roeck-us.net> [cassel: modified commit log as suggested by Damien] Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursxdp: produce a warning when calculated tailroom is negativeLarysa Zaremba1-1/+2
[ Upstream commit 8821e857759be9db3cde337ad328b71fe5c8a55f ] Many ethernet drivers report xdp Rx queue frag size as being the same as DMA write size. However, the only user of this field, namely bpf_xdp_frags_increase_tail(), clearly expects a truesize. Such difference leads to unspecific memory corruption issues under certain circumstances, e.g. in ixgbevf maximum DMA write size is 3 KB, so when running xskxceiver's XDP_ADJUST_TAIL_GROW_MULTI_BUFF, 6K packet fully uses all DMA-writable space in 2 buffers. This would be fine, if only rxq->frag_size was properly set to 4K, but value of 3K results in a negative tailroom, because there is a non-zero page offset. We are supposed to return -EINVAL and be done with it in such case, but due to tailroom being stored as an unsigned int, it is reported to be somewhere near UINT_MAX, resulting in a tail being grown, even if the requested offset is too much (it is around 2K in the abovementioned test). This later leads to all kinds of unspecific calltraces. [ 7340.337579] xskxceiver[1440]: segfault at 1da718 ip 00007f4161aeac9d sp 00007f41615a6a00 error 6 [ 7340.338040] xskxceiver[1441]: segfault at 7f410000000b ip 00000000004042b5 sp 00007f415bffecf0 error 4 [ 7340.338179] in libc.so.6[61c9d,7f4161aaf000+160000] [ 7340.339230] in xskxceiver[42b5,400000+69000] [ 7340.340300] likely on CPU 6 (core 0, socket 6) [ 7340.340302] Code: ff ff 01 e9 f4 fe ff ff 0f 1f 44 00 00 4c 39 f0 74 73 31 c0 ba 01 00 00 00 f0 0f b1 17 0f 85 ba 00 00 00 49 8b 87 88 00 00 00 <4c> 89 70 08 eb cc 0f 1f 44 00 00 48 8d bd f0 fe ff ff 89 85 ec fe [ 7340.340888] likely on CPU 3 (core 0, socket 3) [ 7340.345088] Code: 00 00 00 ba 00 00 00 00 be 00 00 00 00 89 c7 e8 31 ca ff ff 89 45 ec 8b 45 ec 85 c0 78 07 b8 00 00 00 00 eb 46 e8 0b c8 ff ff <8b> 00 83 f8 69 74 24 e8 ff c7 ff ff 8b 00 83 f8 0b 74 18 e8 f3 c7 [ 7340.404334] Oops: general protection fault, probably for non-canonical address 0x6d255010bdffc: 0000 [#1] SMP NOPTI [ 7340.405972] CPU: 7 UID: 0 PID: 1439 Comm: xskxceiver Not tainted 6.19.0-rc1+ #21 PREEMPT(lazy) [ 7340.408006] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014 [ 7340.409716] RIP: 0010:lookup_swap_cgroup_id+0x44/0x80 [ 7340.410455] Code: 83 f8 1c 73 39 48 ba ff ff ff ff ff ff ff 03 48 8b 04 c5 20 55 fa bd 48 21 d1 48 89 ca 83 e1 01 48 d1 ea c1 e1 04 48 8d 04 90 <8b> 00 48 83 c4 10 d3 e8 c3 cc cc cc cc 31 c0 e9 98 b7 dd 00 48 89 [ 7340.412787] RSP: 0018:ffffcc5c04f7f6d0 EFLAGS: 00010202 [ 7340.413494] RAX: 0006d255010bdffc RBX: ffff891f477895a8 RCX: 0000000000000010 [ 7340.414431] RDX: 0001c17e3fffffff RSI: 00fa070000000000 RDI: 000382fc7fffffff [ 7340.415354] RBP: 00fa070000000000 R08: ffffcc5c04f7f8f8 R09: ffffcc5c04f7f7d0 [ 7340.416283] R10: ffff891f4c1a7000 R11: ffffcc5c04f7f9c8 R12: ffffcc5c04f7f7d0 [ 7340.417218] R13: 03ffffffffffffff R14: 00fa06fffffffe00 R15: ffff891f47789500 [ 7340.418229] FS: 0000000000000000(0000) GS:ffff891ffdfaa000(0000) knlGS:0000000000000000 [ 7340.419489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7340.420286] CR2: 00007f415bfffd58 CR3: 0000000103f03002 CR4: 0000000000772ef0 [ 7340.421237] PKRU: 55555554 [ 7340.421623] Call Trace: [ 7340.421987] <TASK> [ 7340.422309] ? softleaf_from_pte+0x77/0xa0 [ 7340.422855] swap_pte_batch+0xa7/0x290 [ 7340.423363] zap_nonpresent_ptes.constprop.0.isra.0+0xd1/0x270 [ 7340.424102] zap_pte_range+0x281/0x580 [ 7340.424607] zap_pmd_range.isra.0+0xc9/0x240 [ 7340.425177] unmap_page_range+0x24d/0x420 [ 7340.425714] unmap_vmas+0xa1/0x180 [ 7340.426185] exit_mmap+0xe1/0x3b0 [ 7340.426644] __mmput+0x41/0x150 [ 7340.427098] exit_mm+0xb1/0x110 [ 7340.427539] do_exit+0x1b2/0x460 [ 7340.427992] do_group_exit+0x2d/0xc0 [ 7340.428477] get_signal+0x79d/0x7e0 [ 7340.428957] arch_do_signal_or_restart+0x34/0x100 [ 7340.429571] exit_to_user_mode_loop+0x8e/0x4c0 [ 7340.430159] do_syscall_64+0x188/0x6b0 [ 7340.430672] ? __do_sys_clone3+0xd9/0x120 [ 7340.431212] ? switch_fpu_return+0x4e/0xd0 [ 7340.431761] ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0 [ 7340.432498] ? do_syscall_64+0xbb/0x6b0 [ 7340.433015] ? __handle_mm_fault+0x445/0x690 [ 7340.433582] ? count_memcg_events+0xd6/0x210 [ 7340.434151] ? handle_mm_fault+0x212/0x340 [ 7340.434697] ? do_user_addr_fault+0x2b4/0x7b0 [ 7340.435271] ? clear_bhb_loop+0x30/0x80 [ 7340.435788] ? clear_bhb_loop+0x30/0x80 [ 7340.436299] ? clear_bhb_loop+0x30/0x80 [ 7340.436812] ? clear_bhb_loop+0x30/0x80 [ 7340.437323] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 7340.437973] RIP: 0033:0x7f4161b14169 [ 7340.438468] Code: Unable to access opcode bytes at 0x7f4161b1413f. [ 7340.439242] RSP: 002b:00007ffc6ebfa770 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca [ 7340.440173] RAX: fffffffffffffe00 RBX: 00000000000005a1 RCX: 00007f4161b14169 [ 7340.441061] RDX: 00000000000005a1 RSI: 0000000000000109 RDI: 00007f415bfff990 [ 7340.441943] RBP: 00007ffc6ebfa7a0 R08: 0000000000000000 R09: 00000000ffffffff [ 7340.442824] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 7340.443707] R13: 0000000000000000 R14: 00007f415bfff990 R15: 00007f415bfff6c0 [ 7340.444586] </TASK> [ 7340.444922] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit libnvdimm kvm_intel vfat fat kvm snd_pcm irqbypass rapl iTCO_wdt snd_timer intel_pmc_bxt iTCO_vendor_support snd ixgbevf virtio_net soundcore i2c_i801 pcspkr libeth_xdp net_failover i2c_smbus lpc_ich failover libeth virtio_balloon joydev 9p fuse loop zram lz4hc_compress lz4_compress 9pnet_virtio 9pnet netfs ghash_clmulni_intel serio_raw qemu_fw_cfg [ 7340.449650] ---[ end trace 0000000000000000 ]--- The issue can be fixed in all in-tree drivers, but we cannot just trust OOT drivers to not do this. Therefore, make tailroom a signed int and produce a warning when it is negative to prevent such mistakes in the future. Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://patch.msgid.link/20260305111253.2317394-10-larysa.zaremba@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnet: enetc: use truesize as XDP RxQ info frag_sizeLarysa Zaremba1-1/+1
[ Upstream commit f8e18abf183dbd636a8725532c7f5aa58957de84 ] The only user of frag_size field in XDP RxQ info is bpf_xdp_frags_increase_tail(). It clearly expects truesize instead of DMA write size. Different assumptions in enetc driver configuration lead to negative tailroom. Set frag_size to the same value as frame_sz. Fixes: 2768b2e2f7d2 ("net: enetc: register XDP RX queues with frag_size") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://patch.msgid.link/20260305111253.2317394-9-larysa.zaremba@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursi40e: use xdp.frame_sz as XDP RxQ info frag_sizeLarysa Zaremba1-3/+6
[ Upstream commit c69d22c6c46a1d792ba8af3d8d6356fdc0e6f538 ] The only user of frag_size field in XDP RxQ info is bpf_xdp_frags_increase_tail(). It clearly expects whole buffer size instead of DMA write size. Different assumptions in i40e driver configuration lead to negative tailroom. Set frag_size to the same value as frame_sz in shared pages mode, use new helper to set frag_size when AF_XDP ZC is active. Fixes: a045d2f2d03d ("i40e: set xdp_rxq_info::frag_size") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://patch.msgid.link/20260305111253.2317394-7-larysa.zaremba@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursi40e: fix registering XDP RxQ infoLarysa Zaremba2-17/+22
[ Upstream commit 8f497dc8a61429cc004720aa8e713743355d80cf ] Current way of handling XDP RxQ info in i40e has a problem, where frag_size is not updated when xsk_buff_pool is detached or when MTU is changed, this leads to growing tail always failing for multi-buffer packets. Couple XDP RxQ info registering with buffer allocations and unregistering with cleaning the ring. Fixes: a045d2f2d03d ("i40e: set xdp_rxq_info::frag_size") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://patch.msgid.link/20260305111253.2317394-6-larysa.zaremba@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursxsk: introduce helper to determine rxq->frag_sizeLarysa Zaremba1-0/+10
[ Upstream commit 16394d80539937d348dd3b9ea32415c54e67a81b ] rxq->frag_size is basically a step between consecutive strictly aligned frames. In ZC mode, chunk size fits exactly, but if chunks are unaligned, there is no safe way to determine accessible space to grow tailroom. Report frag_size to be zero, if chunks are unaligned, chunk_size otherwise. Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://patch.msgid.link/20260305111253.2317394-3-larysa.zaremba@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursxdp: use modulo operation to calculate XDP frag tailroomLarysa Zaremba1-1/+2
[ Upstream commit 88b6b7f7b216108a09887b074395fa7b751880b1 ] The current formula for calculating XDP tailroom in mbuf packets works only if each frag has its own page (if rxq->frag_size is PAGE_SIZE), this defeats the purpose of the parameter overall and without any indication leads to negative calculated tailroom on at least half of frags, if shared pages are used. There are not many drivers that set rxq->frag_size. Among them: * i40e and enetc always split page uniformly between frags, use shared pages * ice uses page_pool frags via libeth, those are power-of-2 and uniformly distributed across page * idpf has variable frag_size with XDP on, so current API is not applicable * mlx5, mtk and mvneta use PAGE_SIZE or 0 as frag_size for page_pool As for AF_XDP ZC, only ice, i40e and idpf declare frag_size for it. Modulo operation yields good results for aligned chunks, they are all power-of-2, between 2K and PAGE_SIZE. Formula without modulo fails when chunk_size is 2K. Buffers in unaligned mode are not distributed uniformly, so modulo operation would not work. To accommodate unaligned buffers, we could define frag_size as data + tailroom, and hence do not subtract offset when calculating tailroom, but this would necessitate more changes in the drivers. Define rxq->frag_size as an even portion of a page that fully belongs to a single frag. When calculating tailroom, locate the data start within such portion by performing a modulo operation on page offset. Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API") Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://patch.msgid.link/20260305111253.2317394-2-larysa.zaremba@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnet/sched: act_ife: Fix metalist update behaviorJamal Hadi Salim2-52/+45
[ Upstream commit e2cedd400c3ec0302ffca2490e8751772906ac23 ] Whenever an ife action replace changes the metalist, instead of replacing the old data on the metalist, the current ife code is appending the new metadata. Aside from being innapropriate behavior, this may lead to an unbounded addition of metadata to the metalist which might cause an out of bounds error when running the encode op: [ 138.423369][ C1] ================================================================== [ 138.424317][ C1] BUG: KASAN: slab-out-of-bounds in ife_tlv_meta_encode (net/ife/ife.c:168) [ 138.424906][ C1] Write of size 4 at addr ffff8880077f4ffe by task ife_out_out_bou/255 [ 138.425778][ C1] CPU: 1 UID: 0 PID: 255 Comm: ife_out_out_bou Not tainted 7.0.0-rc1-00169-gfbdfa8da05b6 #624 PREEMPT(full) [ 138.425795][ C1] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 138.425800][ C1] Call Trace: [ 138.425804][ C1] <IRQ> [ 138.425808][ C1] dump_stack_lvl (lib/dump_stack.c:122) [ 138.425828][ C1] print_report (mm/kasan/report.c:379 mm/kasan/report.c:482) [ 138.425839][ C1] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 138.425844][ C1] ? __virt_addr_valid (./arch/x86/include/asm/preempt.h:95 (discriminator 1) ./include/linux/rcupdate.h:975 (discriminator 1) ./include/linux/mmzone.h:2207 (discriminator 1) arch/x86/mm/physaddr.c:54 (discriminator 1)) [ 138.425853][ C1] ? ife_tlv_meta_encode (net/ife/ife.c:168) [ 138.425859][ C1] kasan_report (mm/kasan/report.c:221 mm/kasan/report.c:597) [ 138.425868][ C1] ? ife_tlv_meta_encode (net/ife/ife.c:168) [ 138.425878][ C1] kasan_check_range (mm/kasan/generic.c:186 (discriminator 1) mm/kasan/generic.c:200 (discriminator 1)) [ 138.425884][ C1] __asan_memset (mm/kasan/shadow.c:84 (discriminator 2)) [ 138.425889][ C1] ife_tlv_meta_encode (net/ife/ife.c:168) [ 138.425893][ C1] ? ife_tlv_meta_encode (net/ife/ife.c:171) [ 138.425898][ C1] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 138.425903][ C1] ife_encode_meta_u16 (net/sched/act_ife.c:57) [ 138.425910][ C1] ? __pfx_do_raw_spin_lock (kernel/locking/spinlock_debug.c:114) [ 138.425916][ C1] ? __asan_memcpy (mm/kasan/shadow.c:105 (discriminator 3)) [ 138.425921][ C1] ? __pfx_ife_encode_meta_u16 (net/sched/act_ife.c:45) [ 138.425927][ C1] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 138.425931][ C1] tcf_ife_act (net/sched/act_ife.c:847 net/sched/act_ife.c:879) To solve this issue, fix the replace behavior by adding the metalist to the ife rcu data structure. Fixes: aa9fd9a325d51 ("sched: act: ife: update parameters via rcu handling") Reported-by: Ruitong Liu <cnitlrt@gmail.com> Tested-by: Ruitong Liu <cnitlrt@gmail.com> Co-developed-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260304140603.76500-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnet: ipv6: fix panic when IPv4 route references loopback IPv6 nexthopJiayuan Chen1-5/+3
[ Upstream commit 21ec92774d1536f71bdc90b0e3d052eff99cf093 ] When a standalone IPv6 nexthop object is created with a loopback device (e.g., "ip -6 nexthop add id 100 dev lo"), fib6_nh_init() misclassifies it as a reject route. This is because nexthop objects have no destination prefix (fc_dst=::), causing fib6_is_reject() to match any loopback nexthop. The reject path skips fib_nh_common_init(), leaving nhc_pcpu_rth_output unallocated. If an IPv4 route later references this nexthop, __mkroute_output() dereferences NULL nhc_pcpu_rth_output and panics. Simplify the check in fib6_nh_init() to only match explicit reject routes (RTF_REJECT) instead of using fib6_is_reject(). The loopback promotion heuristic in fib6_is_reject() is handled separately by ip6_route_info_create_nh(). After this change, the three cases behave as follows: 1. Explicit reject route ("ip -6 route add unreachable 2001:db8::/64"): RTF_REJECT is set, enters reject path, skips fib_nh_common_init(). No behavior change. 2. Implicit loopback reject route ("ip -6 route add 2001:db8::/32 dev lo"): RTF_REJECT is not set, takes normal path, fib_nh_common_init() is called. ip6_route_info_create_nh() still promotes it to reject afterward. nhc_pcpu_rth_output is allocated but unused, which is harmless. 3. Standalone nexthop object ("ip -6 nexthop add id 100 dev lo"): RTF_REJECT is not set, takes normal path, fib_nh_common_init() is called. nhc_pcpu_rth_output is properly allocated, fixing the crash when IPv4 routes reference this nexthop. Suggested-by: Ido Schimmel <idosch@nvidia.com> Fixes: 493ced1ac47c ("ipv4: Allow routes to use nexthop objects") Reported-by: syzbot+334190e097a98a1b81bb@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/698f8482.a70a0220.2c38d7.00ca.GAE@google.com/T/ Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260304113817.294966-2-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnet: vxlan: fix nd_tbl NULL dereference when IPv6 is disabledFernando Fernandez Mancera1-0/+5
[ Upstream commit 168ff39e4758897d2eee4756977d036d52884c7e ] When booting with the 'ipv6.disable=1' parameter, the nd_tbl is never initialized because inet6_init() exits before ndisc_init() is called which initializes it. If an IPv6 packet is injected into the interface, route_shortcircuit() is called and a NULL pointer dereference happens on neigh_lookup(). BUG: kernel NULL pointer dereference, address: 0000000000000380 Oops: Oops: 0000 [#1] SMP NOPTI [...] RIP: 0010:neigh_lookup+0x20/0x270 [...] Call Trace: <TASK> vxlan_xmit+0x638/0x1ef0 [vxlan] dev_hard_start_xmit+0x9e/0x2e0 __dev_queue_xmit+0xbee/0x14e0 packet_sendmsg+0x116f/0x1930 __sys_sendto+0x1f5/0x200 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x12f/0x1590 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fix this by adding an early check on route_shortcircuit() when protocol is ETH_P_IPV6. Note that ipv6_mod_enabled() cannot be used here because VXLAN can be built-in even when IPv6 is built as a module. Fixes: e15a00aafa4b ("vxlan: add ipv6 route short circuit support") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Link: https://patch.msgid.link/20260304120357.9778-2-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnet: bridge: fix nd_tbl NULL dereference when IPv6 is disabledFernando Fernandez Mancera2-2/+2
[ Upstream commit e5e890630533bdc15b26a34bb8e7ef539bdf1322 ] When booting with the 'ipv6.disable=1' parameter, the nd_tbl is never initialized because inet6_init() exits before ndisc_init() is called which initializes it. Then, if neigh_suppress is enabled and an ICMPv6 Neighbor Discovery packet reaches the bridge, br_do_suppress_nd() will dereference ipv6_stub->nd_tbl which is NULL, passing it to neigh_lookup(). This causes a kernel NULL pointer dereference. BUG: kernel NULL pointer dereference, address: 0000000000000268 Oops: 0000 [#1] PREEMPT SMP NOPTI [...] RIP: 0010:neigh_lookup+0x16/0xe0 [...] Call Trace: <IRQ> ? neigh_lookup+0x16/0xe0 br_do_suppress_nd+0x160/0x290 [bridge] br_handle_frame_finish+0x500/0x620 [bridge] br_handle_frame+0x353/0x440 [bridge] __netif_receive_skb_core.constprop.0+0x298/0x1110 __netif_receive_skb_one_core+0x3d/0xa0 process_backlog+0xa0/0x140 __napi_poll+0x2c/0x170 net_rx_action+0x2c4/0x3a0 handle_softirqs+0xd0/0x270 do_softirq+0x3f/0x60 Fix this by replacing IS_ENABLED(IPV6) call with ipv6_mod_enabled() in the callers. This is in essence disabling NS/NA suppression when IPv6 is disabled. Fixes: ed842faeb2bd ("bridge: suppress nd pkts on BR_NEIGH_SUPPRESS ports") Reported-by: Guruprasad C P <gurucp2005@gmail.com> Closes: https://lore.kernel.org/netdev/CAHXs0ORzd62QOG-Fttqa2Cx_A_VFp=utE2H2VTX5nqfgs7LDxQ@mail.gmail.com/ Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260304120357.9778-1-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursselftests/harness: order TEST_F and XFAIL_ADD constructorsSun Jian1-2/+5
[ Upstream commit 6be2681514261324c8ee8a1c6f76cefdf700220f ] TEST_F() allocates and registers its struct __test_metadata via mmap() inside its constructor, and only then assigns the _##fixture_##test##_object pointer. XFAIL_ADD() runs in a constructor too and reads _##fixture_##test##_object to initialize xfail->test. If XFAIL_ADD runs first, xfail->test can be NULL and the expected failure will be reported as FAIL. Use constructor priorities to ensure TEST_F registration runs before XFAIL_ADD, without adding extra state or runtime lookups. Fixes: 2709473c9386 ("selftests: kselftest_harness: support using xfail") Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com> Link: https://patch.msgid.link/20260225111451.347923-1-sun.jian.kdev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hourskselftest/harness: Use helper to avoid zero-size memset warningWake Liu1-1/+7
[ Upstream commit 19b8a76cd99bde6d299e60490f3e62b8d3df3997 ] When building kselftests with a toolchain that enables source fortification (e.g., Android's build environment, which uses -D_FORTIFY_SOURCE=3), a build failure occurs in tests that use an empty FIXTURE(). The root cause is that an empty fixture struct results in `sizeof(self_private)` evaluating to 0. The compiler's fortification checks then detect the `memset()` call with a compile-time constant size of 0, issuing a `-Wuser-defined-warnings` which is promoted to an error by `-Werror`. An initial attempt to guard the call with `if (sizeof(self_private) > 0)` was insufficient. The compiler's static analysis is aggressive enough to flag the `memset(..., 0)` pattern before evaluating the conditional, thus still triggering the error. To resolve this robustly, this change introduces a `static inline` helper function, `__kselftest_memset_safe()`. This function wraps the size check and the `memset()` call. By replacing the direct `memset()` in the `__TEST_F_IMPL` macro with a call to this helper, we create an abstraction boundary. This prevents the compiler's static analyzer from "seeing" the problematic pattern at the macro expansion site, resolving the build failure. Build Context: Compiler: Android (14488419, +pgo, +bolt, +lto, +mlgo, based on r584948) clang version 22.0.0 (https://android.googlesource.com/toolchain/llvm-project 2d65e4108033380e6fe8e08b1f1826cd2bfb0c99) Relevant Options: -O2 -Wall -Werror -D_FORTIFY_SOURCE=3 -target i686-linux-android10000 Test: m kselftest_futex_futex_requeue_pi Removed Gerrit Change-Id Shuah Khan <skhan@linuxfoundation.org> Link: https://lore.kernel.org/r/20251224084120.249417-1-wakel@google.com Signed-off-by: Wake Liu <wakel@google.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> Stable-dep-of: 6be268151426 ("selftests/harness: order TEST_F and XFAIL_ADD constructors") Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnet: ethernet: mtk_eth_soc: Reset prog ptr to old_prog in case of error in ↵Lorenzo Bianconi1-3/+12
mtk_xdp_setup() [ Upstream commit 0abc73c8a40fd64ac1739c90bb4f42c418d27a5e ] Reset eBPF program pointer to old_prog and do not decrease its ref-count if mtk_open routine in mtk_xdp_setup() fails. Fixes: 7c26c20da5d42 ("net: ethernet: mtk_eth_soc: add basic XDP support") Suggested-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20260303-mtk-xdp-prog-ptr-fix-v2-1-97b6dbbe240f@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnetfilter: nft_set_pipapo: split gc into unlink and reclaim phaseFlorian Westphal4-13/+50
[ Upstream commit 9df95785d3d8302f7c066050117b04cd3c2048c2 ] Yiming Qian reports Use-after-free in the pipapo set type: Under a large number of expired elements, commit-time GC can run for a very long time in a non-preemptible context, triggering soft lockup warnings and RCU stall reports (local denial of service). We must split GC in an unlink and a reclaim phase. We cannot queue elements for freeing until pointers have been swapped. Expired elements are still exposed to both the packet path and userspace dumpers via the live copy of the data structure. call_rcu() does not protect us: dump operations or element lookups starting after call_rcu has fired can still observe the free'd element, unless the commit phase has made enough progress to swap the clone and live pointers before any new reader has picked up the old version. This a similar approach as done recently for the rbtree backend in commit 35f83a75529a ("netfilter: nft_set_rbtree: don't gc elements on insert"). Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Reported-by: Yiming Qian <yimingqian591@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnetfilter: nf_tables: clone set on flush onlyPablo Neira Ayuso5-6/+26
[ Upstream commit fb7fb4016300ac622c964069e286dc83166a5d52 ] Syzbot with fault injection triggered a failing memory allocation with GFP_KERNEL which results in a WARN splat: iter.err WARNING: net/netfilter/nf_tables_api.c:845 at nft_map_deactivate+0x34e/0x3c0 net/netfilter/nf_tables_api.c:845, CPU#0: syz.0.17/5992 Modules linked in: CPU: 0 UID: 0 PID: 5992 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026 RIP: 0010:nft_map_deactivate+0x34e/0x3c0 net/netfilter/nf_tables_api.c:845 Code: 8b 05 86 5a 4e 09 48 3b 84 24 a0 00 00 00 75 62 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc e8 63 6d fa f7 90 <0f> 0b 90 43 +80 7c 35 00 00 0f 85 23 fe ff ff e9 26 fe ff ff 89 d9 RSP: 0018:ffffc900045af780 EFLAGS: 00010293 RAX: ffffffff89ca45bd RBX: 00000000fffffff4 RCX: ffff888028111e40 RDX: 0000000000000000 RSI: 00000000fffffff4 RDI: 0000000000000000 RBP: ffffc900045af870 R08: 0000000000400dc0 R09: 00000000ffffffff R10: dffffc0000000000 R11: fffffbfff1d141db R12: ffffc900045af7e0 R13: 1ffff920008b5f24 R14: dffffc0000000000 R15: ffffc900045af920 FS: 000055557a6a5500(0000) GS:ffff888125496000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb5ea271fc0 CR3: 000000003269e000 CR4: 00000000003526f0 Call Trace: <TASK> __nft_release_table+0xceb/0x11f0 net/netfilter/nf_tables_api.c:12115 nft_rcv_nl_event+0xc25/0xdb0 net/netfilter/nf_tables_api.c:12187 notifier_call_chain+0x19d/0x3a0 kernel/notifier.c:85 blocking_notifier_call_chain+0x6a/0x90 kernel/notifier.c:380 netlink_release+0x123b/0x1ad0 net/netlink/af_netlink.c:761 __sock_release net/socket.c:662 [inline] sock_close+0xc3/0x240 net/socket.c:1455 Restrict set clone to the flush set command in the preparation phase. Add NFT_ITER_UPDATE_CLONE and use it for this purpose, update the rbtree and pipapo backends to only clone the set when this iteration type is used. As for the existing NFT_ITER_UPDATE type, update the pipapo backend to use the existing set clone if available, otherwise use the existing set representation. After this update, there is no need to clone a set that is being deleted, this includes bound anonymous set. An alternative approach to NFT_ITER_UPDATE_CLONE is to add a .clone interface and call it from the flush set path. Reported-by: syzbot+4924a0edc148e8b4b342@syzkaller.appspotmail.com Fixes: 3f1d886cc7c3 ("netfilter: nft_set_pipapo: move cloning of match info to insert/removal path") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnetfilter: nf_tables: unconditionally bump set->nelems before insertionPablo Neira Ayuso1-14/+16
[ Upstream commit def602e498a4f951da95c95b1b8ce8ae68aa733a ] In case that the set is full, a new element gets published then removed without waiting for the RCU grace period, while RCU reader can be walking over it already. To address this issue, add the element transaction even if set is full, but toggle the set_full flag to report -ENFILE so the abort path safely unwinds the set to its previous state. As for element updates, decrement set->nelems to restore it. A simpler fix is to call synchronize_rcu() in the error path. However, with a large batch adding elements to already maxed-out set, this could cause noticeable slowdown of such batches. Fixes: 35d0ac9070ef ("netfilter: nf_tables: fix set->nelems counting with no NLM_F_EXCL") Reported-by: Inseo An <y0un9sa@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnet: Provide a PREEMPT_RT specific check for netdev_queue::_xmit_lockSebastian Andrzej Siewior3-10/+24
[ Upstream commit b824c3e16c1904bf80df489e293d1e3cbf98896d ] After acquiring netdev_queue::_xmit_lock the number of the CPU owning the lock is recorded in netdev_queue::xmit_lock_owner. This works as long as the BH context is not preemptible. On PREEMPT_RT the softirq context is preemptible and without the softirq-lock it is possible to have multiple user in __dev_queue_xmit() submitting a skb on the same CPU. This is fine in general but this means also that the current CPU is recorded as netdev_queue::xmit_lock_owner. This in turn leads to the recursion alert and the skb is dropped. Instead checking the for CPU number, that owns the lock, PREEMPT_RT can check if the lockowner matches the current task. Add netif_tx_owned() which returns true if the current context owns the lock by comparing the provided CPU number with the recorded number. This resembles the current check by negating the condition (the current check returns true if the lock is not owned). On PREEMPT_RT use rt_mutex_owner() to return the lock owner and compare the current task against it. Use the new helper in __dev_queue_xmit() and netif_local_xmit_active() which provides a similar check. Update comments regarding pairing READ_ONCE(). Reported-by: Bert Karwatzki <spasswolf@web.de> Closes: https://lore.kernel.org/all/20260216134333.412332-1-spasswolf@web.de Fixes: 3253cb49cbad4 ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reported-by: Bert Karwatzki <spasswolf@web.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20260302162631.uGUyIqDT@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursblock: use trylock to avoid lockdep circular dependency in sysfsMing Lei2-2/+18
[ Upstream commit ce8ee8583ed83122405eabaa8fb351be4d9dc65c ] Use trylock instead of blocking lock acquisition for update_nr_hwq_lock in queue_requests_store() and elv_iosched_store() to avoid circular lock dependency with kernfs active reference during concurrent disk deletion: update_nr_hwq_lock -> kn->active (via del_gendisk -> kobject_del) kn->active -> update_nr_hwq_lock (via sysfs write path) Return -EBUSY when the lock is not immediately available. Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/linux-block/CAHj4cs-em-4acsHabMdT=jJhXkCzjnprD-aQH1OgrZo4nTnmMw@mail.gmail.com/ Fixes: 626ff4f8ebcb ("blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lock") Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnet: stmmac: Defer VLAN HW configuration when interface is downOvidiu Panait2-19/+26
[ Upstream commit 2cd70e3968f505996d5fefdf7ca684f0f4575734 ] VLAN register accesses on the MAC side require the PHY RX clock to be active. When the network interface is down, the PHY is suspended and the RX clock is unavailable, causing VLAN operations to fail with timeouts. The VLAN core automatically removes VID 0 after the interface goes down and re-adds it when it comes back up, so these timeouts happen during normal interface down/up: # ip link set end1 down renesas-gbeth 15c40000.ethernet end1: Timeout accessing MAC_VLAN_Tag_Filter renesas-gbeth 15c40000.ethernet end1: failed to kill vid 0081/0 Adding VLANs while the interface is down also fails: # ip link add link end1 name end1.10 type vlan id 10 renesas-gbeth 15c40000.ethernet end1: Timeout accessing MAC_VLAN_Tag_Filter RTNETLINK answers: Device or resource busy To fix this, check if the interface is up before accessing VLAN registers. The software state is always kept up to date regardless of interface state. When the interface is brought up, stmmac_vlan_restore() is called to write the VLAN state to hardware. Fixes: ed64639bc1e0 ("net: stmmac: Add support for VLAN Rx filtering") Signed-off-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com> Link: https://patch.msgid.link/20260303145828.7845-5-ovidiu.panait.rb@renesas.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnet: stmmac: Fix VLAN HW state restoreOvidiu Panait2-12/+22
[ Upstream commit bd7ad51253a76fb35886d01cfe9a37f0e4ed6709 ] When the network interface is opened or resumed, a DMA reset is performed, which resets all hardware state, including VLAN state. Currently, only the resume path is restoring the VLAN state via stmmac_restore_hw_vlan_rx_fltr(), but that is incomplete: the VLAN hash table and the VLAN_TAG control bits are not restored. Therefore, add stmmac_vlan_restore(), which restores the full VLAN state by updating both the HW filter entries and the hash table, and call it from both the open and resume paths. The VLAN restore is moved outside of phylink_rx_clk_stop_block/unblock in the resume path because receive clock stop is already disabled when stmmac supports VLAN. Also, remove the hash readback code in vlan_restore_hw_rx_fltr() that attempts to restore VTHM by reading VLAN_HASH_TABLE, as it always reads zero after DMA reset, making it dead code. Fixes: 3cd1cfcba26e ("net: stmmac: Implement VLAN Hash Filtering in XGMAC") Fixes: ed64639bc1e0 ("net: stmmac: Add support for VLAN Rx filtering") Signed-off-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com> Link: https://patch.msgid.link/20260303145828.7845-4-ovidiu.panait.rb@renesas.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 2cd70e3968f5 ("net: stmmac: Defer VLAN HW configuration when interface is down") Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnet: stmmac: Improve double VLAN handlingOvidiu Panait3-4/+21
[ Upstream commit e38200e361cbe331806dc454c76c11c7cd95e1b9 ] The double VLAN bits (EDVLP, ESVL, DOVLTC) are handled inconsistently between the two vlan_update_hash() implementations: - dwxgmac2_update_vlan_hash() explicitly clears the double VLAN bits when is_double is false, meaning that adding a 802.1Q VLAN will disable double VLAN mode: $ ip link add link eth0 name eth0.200 type vlan id 200 protocol 802.1ad $ ip link add link eth0 name eth0.100 type vlan id 100 # Double VLAN bits no longer set - vlan_update_hash() sets these bits and only clears them when the last VLAN has been removed, so double VLAN mode remains enabled even after all 802.1AD VLANs are removed. Address both issues by tracking the number of active 802.1AD VLANs in priv->num_double_vlans. Pass this count to stmmac_vlan_update() so both implementations correctly set the double VLAN bits when any 802.1AD VLAN is active, and clear them only when none remain. Also update vlan_update_hash() to explicitly clear the double VLAN bits when is_double is false, matching the dwxgmac2 behavior. Signed-off-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com> Link: https://patch.msgid.link/20260303145828.7845-3-ovidiu.panait.rb@renesas.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 2cd70e3968f5 ("net: stmmac: Defer VLAN HW configuration when interface is down") Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnet: stmmac: Fix error handling in VLAN add and delete pathsOvidiu Panait1-4/+14
[ Upstream commit 35dfedce442c4060cfe5b98368bc9643fb995716 ] stmmac_vlan_rx_add_vid() updates active_vlans and the VLAN hash register before writing the HW filter entry. If the filter write fails, it leaves a stale VID in active_vlans and the hash register. stmmac_vlan_rx_kill_vid() has the reverse problem: it clears active_vlans before removing the HW filter. On failure, the VID is gone from active_vlans but still present in the HW filter table. To fix this, reorder the operations to update the hash table first, then attempt the HW filter operation. If the HW filter fails, roll back both the active_vlans bitmap and the hash table by calling stmmac_vlan_update() again. Fixes: ed64639bc1e0 ("net: stmmac: Add support for VLAN Rx filtering") Signed-off-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com> Link: https://patch.msgid.link/20260303145828.7845-2-ovidiu.panait.rb@renesas.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnfc: rawsock: cancel tx_work before socket teardownJakub Kicinski1-0/+11
[ Upstream commit d793458c45df2aed498d7f74145eab7ee22d25aa ] In rawsock_release(), cancel any pending tx_work and purge the write queue before orphaning the socket. rawsock_tx_work runs on the system workqueue and calls nfc_data_exchange which dereferences the NCI device. Without synchronization, tx_work can race with socket and device teardown when a process is killed (e.g. by SIGKILL), leading to use-after-free or leaked references. Set SEND_SHUTDOWN first so that if tx_work is already running it will see the flag and skip transmitting, then use cancel_work_sync to wait for any in-progress execution to finish, and finally purge any remaining queued skbs. Fixes: 23b7869c0fd0 ("NFC: add the NFC socket raw protocol") Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20260303162346.2071888-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnfc: nci: clear NCI_DATA_EXCHANGE before calling completion callbackJakub Kicinski1-4/+8
[ Upstream commit 0efdc02f4f6d52f8ca5d5889560f325a836ce0a8 ] Move clear_bit(NCI_DATA_EXCHANGE) before invoking the data exchange callback in nci_data_exchange_complete(). The callback (e.g. rawsock_data_exchange_complete) may immediately schedule another data exchange via schedule_work(tx_work). On a multi-CPU system, tx_work can run and reach nci_transceive() before the current nci_data_exchange_complete() clears the flag, causing test_and_set_bit(NCI_DATA_EXCHANGE) to return -EBUSY and the new transfer to fail. This causes intermittent flakes in nci/nci_dev in NIPA: # # RUN NCI.NCI1_0.t4t_tag_read ... # # t4t_tag_read: Test terminated by timeout # # FAIL NCI.NCI1_0.t4t_tag_read # not ok 3 NCI.NCI1_0.t4t_tag_read Fixes: 38f04c6b1b68 ("NFC: protect nci_data_exchange transactions") Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20260303162346.2071888-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnfc: nci: complete pending data exchange on device closeJakub Kicinski1-0/+9
[ Upstream commit 66083581945bd5b8e99fe49b5aeb83d03f62d053 ] In nci_close_device(), complete any pending data exchange before closing. The data exchange callback (e.g. rawsock_data_exchange_complete) holds a socket reference. NIPA occasionally hits this leak: unreferenced object 0xff1100000f435000 (size 2048): comm "nci_dev", pid 3954, jiffies 4295441245 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 27 00 01 40 00 00 00 00 00 00 00 00 00 00 00 00 '..@............ backtrace (crc ec2b3c5): __kmalloc_noprof+0x4db/0x730 sk_prot_alloc.isra.0+0xe4/0x1d0 sk_alloc+0x36/0x760 rawsock_create+0xd1/0x540 nfc_sock_create+0x11f/0x280 __sock_create+0x22d/0x630 __sys_socket+0x115/0x1d0 __x64_sys_socket+0x72/0xd0 do_syscall_64+0x117/0xfc0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: 38f04c6b1b68 ("NFC: protect nci_data_exchange transactions") Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20260303162346.2071888-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnfc: nci: free skb on nci_transceive early error pathsJakub Kicinski1-2/+7
[ Upstream commit 7bd4b0c4779f978a6528c9b7937d2ca18e936e2c ] nci_transceive() takes ownership of the skb passed by the caller, but the -EPROTO, -EINVAL, and -EBUSY error paths return without freeing it. Due to issues clearing NCI_DATA_EXCHANGE fixed by subsequent changes the nci/nci_dev selftest hits the error path occasionally in NIPA, and kmemleak detects leaks: unreferenced object 0xff11000015ce6a40 (size 640): comm "nci_dev", pid 3954, jiffies 4295441246 hex dump (first 32 bytes): 6b 6b 6b 6b 00 a4 00 0c 02 e1 03 6b 6b 6b 6b 6b kkkk.......kkkkk 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk backtrace (crc 7c40cc2a): kmem_cache_alloc_node_noprof+0x492/0x630 __alloc_skb+0x11e/0x5f0 alloc_skb_with_frags+0xc6/0x8f0 sock_alloc_send_pskb+0x326/0x3f0 nfc_alloc_send_skb+0x94/0x1d0 rawsock_sendmsg+0x162/0x4c0 do_syscall_64+0x117/0xfc0 Fixes: 6a2968aaf50c ("NFC: basic NCI protocol implementation") Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20260303162346.2071888-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnet: devmem: use READ_ONCE/WRITE_ONCE on binding->devBobby Eshleman2-3/+5
[ Upstream commit 40bf00ec2ee271df5ba67593991760adf8b5d0ed ] binding->dev is protected on the write-side in mp_dmabuf_devmem_uninstall() against concurrent writes, but due to the concurrent bare reads in net_devmem_get_binding() and validate_xmit_unreadable_skb() it should be wrapped in a READ_ONCE/WRITE_ONCE pair to make sure no compiler optimizations play with the underlying register in unforeseen ways. Doesn't present a critical bug because the known compiler optimizations don't result in bad behavior. There is no tearing on u64, and load omissions/invented loads would only break if additional binding->dev references were inlined together (they aren't right now). This just more strictly follows the linux memory model (i.e., "Lock-Protected Writes With Lockless Reads" in tools/memory-model/Documentation/access-marking.txt). Fixes: bd61848900bf ("net: devmem: Implement TX path") Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com> Link: https://patch.msgid.link/20260302-devmem-membar-fix-v2-1-5b33c9cbc28b@meta.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnet_sched: sch_fq: clear q->band_pkt_count[] in fq_reset()Eric Dumazet1-0/+1
[ Upstream commit a4c2b8be2e5329e7fac6e8f64ddcb8958155cfcb ] When/if a NIC resets, queues are deactivated by dev_deactivate_many(), then reactivated when the reset operation completes. fq_reset() removes all the skbs from various queues. If we do not clear q->band_pkt_count[], these counters keep growing and can eventually reach sch->limit, preventing new packets to be queued. Many thanks to Praveen for discovering the root cause. Fixes: 29f834aa326e ("net_sched: sch_fq: add 3 bands and WRR scheduling") Diagnosed-by: Praveen Kaligineedi <pkaligineedi@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260304015640.961780-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnet: nfc: nci: Fix zero-length proprietary notificationsIan Ray1-1/+11
[ Upstream commit f7d92f11bd33a6eb49c7c812255ef4ab13681f0f ] NCI NFC controllers may have proprietary OIDs with zero-length payload. One example is: drivers/nfc/nxp-nci/core.c, NXP_NCI_RF_TXLDO_ERROR_NTF. Allow a zero length payload in proprietary notifications *only*. Before: -- >8 -- kernel: nci: nci_recv_frame: len 3 -- >8 -- After: -- >8 -- kernel: nci: nci_recv_frame: len 3 kernel: nci: nci_ntf_packet: NCI RX: MT=ntf, PBF=0, GID=0x1, OID=0x23, plen=0 kernel: nci: nci_ntf_packet: unknown ntf opcode 0x123 kernel: nfc nfc0: NFC: RF transmitter couldn't start. Bad power and/or configuration? -- >8 -- After fixing the hardware: -- >8 -- kernel: nci: nci_recv_frame: len 27 kernel: nci: nci_ntf_packet: NCI RX: MT=ntf, PBF=0, GID=0x1, OID=0x5, plen=24 kernel: nci: nci_rf_intf_activated_ntf_packet: rf_discovery_id 1 -- >8 -- Fixes: d24b03535e5e ("nfc: nci: Fix uninit-value in nci_dev_up and nci_ntf_packet") Signed-off-by: Ian Ray <ian.ray@gehealthcare.com> Link: https://patch.msgid.link/20260302163238.140576-1-ian.ray@gehealthcare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hourstcp: secure_seq: add back ports to TS offsetEric Dumazet8-108/+127
[ Upstream commit 165573e41f2f66ef98940cf65f838b2cb575d9d1 ] This reverts 28ee1b746f49 ("secure_seq: downgrade to per-host timestamp offsets") tcp_tw_recycle went away in 2017. Zhouyan Deng reported off-path TCP source port leakage via SYN cookie side-channel that can be fixed in multiple ways. One of them is to bring back TCP ports in TS offset randomization. As a bonus, we perform a single siphash() computation to provide both an ISN and a TS offset. Fixes: 28ee1b746f49 ("secure_seq: downgrade to per-host timestamp offsets") Reported-by: Zhouyan Deng <dengzhouyan_nwpu@163.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260302205527.1982836-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnet: sched: avoid qdisc_reset_all_tx_gt() vs dequeue race for lockless qdiscsKoichiro Den1-0/+10
[ Upstream commit 7f083faf59d14c04e01ec05a7507f036c965acf8 ] When shrinking the number of real tx queues, netif_set_real_num_tx_queues() calls qdisc_reset_all_tx_gt() to flush qdiscs for queues which will no longer be used. qdisc_reset_all_tx_gt() currently serializes qdisc_reset() with qdisc_lock(). However, for lockless qdiscs, the dequeue path is serialized by qdisc_run_begin/end() using qdisc->seqlock instead, so qdisc_reset() can run concurrently with __qdisc_run() and free skbs while they are still being dequeued, leading to UAF. This can easily be reproduced on e.g. virtio-net by imposing heavy traffic while frequently changing the number of queue pairs: iperf3 -ub0 -c $peer -t 0 & while :; do ethtool -L eth0 combined 1 ethtool -L eth0 combined 2 done With KASAN enabled, this leads to reports like: BUG: KASAN: slab-use-after-free in __qdisc_run+0x133f/0x1760 ... Call Trace: <TASK> ... __qdisc_run+0x133f/0x1760 __dev_queue_xmit+0x248f/0x3550 ip_finish_output2+0xa42/0x2110 ip_output+0x1a7/0x410 ip_send_skb+0x2e6/0x480 udp_send_skb+0xb0a/0x1590 udp_sendmsg+0x13c9/0x1fc0 ... </TASK> Allocated by task 1270 on cpu 5 at 44.558414s: ... alloc_skb_with_frags+0x84/0x7c0 sock_alloc_send_pskb+0x69a/0x830 __ip_append_data+0x1b86/0x48c0 ip_make_skb+0x1e8/0x2b0 udp_sendmsg+0x13a6/0x1fc0 ... Freed by task 1306 on cpu 3 at 44.558445s: ... kmem_cache_free+0x117/0x5e0 pfifo_fast_reset+0x14d/0x580 qdisc_reset+0x9e/0x5f0 netif_set_real_num_tx_queues+0x303/0x840 virtnet_set_channels+0x1bf/0x260 [virtio_net] ethnl_set_channels+0x684/0xae0 ethnl_default_set_doit+0x31a/0x890 ... Serialize qdisc_reset_all_tx_gt() against the lockless dequeue path by taking qdisc->seqlock for TCQ_F_NOLOCK qdiscs, matching the serialization model already used by dev_reset_queue(). Additionally clear QDISC_STATE_NON_EMPTY after reset so the qdisc state reflects an empty queue, avoiding needless re-scheduling. Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Signed-off-by: Koichiro Den <den@valinux.co.jp> Link: https://patch.msgid.link/20260228145307.3955532-1-den@valinux.co.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hourshwmon: (max6639) fix inverted polarityOlivier Sobrie1-1/+1
[ Upstream commit 170a4b21f49b3dcff3115b4c90758f0a0d77375a ] According to MAX6639 documentation: D1: PWM Output Polarity. PWM output is low at 100% duty cycle when this bit is set to zero. PWM output is high at 100% duty cycle when this bit is set to 1. Up to commit 0f33272b60ed ("hwmon: (max6639) : Update hwmon init using info structure"), the polarity was set to high (0x2) when no platform data was set. After the patch, the polarity register wasn't set anymore if no platform data was specified. Nowadays, since commit 7506ebcd662b ("hwmon: (max6639) : Configure based on DT property"), it is always set to low which doesn't match with the comment above and change the behavior compared to versions prior 0f33272b60ed. Fixes: 0f33272b60ed ("hwmon: (max6639) : Update hwmon init using info structure") Signed-off-by: Olivier Sobrie <olivier@sobrie.be> Link: https://lore.kernel.org/r/20260304212039.570274-1-olivier@sobrie.be Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hourstimekeeping: Fix timex status validation for auxiliary clocksMiroslav Lichvar1-2/+4
[ Upstream commit e48a869957a70cc39b4090cd27c36a86f8db9b92 ] The timekeeping_validate_timex() function validates the timex status of an auxiliary system clock even when the status is not to be changed, which causes unexpected errors for applications that make read-only clock_adjtime() calls, or set some other timex fields, but without clearing the status field. Do the AUX-specific status validation only when the modes field contains ADJ_STATUS, i.e. the application is actually trying to change the status. This makes the AUX-specific clock_adjtime() behavior consistent with CLOCK_REALTIME. Fixes: 4eca49d0b621 ("timekeeping: Prepare do_adtimex() for auxiliary clocks") Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260225085231.276751-1-mlichvar@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnvme: fix memory allocation in nvme_pr_read_keys()Sungwoo Kim1-2/+2
[ Upstream commit c3320153769f05fd7fe9d840cb555dd3080ae424 ] nvme_pr_read_keys() takes num_keys from userspace and uses it to calculate the allocation size for rse via struct_size(). The upper limit is PR_KEYS_MAX (64K). A malicious or buggy userspace can pass a large num_keys value that results in a 4MB allocation attempt at most, causing a warning in the page allocator when the order exceeds MAX_PAGE_ORDER. To fix this, use kvzalloc() instead of kzalloc(). This bug has the same reasoning and fix with the patch below: https://lore.kernel.org/linux-block/20251212013510.3576091-1-kartikey406@gmail.com/ Warning log: WARNING: mm/page_alloc.c:5216 at __alloc_frozen_pages_noprof+0x5aa/0x2300 mm/page_alloc.c:5216, CPU#1: syz-executor117/272 Modules linked in: CPU: 1 UID: 0 PID: 272 Comm: syz-executor117 Not tainted 6.19.0 #1 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 RIP: 0010:__alloc_frozen_pages_noprof+0x5aa/0x2300 mm/page_alloc.c:5216 Code: ff 83 bd a8 fe ff ff 0a 0f 86 69 fb ff ff 0f b6 1d f9 f9 c4 04 80 fb 01 0f 87 3b 76 30 ff 83 e3 01 75 09 c6 05 e4 f9 c4 04 01 <0f> 0b 48 c7 85 70 fe ff ff 00 00 00 00 e9 8f fd ff ff 31 c0 e9 0d RSP: 0018:ffffc90000fcf450 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 1ffff920001f9ea0 RDX: 0000000000000000 RSI: 000000000000000b RDI: 0000000000040dc0 RBP: ffffc90000fcf648 R08: ffff88800b6c3380 R09: 0000000000000001 R10: ffffc90000fcf840 R11: ffff88807ffad280 R12: 0000000000000000 R13: 0000000000040dc0 R14: 0000000000000001 R15: ffffc90000fcf620 FS: 0000555565db33c0(0000) GS:ffff8880be26c000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000002000000c CR3: 0000000003b72000 CR4: 00000000000006f0 Call Trace: <TASK> alloc_pages_mpol+0x236/0x4d0 mm/mempolicy.c:2486 alloc_frozen_pages_noprof+0x149/0x180 mm/mempolicy.c:2557 ___kmalloc_large_node+0x10c/0x140 mm/slub.c:5598 __kmalloc_large_node_noprof+0x25/0xc0 mm/slub.c:5629 __do_kmalloc_node mm/slub.c:5645 [inline] __kmalloc_noprof+0x483/0x6f0 mm/slub.c:5669 kmalloc_noprof include/linux/slab.h:961 [inline] kzalloc_noprof include/linux/slab.h:1094 [inline] nvme_pr_read_keys+0x8f/0x4c0 drivers/nvme/host/pr.c:245 blkdev_pr_read_keys block/ioctl.c:456 [inline] blkdev_common_ioctl+0x1b71/0x29b0 block/ioctl.c:730 blkdev_ioctl+0x299/0x700 block/ioctl.c:786 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:597 [inline] __se_sys_ioctl fs/ioctl.c:583 [inline] __x64_sys_ioctl+0x1bf/0x220 fs/ioctl.c:583 x64_sys_call+0x1280/0x21b0 mnt/fuzznvme_1/fuzznvme/linux-build/v6.19/./arch/x86/include/generated/asm/syscalls_64.h:17 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x71/0x330 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fb893d3108d Code: 28 c3 e8 46 1e 00 00 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffff61f2f38 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007ffff61f3138 RCX: 00007fb893d3108d RDX: 0000000020000040 RSI: 00000000c01070ce RDI: 0000000000000003 RBP: 0000000000000001 R08: 0000000000000000 R09: 00007ffff61f3138 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 R13: 00007ffff61f3128 R14: 00007fb893dae530 R15: 0000000000000001 </TASK> Fixes: 5fd96a4e15de (nvme: Add pr_ops read_keys support) Acked-by: Chao Shi <cshi008@fiu.edu> Acked-by: Weidong Zhu <weizhu@fiu.edu> Acked-by: Dave Tian <daveti@purdue.edu> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Sungwoo Kim <iam@sung-woo.kim> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnvme: reject invalid pr_read_keys() num_keys valuesStefan Hajnoczi1-1/+5
[ Upstream commit 38ec8469f39e0e96e7dd9b76f05e0f8eb78be681 ] The pr_read_keys() interface has a u32 num_keys parameter. The NVMe Reservation Report command has a u32 maximum length. Reject num_keys values that are too large to fit. This will become important when pr_read_keys() is exposed to untrusted userspace via an <linux/pr.h> ioctl. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: c3320153769f ("nvme: fix memory allocation in nvme_pr_read_keys()") Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursdrm/xe/reg_sr: Fix leak on xa_store failureShuicheng Lin1-1/+3
[ Upstream commit 3091723785def05ebfe6a50866f87a044ae314ba ] Free the newly allocated entry when xa_store() fails to avoid a memory leak on the error path. v2: use goto fail_free. (Bala) Fixes: e5283bd4dfec ("drm/xe/reg_sr: Remove register pool") Cc: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patch.msgid.link/20260204172810.1486719-2-shuicheng.lin@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com> (cherry picked from commit 6bc6fec71ac45f52db609af4e62bdb96b9f5fadb) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursdrm/xe/gsc: Fix GSC proxy cleanup on early initialization failureZhanjun Dong2-8/+37
[ Upstream commit b3368ecca9538b88ddf982ea99064860fd5add97 ] xe_gsc_proxy_remove undoes what is done in both xe_gsc_proxy_init and xe_gsc_proxy_start; however, if we fail between those 2 calls, it is possible that the HW forcewake access hasn't been initialized yet and so we hit errors when the cleanup code tries to write GSC register. To avoid that, split the cleanup in 2 functions so that the HW cleanup is only called if the HW setup was completed successfully. Since the HW cleanup (interrupt disabling) is now removed from xe_gsc_proxy_remove, the cleanup on error paths in xe_gsc_proxy_start must be updated to disable interrupts before returning. Fixes: ff6cd29b690b ("drm/xe: Cleanup unwind of gt initialization") Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Link: https://patch.msgid.link/20260220225308.101469-1-zhanjun.dong@intel.com (cherry picked from commit 2b37c401b265c07b46408b5cb36a4b757c9b5060) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursi2c: i801: Revert "i2c: i801: replace acpi_lock with I2C bus lock"Charles Haithcock1-4/+10
[ Upstream commit cfc69c2e6c699c96949f7b0455195b0bfb7dc715 ] This reverts commit f707d6b9e7c18f669adfdb443906d46cfbaaa0c1. Under rare circumstances, multiple udev threads can collect i801 device info on boot and walk i801_acpi_io_handler somewhat concurrently. The first will note the area is reserved by acpi to prevent further touches. This ultimately causes the area to be deregistered. The second will enter i801_acpi_io_handler after the area is unregistered but before a check can be made that the area is unregistered. i2c_lock_bus relies on the now unregistered area containing lock_ops to lock the bus. The end result is a kernel panic on boot with the following backtrace; [ 14.971872] ioatdma 0000:09:00.2: enabling device (0100 -> 0102) [ 14.971873] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 14.971880] #PF: supervisor read access in kernel mode [ 14.971884] #PF: error_code(0x0000) - not-present page [ 14.971887] PGD 0 P4D 0 [ 14.971894] Oops: 0000 [#1] PREEMPT SMP PTI [ 14.971900] CPU: 5 PID: 956 Comm: systemd-udevd Not tainted 5.14.0-611.5.1.el9_7.x86_64 #1 [ 14.971905] Hardware name: XXXXXXXXXXXXXXXXXXXXXXX BIOS 1.20.10.SV91 01/30/2023 [ 14.971908] RIP: 0010:i801_acpi_io_handler+0x2d/0xb0 [i2c_i801] [ 14.971929] Code: 00 00 49 8b 40 20 41 57 41 56 4d 8b b8 30 04 00 00 49 89 ce 41 55 41 89 d5 41 54 49 89 f4 be 02 00 00 00 55 4c 89 c5 53 89 fb <48> 8b 00 4c 89 c7 e8 18 61 54 e9 80 bd 80 04 00 00 00 75 09 4c 3b [ 14.971933] RSP: 0018:ffffbaa841483838 EFLAGS: 00010282 [ 14.971938] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9685e01ba568 [ 14.971941] RDX: 0000000000000008 RSI: 0000000000000002 RDI: 0000000000000000 [ 14.971944] RBP: ffff9685ca22f028 R08: ffff9685ca22f028 R09: ffff9685ca22f028 [ 14.971948] R10: 000000000000000b R11: 0000000000000580 R12: 0000000000000580 [ 14.971951] R13: 0000000000000008 R14: ffff9685e01ba568 R15: ffff9685c222f000 [ 14.971954] FS: 00007f8287c0ab40(0000) GS:ffff96a47f940000(0000) knlGS:0000000000000000 [ 14.971959] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 14.971963] CR2: 0000000000000000 CR3: 0000000168090001 CR4: 00000000003706f0 [ 14.971966] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 14.971968] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 14.971972] Call Trace: [ 14.971977] <TASK> [ 14.971981] ? show_trace_log_lvl+0x1c4/0x2df [ 14.971994] ? show_trace_log_lvl+0x1c4/0x2df [ 14.972003] ? acpi_ev_address_space_dispatch+0x16e/0x3c0 [ 14.972014] ? __die_body.cold+0x8/0xd [ 14.972021] ? page_fault_oops+0x132/0x170 [ 14.972028] ? exc_page_fault+0x61/0x150 [ 14.972036] ? asm_exc_page_fault+0x22/0x30 [ 14.972045] ? i801_acpi_io_handler+0x2d/0xb0 [i2c_i801] [ 14.972061] acpi_ev_address_space_dispatch+0x16e/0x3c0 [ 14.972069] ? __pfx_i801_acpi_io_handler+0x10/0x10 [i2c_i801] [ 14.972085] acpi_ex_access_region+0x5b/0xd0 [ 14.972093] acpi_ex_field_datum_io+0x73/0x2e0 [ 14.972100] acpi_ex_read_data_from_field+0x8e/0x230 [ 14.972106] acpi_ex_resolve_node_to_value+0x23d/0x310 [ 14.972114] acpi_ds_evaluate_name_path+0xad/0x110 [ 14.972121] acpi_ds_exec_end_op+0x321/0x510 [ 14.972127] acpi_ps_parse_loop+0xf7/0x680 [ 14.972136] acpi_ps_parse_aml+0x17a/0x3d0 [ 14.972143] acpi_ps_execute_method+0x137/0x270 [ 14.972150] acpi_ns_evaluate+0x1f4/0x2e0 [ 14.972158] acpi_evaluate_object+0x134/0x2f0 [ 14.972164] acpi_evaluate_integer+0x50/0xe0 [ 14.972173] ? vsnprintf+0x24b/0x570 [ 14.972181] acpi_ac_get_state.part.0+0x23/0x70 [ 14.972189] get_ac_property+0x4e/0x60 [ 14.972195] power_supply_show_property+0x90/0x1f0 [ 14.972205] add_prop_uevent+0x29/0x90 [ 14.972213] power_supply_uevent+0x109/0x1d0 [ 14.972222] dev_uevent+0x10e/0x2f0 [ 14.972228] uevent_show+0x8e/0x100 [ 14.972236] dev_attr_show+0x19/0x40 [ 14.972246] sysfs_kf_seq_show+0x9b/0x100 [ 14.972253] seq_read_iter+0x120/0x4b0 [ 14.972262] ? selinux_file_permission+0x106/0x150 [ 14.972273] vfs_read+0x24f/0x3a0 [ 14.972284] ksys_read+0x5f/0xe0 [ 14.972291] do_syscall_64+0x5f/0xe0 ... The kernel panic is mitigated by setting limiting the count of udev children to 1. Revert to using the acpi_lock to continue protecting marking the area as owned by firmware without relying on a lock in a potentially unmapped region of memory. Fixes: f707d6b9e7c1 ("i2c: i801: replace acpi_lock with I2C bus lock") Signed-off-by: Charles Haithcock <chaithco@redhat.com> [wsa: added Fixes-tag and updated comment stating the importance of the lock] Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursdrm/sched: Fix kernel-doc warning for drm_sched_job_done()Yujie Liu1-0/+1
[ Upstream commit 61ded1083b264ff67ca8c2de822c66b6febaf9a8 ] There is a kernel-doc warning for the scheduler: Warning: drivers/gpu/drm/scheduler/sched_main.c:367 function parameter 'result' not described in 'drm_sched_job_done' Fix the warning by describing the undocumented error code. Fixes: 539f9ee4b52a ("drm/scheduler: properly forward fence errors") Signed-off-by: Yujie Liu <yujie.liu@intel.com> [phasta: Flesh out commit message] Signed-off-by: Philipp Stanner <phasta@kernel.org> Link: https://patch.msgid.link/20260227082452.1802922-1-yujie.liu@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursamd-xgbe: fix sleep while atomic on suspend/resumeRaju Rangoju3-14/+0
[ Upstream commit e2f27363aa6d983504c6836dd0975535e2e9dba0 ] The xgbe_powerdown() and xgbe_powerup() functions use spinlocks (spin_lock_irqsave) while calling functions that may sleep: - napi_disable() can sleep waiting for NAPI polling to complete - flush_workqueue() can sleep waiting for pending work items This causes a "BUG: scheduling while atomic" error during suspend/resume cycles on systems using the AMD XGBE Ethernet controller. The spinlock protection in these functions is unnecessary as these functions are called from suspend/resume paths which are already serialized by the PM core Fix this by removing the spinlock. Since only code that takes this lock is xgbe_powerdown() and xgbe_powerup(), remove it completely. Fixes: c5aa9e3b8156 ("amd-xgbe: Initial AMD 10GbE platform driver") Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Link: https://patch.msgid.link/20260302042124.1386445-1-Raju.Rangoju@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursnet: ipv4: fix ARM64 alignment fault in multipath hash seedYung Chih Su2-3/+4
[ Upstream commit 4ee7fa6cf78ff26d783d39e2949d14c4c1cd5e7f ] `struct sysctl_fib_multipath_hash_seed` contains two u32 fields (user_seed and mp_seed), making it an 8-byte structure with a 4-byte alignment requirement. In `fib_multipath_hash_from_keys()`, the code evaluates the entire struct atomically via `READ_ONCE()`: mp_seed = READ_ONCE(net->ipv4.sysctl_fib_multipath_hash_seed).mp_seed; While this silently works on GCC by falling back to unaligned regular loads which the ARM64 kernel tolerates, it causes a fatal kernel panic when compiled with Clang and LTO enabled. Commit e35123d83ee3 ("arm64: lto: Strengthen READ_ONCE() to acquire when CONFIG_LTO=y") strengthens `READ_ONCE()` to use Load-Acquire instructions (`ldar` / `ldapr`) to prevent compiler reordering bugs under Clang LTO. Since the macro evaluates the full 8-byte struct, Clang emits a 64-bit `ldar` instruction. ARM64 architecture strictly requires `ldar` to be naturally aligned, thus executing it on a 4-byte aligned address triggers a strict Alignment Fault (FSC = 0x21). Fix the read side by moving the `READ_ONCE()` directly to the `u32` member, which emits a safe 32-bit `ldar Wn`. Furthermore, Eric Dumazet pointed out that `WRITE_ONCE()` on the entire struct in `proc_fib_multipath_hash_set_seed()` is also flawed. Analysis shows that Clang splits this 8-byte write into two separate 32-bit `str` instructions. While this avoids an alignment fault, it destroys atomicity and exposes a tear-write vulnerability. Fix this by explicitly splitting the write into two 32-bit `WRITE_ONCE()` operations. Finally, add the missing `READ_ONCE()` when reading `user_seed` in `proc_fib_multipath_hash_seed()` to ensure proper pairing and concurrency safety. Fixes: 4ee2a8cace3f ("net: ipv4: Add a sysctl to set multipath hash seed") Signed-off-by: Yung Chih Su <yuuchihsu@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260302060247.7066-1-yuuchihsu@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hoursipv6: fix NULL pointer deref in ip6_rt_get_dev_rcu()Jakub Kicinski1-1/+2
[ Upstream commit 2ffb4f5c2ccb2fa1c049dd11899aee7967deef5a ] l3mdev_master_dev_rcu() can return NULL when the slave device is being un-slaved from a VRF. All other callers deal with this, but we lost the fallback to loopback in ip6_rt_pcpu_alloc() -> ip6_rt_get_dev_rcu() with commit 4832c30d5458 ("net: ipv6: put host and anycast routes on device with address"). KASAN: null-ptr-deref in range [0x0000000000000108-0x000000000000010f] RIP: 0010:ip6_rt_pcpu_alloc (net/ipv6/route.c:1418) Call Trace: ip6_pol_route (net/ipv6/route.c:2318) fib6_rule_lookup (net/ipv6/fib6_rules.c:115) ip6_route_output_flags (net/ipv6/route.c:2607) vrf_process_v6_outbound (drivers/net/vrf.c:437) I was tempted to rework the un-slaving code to clear the flag first and insert synchronize_rcu() before we remove the upper. But looks like the explicit fallback to loopback_dev is an established pattern. And I guess avoiding the synchronize_rcu() is nice, too. Fixes: 4832c30d5458 ("net: ipv6: put host and anycast routes on device with address") Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260301194548.927324-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hourssmb/client: fix buffer size for smb311_posix_qinfo in SMB311_posix_query_info()ZhangGuoDong1-1/+1
[ Upstream commit 9621b996e4db1dbc2b3dc5d5910b7d6179397320 ] SMB311_posix_query_info() is currently unused, but it may still be used in some stable versions, so these changes are submitted as a separate patch. Use `sizeof(struct smb311_posix_qinfo)` instead of sizeof its pointer, so the allocated buffer matches the actual struct size. Fixes: b1bc1874b885 ("smb311: Add support for SMB311 query info (non-compounded)") Reported-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Signed-off-by: ZhangGuoDong <zhangguodong@kylinos.cn> Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
40 hourssmb/client: fix buffer size for smb311_posix_qinfo in smb2_compound_op()ZhangGuoDong1-2/+2
[ Upstream commit 12c43a062acb0ac137fc2a4a106d4d084b8c5416 ] Use `sizeof(struct smb311_posix_qinfo)` instead of sizeof its pointer, so the allocated buffer matches the actual struct size. Fixes: 6a5f6592a0b6 ("SMB311: Add support for query info using posix extensions (level 100)") Reported-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Signed-off-by: ZhangGuoDong <zhangguodong@kylinos.cn> Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>