| Age | Commit message (Collapse) | Author | Files | Lines |
|
commit ec3797f043756a94ea2d0f106022e14ac4946c02 upstream.
OSD indexes come from untrusted network packets. Boundary checks are
added to validate these against map->max_osd.
[ idryomov: drop BUG_ON in ceph_get_primary_affinity(), minor cosmetic
edits ]
Cc: stable@vger.kernel.org
Signed-off-by: ziming zhang <ezrakiez@gmail.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7fce830ecd0a0256590ee37eb65a39cbad3d64fc upstream.
The len field originates from untrusted network packets. Boundary
checks have been added to prevent potential out-of-bounds writes when
decrypting the connection secret or processing service tickets.
[ idryomov: changelog ]
Cc: stable@vger.kernel.org
Signed-off-by: ziming zhang <ezrakiez@gmail.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 076381c261374c587700b3accf410bdd2dba334e upstream.
The wait loop in __ceph_open_session() can race with the client
receiving a new monmap or osdmap shortly after the initial map is
received. Both ceph_monc_handle_map() and handle_one_map() install
a new map immediately after freeing the old one
kfree(monc->monmap);
monc->monmap = monmap;
ceph_osdmap_destroy(osdc->osdmap);
osdc->osdmap = newmap;
under client->monc.mutex and client->osdc.lock respectively, but
because neither is taken in have_mon_and_osd_map() it's possible for
client->monc.monmap->epoch and client->osdc.osdmap->epoch arms in
client->monc.monmap && client->monc.monmap->epoch &&
client->osdc.osdmap && client->osdc.osdmap->epoch;
condition to dereference an already freed map. This happens to be
reproducible with generic/395 and generic/397 with KASAN enabled:
BUG: KASAN: slab-use-after-free in have_mon_and_osd_map+0x56/0x70
Read of size 4 at addr ffff88811012d810 by task mount.ceph/13305
CPU: 2 UID: 0 PID: 13305 Comm: mount.ceph Not tainted 6.14.0-rc2-build2+ #1266
...
Call Trace:
<TASK>
have_mon_and_osd_map+0x56/0x70
ceph_open_session+0x182/0x290
ceph_get_tree+0x333/0x680
vfs_get_tree+0x49/0x180
do_new_mount+0x1a3/0x2d0
path_mount+0x6dd/0x730
do_mount+0x99/0xe0
__do_sys_mount+0x141/0x180
do_syscall_64+0x9f/0x100
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
Allocated by task 13305:
ceph_osdmap_alloc+0x16/0x130
ceph_osdc_init+0x27a/0x4c0
ceph_create_client+0x153/0x190
create_fs_client+0x50/0x2a0
ceph_get_tree+0xff/0x680
vfs_get_tree+0x49/0x180
do_new_mount+0x1a3/0x2d0
path_mount+0x6dd/0x730
do_mount+0x99/0xe0
__do_sys_mount+0x141/0x180
do_syscall_64+0x9f/0x100
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Freed by task 9475:
kfree+0x212/0x290
handle_one_map+0x23c/0x3b0
ceph_osdc_handle_map+0x3c9/0x590
mon_dispatch+0x655/0x6f0
ceph_con_process_message+0xc3/0xe0
ceph_con_v1_try_read+0x614/0x760
ceph_con_workfn+0x2de/0x650
process_one_work+0x486/0x7c0
process_scheduled_works+0x73/0x90
worker_thread+0x1c8/0x2a0
kthread+0x2ec/0x300
ret_from_fork+0x24/0x40
ret_from_fork_asm+0x1a/0x30
Rewrite the wait loop to check the above condition directly with
client->monc.mutex and client->osdc.lock taken as appropriate. While
at it, improve the timeout handling (previously mount_timeout could be
exceeded in case wait_event_interruptible_timeout() slept more than
once) and access client->auth_err under client->monc.mutex to match
how it's set in finish_auth().
monmap_show() and osdmap_show() now take the respective lock before
accessing the map as well.
Cc: stable@vger.kernel.org
Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
mptcp_do_fastclose().
commit f07f4ea53e22429c84b20832fa098b5ecc0d4e35 upstream.
syzbot reported divide-by-zero in __tcp_select_window() by
MPTCP socket. [0]
We had a similar issue for the bare TCP and fixed in commit
499350a5a6e7 ("tcp: initialize rcv_mss to TCP_MIN_MSS instead
of 0").
Let's apply the same fix to mptcp_do_fastclose().
[0]:
Oops: divide error: 0000 [#1] SMP KASAN PTI
CPU: 0 UID: 0 PID: 6068 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025
RIP: 0010:__tcp_select_window+0x824/0x1320 net/ipv4/tcp_output.c:3336
Code: ff ff ff 44 89 f1 d3 e0 89 c1 f7 d1 41 01 cc 41 21 c4 e9 a9 00 00 00 e8 ca 49 01 f8 e9 9c 00 00 00 e8 c0 49 01 f8 44 89 e0 99 <f7> 7c 24 1c 41 29 d4 48 bb 00 00 00 00 00 fc ff df e9 80 00 00 00
RSP: 0018:ffffc90003017640 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88807b469e40
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc90003017730 R08: ffff888033268143 R09: 1ffff1100664d028
R10: dffffc0000000000 R11: ffffed100664d029 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 000055557faa0500(0000) GS:ffff888126135000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f64a1912ff8 CR3: 0000000072122000 CR4: 00000000003526f0
Call Trace:
<TASK>
tcp_select_window net/ipv4/tcp_output.c:281 [inline]
__tcp_transmit_skb+0xbc7/0x3aa0 net/ipv4/tcp_output.c:1568
tcp_transmit_skb net/ipv4/tcp_output.c:1649 [inline]
tcp_send_active_reset+0x2d1/0x5b0 net/ipv4/tcp_output.c:3836
mptcp_do_fastclose+0x27e/0x380 net/mptcp/protocol.c:2793
mptcp_disconnect+0x238/0x710 net/mptcp/protocol.c:3253
mptcp_sendmsg_fastopen+0x2f8/0x580 net/mptcp/protocol.c:1776
mptcp_sendmsg+0x1774/0x1980 net/mptcp/protocol.c:1855
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg+0xe5/0x270 net/socket.c:742
__sys_sendto+0x3bd/0x520 net/socket.c:2244
__do_sys_sendto net/socket.c:2251 [inline]
__se_sys_sendto net/socket.c:2247 [inline]
__x64_sys_sendto+0xde/0x100 net/socket.c:2247
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f66e998f749
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffff9acedb8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00007f66e9be5fa0 RCX: 00007f66e998f749
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00007ffff9acee10 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
R13: 00007f66e9be5fa0 R14: 00007f66e9be5fa0 R15: 0000000000000006
</TASK>
Fixes: ae155060247b ("mptcp: fix duplicate reset on fastclose")
Reported-by: syzbot+3a92d359bc2ec6255a33@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/69260882.a70a0220.d98e3.00b4.GAE@google.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251125195331.309558-1-kuniyu@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 27fd02860164bfa78cec2640dfad630d832e302c upstream.
When __mptcp_retrans() kicks-in, it schedules one or more subflows for
retransmission, but such subflows could be actually left alone if there
is no more data to retransmit and/or in case of concurrent fallback.
Scheduled subflows could be processed much later in time, i.e. when new
data will be transmitted, leading to bad subflow selection.
Explicitly clear all scheduled subflows before leaving the
retransmission function.
Fixes: ee2708aedad0 ("mptcp: use get_retrans wrapper")
Cc: stable@vger.kernel.org
Reported-by: Filip Pokryvka <fpokryvk@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251125-net-mptcp-clear-sched-rtx-v1-1-1cea4ad2165f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 43962db4a6f593903340c85591056a0cef812dfd upstream.
The crash in process_v2_sparse_read() for fscrypt-encrypted directories
has been reported. Issue takes place for Ceph msgr2 protocol in secure
mode. It can be reproduced by the steps:
sudo mount -t ceph :/ /mnt/cephfs/ -o name=admin,fs=cephfs,ms_mode=secure
(1) mkdir /mnt/cephfs/fscrypt-test-3
(2) cp area_decrypted.tar /mnt/cephfs/fscrypt-test-3
(3) fscrypt encrypt --source=raw_key --key=./my.key /mnt/cephfs/fscrypt-test-3
(4) fscrypt lock /mnt/cephfs/fscrypt-test-3
(5) fscrypt unlock --key=my.key /mnt/cephfs/fscrypt-test-3
(6) cat /mnt/cephfs/fscrypt-test-3/area_decrypted.tar
(7) Issue has been triggered
[ 408.072247] ------------[ cut here ]------------
[ 408.072251] WARNING: CPU: 1 PID: 392 at net/ceph/messenger_v2.c:865
ceph_con_v2_try_read+0x4b39/0x72f0
[ 408.072267] Modules linked in: intel_rapl_msr intel_rapl_common
intel_uncore_frequency_common intel_pmc_core pmt_telemetry pmt_discovery
pmt_class intel_pmc_ssram_telemetry intel_vsec kvm_intel joydev kvm irqbypass
polyval_clmulni ghash_clmulni_intel aesni_intel rapl input_leds psmouse
serio_raw i2c_piix4 vga16fb bochs vgastate i2c_smbus floppy mac_hid qemu_fw_cfg
pata_acpi sch_fq_codel rbd msr parport_pc ppdev lp parport efi_pstore
[ 408.072304] CPU: 1 UID: 0 PID: 392 Comm: kworker/1:3 Not tainted 6.17.0-rc7+
[ 408.072307] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.17.0-5.fc42 04/01/2014
[ 408.072310] Workqueue: ceph-msgr ceph_con_workfn
[ 408.072314] RIP: 0010:ceph_con_v2_try_read+0x4b39/0x72f0
[ 408.072317] Code: c7 c1 20 f0 d4 ae 50 31 d2 48 c7 c6 60 27 d5 ae 48 c7 c7 f8
8e 6f b0 68 60 38 d5 ae e8 00 47 61 fe 48 83 c4 18 e9 ac fc ff ff <0f> 0b e9 06
fe ff ff 4c 8b 9d 98 fd ff ff 0f 84 64 e7 ff ff 89 85
[ 408.072319] RSP: 0018:ffff88811c3e7a30 EFLAGS: 00010246
[ 408.072322] RAX: ffffed1024874c6f RBX: ffffea00042c2b40 RCX: 0000000000000f38
[ 408.072324] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 408.072325] RBP: ffff88811c3e7ca8 R08: 0000000000000000 R09: 00000000000000c8
[ 408.072326] R10: 00000000000000c8 R11: 0000000000000000 R12: 00000000000000c8
[ 408.072327] R13: dffffc0000000000 R14: ffff8881243a6030 R15: 0000000000003000
[ 408.072329] FS: 0000000000000000(0000) GS:ffff88823eadf000(0000)
knlGS:0000000000000000
[ 408.072331] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 408.072332] CR2: 000000c0003c6000 CR3: 000000010c106005 CR4: 0000000000772ef0
[ 408.072336] PKRU: 55555554
[ 408.072337] Call Trace:
[ 408.072338] <TASK>
[ 408.072340] ? sched_clock_noinstr+0x9/0x10
[ 408.072344] ? __pfx_ceph_con_v2_try_read+0x10/0x10
[ 408.072347] ? _raw_spin_unlock+0xe/0x40
[ 408.072349] ? finish_task_switch.isra.0+0x15d/0x830
[ 408.072353] ? __kasan_check_write+0x14/0x30
[ 408.072357] ? mutex_lock+0x84/0xe0
[ 408.072359] ? __pfx_mutex_lock+0x10/0x10
[ 408.072361] ceph_con_workfn+0x27e/0x10e0
[ 408.072364] ? metric_delayed_work+0x311/0x2c50
[ 408.072367] process_one_work+0x611/0xe20
[ 408.072371] ? __kasan_check_write+0x14/0x30
[ 408.072373] worker_thread+0x7e3/0x1580
[ 408.072375] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ 408.072378] ? __pfx_worker_thread+0x10/0x10
[ 408.072381] kthread+0x381/0x7a0
[ 408.072383] ? __pfx__raw_spin_lock_irq+0x10/0x10
[ 408.072385] ? __pfx_kthread+0x10/0x10
[ 408.072387] ? __kasan_check_write+0x14/0x30
[ 408.072389] ? recalc_sigpending+0x160/0x220
[ 408.072392] ? _raw_spin_unlock_irq+0xe/0x50
[ 408.072394] ? calculate_sigpending+0x78/0xb0
[ 408.072395] ? __pfx_kthread+0x10/0x10
[ 408.072397] ret_from_fork+0x2b6/0x380
[ 408.072400] ? __pfx_kthread+0x10/0x10
[ 408.072402] ret_from_fork_asm+0x1a/0x30
[ 408.072406] </TASK>
[ 408.072407] ---[ end trace 0000000000000000 ]---
[ 408.072418] Oops: general protection fault, probably for non-canonical
address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI
[ 408.072984] KASAN: null-ptr-deref in range [0x0000000000000000-
0x0000000000000007]
[ 408.073350] CPU: 1 UID: 0 PID: 392 Comm: kworker/1:3 Tainted: G W
6.17.0-rc7+ #1 PREEMPT(voluntary)
[ 408.073886] Tainted: [W]=WARN
[ 408.074042] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.17.0-5.fc42 04/01/2014
[ 408.074468] Workqueue: ceph-msgr ceph_con_workfn
[ 408.074694] RIP: 0010:ceph_msg_data_advance+0x79/0x1a80
[ 408.074976] Code: fc ff df 49 8d 77 08 48 c1 ee 03 80 3c 16 00 0f 85 07 11 00
00 48 ba 00 00 00 00 00 fc ff df 49 8b 5f 08 48 89 de 48 c1 ee 03 <0f> b6 14 16
84 d2 74 09 80 fa 03 0f 8e 0f 0e 00 00 8b 13 83 fa 03
[ 408.075884] RSP: 0018:ffff88811c3e7990 EFLAGS: 00010246
[ 408.076305] RAX: ffff8881243a6388 RBX: 0000000000000000 RCX: 0000000000000000
[ 408.076909] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: ffff8881243a6378
[ 408.077466] RBP: ffff88811c3e7a20 R08: 0000000000000000 R09: 00000000000000c8
[ 408.078034] R10: ffff8881243a6388 R11: 0000000000000000 R12: ffffed1024874c71
[ 408.078575] R13: dffffc0000000000 R14: ffff8881243a6030 R15: ffff8881243a6378
[ 408.079159] FS: 0000000000000000(0000) GS:ffff88823eadf000(0000)
knlGS:0000000000000000
[ 408.079736] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 408.080039] CR2: 000000c0003c6000 CR3: 000000010c106005 CR4: 0000000000772ef0
[ 408.080376] PKRU: 55555554
[ 408.080513] Call Trace:
[ 408.080630] <TASK>
[ 408.080729] ceph_con_v2_try_read+0x49b9/0x72f0
[ 408.081115] ? __pfx_ceph_con_v2_try_read+0x10/0x10
[ 408.081348] ? _raw_spin_unlock+0xe/0x40
[ 408.081538] ? finish_task_switch.isra.0+0x15d/0x830
[ 408.081768] ? __kasan_check_write+0x14/0x30
[ 408.081986] ? mutex_lock+0x84/0xe0
[ 408.082160] ? __pfx_mutex_lock+0x10/0x10
[ 408.082343] ceph_con_workfn+0x27e/0x10e0
[ 408.082529] ? metric_delayed_work+0x311/0x2c50
[ 408.082737] process_one_work+0x611/0xe20
[ 408.082948] ? __kasan_check_write+0x14/0x30
[ 408.083156] worker_thread+0x7e3/0x1580
[ 408.083331] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ 408.083557] ? __pfx_worker_thread+0x10/0x10
[ 408.083751] kthread+0x381/0x7a0
[ 408.083922] ? __pfx__raw_spin_lock_irq+0x10/0x10
[ 408.084139] ? __pfx_kthread+0x10/0x10
[ 408.084310] ? __kasan_check_write+0x14/0x30
[ 408.084510] ? recalc_sigpending+0x160/0x220
[ 408.084708] ? _raw_spin_unlock_irq+0xe/0x50
[ 408.084917] ? calculate_sigpending+0x78/0xb0
[ 408.085138] ? __pfx_kthread+0x10/0x10
[ 408.085335] ret_from_fork+0x2b6/0x380
[ 408.085525] ? __pfx_kthread+0x10/0x10
[ 408.085720] ret_from_fork_asm+0x1a/0x30
[ 408.085922] </TASK>
[ 408.086036] Modules linked in: intel_rapl_msr intel_rapl_common
intel_uncore_frequency_common intel_pmc_core pmt_telemetry pmt_discovery
pmt_class intel_pmc_ssram_telemetry intel_vsec kvm_intel joydev kvm irqbypass
polyval_clmulni ghash_clmulni_intel aesni_intel rapl input_leds psmouse
serio_raw i2c_piix4 vga16fb bochs vgastate i2c_smbus floppy mac_hid qemu_fw_cfg
pata_acpi sch_fq_codel rbd msr parport_pc ppdev lp parport efi_pstore
[ 408.087778] ---[ end trace 0000000000000000 ]---
[ 408.088007] RIP: 0010:ceph_msg_data_advance+0x79/0x1a80
[ 408.088260] Code: fc ff df 49 8d 77 08 48 c1 ee 03 80 3c 16 00 0f 85 07 11 00
00 48 ba 00 00 00 00 00 fc ff df 49 8b 5f 08 48 89 de 48 c1 ee 03 <0f> b6 14 16
84 d2 74 09 80 fa 03 0f 8e 0f 0e 00 00 8b 13 83 fa 03
[ 408.089118] RSP: 0018:ffff88811c3e7990 EFLAGS: 00010246
[ 408.089357] RAX: ffff8881243a6388 RBX: 0000000000000000 RCX: 0000000000000000
[ 408.089678] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: ffff8881243a6378
[ 408.090020] RBP: ffff88811c3e7a20 R08: 0000000000000000 R09: 00000000000000c8
[ 408.090360] R10: ffff8881243a6388 R11: 0000000000000000 R12: ffffed1024874c71
[ 408.090687] R13: dffffc0000000000 R14: ffff8881243a6030 R15: ffff8881243a6378
[ 408.091035] FS: 0000000000000000(0000) GS:ffff88823eadf000(0000)
knlGS:0000000000000000
[ 408.091452] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 408.092015] CR2: 000000c0003c6000 CR3: 000000010c106005 CR4: 0000000000772ef0
[ 408.092530] PKRU: 55555554
[ 417.112915]
==================================================================
[ 417.113491] BUG: KASAN: slab-use-after-free in
__mutex_lock.constprop.0+0x1522/0x1610
[ 417.114014] Read of size 4 at addr ffff888124870034 by task kworker/2:0/4951
[ 417.114587] CPU: 2 UID: 0 PID: 4951 Comm: kworker/2:0 Tainted: G D W
6.17.0-rc7+ #1 PREEMPT(voluntary)
[ 417.114592] Tainted: [D]=DIE, [W]=WARN
[ 417.114593] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.17.0-5.fc42 04/01/2014
[ 417.114596] Workqueue: events handle_timeout
[ 417.114601] Call Trace:
[ 417.114602] <TASK>
[ 417.114604] dump_stack_lvl+0x5c/0x90
[ 417.114610] print_report+0x171/0x4dc
[ 417.114613] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ 417.114617] ? kasan_complete_mode_report_info+0x80/0x220
[ 417.114621] kasan_report+0xbd/0x100
[ 417.114625] ? __mutex_lock.constprop.0+0x1522/0x1610
[ 417.114628] ? __mutex_lock.constprop.0+0x1522/0x1610
[ 417.114630] __asan_report_load4_noabort+0x14/0x30
[ 417.114633] __mutex_lock.constprop.0+0x1522/0x1610
[ 417.114635] ? queue_con_delay+0x8d/0x200
[ 417.114638] ? __pfx___mutex_lock.constprop.0+0x10/0x10
[ 417.114641] ? __send_subscribe+0x529/0xb20
[ 417.114644] __mutex_lock_slowpath+0x13/0x20
[ 417.114646] mutex_lock+0xd4/0xe0
[ 417.114649] ? __pfx_mutex_lock+0x10/0x10
[ 417.114652] ? ceph_monc_renew_subs+0x2a/0x40
[ 417.114654] ceph_con_keepalive+0x22/0x110
[ 417.114656] handle_timeout+0x6b3/0x11d0
[ 417.114659] ? _raw_spin_unlock_irq+0xe/0x50
[ 417.114662] ? __pfx_handle_timeout+0x10/0x10
[ 417.114664] ? queue_delayed_work_on+0x8e/0xa0
[ 417.114669] process_one_work+0x611/0xe20
[ 417.114672] ? __kasan_check_write+0x14/0x30
[ 417.114676] worker_thread+0x7e3/0x1580
[ 417.114678] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ 417.114682] ? __pfx_sched_setscheduler_nocheck+0x10/0x10
[ 417.114687] ? __pfx_worker_thread+0x10/0x10
[ 417.114689] kthread+0x381/0x7a0
[ 417.114692] ? __pfx__raw_spin_lock_irq+0x10/0x10
[ 417.114694] ? __pfx_kthread+0x10/0x10
[ 417.114697] ? __kasan_check_write+0x14/0x30
[ 417.114699] ? recalc_sigpending+0x160/0x220
[ 417.114703] ? _raw_spin_unlock_irq+0xe/0x50
[ 417.114705] ? calculate_sigpending+0x78/0xb0
[ 417.114707] ? __pfx_kthread+0x10/0x10
[ 417.114710] ret_from_fork+0x2b6/0x380
[ 417.114713] ? __pfx_kthread+0x10/0x10
[ 417.114715] ret_from_fork_asm+0x1a/0x30
[ 417.114720] </TASK>
[ 417.125171] Allocated by task 2:
[ 417.125333] kasan_save_stack+0x26/0x60
[ 417.125522] kasan_save_track+0x14/0x40
[ 417.125742] kasan_save_alloc_info+0x39/0x60
[ 417.125945] __kasan_slab_alloc+0x8b/0xb0
[ 417.126133] kmem_cache_alloc_node_noprof+0x13b/0x460
[ 417.126381] copy_process+0x320/0x6250
[ 417.126595] kernel_clone+0xb7/0x840
[ 417.126792] kernel_thread+0xd6/0x120
[ 417.126995] kthreadd+0x85c/0xbe0
[ 417.127176] ret_from_fork+0x2b6/0x380
[ 417.127378] ret_from_fork_asm+0x1a/0x30
[ 417.127692] Freed by task 0:
[ 417.127851] kasan_save_stack+0x26/0x60
[ 417.128057] kasan_save_track+0x14/0x40
[ 417.128267] kasan_save_free_info+0x3b/0x60
[ 417.128491] __kasan_slab_free+0x6c/0xa0
[ 417.128708] kmem_cache_free+0x182/0x550
[ 417.128906] free_task+0xeb/0x140
[ 417.129070] __put_task_struct+0x1d2/0x4f0
[ 417.129259] __put_task_struct_rcu_cb+0x15/0x20
[ 417.129480] rcu_do_batch+0x3d3/0xe70
[ 417.129681] rcu_core+0x549/0xb30
[ 417.129839] rcu_core_si+0xe/0x20
[ 417.130005] handle_softirqs+0x160/0x570
[ 417.130190] __irq_exit_rcu+0x189/0x1e0
[ 417.130369] irq_exit_rcu+0xe/0x20
[ 417.130531] sysvec_apic_timer_interrupt+0x9f/0xd0
[ 417.130768] asm_sysvec_apic_timer_interrupt+0x1b/0x20
[ 417.131082] Last potentially related work creation:
[ 417.131305] kasan_save_stack+0x26/0x60
[ 417.131484] kasan_record_aux_stack+0xae/0xd0
[ 417.131695] __call_rcu_common+0xcd/0x14b0
[ 417.131909] call_rcu+0x31/0x50
[ 417.132071] delayed_put_task_struct+0x128/0x190
[ 417.132295] rcu_do_batch+0x3d3/0xe70
[ 417.132478] rcu_core+0x549/0xb30
[ 417.132658] rcu_core_si+0xe/0x20
[ 417.132808] handle_softirqs+0x160/0x570
[ 417.132993] __irq_exit_rcu+0x189/0x1e0
[ 417.133181] irq_exit_rcu+0xe/0x20
[ 417.133353] sysvec_apic_timer_interrupt+0x9f/0xd0
[ 417.133584] asm_sysvec_apic_timer_interrupt+0x1b/0x20
[ 417.133921] Second to last potentially related work creation:
[ 417.134183] kasan_save_stack+0x26/0x60
[ 417.134362] kasan_record_aux_stack+0xae/0xd0
[ 417.134566] __call_rcu_common+0xcd/0x14b0
[ 417.134782] call_rcu+0x31/0x50
[ 417.134929] put_task_struct_rcu_user+0x58/0xb0
[ 417.135143] finish_task_switch.isra.0+0x5d3/0x830
[ 417.135366] __schedule+0xd30/0x5100
[ 417.135534] schedule_idle+0x5a/0x90
[ 417.135712] do_idle+0x25f/0x410
[ 417.135871] cpu_startup_entry+0x53/0x70
[ 417.136053] start_secondary+0x216/0x2c0
[ 417.136233] common_startup_64+0x13e/0x141
[ 417.136894] The buggy address belongs to the object at ffff888124870000
which belongs to the cache task_struct of size 10504
[ 417.138122] The buggy address is located 52 bytes inside of
freed 10504-byte region [ffff888124870000, ffff888124872908)
[ 417.139465] The buggy address belongs to the physical page:
[ 417.140016] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0
pfn:0x124870
[ 417.140789] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0
pincount:0
[ 417.141519] memcg:ffff88811aa20e01
[ 417.141874] anon flags:
0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[ 417.142600] page_type: f5(slab)
[ 417.142922] raw: 0017ffffc0000040 ffff88810094f040 0000000000000000
dead000000000001
[ 417.143554] raw: 0000000000000000 0000000000030003 00000000f5000000
ffff88811aa20e01
[ 417.143954] head: 0017ffffc0000040 ffff88810094f040 0000000000000000
dead000000000001
[ 417.144329] head: 0000000000000000 0000000000030003 00000000f5000000
ffff88811aa20e01
[ 417.144710] head: 0017ffffc0000003 ffffea0004921c01 00000000ffffffff
00000000ffffffff
[ 417.145106] head: ffffffffffffffff 0000000000000000 00000000ffffffff
0000000000000008
[ 417.145485] page dumped because: kasan: bad access detected
[ 417.145859] Memory state around the buggy address:
[ 417.146094] ffff88812486ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
fc
[ 417.146439] ffff88812486ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
fc
[ 417.146791] >ffff888124870000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb
fb
[ 417.147145] ^
[ 417.147387] ffff888124870080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
fb
[ 417.147751] ffff888124870100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
fb
[ 417.148123]
==================================================================
First of all, we have warning in get_bvec_at() because
cursor->total_resid contains zero value. And, finally,
we have crash in ceph_msg_data_advance() because
cursor->data is NULL. It means that get_bvec_at()
receives not initialized ceph_msg_data_cursor structure
because data is NULL and total_resid contains zero.
Moreover, we don't have likewise issue for the case of
Ceph msgr1 protocol because ceph_msg_data_cursor_init()
has been called before reading sparse data.
This patch adds calling of ceph_msg_data_cursor_init()
in the beginning of process_v2_sparse_read() with
the goal to guarantee that logic of reading sparse data
works correctly for the case of Ceph msgr2 protocol.
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/73152
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit b3e528a5811bbc8246dbdb962f0812dc9b721681 ]
On transmit, we are currently relying on skb->dev being set by
mctp_local_output() when we first set up the skb destination fields.
However, forwarded skbs do not use the local_output path, so will retain
their incoming netdev as their ->dev on tx. This does not work when
we're forwarding between interfaces.
Set skb->dev unconditionally in the transmit path, to allow for proper
forwarding.
We keep the skb->dev initialisation in mctp_local_output(), as we use it
for fragmentation.
Fixes: 269936db5eb3 ("net: mctp: separate routing database from routing operations")
Suggested-by: Vince Chang <vince_chang@aspeedtech.com>
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20251125-dev-forward-v1-1-54ecffcd0616@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0ebc27a4c67d44e5ce88d21cdad8201862b78837 ]
Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor
production"), the descriptor number is stored in skb control block and
xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto
pool's completion queue.
skb control block shouldn't be used for this purpose as after transmit
xsk doesn't have control over it and other subsystems could use it. This
leads to the following kernel panic due to a NULL pointer dereference.
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 2 UID: 1 PID: 927 Comm: p4xsk.bin Not tainted 6.16.12+deb14-cloud-amd64 #1 PREEMPT(lazy) Debian 6.16.12-1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014
RIP: 0010:xsk_destruct_skb+0xd0/0x180
[...]
Call Trace:
<IRQ>
? napi_complete_done+0x7a/0x1a0
ip_rcv_core+0x1bb/0x340
ip_rcv+0x30/0x1f0
__netif_receive_skb_one_core+0x85/0xa0
process_backlog+0x87/0x130
__napi_poll+0x28/0x180
net_rx_action+0x339/0x420
handle_softirqs+0xdc/0x320
? handle_edge_irq+0x90/0x1e0
do_softirq.part.0+0x3b/0x60
</IRQ>
<TASK>
__local_bh_enable_ip+0x60/0x70
__dev_direct_xmit+0x14e/0x1f0
__xsk_generic_xmit+0x482/0xb70
? __remove_hrtimer+0x41/0xa0
? __xsk_generic_xmit+0x51/0xb70
? _raw_spin_unlock_irqrestore+0xe/0x40
xsk_sendmsg+0xda/0x1c0
__sys_sendto+0x1ee/0x200
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x84/0x2f0
? __pfx_pollwake+0x10/0x10
? __rseq_handle_notify_resume+0xad/0x4c0
? restore_fpregs_from_fpstate+0x3c/0x90
? switch_fpu_return+0x5b/0xe0
? do_syscall_64+0x204/0x2f0
? do_syscall_64+0x204/0x2f0
? do_syscall_64+0x204/0x2f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
[...]
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Instead use the skb destructor_arg pointer along with pointer tagging.
As pointers are always aligned to 8B, use the bottom bit to indicate
whether this a single address or an allocated struct containing several
addresses.
Fixes: 30f241fcf52a ("xsk: Fix immature cq descriptor production")
Closes: https://lore.kernel.org/netdev/0435b904-f44f-48f8-afb0-68868474bf1c@nop.hu/
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20251124171409.3845-1-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c30d084960cf316c95fbf145d39974ce1ff7889c ]
We are unnecessarily setting a bunch of skb fields per each processed
descriptor, which is redundant for fragmented frames.
Let us set these respective members for first fragment only. To address
both paths that we have within xsk_build_skb(), move assignments onto
xsk_set_destructor_arg() and rename it to xsk_skb_init_misc().
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250925160009.2474816-2-maciej.fijalkowski@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stable-dep-of: 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 545d7827b2cd5de5eb85580cebeda6b35b3ff443 ]
The change eed467b517e8 ("Bluetooth: fix passkey uninitialized when used")
introduced a goto that bypasses the creation of temporary mackey and ltk
which are later used by the likes of DHKey Check step.
Later ffee202a78c2 ("Bluetooth: Always request for user confirmation for
Just Works (LE SC)") which means confirm_hint is always set in case
JUST_WORKS so the branch checking for an existing LTK becomes pointless
as confirm_hint will always be set, so this just merge both cases of
malicious or legitimate devices to be confirmed before continuing with the
pairing procedure.
Link: https://github.com/bluez/bluez/issues/1622
Fixes: eed467b517e8 ("Bluetooth: fix passkey uninitialized when used")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 79a2d4678ba90bdba577dc3af88cc900d6dcd5ee ]
The hdev lock/lookup/unlock/use pattern in the packet RX path doesn't
ensure hci_conn* is not concurrently modified/deleted. This locking
appears to be leftover from before conn_hash started using RCU
commit bf4c63252490b ("Bluetooth: convert conn hash to RCU")
and not clear if it had purpose since then.
Currently, there are code paths that delete hci_conn* from elsewhere
than the ordered hdev->workqueue where the RX work runs in. E.g.
commit 5af1f84ed13a ("Bluetooth: hci_sync: Fix UAF on hci_abort_conn_sync")
introduced some of these, and there probably were a few others before
it. It's better to do the locking so that even if these run
concurrently no UAF is possible.
Move the lookup of hci_conn and associated socket-specific conn to
protocol recv handlers, and do them within a single critical section
to cover hci_conn* usage and lookup.
syzkaller has reported a crash that appears to be this issue:
[Task hdev->workqueue] [Task 2]
hci_disconnect_all_sync
l2cap_recv_acldata(hcon)
hci_conn_get(hcon)
hci_abort_conn_sync(hcon)
hci_dev_lock
hci_dev_lock
hci_conn_del(hcon)
v-------------------------------- hci_dev_unlock
hci_conn_put(hcon)
conn = hcon->l2cap_data (UAF)
Fixes: 5af1f84ed13a ("Bluetooth: hci_sync: Fix UAF on hci_abort_conn_sync")
Reported-by: syzbot+d32d77220b92eddd89ad@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d32d77220b92eddd89ad
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 89bb613511cc21ed5ba6bddc1c9b9ae9c0dad392 ]
There is a potential race condition between sock bind and socket write
iter. bind may free the same cmd via mgmt_pending before write iter sends
the cmd, just as syzbot reported in UAF[1].
Here we use hci_dev_lock to synchronize the two, thereby avoiding the
UAF mentioned in [1].
[1]
syzbot reported:
BUG: KASAN: slab-use-after-free in mgmt_pending_remove+0x3b/0x210 net/bluetooth/mgmt_util.c:316
Read of size 8 at addr ffff888077164818 by task syz.0.17/5989
Call Trace:
mgmt_pending_remove+0x3b/0x210 net/bluetooth/mgmt_util.c:316
set_link_security+0x5c2/0x710 net/bluetooth/mgmt.c:1918
hci_mgmt_cmd+0x9c9/0xef0 net/bluetooth/hci_sock.c:1719
hci_sock_sendmsg+0x6ca/0xef0 net/bluetooth/hci_sock.c:1839
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg+0x21c/0x270 net/socket.c:742
sock_write_iter+0x279/0x360 net/socket.c:1195
Allocated by task 5989:
mgmt_pending_add+0x35/0x140 net/bluetooth/mgmt_util.c:296
set_link_security+0x557/0x710 net/bluetooth/mgmt.c:1910
hci_mgmt_cmd+0x9c9/0xef0 net/bluetooth/hci_sock.c:1719
hci_sock_sendmsg+0x6ca/0xef0 net/bluetooth/hci_sock.c:1839
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg+0x21c/0x270 net/socket.c:742
sock_write_iter+0x279/0x360 net/socket.c:1195
Freed by task 5991:
mgmt_pending_free net/bluetooth/mgmt_util.c:311 [inline]
mgmt_pending_foreach+0x30d/0x380 net/bluetooth/mgmt_util.c:257
mgmt_index_removed+0x112/0x2f0 net/bluetooth/mgmt.c:9477
hci_sock_bind+0xbe9/0x1000 net/bluetooth/hci_sock.c:1314
Fixes: 6fe26f694c82 ("Bluetooth: MGMT: Protect mgmt_pending list with its own lock")
Reported-by: syzbot+9aa47cd4633a3cf92a80@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9aa47cd4633a3cf92a80
Tested-by: syzbot+9aa47cd4633a3cf92a80@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 275ddfeb3fdc274050c2173ffd985b1e80a9aa37 ]
HCI_OP_NOP means no command was actually sent so there is no point in
triggering cmd_timer which may cause a hdev->reset in the process since
it is assumed that the controller is stuck processing a command.
Fixes: e2d471b7806b ("Bluetooth: ISO: Fix not using SID from adv report")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 92e239e36d600002559074994a545fcfac9afd2d ]
Fix inverted WARN_ON_ONCE condition that prevented normal address
removal counter updates. The current code only executes decrement
logic when the counter is already 0 (abnormal state), while
normal removals (counter > 0) are ignored.
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Fixes: 636113918508 ("mptcp: pm: remove '_nl' from mptcp_pm_nl_rm_addr_received")
Cc: stable@vger.kernel.org
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-10-806d3781c95f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
[ Context ]
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit c15d5c62ab313c19121f10e25d4fec852bd1c40c ]
When a netdev issues a RX async resync request for a TLS connection,
the TLS module handles it by logging record headers and attempting to
match them to the tcp_sn provided by the device. If a match is found,
the TLS module approves the tcp_sn for resynchronization.
While waiting for a device response, the TLS module also increments
rcd_delta each time a new TLS record is received, tracking the distance
from the original resync request.
However, if the device response is delayed or fails (e.g due to
unstable connection and device getting out of tracking, hardware
errors, resource exhaustion etc.), the TLS module keeps logging and
incrementing, which can lead to a WARN() when rcd_delta exceeds the
threshold.
To address this, introduce tls_offload_rx_resync_async_request_cancel()
to explicitly cancel resync requests when a device response failure is
detected. Call this helper also as a final safeguard when rcd_delta
crosses its threshold, as reaching this point implies that earlier
cancellation did not occur.
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1761508983-937977-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 002541ef650b742a198e4be363881439bb9d86b4 ]
During connect(), acting on a signal/timeout by disconnecting an already
established socket leads to several issues:
1. connect() invoking vsock_transport_cancel_pkt() ->
virtio_transport_purge_skbs() may race with sendmsg() invoking
virtio_transport_get_credit(). This results in a permanently elevated
`vvs->bytes_unsent`. Which, in turn, confuses the SOCK_LINGER handling.
2. connect() resetting a connected socket's state may race with socket
being placed in a sockmap. A disconnected socket remaining in a sockmap
breaks sockmap's assumptions. And gives rise to WARNs.
3. connect() transitioning SS_CONNECTED -> SS_UNCONNECTED allows for a
transport change/drop after TCP_ESTABLISHED. Which poses a problem for
any simultaneous sendmsg() or connect() and may result in a
use-after-free/null-ptr-deref.
Do not disconnect socket on signal/timeout. Keep the logic for unconnected
sockets: they don't linger, can't be placed in a sockmap, are rejected by
sendmsg().
[1]: https://lore.kernel.org/netdev/e07fd95c-9a38-4eea-9638-133e38c2ec9b@rbox.co/
[2]: https://lore.kernel.org/netdev/20250317-vsock-trans-signal-race-v4-0-fc8837f3f1d4@rbox.co/
[3]: https://lore.kernel.org/netdev/60f1b7db-3099-4f6a-875e-af9f6ef194f6@rbox.co/
Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20251119-vsock-interrupted-connect-v2-1-70734cf1233f@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
unix_stream_read_generic().
[ Upstream commit 7bf3a476ce43833c49fceddbe94ff3472e04e9bc ]
Miao Wang reported a bug of SO_PEEK_OFF on AF_UNIX SOCK_STREAM
socket.
The unexpected behaviour is triggered when the peek offset is
larger than the recv queue and the thread is unblocked by new
data.
Let's assume a socket which has "aaaa" in the recv queue and
the peek offset is 4.
First, unix_stream_read_generic() reads the offset 4 and skips
the skb(s) of "aaaa" with the code below:
skip = max(sk_peek_offset(sk, flags), 0); /* @skip is 4. */
do {
...
while (skip >= unix_skb_len(skb)) {
skip -= unix_skb_len(skb);
...
skb = skb_peek_next(skb, &sk->sk_receive_queue);
if (!skb)
goto again; /* @skip is 0. */
}
The thread jumps to the 'again' label and goes to sleep since
new data has not arrived yet.
Later, new data "bbbb" unblocks the thread, and the thread jumps
to the 'redo:' label to restart the entire process from the first
skb in the recv queue.
do {
...
redo:
...
last = skb = skb_peek(&sk->sk_receive_queue);
...
again:
if (skb == NULL) {
...
timeo = unix_stream_data_wait(sk, timeo, last,
last_len, freezable);
...
goto redo; /* @skip is 0 !! */
However, the peek offset is not reset in the path.
If the buffer size is 8, recv() will return "aaaabbbb" without
skipping any data, and the final offset will be 12 (the original
offset 4 + peeked skbs' length 8).
After sleeping in unix_stream_read_generic(), we have to fetch the
peek offset again.
Let's move the redo label before mutex_lock(&u->iolock).
Fixes: 9f389e35674f ("af_unix: return data from multiple SKBs on recv() with MSG_PEEK flag")
Reported-by: Miao Wang <shankerwangmiao@gmail.com>
Closes: https://lore.kernel.org/netdev/3B969F90-F51F-4B9D-AB1A-994D9A54D460@gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251117174740.3684604-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f94c1a114ac209977bdf5ca841b98424295ab1f0 ]
The function devl_rate_nodes_destroy is documented to "Unset parent for
all rate objects". However, it was only calling the driver-specific
`rate_leaf_parent_set` or `rate_node_parent_set` ops and decrementing
the parent's refcount, without actually setting the
`devlink_rate->parent` pointer to NULL.
This leaves a dangling pointer in the `devlink_rate` struct, which cause
refcount error in netdevsim[1] and mlx5[2]. In addition, this is
inconsistent with the behavior of `devlink_nl_rate_parent_node_set`,
where the parent pointer is correctly cleared.
This patch fixes the issue by explicitly setting `devlink_rate->parent`
to NULL after notifying the driver, thus fulfilling the function's
documented behavior for all rate objects.
[1]
repro steps:
echo 1 > /sys/bus/netdevsim/new_device
devlink dev eswitch set netdevsim/netdevsim1 mode switchdev
echo 1 > /sys/bus/netdevsim/devices/netdevsim1/sriov_numvfs
devlink port function rate add netdevsim/netdevsim1/test_node
devlink port function rate set netdevsim/netdevsim1/128 parent test_node
echo 1 > /sys/bus/netdevsim/del_device
dmesg:
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 8 PID: 1530 at lib/refcount.c:31 refcount_warn_saturate+0x42/0xe0
CPU: 8 UID: 0 PID: 1530 Comm: bash Not tainted 6.18.0-rc4+ #1 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:refcount_warn_saturate+0x42/0xe0
Call Trace:
<TASK>
devl_rate_leaf_destroy+0x8d/0x90
__nsim_dev_port_del+0x6c/0x70 [netdevsim]
nsim_dev_reload_destroy+0x11c/0x140 [netdevsim]
nsim_drv_remove+0x2b/0xb0 [netdevsim]
device_release_driver_internal+0x194/0x1f0
bus_remove_device+0xc6/0x130
device_del+0x159/0x3c0
device_unregister+0x1a/0x60
del_device_store+0x111/0x170 [netdevsim]
kernfs_fop_write_iter+0x12e/0x1e0
vfs_write+0x215/0x3d0
ksys_write+0x5f/0xd0
do_syscall_64+0x55/0x10f0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
[2]
devlink dev eswitch set pci/0000:08:00.0 mode switchdev
devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 1000
devlink port function rate add pci/0000:08:00.0/group1
devlink port function rate set pci/0000:08:00.0/32768 parent group1
modprobe -r mlx5_ib mlx5_fwctl mlx5_core
dmesg:
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 7 PID: 16151 at lib/refcount.c:31 refcount_warn_saturate+0x42/0xe0
CPU: 7 UID: 0 PID: 16151 Comm: bash Not tainted 6.17.0-rc7_for_upstream_min_debug_2025_10_02_12_44 #1 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
RIP: 0010:refcount_warn_saturate+0x42/0xe0
Call Trace:
<TASK>
devl_rate_leaf_destroy+0x8d/0x90
mlx5_esw_offloads_devlink_port_unregister+0x33/0x60 [mlx5_core]
mlx5_esw_offloads_unload_rep+0x3f/0x50 [mlx5_core]
mlx5_eswitch_unload_sf_vport+0x40/0x90 [mlx5_core]
mlx5_sf_esw_event+0xc4/0x120 [mlx5_core]
notifier_call_chain+0x33/0xa0
blocking_notifier_call_chain+0x3b/0x50
mlx5_eswitch_disable_locked+0x50/0x110 [mlx5_core]
mlx5_eswitch_disable+0x63/0x90 [mlx5_core]
mlx5_unload+0x1d/0x170 [mlx5_core]
mlx5_uninit_one+0xa2/0x130 [mlx5_core]
remove_one+0x78/0xd0 [mlx5_core]
pci_device_remove+0x39/0xa0
device_release_driver_internal+0x194/0x1f0
unbind_store+0x99/0xa0
kernfs_fop_write_iter+0x12e/0x1e0
vfs_write+0x215/0x3d0
ksys_write+0x5f/0xd0
do_syscall_64+0x53/0x1f0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Fixes: d75559845078 ("devlink: Allow setting parent node of rate objects")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1763381149-1234377-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit dfe28c4167a9259fc0c372d9f9473e1ac95cff67 ]
The validation of the set(nsh(...)) action is completely wrong.
It runs through the nsh_key_put_from_nlattr() function that is the
same function that validates NSH keys for the flow match and the
push_nsh() action. However, the set(nsh(...)) has a very different
memory layout. Nested attributes in there are doubled in size in
case of the masked set(). That makes proper validation impossible.
There is also confusion in the code between the 'masked' flag, that
says that the nested attributes are doubled in size containing both
the value and the mask, and the 'is_mask' that says that the value
we're parsing is the mask. This is causing kernel crash on trying to
write into mask part of the match with SW_FLOW_KEY_PUT() during
validation, while validate_nsh() doesn't allocate any memory for it:
BUG: kernel NULL pointer dereference, address: 0000000000000018
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 1c2383067 P4D 1c2383067 PUD 20b703067 PMD 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 8 UID: 0 Kdump: loaded Not tainted 6.17.0-rc4+ #107 PREEMPT(voluntary)
RIP: 0010:nsh_key_put_from_nlattr+0x19d/0x610 [openvswitch]
Call Trace:
<TASK>
validate_nsh+0x60/0x90 [openvswitch]
validate_set.constprop.0+0x270/0x3c0 [openvswitch]
__ovs_nla_copy_actions+0x477/0x860 [openvswitch]
ovs_nla_copy_actions+0x8d/0x100 [openvswitch]
ovs_packet_cmd_execute+0x1cc/0x310 [openvswitch]
genl_family_rcv_msg_doit+0xdb/0x130
genl_family_rcv_msg+0x14b/0x220
genl_rcv_msg+0x47/0xa0
netlink_rcv_skb+0x53/0x100
genl_rcv+0x24/0x40
netlink_unicast+0x280/0x3b0
netlink_sendmsg+0x1f7/0x430
____sys_sendmsg+0x36b/0x3a0
___sys_sendmsg+0x87/0xd0
__sys_sendmsg+0x6d/0xd0
do_syscall_64+0x7b/0x2c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The third issue with this process is that while trying to convert
the non-masked set into masked one, validate_set() copies and doubles
the size of the OVS_KEY_ATTR_NSH as if it didn't have any nested
attributes. It should be copying each nested attribute and doubling
them in size independently. And the process must be properly reversed
during the conversion back from masked to a non-masked variant during
the flow dump.
In the end, the only two outcomes of trying to use this action are
either validation failure or a kernel crash. And if somehow someone
manages to install a flow with such an action, it will most definitely
not do what it is supposed to, since all the keys and the masks are
mixed up.
Fixing all the issues is a complex task as it requires re-writing
most of the validation code.
Given that and the fact that this functionality never worked since
introduction, let's just remove it altogether. It's better to
re-introduce it later with a proper implementation instead of trying
to fix it in stable releases.
Fixes: b2d0f5d5dc53 ("openvswitch: enable NSH support")
Reported-by: Junvy Yang <zhuque@tencent.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20251112112246.95064-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f796a8dec9beafcc0f6f0d3478ed685a15c5e062 ]
The ethtool tsconfig Netlink path can trigger a null pointer
dereference. A call chain such as:
tsconfig_prepare_data() ->
dev_get_hwtstamp_phylib() ->
vlan_hwtstamp_get() ->
generic_hwtstamp_get_lower() ->
generic_hwtstamp_ioctl_lower()
results in generic_hwtstamp_ioctl_lower() being called with
kernel_cfg->ifr as NULL.
The generic_hwtstamp_ioctl_lower() function does not expect
a NULL ifr and dereferences it, leading to a system crash.
Fix this by adding a NULL check for kernel_cfg->ifr in
generic_hwtstamp_ioctl_lower(). If ifr is NULL, return -EINVAL.
Fixes: 6e9e2eed4f39 ("net: ethtool: Add support for tsconfig command to get/set hwtstamp config")
Closes: https://lore.kernel.org/cd6a7056-fa6d-43f8-b78a-f5e811247ba8@linux.dev
Signed-off-by: Jiaming Zhang <r772577952@gmail.com>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20251111173652.749159-2-r772577952@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 59630e2ccd728703cc826e3a3515d70f8c7a766c ]
Add a check to ensure locally generated packets (skb->sk != NULL) do
not use direct output in tunnel mode, as these packets require proper
L2 header setup that is handled by the normal XFRM processing path.
Fixes: 5eddd76ec2fd ("xfrm: fix tunnel mode TX datapath in packet offload mode")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 61fafbee6cfed283c02a320896089f658fa67e56 ]
The GSO segmentation functions for ESP tunnel mode
(xfrm4_tunnel_gso_segment and xfrm6_tunnel_gso_segment) were
determining the inner packet's L2 protocol type by checking the static
x->inner_mode.family field from the xfrm state.
This is unreliable. In tunnel mode, the state's actual inner family
could be defined by x->inner_mode.family or by
x->inner_mode_iaf.family. Checking only the former can lead to a
mismatch with the actual packet being processed, causing GSO to create
segments with the wrong L2 header type.
This patch fixes the bug by deriving the inner mode directly from the
packet's inner protocol stored in XFRM_MODE_SKB_CB(skb)->protocol.
Instead of replicating the code, this patch modifies the
xfrm_ip2inner_mode helper function. It now correctly returns
&x->inner_mode if the selector family (x->sel.family) is already
specified, thereby handling both specific and AF_UNSPEC cases
appropriately.
With this change, ESP GSO can use xfrm_ip2inner_mode to get the
correct inner mode. It doesn't affect existing callers, as the updated
logic now mirrors the checks they were already performing externally.
Fixes: 26dbd66eab80 ("esp: choose the correct inner protocol for GSO on inter address family tunnels")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 082ef944e55da8a9a8df92e3842ca82a626d359a ]
In the output path, xfrm_dev_offload_ok and xfrm_get_inner_ipproto
need to determine the protocol family of the inner packet (skb) before
it gets encapsulated.
In xfrm_dev_offload_ok, the code checked x->inner_mode.family. This is
unreliable because, for states handling both IPv4 and IPv6, the
relevant inner family could be either x->inner_mode.family or
x->inner_mode_iaf.family. Checking only the former can lead to a
mismatch with the actual packet being processed.
In xfrm_get_inner_ipproto, the code checked x->outer_mode.family. This
is also incorrect for tunnel mode, as the inner packet's family can be
different from the outer header's family.
At both of these call sites, the skb variable holds the original inner
packet. The most direct and reliable source of truth for its protocol
family is its destination entry. This patch fixes the issue by using
skb_dst(skb)->ops->family to ensure protocol-specific headers are only
accessed for the correct packet type.
Fixes: 91d8a53db219 ("xfrm: fix offloading of cross-family tunnels")
Fixes: 45a98ef4922d ("net/xfrm: IPsec tunnel mode fix inner_ipproto setting in sec_path")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1dcf617bec5cb85f68ca19969e7537ef6f6931d3 ]
xfrm_state_construct can fail without setting an error if the
requested pcpu_num value is too big. Set err and add an extack message
to avoid confusing userspace.
Fixes: 1ddf9916ac09 ("xfrm: Add support for per cpu xfrm state handling.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7f02285764790e0ff1a731b4187fa3e389ed02c7 ]
In case xfrm_state_migrate fails after calling xfrm_dev_state_add, we
directly release the last reference and destroy the new state, without
calling xfrm_dev_state_delete (this only happens in
__xfrm_state_delete, which we're not calling on this path, since the
state was never added).
Call xfrm_dev_state_delete on error when an offload configuration was
provided.
Fixes: ab244a394c7f ("xfrm: Migrate offload configuration")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
were never added
[ Upstream commit 10deb69864840ccf96b00ac2ab3a2055c0c04721 ]
In commit b441cf3f8c4b ("xfrm: delete x->tunnel as we delete x"), I
missed the case where state creation fails between full
initialization (->init_state has been called) and being inserted on
the lists.
In this situation, ->init_state has been called, so for IPcomp
tunnels, the fallback tunnel has been created and added onto the
lists, but the user state never gets added, because we fail before
that. The user state doesn't go through __xfrm_state_delete, so we
don't call xfrm_state_delete_tunnel for those states, and we end up
leaking the FB tunnel.
There are several codepaths affected by this: the add/update paths, in
both net/key and xfrm, and the migrate code (xfrm_migrate,
xfrm_state_migrate). A "proper" rollback of the init_state work would
probably be doable in the add/update code, but for migrate it gets
more complicated as multiple states may be involved.
At some point, the new (not-inserted) state will be destroyed, so call
xfrm_state_delete_tunnel during xfrm_state_gc_destroy. Most states
will have their fallback tunnel cleaned up during __xfrm_state_delete,
which solves the issue that b441cf3f8c4b (and other patches before it)
aimed at. All states (including FB tunnels) will be removed from the
lists once xfrm_state_fini has called flush_work(&xfrm_state_gc_work).
Reported-by: syzbot+999eb23467f83f9bf9bf@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=999eb23467f83f9bf9bf
Fixes: b441cf3f8c4b ("xfrm: delete x->tunnel as we delete x")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8d2a2a49c30f67a480fa9ed25e08436a446f057e ]
We're not updating x1, but we still need to put() it.
Fixes: a4a87fa4e96c ("xfrm: Add Direction to the SA in or out")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 1bba3f219c5e8c29e63afa3c1fc24f875ebec119 upstream.
In case of DSS corruption, the MPTCP protocol tries to avoid the subflow
reset if fallback is possible. Such corruptions happen in the receive
path; to ensure fallback is possible the stack additionally needs to
check for OoO data, otherwise the fallback will break the data stream.
Fixes: e32d262c89e2 ("mptcp: handle consistently DSS corruption")
Cc: stable@vger.kernel.org
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/598
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-4-806d3781c95f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fff0c87996672816a84c3386797a5e69751c5888 upstream.
With the current fastclose implementation, the mptcp_do_fastclose()
helper is in charge of two distinct actions: send the fastclose reset
and cleanup the subflows.
Formally decouple the two steps, ensuring that mptcp explicitly closes
all the subflows after the mentioned helper.
This will make the upcoming fix simpler, and allows dropping the 2nd
argument from mptcp_destroy_common(). The Fixes tag is then the same as
in the next commit to help with the backports.
Fixes: d21f83485518 ("mptcp: use fastclose on more edge scenarios")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Geliang Tang <geliang@kernel.org>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-5-806d3781c95f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4f102d747cadd8f595f2b25882eed9bec1675fb1 upstream.
The rcv window is shared among all the subflows. Currently, MPTCP sync
the TCP-level rcv window with the MPTCP one at tcp_transmit_skb() time.
The above means that incoming data may sporadically observe outdated
TCP-level rcv window and being wrongly dropped by TCP.
Address the issue checking for the edge condition before queuing the
data at TCP level, and eventually syncing the rcv window as needed.
Note that the issue is actually present from the very first MPTCP
implementation, but backports older than the blamed commit below will
range from impossible to useless.
Before:
$ nstat -n; sleep 1; nstat -z TcpExtBeyondWindow
TcpExtBeyondWindow 14 0.0
After:
$ nstat -n; sleep 1; nstat -z TcpExtBeyondWindow
TcpExtBeyondWindow 0 0.0
Fixes: fa3fe2b15031 ("mptcp: track window announced to peer")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-2-806d3781c95f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 17393fa7b7086664be519e7230cb6ed7ec7d9462 upstream.
I'm observing very frequent self-tests failures in case of fallback when
running on a CONFIG_PREEMPT kernel.
The root cause is that subflow_sched_work_if_closed() closes any subflow
as soon as it is half-closed and has no incoming data pending.
That works well for regular subflows - MPTCP needs bi-directional
connectivity to operate on a given subflow - but for fallback socket is
race prone.
When TCP peer closes the connection before the MPTCP one,
subflow_sched_work_if_closed() will schedule the MPTCP worker to
gracefully close the subflow, and shortly after will do another schedule
to inject and process a dummy incoming DATA_FIN.
On CONFIG_PREEMPT kernel, the MPTCP worker can kick-in and close the
fallback subflow before subflow_sched_work_if_closed() is able to create
the dummy DATA_FIN, unexpectedly interrupting the transfer.
Address the issue explicitly avoiding closing fallback subflows on when
the peer is only half-closed.
Note that, when the subflow is able to create the DATA_FIN before the
worker invocation, the worker will change the msk state before trying to
close the subflow and will skip the latter operation as the msk will not
match anymore the precondition in __mptcp_close_subflow().
Fixes: f09b0ad55a11 ("mptcp: close subflow when receiving TCP+FIN")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-3-806d3781c95f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ae155060247be8dcae3802a95bd1bdf93ab3215d upstream.
The CI reports sporadic failures of the fastclose self-tests. The root
cause is a duplicate reset, not carrying the relevant MPTCP option.
In the failing scenario the bad reset is received by the peer before
the fastclose one, preventing the reception of the latter.
Indeed there is window of opportunity at fastclose time for the
following race:
mptcp_do_fastclose
__mptcp_close_ssk
__tcp_close()
tcp_set_state() [1]
tcp_send_active_reset() [2]
After [1] the stack will send reset to in-flight data reaching the now
closed port. Such reset may race with [2].
Address the issue explicitly sending a single reset on fastclose before
explicitly moving the subflow to close status.
Fixes: d21f83485518 ("mptcp: use fastclose on more edge scenarios")
Cc: stable@vger.kernel.org
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/596
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Geliang Tang <geliang@kernel.org>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-6-806d3781c95f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5e15395f6d9ec07395866c5511f4b4ac566c0c9b upstream.
mptcp_cleanup_rbuf() needs to know the last most recent, mptcp-level
rcv_wnd sent, and such information is tracked into the msk->old_wspace
field, updated at ack transmission time by mptcp_write_options().
Fallback socket do not add any mptcp options, such helper is never
invoked, and msk->old_wspace value remain stale. That in turn makes
ack generation at recvmsg() time quite random.
Address the issue ensuring mptcp_write_options() is invoked even for
fallback sockets, and just update the needed info in such a case.
The issue went unnoticed for a long time, as mptcp currently overshots
the fallback socket receive buffer autotune significantly. It is going
to change in the near future.
Fixes: e3859603ba13 ("mptcp: better msk receive window updates")
Cc: stable@vger.kernel.org
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/594
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Geliang Tang <geliang@kernel.org>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251118-net-mptcp-misc-fixes-6-18-rc6-v1-1-806d3781c95f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 426358d9be7ce3518966422f87b96f1bad27295f upstream.
mptcp_pm_del_add_timer() can call sk_stop_timer_sync(sk, &entry->add_timer)
while another might have free entry already, as reported by syzbot.
Add RCU protection to fix this issue.
Also change confusing add_timer variable with stop_timer boolean.
syzbot report:
BUG: KASAN: slab-use-after-free in __timer_delete_sync+0x372/0x3f0 kernel/time/timer.c:1616
Read of size 4 at addr ffff8880311e4150 by task kworker/1:1/44
CPU: 1 UID: 0 PID: 44 Comm: kworker/1:1 Not tainted syzkaller #0 PREEMPT_{RT,(full)}
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/02/2025
Workqueue: events mptcp_worker
Call Trace:
<TASK>
dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0xca/0x240 mm/kasan/report.c:482
kasan_report+0x118/0x150 mm/kasan/report.c:595
__timer_delete_sync+0x372/0x3f0 kernel/time/timer.c:1616
sk_stop_timer_sync+0x1b/0x90 net/core/sock.c:3631
mptcp_pm_del_add_timer+0x283/0x310 net/mptcp/pm.c:362
mptcp_incoming_options+0x1357/0x1f60 net/mptcp/options.c:1174
tcp_data_queue+0xca/0x6450 net/ipv4/tcp_input.c:5361
tcp_rcv_established+0x1335/0x2670 net/ipv4/tcp_input.c:6441
tcp_v4_do_rcv+0x98b/0xbf0 net/ipv4/tcp_ipv4.c:1931
tcp_v4_rcv+0x252a/0x2dc0 net/ipv4/tcp_ipv4.c:2374
ip_protocol_deliver_rcu+0x221/0x440 net/ipv4/ip_input.c:205
ip_local_deliver_finish+0x3bb/0x6f0 net/ipv4/ip_input.c:239
NF_HOOK+0x30c/0x3a0 include/linux/netfilter.h:318
NF_HOOK+0x30c/0x3a0 include/linux/netfilter.h:318
__netif_receive_skb_one_core net/core/dev.c:6079 [inline]
__netif_receive_skb+0x143/0x380 net/core/dev.c:6192
process_backlog+0x31e/0x900 net/core/dev.c:6544
__napi_poll+0xb6/0x540 net/core/dev.c:7594
napi_poll net/core/dev.c:7657 [inline]
net_rx_action+0x5f7/0xda0 net/core/dev.c:7784
handle_softirqs+0x22f/0x710 kernel/softirq.c:622
__do_softirq kernel/softirq.c:656 [inline]
__local_bh_enable_ip+0x1a0/0x2e0 kernel/softirq.c:302
mptcp_pm_send_ack net/mptcp/pm.c:210 [inline]
mptcp_pm_addr_send_ack+0x41f/0x500 net/mptcp/pm.c:-1
mptcp_pm_worker+0x174/0x320 net/mptcp/pm.c:1002
mptcp_worker+0xd5/0x1170 net/mptcp/protocol.c:2762
process_one_work kernel/workqueue.c:3263 [inline]
process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3346
worker_thread+0x8a0/0xda0 kernel/workqueue.c:3427
kthread+0x711/0x8a0 kernel/kthread.c:463
ret_from_fork+0x4bc/0x870 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
</TASK>
Allocated by task 44:
kasan_save_stack mm/kasan/common.c:56 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:77
poison_kmalloc_redzone mm/kasan/common.c:400 [inline]
__kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:417
kasan_kmalloc include/linux/kasan.h:262 [inline]
__kmalloc_cache_noprof+0x1ef/0x6c0 mm/slub.c:5748
kmalloc_noprof include/linux/slab.h:957 [inline]
mptcp_pm_alloc_anno_list+0x104/0x460 net/mptcp/pm.c:385
mptcp_pm_create_subflow_or_signal_addr+0xf9d/0x1360 net/mptcp/pm_kernel.c:355
mptcp_pm_nl_fully_established net/mptcp/pm_kernel.c:409 [inline]
__mptcp_pm_kernel_worker+0x417/0x1ef0 net/mptcp/pm_kernel.c:1529
mptcp_pm_worker+0x1ee/0x320 net/mptcp/pm.c:1008
mptcp_worker+0xd5/0x1170 net/mptcp/protocol.c:2762
process_one_work kernel/workqueue.c:3263 [inline]
process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3346
worker_thread+0x8a0/0xda0 kernel/workqueue.c:3427
kthread+0x711/0x8a0 kernel/kthread.c:463
ret_from_fork+0x4bc/0x870 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
Freed by task 6630:
kasan_save_stack mm/kasan/common.c:56 [inline]
kasan_save_track+0x3e/0x80 mm/kasan/common.c:77
__kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:587
kasan_save_free_info mm/kasan/kasan.h:406 [inline]
poison_slab_object mm/kasan/common.c:252 [inline]
__kasan_slab_free+0x5c/0x80 mm/kasan/common.c:284
kasan_slab_free include/linux/kasan.h:234 [inline]
slab_free_hook mm/slub.c:2523 [inline]
slab_free mm/slub.c:6611 [inline]
kfree+0x197/0x950 mm/slub.c:6818
mptcp_remove_anno_list_by_saddr+0x2d/0x40 net/mptcp/pm.c:158
mptcp_pm_flush_addrs_and_subflows net/mptcp/pm_kernel.c:1209 [inline]
mptcp_nl_flush_addrs_list net/mptcp/pm_kernel.c:1240 [inline]
mptcp_pm_nl_flush_addrs_doit+0x593/0xbb0 net/mptcp/pm_kernel.c:1281
genl_family_rcv_msg_doit+0x215/0x300 net/netlink/genetlink.c:1115
genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline]
genl_rcv_msg+0x60e/0x790 net/netlink/genetlink.c:1210
netlink_rcv_skb+0x208/0x470 net/netlink/af_netlink.c:2552
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219
netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline]
netlink_unicast+0x846/0xa10 net/netlink/af_netlink.c:1346
netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1896
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg+0x21c/0x270 net/socket.c:742
____sys_sendmsg+0x508/0x820 net/socket.c:2630
___sys_sendmsg+0x21f/0x2a0 net/socket.c:2684
__sys_sendmsg net/socket.c:2716 [inline]
__do_sys_sendmsg net/socket.c:2721 [inline]
__se_sys_sendmsg net/socket.c:2719 [inline]
__x64_sys_sendmsg+0x1a1/0x260 net/socket.c:2719
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Cc: stable@vger.kernel.org
Fixes: 00cfd77b9063 ("mptcp: retransmit ADD_ADDR when timeout")
Reported-by: syzbot+2a6fbf0f0530375968df@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/691ad3c3.a70a0220.f6df1.0004.GAE@google.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Geliang Tang <geliang@kernel.org>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251117100745.1913963-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 035bca3f017ee9dea3a5a756e77a6f7138cc6eea upstream.
syzbot reported use-after-free in mptcp_schedule_work() [1]
Issue here is that mptcp_schedule_work() schedules a work,
then gets a refcount on sk->sk_refcnt if the work was scheduled.
This refcount will be released by mptcp_worker().
[A] if (schedule_work(...)) {
[B] sock_hold(sk);
return true;
}
Problem is that mptcp_worker() can run immediately and complete before [B]
We need instead :
sock_hold(sk);
if (schedule_work(...))
return true;
sock_put(sk);
[1]
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 1 PID: 29 at lib/refcount.c:25 refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:25
Call Trace:
<TASK>
__refcount_add include/linux/refcount.h:-1 [inline]
__refcount_inc include/linux/refcount.h:366 [inline]
refcount_inc include/linux/refcount.h:383 [inline]
sock_hold include/net/sock.h:816 [inline]
mptcp_schedule_work+0x164/0x1a0 net/mptcp/protocol.c:943
mptcp_tout_timer+0x21/0xa0 net/mptcp/protocol.c:2316
call_timer_fn+0x17e/0x5f0 kernel/time/timer.c:1747
expire_timers kernel/time/timer.c:1798 [inline]
__run_timers kernel/time/timer.c:2372 [inline]
__run_timer_base+0x648/0x970 kernel/time/timer.c:2384
run_timer_base kernel/time/timer.c:2393 [inline]
run_timer_softirq+0xb7/0x180 kernel/time/timer.c:2403
handle_softirqs+0x22f/0x710 kernel/softirq.c:622
__do_softirq kernel/softirq.c:656 [inline]
run_ktimerd+0xcf/0x190 kernel/softirq.c:1138
smpboot_thread_fn+0x542/0xa60 kernel/smpboot.c:160
kthread+0x711/0x8a0 kernel/kthread.c:463
ret_from_fork+0x4bc/0x870 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
Cc: stable@vger.kernel.org
Fixes: 3b1d6210a957 ("mptcp: implement and use MPTCP-level retransmission")
Reported-by: syzbot+355158e7e301548a1424@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6915b46f.050a0220.3565dc.0028.GAE@google.com/T/#u
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251113103924.3737425-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c77b3b79a92e3345aa1ee296180d1af4e7031f8f upstream.
The sockmap feature allows bpf syscall from userspace, or based
on bpf sockops, replacing the sk_prot of sockets during protocol stack
processing with sockmap's custom read/write interfaces.
'''
tcp_rcv_state_process()
syn_recv_sock()/subflow_syn_recv_sock()
tcp_init_transfer(BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB)
bpf_skops_established <== sockops
bpf_sock_map_update(sk) <== call bpf helper
tcp_bpf_update_proto() <== update sk_prot
'''
When the server has MPTCP enabled but the client sends a TCP SYN
without MPTCP, subflow_syn_recv_sock() performs a fallback on the
subflow, replacing the subflow sk's sk_prot with the native sk_prot.
'''
subflow_syn_recv_sock()
subflow_ulp_fallback()
subflow_drop_ctx()
mptcp_subflow_ops_undo_override()
'''
Then, this subflow can be normally used by sockmap, which replaces the
native sk_prot with sockmap's custom sk_prot. The issue occurs when the
user executes accept::mptcp_stream_accept::mptcp_fallback_tcp_ops().
Here, it uses sk->sk_prot to compare with the native sk_prot, but this
is incorrect when sockmap is used, as we may incorrectly set
sk->sk_socket->ops.
This fix uses the more generic sk_family for the comparison instead.
Additionally, this also prevents a WARNING from occurring:
result from ./scripts/decode_stacktrace.sh:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 337 at net/mptcp/protocol.c:68 mptcp_stream_accept \
(net/mptcp/protocol.c:4005)
Modules linked in:
...
PKRU: 55555554
Call Trace:
<TASK>
do_accept (net/socket.c:1989)
__sys_accept4 (net/socket.c:2028 net/socket.c:2057)
__x64_sys_accept (net/socket.c:2067)
x64_sys_call (arch/x86/entry/syscall_64.c:41)
do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f87ac92b83d
---[ end trace 0000000000000000 ]---
Fixes: 0b4f33def7bb ("mptcp: fix tcp fallback crash")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://patch.msgid.link/20251111060307.194196-3-jiayuan.chen@linux.dev
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fbade4bd08ba52cbc74a71c4e86e736f059f99f7 upstream.
The sockmap feature allows bpf syscall from userspace, or based on bpf
sockops, replacing the sk_prot of sockets during protocol stack processing
with sockmap's custom read/write interfaces.
'''
tcp_rcv_state_process()
subflow_syn_recv_sock()
tcp_init_transfer(BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB)
bpf_skops_established <== sockops
bpf_sock_map_update(sk) <== call bpf helper
tcp_bpf_update_proto() <== update sk_prot
'''
Consider two scenarios:
1. When the server has MPTCP enabled and the client also requests MPTCP,
the sk passed to the BPF program is a subflow sk. Since subflows only
handle partial data, replacing their sk_prot is meaningless and will
cause traffic disruption.
2. When the server has MPTCP enabled but the client sends a TCP SYN
without MPTCP, subflow_syn_recv_sock() performs a fallback on the
subflow, replacing the subflow sk's sk_prot with the native sk_prot.
'''
subflow_ulp_fallback()
subflow_drop_ctx()
mptcp_subflow_ops_undo_override()
'''
Subsequently, accept::mptcp_stream_accept::mptcp_fallback_tcp_ops()
converts the subflow to plain TCP.
For the first case, we should prevent it from being combined with sockmap
by setting sk_prot->psock_update_sk_prot to NULL, which will be blocked by
sockmap's own flow.
For the second case, since subflow_syn_recv_sock() has already restored
sk_prot to native tcp_prot/tcpv6_prot, no further action is needed.
Fixes: cec37a6e41aa ("mptcp: Handle MP_CAPABLE options for outgoing connections")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://patch.msgid.link/20251111060307.194196-2-jiayuan.chen@linux.dev
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a9da90e618cd0669a22bcc06a96209db5dd96e9b upstream.
While connecting, the MAC address can already no longer be
changed. The change is already rejected if netif_carrier_ok(),
but of course that's not true yet while connecting. Check for
auth_data or assoc_data, so the MAC address cannot be changed.
Also more comprehensively check that there are no stations on
the interface being changed - if any peer station is added it
will know about our address already, so we cannot change it.
Cc: stable@vger.kernel.org
Fixes: 3c06e91b40db ("wifi: mac80211: Support POWERED_ADDR_CHANGE feature")
Link: https://patch.msgid.link/20251105154119.f9f6c1df81bb.I9bb3760ede650fb96588be0d09a5a7bdec21b217@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 49c8d2c1f94cc2f4d1a108530d7ba52614b874c2 upstream.
commit efa95b01da18 ("netpoll: fix use after free") incorrectly
ignored the refcount and prematurely set dev->npinfo to NULL during
netpoll cleanup, leading to improper behavior and memory leaks.
Scenario causing lack of proper cleanup:
1) A netpoll is associated with a NIC (e.g., eth0) and netdev->npinfo is
allocated, and refcnt = 1
- Keep in mind that npinfo is shared among all netpoll instances. In
this case, there is just one.
2) Another netpoll is also associated with the same NIC and
npinfo->refcnt += 1.
- Now dev->npinfo->refcnt = 2;
- There is just one npinfo associated to the netdev.
3) When the first netpolls goes to clean up:
- The first cleanup succeeds and clears np->dev->npinfo, ignoring
refcnt.
- It basically calls `RCU_INIT_POINTER(np->dev->npinfo, NULL);`
- Set dev->npinfo = NULL, without proper cleanup
- No ->ndo_netpoll_cleanup() is either called
4) Now the second target tries to clean up
- The second cleanup fails because np->dev->npinfo is already NULL.
* In this case, ops->ndo_netpoll_cleanup() was never called, and
the skb pool is not cleaned as well (for the second netpoll
instance)
- This leaks npinfo and skbpool skbs, which is clearly reported by
kmemleak.
Revert commit efa95b01da18 ("netpoll: fix use after free") and adds
clarifying comments emphasizing that npinfo cleanup should only happen
once the refcount reaches zero, ensuring stable and correct netpoll
behavior.
Cc: <stable@vger.kernel.org> # 3.17.x
Cc: Jay Vosburgh <jv@jvosburgh.net>
Fixes: efa95b01da18 ("netpoll: fix use after free")
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251107-netconsole_torture-v10-1-749227b55f63@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ac1499fcd40fe06479e9b933347b837ccabc2a40 upstream.
The sit driver's packet transmission path calls: sit_tunnel_xmit() ->
update_or_create_fnhe(), which lead to fnhe_remove_oldest() being called
to delete entries exceeding FNHE_RECLAIM_DEPTH+random.
The race window is between fnhe_remove_oldest() selecting fnheX for
deletion and the subsequent kfree_rcu(). During this time, the
concurrent path's __mkroute_output() -> find_exception() can fetch the
soon-to-be-deleted fnheX, and rt_bind_exception() then binds it with a
new dst using a dst_hold(). When the original fnheX is freed via RCU,
the dst reference remains permanently leaked.
CPU 0 CPU 1
__mkroute_output()
find_exception() [fnheX]
update_or_create_fnhe()
fnhe_remove_oldest() [fnheX]
rt_bind_exception() [bind dst]
RCU callback [fnheX freed, dst leak]
This issue manifests as a device reference count leak and a warning in
dmesg when unregistering the net device:
unregister_netdevice: waiting for sitX to become free. Usage count = N
Ido Schimmel provided the simple test validation method [1].
The fix clears 'oldest->fnhe_daddr' before calling fnhe_flush_routes().
Since rt_bind_exception() checks this field, setting it to zero prevents
the stale fnhe from being reused and bound to a new dst just before it
is freed.
[1]
ip netns add ns1
ip -n ns1 link set dev lo up
ip -n ns1 address add 192.0.2.1/32 dev lo
ip -n ns1 link add name dummy1 up type dummy
ip -n ns1 route add 192.0.2.2/32 dev dummy1
ip -n ns1 link add name gretap1 up arp off type gretap \
local 192.0.2.1 remote 192.0.2.2
ip -n ns1 route add 198.51.0.0/16 dev gretap1
taskset -c 0 ip netns exec ns1 mausezahn gretap1 \
-A 198.51.100.1 -B 198.51.0.0/16 -t udp -p 1000 -c 0 -q &
taskset -c 2 ip netns exec ns1 mausezahn gretap1 \
-A 198.51.100.1 -B 198.51.0.0/16 -t udp -p 1000 -c 0 -q &
sleep 10
ip netns pids ns1 | xargs kill
ip netns del ns1
Cc: stable@vger.kernel.org
Fixes: 67d6d681e15b ("ipv4: make exception cache less predictible")
Signed-off-by: Chuang Wang <nashuiliang@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251111064328.24440-1-nashuiliang@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4da4e4bde1c453ac5cc2dce5def81d504ae257ee upstream.
The `len` member of the sk_buff is an unsigned int. This is cast to
`ssize_t` (a signed type) for the first sk_buff in the comparison,
but not the second sk_buff. On 32-bit systems, this can result in
an integer underflow for certain values because unsigned arithmetic
is being used.
This appears to be an oversight: if the intention was to use unsigned
arithmetic, then the first cast would have been omitted. The change
ensures both len values are cast to `ssize_t`.
The underflow causes an issue with ktls when multiple TLS PDUs are
included in a single TCP segment. The mainline kernel does not use
strparser for ktls anymore, but this is still useful for other
features that still use strparser, and for backporting.
Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
Cc: stable@vger.kernel.org
Fixes: 43a0c6751a32 ("strparser: Stream parser for messages")
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20251106222835.1871628-1-nate.karstens@garmin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 4ef92743625818932b9c320152b58274c05e5053 ]
syzbot found that cls_bpf_classify() is able to change
tc_skb_cb(skb)->drop_reason triggering a warning in sk_skb_reason_drop().
WARNING: CPU: 0 PID: 5965 at net/core/skbuff.c:1192 __sk_skb_reason_drop net/core/skbuff.c:1189 [inline]
WARNING: CPU: 0 PID: 5965 at net/core/skbuff.c:1192 sk_skb_reason_drop+0x76/0x170 net/core/skbuff.c:1214
struct tc_skb_cb has been added in commit ec624fe740b4 ("net/sched:
Extend qdisc control block with tc control block"), which added a wrong
interaction with db58ba459202 ("bpf: wire in data and data_end for
cls_act_bpf").
drop_reason was added later.
Add bpf_prog_run_data_pointers() helper to save/restore the net_sched
storage colliding with BPF data_meta/data_end.
Fixes: ec624fe740b4 ("net/sched: Extend qdisc control block with tc control block")
Reported-by: syzbot <syzkaller@googlegroups.com>
Closes: https://lore.kernel.org/netdev/6913437c.a70a0220.22f260.013b.GAE@google.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251112125516.1563021-1-edumazet@google.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 90918e3b6404c2a37837b8f11692471b4c512de2 ]
Sequence adjustment may be required for FTP traffic with PASV/EPSV modes.
due to need to re-write packet payload (IP, port) on the ftp control
connection. This can require changes to the TCP length and expected
seq / ack_seq.
The easiest way to reproduce this issue is with PASV mode.
Example ruleset:
table inet ftp_nat {
ct helper ftp_helper {
type "ftp" protocol tcp
l3proto inet
}
chain prerouting {
type filter hook prerouting priority 0; policy accept;
tcp dport 21 ct state new ct helper set "ftp_helper"
}
}
table ip nat {
chain prerouting {
type nat hook prerouting priority -100; policy accept;
tcp dport 21 dnat ip prefix to ip daddr map {
192.168.100.1 : 192.168.13.2/32 }
}
chain postrouting {
type nat hook postrouting priority 100 ; policy accept;
tcp sport 21 snat ip prefix to ip saddr map {
192.168.13.2 : 192.168.100.1/32 }
}
}
Note that the ftp helper gets assigned *after* the dnat setup.
The inverse (nat after helper assign) is handled by an existing
check in nf_nat_setup_info() and will not show the problem.
Topoloy:
+-------------------+ +----------------------------------+
| FTP: 192.168.13.2 | <-> | NAT: 192.168.13.3, 192.168.100.1 |
+-------------------+ +----------------------------------+
|
+-----------------------+
| Client: 192.168.100.2 |
+-----------------------+
ftp nat changes do not work as expected in this case:
Connected to 192.168.100.1.
[..]
ftp> epsv
EPSV/EPRT on IPv4 off.
ftp> ls
227 Entering passive mode (192,168,100,1,209,129).
421 Service not available, remote server has closed connection.
Kernel logs:
Missing nfct_seqadj_ext_add() setup call
WARNING: CPU: 1 PID: 0 at net/netfilter/nf_conntrack_seqadj.c:41
[..]
__nf_nat_mangle_tcp_packet+0x100/0x160 [nf_nat]
nf_nat_ftp+0x142/0x280 [nf_nat_ftp]
help+0x4d1/0x880 [nf_conntrack_ftp]
nf_confirm+0x122/0x2e0 [nf_conntrack]
nf_hook_slow+0x3c/0xb0
..
Fix this by adding the required extension when a conntrack helper is assigned
to a connection that has a nat binding.
Fixes: 1a64edf54f55 ("netfilter: nft_ct: add helper set support")
Signed-off-by: Andrii Melnychenko <a.melnychenko@vyos.io>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e060088db0bdf7932e0e3c2d24b7371c4c5b867c ]
l2cap_chan_put() is exported, so export also l2cap_chan_hold() for
modules.
l2cap_chan_hold() has use case in net/bluetooth/6lowpan.c
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b2c26c82f7a94ec4da096f370e3612ee14424450 ]
For HSRv0, the path_id has the following meaning:
- 0000: PRP supervision frame
- 0001-1001: HSR ring identifier
- 1010-1011: Frames from PRP network (A/B, with RedBoxes)
- 1111: HSR supervision frame
Follow the IEC 62439-3:2010 standard more closely by setting the right
path_id for HSRv0 supervision frames (actually, it is correctly set when
the frame is constructed, but hsr_set_path_id() overwrites it) and set a
fixed HSR ring identifier of 1. The ring identifier seems to be generally
unused and we ignore it anyways on reception, but some fixed identifier is
definitely better than using one identifier in one direction and a wrong
identifier in the other.
This was also the behavior before commit f266a683a480 ("net/hsr: Better
frame dispatch") which introduced the alternating path_id. This was later
moved to hsr_set_path_id() in commit 451d8123f897 ("net: prp: add packet
handling support").
The IEC 62439-3:2010 also contains 6 unused bytes after the MacAddressA in
the HSRv0 supervision frames. Adjust a TODO comment accordingly.
Fixes: f266a683a480 ("net/hsr: Better frame dispatch")
Fixes: 451d8123f897 ("net: prp: add packet handling support")
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/ea0d5133cd593856b2fa673d6e2067bf1d4d1794.1762876095.git.fmaurer@redhat.com
Tested-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 96a3a03abf3d8cc38cd9cb0d280235fbcf7c3f7f ]
On HSRv0, no supervision frames were sent. The supervison frames were
generated successfully, but failed the check for a sufficiently long mac
header, i.e., at least sizeof(struct hsr_ethhdr), in hsr_fill_frame_info()
because the mac header only contained the ethernet header.
Fix this by including the HSR header in the mac header when generating HSR
supervision frames. Note that the mac header now also includes the TLV
fields. This matches how we set the headers on rx and also the size of
struct hsrv0_ethhdr_sp.
Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Closes: https://lore.kernel.org/netdev/aMONxDXkzBZZRfE5@fedora/
Fixes: 9cfb5e7f0ded ("net: hsr: fix hsr_init_sk() vs network/transport headers.")
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/4354114fea9a642fe71f49aeeb6c6159d1d61840.1762876095.git.fmaurer@redhat.com
Tested-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0345552a653ce5542affeb69ac5aa52177a5199b ]
After commit 100dfa74cad9 ("inet: dev_queue_xmit() llist adoption")
I started seeing many qdisc requeues on IDPF under high TX workload.
$ tc -s qd sh dev eth1 handle 1: ; sleep 1; tc -s qd sh dev eth1 handle 1:
qdisc mq 1: root
Sent 43534617319319 bytes 268186451819 pkt (dropped 0, overlimits 0 requeues 3532840114)
backlog 1056Kb 6675p requeues 3532840114
qdisc mq 1: root
Sent 43554665866695 bytes 268309964788 pkt (dropped 0, overlimits 0 requeues 3537737653)
backlog 781164b 4822p requeues 3537737653
This is caused by try_bulk_dequeue_skb() being only limited by BQL budget.
perf record -C120-239 -e qdisc:qdisc_dequeue sleep 1 ; perf script
...
netperf 75332 [146] 2711.138269: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1292 skbaddr=0xff378005a1e9f200
netperf 75332 [146] 2711.138953: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1213 skbaddr=0xff378004d607a500
netperf 75330 [144] 2711.139631: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1233 skbaddr=0xff3780046be20100
netperf 75333 [147] 2711.140356: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1093 skbaddr=0xff37800514845b00
netperf 75337 [151] 2711.141037: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1353 skbaddr=0xff37800460753300
netperf 75337 [151] 2711.141877: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1367 skbaddr=0xff378004e72c7b00
netperf 75330 [144] 2711.142643: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1202 skbaddr=0xff3780045bd60000
...
This is bad because :
1) Large batches hold one victim cpu for a very long time.
2) Driver often hit their own TX ring limit (all slots are used).
3) We call dev_requeue_skb()
4) Requeues are using a FIFO (q->gso_skb), breaking qdisc ability to
implement FQ or priority scheduling.
5) dequeue_skb() gets packets from q->gso_skb one skb at a time
with no xmit_more support. This is causing many spinlock games
between the qdisc and the device driver.
Requeues were supposed to be very rare, lets keep them this way.
Limit batch sizes to /proc/sys/net/core/dev_weight (default 64) as
__qdisc_run() was designed to use.
Fixes: 5772e9a3463b ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/20251109161215.2574081-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ce50039be49eea9b4cd8873ca6eccded1b4a130a ]
Fix a KMSAN kernel-infoleak detected by the syzbot .
[net?] KMSAN: kernel-infoleak in __skb_datagram_iter
In tcf_ife_dump(), the variable 'opt' was partially initialized using a
designatied initializer. While the padding bytes are reamined
uninitialized. nla_put() copies the entire structure into a
netlink message, these uninitialized bytes leaked to userspace.
Initialize the structure with memset before assigning its fields
to ensure all members and padding are cleared prior to beign copied.
This change silences the KMSAN report and prevents potential information
leaks from the kernel memory.
This fix has been tested and validated by syzbot. This patch closes the
bug reported at the following syzkaller link and ensures no infoleak.
Reported-by: syzbot+0c85cae3350b7d486aee@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0c85cae3350b7d486aee
Tested-by: syzbot+0c85cae3350b7d486aee@syzkaller.appspotmail.com
Fixes: ef6980b6becb ("introduce IFE action")
Signed-off-by: Ranganath V N <vnranganath.20@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251109091336.9277-3-vnranganath.20@gmail.com
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 62b656e43eaeae445a39cd8021a4f47065af4389 ]
In tcf_connmark_dump(), the variable 'opt' was partially initialized using a
designatied initializer. While the padding bytes are reamined
uninitialized. nla_put() copies the entire structure into a
netlink message, these uninitialized bytes leaked to userspace.
Initialize the structure with memset before assigning its fields
to ensure all members and padding are cleared prior to beign copied.
Reported-by: syzbot+0c85cae3350b7d486aee@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0c85cae3350b7d486aee
Tested-by: syzbot+0c85cae3350b7d486aee@syzkaller.appspotmail.com
Fixes: 22a5dc0e5e3e ("net: sched: Introduce connmark action")
Signed-off-by: Ranganath V N <vnranganath.20@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251109091336.9277-2-vnranganath.20@gmail.com
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 485e0626e58768f3c53ba61ab9e09d6b60a455f4 ]
This handles PA Sync Lost event which previously was assumed to be
handled with BIG Sync Lost but their lifetime are not the same thus why
there are 2 different events to inform when each sync is lost.
Fixes: b2a5f2e1c127 ("Bluetooth: hci_event: Add support for handling LE BIG Sync Lost event")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|