summaryrefslogtreecommitdiff
path: root/net/unix/af_unix.c
AgeCommit message (Collapse)AuthorFilesLines
2023-10-27af_unix: Remove module remnants.Kuniyuki Iwashima1-19/+4
Since commit 97154bcf4d1b ("af_unix: Kconfig: make CONFIG_UNIX bool"), af_unix.c is no longer built as module. Let's remove unnecessary #if condition, exitcall, and module macros. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20231026212305.45545-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-11bpf: Implement cgroup sockaddr hooks for unix socketsDaan De Meyer1-1/+34
These hooks allows intercepting connect(), getsockname(), getpeername(), sendmsg() and recvmsg() for unix sockets. The unix socket hooks get write access to the address length because the address length is not fixed when dealing with unix sockets and needs to be modified when a unix socket address is modified by the hook. Because abstract socket unix addresses start with a NUL byte, we cannot recalculate the socket address in kernelspace after running the hook by calculating the length of the unix socket path using strlen(). These hooks can be used when users want to multiplex syscall to a single unix socket to multiple different processes behind the scenes by redirecting the connect() and other syscalls to process specific sockets. We do not implement support for intercepting bind() because when using bind() with unix sockets with a pathname address, this creates an inode in the filesystem which must be cleaned up. If we rewrite the address, the user might try to clean up the wrong file, leaking the socket in the filesystem where it is never cleaned up. Until we figure out a solution for this (and a use case for intercepting bind()), we opt to not allow rewriting the sockaddr in bind() calls. We also implement recvmsg() support for connected streams so that after a connect() that is modified by a sockaddr hook, any corresponding recmvsg() on the connected socket can also be modified to make the connected program think it is connected to the "intended" remote. Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com> Link: https://lore.kernel.org/r/20231011185113.140426-5-daan.j.demeyer@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-04af_unix: Fix data-race around unix_tot_inflight.Kuniyuki Iwashima1-1/+1
unix_tot_inflight is changed under spin_lock(unix_gc_lock), but unix_release_sock() reads it locklessly. Let's use READ_ONCE() for unix_tot_inflight. Note that the writer side was marked by commit 9d6d7f1cb67c ("af_unix: annote lockless accesses to unix_tot_inflight & gc_in_progress") BUG: KCSAN: data-race in unix_inflight / unix_release_sock write (marked) to 0xffffffff871852b8 of 4 bytes by task 123 on cpu 1: unix_inflight+0x130/0x180 net/unix/scm.c:64 unix_attach_fds+0x137/0x1b0 net/unix/scm.c:123 unix_scm_to_skb net/unix/af_unix.c:1832 [inline] unix_dgram_sendmsg+0x46a/0x14f0 net/unix/af_unix.c:1955 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg+0x148/0x160 net/socket.c:747 ____sys_sendmsg+0x4e4/0x610 net/socket.c:2493 ___sys_sendmsg+0xc6/0x140 net/socket.c:2547 __sys_sendmsg+0x94/0x140 net/socket.c:2576 __do_sys_sendmsg net/socket.c:2585 [inline] __se_sys_sendmsg net/socket.c:2583 [inline] __x64_sys_sendmsg+0x45/0x50 net/socket.c:2583 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x72/0xdc read to 0xffffffff871852b8 of 4 bytes by task 4891 on cpu 0: unix_release_sock+0x608/0x910 net/unix/af_unix.c:671 unix_release+0x59/0x80 net/unix/af_unix.c:1058 __sock_release+0x7d/0x170 net/socket.c:653 sock_close+0x19/0x30 net/socket.c:1385 __fput+0x179/0x5e0 fs/file_table.c:321 ____fput+0x15/0x20 fs/file_table.c:349 task_work_run+0x116/0x1a0 kernel/task_work.c:179 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204 __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline] syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297 do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x72/0xdc value changed: 0x00000000 -> 0x00000001 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 4891 Comm: systemd-coredum Not tainted 6.4.0-rc5-01219-gfa0e21fa4443 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Fixes: 9305cfa4443d ("[AF_UNIX]: Make unix_tot_inflight counter non-atomic") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29net: add missing data-race annotations around sk->sk_peek_offEric Dumazet1-1/+1
sk_getsockopt() runs locklessly, thus we need to annotate the read of sk->sk_peek_off. While we are at it, add corresponding annotations to sk_set_peek_off() and unix_set_peek_off(). Fixes: b9bb53f3836f ("sock: convert sk_peek_offset functions to WRITE_ONCE") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-27af_unix: Terminate sun_path when bind()ing pathname socket.Kuniyuki Iwashima1-5/+16
kernel test robot reported slab-out-of-bounds access in strlen(). [0] Commit 06d4c8a80836 ("af_unix: Fix fortify_panic() in unix_bind_bsd().") removed unix_mkname_bsd() call in unix_bind_bsd(). If sunaddr->sun_path is not terminated by user and we don't enable CONFIG_INIT_STACK_ALL_ZERO=y, strlen() will do the out-of-bounds access during file creation. Let's go back to strlen()-with-sockaddr_storage way and pack all 108 trickiness into unix_mkname_bsd() with bold comments. [0]: BUG: KASAN: slab-out-of-bounds in strlen (lib/string.c:?) Read of size 1 at addr ffff000015492777 by task fortify_strlen_/168 CPU: 0 PID: 168 Comm: fortify_strlen_ Not tainted 6.5.0-rc1-00333-g3329b603ebba #16 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace (arch/arm64/kernel/stacktrace.c:235) show_stack (arch/arm64/kernel/stacktrace.c:242) dump_stack_lvl (lib/dump_stack.c:107) print_report (mm/kasan/report.c:365 mm/kasan/report.c:475) kasan_report (mm/kasan/report.c:590) __asan_report_load1_noabort (mm/kasan/report_generic.c:378) strlen (lib/string.c:?) getname_kernel (./include/linux/fortify-string.h:? fs/namei.c:226) kern_path_create (fs/namei.c:3926) unix_bind (net/unix/af_unix.c:1221 net/unix/af_unix.c:1324) __sys_bind (net/socket.c:1792) __arm64_sys_bind (net/socket.c:1801) invoke_syscall (arch/arm64/kernel/syscall.c:? arch/arm64/kernel/syscall.c:52) el0_svc_common (./include/linux/thread_info.h:127 arch/arm64/kernel/syscall.c:147) do_el0_svc (arch/arm64/kernel/syscall.c:189) el0_svc (./arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:144 arch/arm64/kernel/entry-common.c:648) el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:?) el0t_64_sync (arch/arm64/kernel/entry.S:591) Allocated by task 168: kasan_set_track (mm/kasan/common.c:45 mm/kasan/common.c:52) kasan_save_alloc_info (mm/kasan/generic.c:512) __kasan_kmalloc (mm/kasan/common.c:383) __kmalloc (mm/slab_common.c:? mm/slab_common.c:998) unix_bind (net/unix/af_unix.c:257 net/unix/af_unix.c:1213 net/unix/af_unix.c:1324) __sys_bind (net/socket.c:1792) __arm64_sys_bind (net/socket.c:1801) invoke_syscall (arch/arm64/kernel/syscall.c:? arch/arm64/kernel/syscall.c:52) el0_svc_common (./include/linux/thread_info.h:127 arch/arm64/kernel/syscall.c:147) do_el0_svc (arch/arm64/kernel/syscall.c:189) el0_svc (./arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:144 arch/arm64/kernel/entry-common.c:648) el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:?) el0t_64_sync (arch/arm64/kernel/entry.S:591) The buggy address belongs to the object at ffff000015492700 which belongs to the cache kmalloc-128 of size 128 The buggy address is located 0 bytes to the right of allocated 119-byte region [ffff000015492700, ffff000015492777) The buggy address belongs to the physical page: page:00000000aeab52ba refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x55492 anon flags: 0x3fffc0000000200(slab|node=0|zone=0|lastcpupid=0xffff) page_type: 0xffffffff() raw: 03fffc0000000200 ffff0000084018c0 fffffc00003d0e00 0000000000000005 raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff000015492600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff000015492680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff000015492700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 07 fc ^ ffff000015492780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff000015492800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Fixes: 06d4c8a80836 ("af_unix: Fix fortify_panic() in unix_bind_bsd().") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/netdev/202307262110.659e5e8-oliver.sang@intel.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230726190828.47874-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-25af_unix: Fix fortify_panic() in unix_bind_bsd().Kuniyuki Iwashima1-4/+2
syzkaller found a bug in unix_bind_bsd() [0]. We can reproduce it by bind()ing a socket on a path with length 108. 108 is the size of sun_addr of struct sockaddr_un and is the maximum valid length for the pathname socket. When calling bind(), we use struct sockaddr_storage as the actual buffer size, so terminating sun_addr[108] with null is legitimate as done in unix_mkname_bsd(). However, strlen(sunaddr) for such a case causes fortify_panic() if CONFIG_FORTIFY_SOURCE=y. __fortify_strlen() has no idea about the actual buffer size and see the string as unterminated. Let's use strnlen() to allow sun_addr to be unterminated at 107. [0]: detected buffer overflow in __fortify_strlen kernel BUG at lib/string_helpers.c:1031! Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP Modules linked in: CPU: 0 PID: 255 Comm: syz-executor296 Not tainted 6.5.0-rc1-00330-g60cc1f7d0605 #4 Hardware name: linux,dummy-virt (DT) pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : fortify_panic+0x1c/0x20 lib/string_helpers.c:1030 lr : fortify_panic+0x1c/0x20 lib/string_helpers.c:1030 sp : ffff800089817af0 x29: ffff800089817af0 x28: ffff800089817b40 x27: 1ffff00011302f68 x26: 000000000000006e x25: 0000000000000012 x24: ffff800087e60140 x23: dfff800000000000 x22: ffff800089817c20 x21: ffff800089817c8e x20: 000000000000006c x19: ffff00000c323900 x18: ffff800086ab1630 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000001 x14: 1ffff00011302eb8 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : 64a26b65474d2a00 x8 : 64a26b65474d2a00 x7 : 0000000000000001 x6 : 0000000000000001 x5 : ffff800089817438 x4 : ffff800086ac99e0 x3 : ffff800080f19e8c x2 : 0000000000000001 x1 : 0000000100000000 x0 : 000000000000002c Call trace: fortify_panic+0x1c/0x20 lib/string_helpers.c:1030 _Z16__fortify_strlenPKcU25pass_dynamic_object_size1 include/linux/fortify-string.h:217 [inline] unix_bind_bsd net/unix/af_unix.c:1212 [inline] unix_bind+0xba8/0xc58 net/unix/af_unix.c:1326 __sys_bind+0x1ac/0x248 net/socket.c:1792 __do_sys_bind net/socket.c:1803 [inline] __se_sys_bind net/socket.c:1801 [inline] __arm64_sys_bind+0x7c/0x94 net/socket.c:1801 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall+0x98/0x2c0 arch/arm64/kernel/syscall.c:52 el0_svc_common+0x134/0x240 arch/arm64/kernel/syscall.c:139 do_el0_svc+0x64/0x198 arch/arm64/kernel/syscall.c:188 el0_svc+0x2c/0x7c arch/arm64/kernel/entry-common.c:647 el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:665 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:591 Code: aa0003e1 d0000e80 91030000 97ffc91a (d4210000) Fixes: df8fc4e934c1 ("kbuild: Enable -fstrict-flex-arrays=3") Reported-by: syzkaller <syzkaller@googlegroups.com> Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230724213425.22920-2-kuniyu@amazon.com Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-27net: scm: introduce and use scm_recv_unix helperAlexander Mikhalitsyn1-2/+2
Recently, our friends from bluetooth subsystem reported [1] that after commit 5e2ff6704a27 ("scm: add SO_PASSPIDFD and SCM_PIDFD") scm_recv() helper become unusable in kernel modules (because it uses unexported pidfd_prepare() API). We were aware of this issue and workarounded it in a hard way by commit 97154bcf4d1b ("af_unix: Kconfig: make CONFIG_UNIX bool"). But recently a new functionality was added in the scope of commit 817efd3cad74 ("Bluetooth: hci_sock: Forward credentials to monitor") and after that bluetooth can't be compiled as a kernel module. After some discussion in [1] we decided to split scm_recv() into two helpers, one won't support SCM_PIDFD (used for unix sockets), and another one will be completely the same as it was before commit 5e2ff6704a27 ("scm: add SO_PASSPIDFD and SCM_PIDFD"). Link: https://lore.kernel.org/lkml/CAJqdLrpFcga4n7wxBhsFqPQiN8PKFVr6U10fKcJ9W7AcZn+o6Q@mail.gmail.com/ [1] Fixes: 5e2ff6704a27 ("scm: add SO_PASSPIDFD and SCM_PIDFD") Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230627174314.67688-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-27Revert "af_unix: Call scm_recv() only after scm_set_cred()."Kuniyuki Iwashima1-1/+1
This reverts commit 3f5f118bb657f94641ea383c7c1b8c09a5d46ea2. Konrad reported that desktop environment below cannot be reached after commit 3f5f118bb657 ("af_unix: Call scm_recv() only after scm_set_cred().") - postmarketOS (Alpine Linux w/ musl 1.2.4) - busybox 1.36.1 - GNOME 44.1 - networkmanager 1.42.6 - openrc 0.47 Regarding to the warning of SO_PASSPIDFD, I'll post another patch to suppress it by skipping SCM_PIDFD if scm->pid == NULL in scm_pidfd_recv(). Reported-by: Konrad Dybcio <konradybcio@kernel.org> Link: https://lore.kernel.org/netdev/8c7f9abd-4f84-7296-2788-1e130d6304a0@kernel.org/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Gal Pressman <gal@nvidia.com> Link: https://lore.kernel.org/r/20230626205837.82086-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-24sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)David Howells1-19/+0
Remove ->sendpage() and ->sendpage_locked(). sendmsg() with MSG_SPLICE_PAGES should be used instead. This allows multiple pages and multipage folios to be passed through. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for net/can cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: linux-afs@lists.infradead.org cc: mptcp@lists.linux.dev cc: rds-devel@oss.oracle.com cc: tipc-discussion@lists.sourceforge.net cc: virtualization@lists.linux-foundation.org Link: https://lore.kernel.org/r/20230623225513.2732256-16-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-24af_unix: Call scm_recv() only after scm_set_cred().Kuniyuki Iwashima1-1/+1
syzkaller hit a WARN_ON_ONCE(!scm->pid) in scm_pidfd_recv(). In unix_stream_read_generic(), if there is no skb in the queue, we could bail out the do-while loop without calling scm_set_cred(): 1. No skb in the queue 2. sk is non-blocking or shutdown(sk, RCV_SHUTDOWN) is called concurrently or peer calls close() If the socket is configured with SO_PASSCRED or SO_PASSPIDFD, scm_recv() would populate cmsg with garbage. Let's not call scm_recv() unless there is skb to receive. WARNING: CPU: 1 PID: 3245 at include/net/scm.h:138 scm_pidfd_recv include/net/scm.h:138 [inline] WARNING: CPU: 1 PID: 3245 at include/net/scm.h:138 scm_recv.constprop.0+0x754/0x850 include/net/scm.h:177 Modules linked in: CPU: 1 PID: 3245 Comm: syz-executor.1 Not tainted 6.4.0-rc5-01219-gfa0e21fa4443 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:scm_pidfd_recv include/net/scm.h:138 [inline] RIP: 0010:scm_recv.constprop.0+0x754/0x850 include/net/scm.h:177 Code: 67 fd e9 55 fd ff ff e8 4a 70 67 fd e9 7f fd ff ff e8 40 70 67 fd e9 3e fb ff ff e8 36 70 67 fd e9 02 fd ff ff e8 8c 3a 20 fd <0f> 0b e9 fe fb ff ff e8 50 70 67 fd e9 2e f9 ff ff e8 46 70 67 fd RSP: 0018:ffffc90009af7660 EFLAGS: 00010216 RAX: 00000000000000a1 RBX: ffff888041e58a80 RCX: ffffc90003852000 RDX: 0000000000040000 RSI: ffffffff842675b4 RDI: 0000000000000007 RBP: ffffc90009af7810 R08: 0000000000000007 R09: 0000000000000013 R10: 00000000000000f8 R11: 0000000000000001 R12: ffffc90009af7db0 R13: 0000000000000000 R14: ffff888041e58a88 R15: 1ffff9200135eecc FS: 00007f6b7113f640(0000) GS:ffff88806cf00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6b7111de38 CR3: 0000000012a6e002 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> unix_stream_read_generic+0x5fe/0x1f50 net/unix/af_unix.c:2830 unix_stream_recvmsg+0x194/0x1c0 net/unix/af_unix.c:2880 sock_recvmsg_nosec net/socket.c:1019 [inline] sock_recvmsg+0x188/0x1d0 net/socket.c:1040 ____sys_recvmsg+0x210/0x610 net/socket.c:2712 ___sys_recvmsg+0xff/0x190 net/socket.c:2754 do_recvmmsg+0x25d/0x6c0 net/socket.c:2848 __sys_recvmmsg net/socket.c:2927 [inline] __do_sys_recvmmsg net/socket.c:2950 [inline] __se_sys_recvmmsg net/socket.c:2943 [inline] __x64_sys_recvmmsg+0x224/0x290 net/socket.c:2943 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3f/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7f6b71da2e5d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48 RSP: 002b:00007f6b7113ecc8 EFLAGS: 00000246 ORIG_RAX: 000000000000012b RAX: ffffffffffffffda RBX: 00000000004bc050 RCX: 00007f6b71da2e5d RDX: 0000000000000007 RSI: 0000000020006600 RDI: 000000000000000b RBP: 00000000004bc050 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000120 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000006e R14: 00007f6b71e03530 R15: 0000000000000000 </TASK> Fixes: 5e2ff6704a27 ("scm: add SO_PASSPIDFD and SCM_PIDFD") Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Alexander Mikhalitsyn <alexander@mihalicyn.com> Link: https://lore.kernel.org/r/20230622184351.91544-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-12net: core: add getsockopt SO_PEERPIDFDAlexander Mikhalitsyn1-0/+16
Add SO_PEERPIDFD which allows to get pidfd of peer socket holder pidfd. This thing is direct analog of SO_PEERCRED which allows to get plain PID. Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: David Ahern <dsahern@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Kees Cook <keescook@chromium.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Kuniyuki Iwashima <kuniyu@amazon.com> Cc: Lennart Poettering <mzxreary@0pointer.de> Cc: Luca Boccassi <bluca@debian.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Stanislav Fomichev <sdf@google.com> Cc: bpf@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: netdev@vger.kernel.org Cc: linux-arch@vger.kernel.org Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Tested-by: Luca Boccassi <bluca@debian.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-12scm: add SO_PASSPIDFD and SCM_PIDFDAlexander Mikhalitsyn1-5/+13
Implement SCM_PIDFD, a new type of CMSG type analogical to SCM_CREDENTIALS, but it contains pidfd instead of plain pid, which allows programmers not to care about PID reuse problem. We mask SO_PASSPIDFD feature if CONFIG_UNIX is not builtin because it depends on a pidfd_prepare() API which is not exported to the kernel modules. Idea comes from UAPI kernel group: https://uapi-group.org/kernel-features/ Big thanks to Christian Brauner and Lennart Poettering for productive discussions about this. Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: David Ahern <dsahern@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Kees Cook <keescook@chromium.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Kuniyuki Iwashima <kuniyu@amazon.com> Cc: Lennart Poettering <mzxreary@0pointer.de> Cc: Luca Boccassi <bluca@debian.org> Cc: linux-kernel@vger.kernel.org Cc: netdev@vger.kernel.org Cc: linux-arch@vger.kernel.org Tested-by: Luca Boccassi <bluca@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-5/+2
Cross-merge networking fixes after downstream PR. Conflicts: net/ipv4/raw.c 3632679d9e4f ("ipv{4,6}/raw: fix output xfrm lookup wrt protocol") c85be08fc4fa ("raw: Stop using RTO_ONLINK.") https://lore.kernel.org/all/20230525110037.2b532b83@canb.auug.org.au/ Adjacent changes: drivers/net/ethernet/freescale/fec_main.c 9025944fddfe ("net: fec: add dma_wmb to ensure correct descriptor values") 144470c88c5d ("net: fec: using the standard return codes when xdp xmit errors") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23unix: Convert unix_stream_sendpage() to use MSG_SPLICE_PAGESDavid Howells1-127/+7
Convert unix_stream_sendpage() to use sendmsg() with MSG_SPLICE_PAGES rather than directly splicing in the pages itself. This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells <dhowells@redhat.com> cc: Kuniyuki Iwashima <kuniyu@amazon.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23af_unix: Support MSG_SPLICE_PAGESDavid Howells1-16/+33
Make AF_UNIX sendmsg() support MSG_SPLICE_PAGES, splicing in pages from the source iterator if possible and copying the data in otherwise. This allows ->sendpage() to be replaced by something that can handle multiple multipage folios in a single transaction. Signed-off-by: David Howells <dhowells@redhat.com> cc: Kuniyuki Iwashima <kuniyu@amazon.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23net: Pass max frags into skb_append_pagefrags()David Howells1-1/+1
Pass the maximum number of fragments into skb_append_pagefrags() rather than using MAX_SKB_FRAGS so that it can be used from code that wants to specify sysctl_max_skb_frags. Signed-off-by: David Howells <dhowells@redhat.com> cc: David Ahern <dsahern@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23bpf, sockmap: Pass skb ownership through read_skbJohn Fastabend1-5/+2
The read_skb hook calls consume_skb() now, but this means that if the recv_actor program wants to use the skb it needs to inc the ref cnt so that the consume_skb() doesn't kfree the sk_buff. This is problematic because in some error cases under memory pressure we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue(). Then we get this, skb_linearize() __pskb_pull_tail() pskb_expand_head() BUG_ON(skb_shared(skb)) Because we incremented users refcnt from sk_psock_verdict_recv() we hit the bug on with refcnt > 1 and trip it. To fix lets simply pass ownership of the sk_buff through the skb_read call. Then we can drop the consume from read_skb handlers and assume the verdict recv does any required kfree. Bug found while testing in our CI which runs in VMs that hit memory constraints rather regularly. William tested TCP read_skb handlers. [ 106.536188] ------------[ cut here ]------------ [ 106.536197] kernel BUG at net/core/skbuff.c:1693! [ 106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1 [ 106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014 [ 106.537467] RIP: 0010:pskb_expand_head+0x269/0x330 [ 106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202 [ 106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20 [ 106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8 [ 106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000 [ 106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8 [ 106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8 [ 106.540568] FS: 00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000 [ 106.540954] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0 [ 106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 106.542255] Call Trace: [ 106.542383] <IRQ> [ 106.542487] __pskb_pull_tail+0x4b/0x3e0 [ 106.542681] skb_ensure_writable+0x85/0xa0 [ 106.542882] sk_skb_pull_data+0x18/0x20 [ 106.543084] bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9 [ 106.543536] ? migrate_disable+0x66/0x80 [ 106.543871] sk_psock_verdict_recv+0xe2/0x310 [ 106.544258] ? sk_psock_write_space+0x1f0/0x1f0 [ 106.544561] tcp_read_skb+0x7b/0x120 [ 106.544740] tcp_data_queue+0x904/0xee0 [ 106.544931] tcp_rcv_established+0x212/0x7c0 [ 106.545142] tcp_v4_do_rcv+0x174/0x2a0 [ 106.545326] tcp_v4_rcv+0xe70/0xf60 [ 106.545500] ip_protocol_deliver_rcu+0x48/0x290 [ 106.545744] ip_local_deliver_finish+0xa7/0x150 Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Reported-by: William Findlay <will@isovalent.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: William Findlay <will@isovalent.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com
2023-05-10af_unix: Fix data races around sk->sk_shutdown.Kuniyuki Iwashima1-8/+12
KCSAN found a data race around sk->sk_shutdown where unix_release_sock() and unix_shutdown() update it under unix_state_lock(), OTOH unix_poll() and unix_dgram_poll() read it locklessly. We need to annotate the writes and reads with WRITE_ONCE() and READ_ONCE(). BUG: KCSAN: data-race in unix_poll / unix_release_sock write to 0xffff88800d0f8aec of 1 bytes by task 264 on cpu 0: unix_release_sock+0x75c/0x910 net/unix/af_unix.c:631 unix_release+0x59/0x80 net/unix/af_unix.c:1042 __sock_release+0x7d/0x170 net/socket.c:653 sock_close+0x19/0x30 net/socket.c:1397 __fput+0x179/0x5e0 fs/file_table.c:321 ____fput+0x15/0x20 fs/file_table.c:349 task_work_run+0x116/0x1a0 kernel/task_work.c:179 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204 __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline] syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297 do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x72/0xdc read to 0xffff88800d0f8aec of 1 bytes by task 222 on cpu 1: unix_poll+0xa3/0x2a0 net/unix/af_unix.c:3170 sock_poll+0xcf/0x2b0 net/socket.c:1385 vfs_poll include/linux/poll.h:88 [inline] ep_item_poll.isra.0+0x78/0xc0 fs/eventpoll.c:855 ep_send_events fs/eventpoll.c:1694 [inline] ep_poll fs/eventpoll.c:1823 [inline] do_epoll_wait+0x6c4/0xea0 fs/eventpoll.c:2258 __do_sys_epoll_wait fs/eventpoll.c:2270 [inline] __se_sys_epoll_wait fs/eventpoll.c:2265 [inline] __x64_sys_epoll_wait+0xcc/0x190 fs/eventpoll.c:2265 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x72/0xdc value changed: 0x00 -> 0x03 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 222 Comm: dbus-broker Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Fixes: 3c73419c09a5 ("af_unix: fix 'poll for write'/ connected DGRAM sockets") Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-10af_unix: Fix a data race of sk->sk_receive_queue->qlen.Kuniyuki Iwashima1-1/+1
KCSAN found a data race of sk->sk_receive_queue->qlen where recvmsg() updates qlen under the queue lock and sendmsg() checks qlen under unix_state_sock(), not the queue lock, so the reader side needs READ_ONCE(). BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_wait_for_peer write (marked) to 0xffff888019fe7c68 of 4 bytes by task 49792 on cpu 0: __skb_unlink include/linux/skbuff.h:2347 [inline] __skb_try_recv_from_queue+0x3de/0x470 net/core/datagram.c:197 __skb_try_recv_datagram+0xf7/0x390 net/core/datagram.c:263 __unix_dgram_recvmsg+0x109/0x8a0 net/unix/af_unix.c:2452 unix_dgram_recvmsg+0x94/0xa0 net/unix/af_unix.c:2549 sock_recvmsg_nosec net/socket.c:1019 [inline] ____sys_recvmsg+0x3a3/0x3b0 net/socket.c:2720 ___sys_recvmsg+0xc8/0x150 net/socket.c:2764 do_recvmmsg+0x182/0x560 net/socket.c:2858 __sys_recvmmsg net/socket.c:2937 [inline] __do_sys_recvmmsg net/socket.c:2960 [inline] __se_sys_recvmmsg net/socket.c:2953 [inline] __x64_sys_recvmmsg+0x153/0x170 net/socket.c:2953 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x72/0xdc read to 0xffff888019fe7c68 of 4 bytes by task 49793 on cpu 1: skb_queue_len include/linux/skbuff.h:2127 [inline] unix_recvq_full net/unix/af_unix.c:229 [inline] unix_wait_for_peer+0x154/0x1a0 net/unix/af_unix.c:1445 unix_dgram_sendmsg+0x13bc/0x14b0 net/unix/af_unix.c:2048 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg+0x148/0x160 net/socket.c:747 ____sys_sendmsg+0x20e/0x620 net/socket.c:2503 ___sys_sendmsg+0xc6/0x140 net/socket.c:2557 __sys_sendmmsg+0x11d/0x370 net/socket.c:2643 __do_sys_sendmmsg net/socket.c:2672 [inline] __se_sys_sendmmsg net/socket.c:2669 [inline] __x64_sys_sendmmsg+0x58/0x70 net/socket.c:2669 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x72/0xdc value changed: 0x0000000b -> 0x00000001 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 49793 Comm: syz-executor.0 Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-17af_unix: annotate lockless accesses to sk->sk_errEric Dumazet1-4/+5
unix_poll() and unix_dgram_poll() read sk->sk_err without any lock held. Add relevant READ_ONCE()/WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-08af_unix: fix struct pid leaks in OOB supportEric Dumazet1-2/+8
syzbot reported struct pid leak [1]. Issue is that queue_oob() calls maybe_add_creds() which potentially holds a reference on a pid. But skb->destructor is not set (either directly or by calling unix_scm_to_skb()) This means that subsequent kfree_skb() or consume_skb() would leak this reference. In this fix, I chose to fully support scm even for the OOB message. [1] BUG: memory leak unreferenced object 0xffff8881053e7f80 (size 128): comm "syz-executor242", pid 5066, jiffies 4294946079 (age 13.220s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff812ae26a>] alloc_pid+0x6a/0x560 kernel/pid.c:180 [<ffffffff812718df>] copy_process+0x169f/0x26c0 kernel/fork.c:2285 [<ffffffff81272b37>] kernel_clone+0xf7/0x610 kernel/fork.c:2684 [<ffffffff812730cc>] __do_sys_clone+0x7c/0xb0 kernel/fork.c:2825 [<ffffffff849ad699>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff849ad699>] do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80 [<ffffffff84a0008b>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: 314001f0bf92 ("af_unix: Add OOB support") Reported-by: syzbot+7699d9e5635c10253a27@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Rao Shoaib <rao.shoaib@oracle.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230307164530.771896-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-21Merge tag 'net-next-6.3' of ↵Linus Torvalds1-10/+11
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ...
2023-01-27net: add missing includes of linux/splice.hJakub Kicinski1-0/+1
Number of files depend on linux/splice.h getting included by linux/skbuff.h which soon will no longer be the case. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-18fs: port vfs_*() helpers to struct mnt_idmapChristian Brauner1-4/+4
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-16unix: Improve locking scheme in unix_show_fdinfo()Kirill Tkhai1-10/+10
After switching to TCP_ESTABLISHED or TCP_LISTEN sk_state, alive SOCK_STREAM and SOCK_SEQPACKET sockets can't change it anymore (since commit 3ff8bff704f4 "unix: Fix race in SOCK_SEQPACKET's unix_dgram_sendmsg()"). Thus, we do not need to take lock here. Signed-off-by: Kirill Tkhai <tkhai@ya.ru> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-15unix: Fix race in SOCK_SEQPACKET's unix_dgram_sendmsg()Kirill Tkhai1-2/+9
There is a race resulting in alive SOCK_SEQPACKET socket may change its state from TCP_ESTABLISHED to TCP_CLOSE: unix_release_sock(peer) unix_dgram_sendmsg(sk) sock_orphan(peer) sock_set_flag(peer, SOCK_DEAD) sock_alloc_send_pskb() if !(sk->sk_shutdown & SEND_SHUTDOWN) OK if sock_flag(peer, SOCK_DEAD) sk->sk_state = TCP_CLOSE sk->sk_shutdown = SHUTDOWN_MASK After that socket sk remains almost normal: it is able to connect, listen, accept and recvmsg, while it can't sendmsg. Since this is the only possibility for alive SOCK_SEQPACKET to change the state in such way, we should better fix this strange and potentially danger corner case. Note, that we will return EPIPE here like this is normally done in sock_alloc_send_pskb(). Originally used ECONNREFUSED looks strange, since it's strange to return a specific retval in dependence of race in kernel, when user can't affect on this. Also, move TCP_CLOSE assignment for SOCK_DGRAM sockets under state lock to fix race with unix_dgram_connect(): unix_dgram_connect(other) unix_dgram_sendmsg(sk) unix_peer(sk) = NULL unix_state_unlock(sk) unix_state_double_lock(sk, other) sk->sk_state = TCP_ESTABLISHED unix_peer(sk) = other unix_state_double_unlock(sk, other) sk->sk_state = TCP_CLOSED This patch fixes both of these races. Fixes: 83301b5367a9 ("af_unix: Set TCP_ESTABLISHED for datagram sockets too") Signed-off-by: Kirill Tkhai <tkhai@ya.ru> Link: https://lore.kernel.org/r/135fda25-22d5-837a-782b-ceee50e19844@ya.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-12-12af_unix: call proto_unregister() in the error path in af_unix_init()Yang Yingliang1-0/+1
If register unix_stream_proto returns error, unix_dgram_proto needs be unregistered. Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-11