summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)AuthorFilesLines
2021-04-21gro: ensure frag0 meets IP header alignmentEric Dumazet1-1/+2
commit 38ec4944b593fd90c5ef42aaaa53e66ae5769d04 upstream. After commit 0f6925b3e8da ("virtio_net: Do not pull payload in skb->head") Guenter Roeck reported one failure in his tests using sh architecture. After much debugging, we have been able to spot silent unaligned accesses in inet_gro_receive() The issue at hand is that upper networking stacks assume their header is word-aligned. Low level drivers are supposed to reserve NET_IP_ALIGN bytes before the Ethernet header to make that happen. This patch hardens skb_gro_reset_offset() to not allow frag0 fast-path if the fragment is not properly aligned. Some arches like x86, arm64 and powerpc do not care and define NET_IP_ALIGN as 0, this extra check will be a NOP for them. Note that if frag0 is not used, GRO will call pskb_may_pull() as many times as needed to pull network and transport headers. Fixes: 0f6925b3e8da ("virtio_net: Do not pull payload in skb->head") Fixes: 78a478d0efd9 ("gro: Inline skb_gro_header and cache frag0 virtual address") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-21neighbour: Disregard DEAD dst in neigh_updateTong Zhu1-1/+1
[ Upstream commit d47ec7a0a7271dda08932d6208e4ab65ab0c987c ] After a short network outage, the dst_entry is timed out and put in DST_OBSOLETE_DEAD. We are in this code because arp reply comes from this neighbour after network recovers. There is a potential race condition that dst_entry is still in DST_OBSOLETE_DEAD. With that, another neighbour lookup causes more harm than good. In best case all packets in arp_queue are lost. This is counterproductive to the original goal of finding a better path for those packets. I observed a worst case with 4.x kernel where a dst_entry in DST_OBSOLETE_DEAD state is associated with loopback net_device. It leads to an ethernet header with all zero addresses. A packet with all zero source MAC address is quite deadly with mac80211, ath9k and 802.11 block ack. It fails ieee80211_find_sta_by_ifaddr in ath9k (xmit.c). Ath9k flushes tx queue (ath_tx_complete_aggr). BAW (block ack window) is not updated. BAW logic is damaged and ath9k transmission is disabled. Signed-off-by: Tong Zhu <zhutong@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-14xdp: fix xdp_return_frame() kernel BUG throw for page_pool memory modelOng Boon Leong1-1/+2
[ Upstream commit 622d13694b5f048c01caa7ba548498d9880d4cb0 ] xdp_return_frame() may be called outside of NAPI context to return xdpf back to page_pool. xdp_return_frame() calls __xdp_return() with napi_direct = false. For page_pool memory model, __xdp_return() calls xdp_return_frame_no_direct() unconditionally and below false negative kernel BUG throw happened under preempt-rt build: [ 430.450355] BUG: using smp_processor_id() in preemptible [00000000] code: modprobe/3884 [ 430.451678] caller is __xdp_return+0x1ff/0x2e0 [ 430.452111] CPU: 0 PID: 3884 Comm: modprobe Tainted: G U E 5.12.0-rc2+ #45 Changes in v2: - This patch fixes the issue by making xdp_return_frame_no_direct() is only called if napi_direct = true, as recommended for better by Jesper Dangaard Brouer. Thanks! Fixes: 2539650fadbf ("xdp: Helpers for disabling napi_direct of xdp_return_frame") Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-14net: let skb_orphan_partial wake-up waiters.Paolo Abeni1-9/+3
commit 9adc89af724f12a03b47099cd943ed54e877cd59 upstream. Currently the mentioned helper can end-up freeing the socket wmem without waking-up any processes waiting for more write memory. If the partially orphaned skb is attached to an UDP (or raw) socket, the lack of wake-up can hang the user-space. Even for TCP sockets not calling the sk destructor could have bad effects on TSQ. Address the issue using skb_orphan to release the sk wmem before setting the new sock_efree destructor. Additionally bundle the whole ownership update in a new helper, so that later other potential users could avoid duplicate code. v1 -> v2: - use skb_orphan() instead of sort of open coding it (Eric) - provide an helper for the ownership change (Eric) Fixes: f6ba8d33cfbb ("netem: fix skb_orphan_partial()") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-14bpf, sockmap: Fix incorrect fwd_alloc accountingJohn Fastabend1-7/+5
commit 144748eb0c445091466c9b741ebd0bfcc5914f3d upstream. Incorrect accounting fwd_alloc can result in a warning when the socket is torn down, [18455.319240] WARNING: CPU: 0 PID: 24075 at net/core/stream.c:208 sk_stream_kill_queues+0x21f/0x230 [...] [18455.319543] Call Trace: [18455.319556] inet_csk_destroy_sock+0xba/0x1f0 [18455.319577] tcp_rcv_state_process+0x1b4e/0x2380 [18455.319593] ? lock_downgrade+0x3a0/0x3a0 [18455.319617] ? tcp_finish_connect+0x1e0/0x1e0 [18455.319631] ? sk_reset_timer+0x15/0x70 [18455.319646] ? tcp_schedule_loss_probe+0x1b2/0x240 [18455.319663] ? lock_release+0xb2/0x3f0 [18455.319676] ? __release_sock+0x8a/0x1b0 [18455.319690] ? lock_downgrade+0x3a0/0x3a0 [18455.319704] ? lock_release+0x3f0/0x3f0 [18455.319717] ? __tcp_close+0x2c6/0x790 [18455.319736] ? tcp_v4_do_rcv+0x168/0x370 [18455.319750] tcp_v4_do_rcv+0x168/0x370 [18455.319767] __release_sock+0xbc/0x1b0 [18455.319785] __tcp_close+0x2ee/0x790 [18455.319805] tcp_close+0x20/0x80 This currently happens because on redirect case we do skb_set_owner_r() with the original sock. This increments the fwd_alloc memory accounting on the original sock. Then on redirect we may push this into the queue of the psock we are redirecting to. When the skb is flushed from the queue we give the memory back to the original sock. The problem is if the original sock is destroyed/closed with skbs on another psocks queue then the original sock will not have a way to reclaim the memory before being destroyed. Then above warning will be thrown sockA sockB sk_psock_strp_read() sk_psock_verdict_apply() -- SK_REDIRECT -- sk_psock_skb_redirect() skb_queue_tail(psock_other->ingress_skb..) sk_close() sock_map_unref() sk_psock_put() sk_psock_drop() sk_psock_zap_ingress() At this point we have torn down our own psock, but have the outstanding skb in psock_other. Note that SK_PASS doesn't have this problem because the sk_psock_drop() logic releases the skb, its still associated with our psock. To resolve lets only account for sockets on the ingress queue that are still associated with the current socket. On the redirect case we will check memory limits per 6fa9201a89898, but will omit fwd_alloc accounting until skb is actually enqueued. When the skb is sent via skb_send_sock_locked or received with sk_psock_skb_ingress memory will be claimed on psock_other. Fixes: 6fa9201a89898 ("bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self") Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/161731444013.68884.4021114312848535993.stgit@john-XPS-13-9370 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-07bpf: Remove MTU check in __bpf_skb_max_lenJesper Dangaard Brouer1-8/+4
commit 6306c1189e77a513bf02720450bb43bd4ba5d8ae upstream. Multiple BPF-helpers that can manipulate/increase the size of the SKB uses __bpf_skb_max_len() as the max-length. This function limit size against the current net_device MTU (skb->dev->mtu). When a BPF-prog grow the packet size, then it should not be limited to the MTU. The MTU is a transmit limitation, and software receiving this packet should be allowed to increase the size. Further more, current MTU check in __bpf_skb_max_len uses the MTU from ingress/current net_device, which in case of redirects uses the wrong net_device. This patch keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit is elsewhere in the system. Jesper's testing[1] showed it was not possible to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is in-effect due to this being called from softirq context see code __gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed that frames above 16KiB can cause NICs to reset (but not crash). Keep this sanity limit at this level as memory layer can differ based on kernel config. [1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/161287788936.790810.2937823995775097177.stgit@firesoul Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-04-07flow_dissector: fix TTL and TOS dissection on IPv4 fragmentsDavide Caratti1-3/+3
[ Upstream commit d2126838050ccd1dadf310ffb78b2204f3b032b9 ] the following command: # tc filter add dev $h2 ingress protocol ip pref 1 handle 101 flower \ $tcflags dst_ip 192.0.2.2 ip_ttl 63 action drop doesn't drop all IPv4 packets that match the configured TTL / destination address. In particular, if "fragment offset" or "more fragments" have non zero value in the IPv4 header, setting of FLOW_DISSECTOR_KEY_IP is simply ignored. Fix this dissecting IPv4 TTL and TOS before fragment info; while at it, add a selftest for tc flower's match on 'ip_ttl' that verifies the correct behavior. Fixes: 518d8a2e9bad ("net/flow_dissector: add support for dissection of misc ip header fields") Reported-by: Shuang Li <shuali@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-30can: dev: Move device back to init netns on owning netns deleteMartin Willi1-1/+1
commit 3a5ca857079ea022e0b1b17fc154f7ad7dbc150f upstream. When a non-initial netns is destroyed, the usual policy is to delete all virtual network interfaces contained, but move physical interfaces back to the initial netns. This keeps the physical interface visible on the system. CAN devices are somewhat special, as they define rtnl_link_ops even if they are physical devices. If a CAN interface is moved into a non-initial netns, destroying that netns lets the interface vanish instead of moving it back to the initial netns. default_device_exit() skips CAN interfaces due to having rtnl_link_ops set. Reproducer: ip netns add foo ip link set can0 netns foo ip netns delete foo WARNING: CPU: 1 PID: 84 at net/core/dev.c:11030 ops_exit_list+0x38/0x60 CPU: 1 PID: 84 Comm: kworker/u4:2 Not tainted 5.10.19 #1 Workqueue: netns cleanup_net [<c010e700>] (unwind_backtrace) from [<c010a1d8>] (show_stack+0x10/0x14) [<c010a1d8>] (show_stack) from [<c086dc10>] (dump_stack+0x94/0xa8) [<c086dc10>] (dump_stack) from [<c086b938>] (__warn+0xb8/0x114) [<c086b938>] (__warn) from [<c086ba10>] (warn_slowpath_fmt+0x7c/0xac) [<c086ba10>] (warn_slowpath_fmt) from [<c0629f20>] (ops_exit_list+0x38/0x60) [<c0629f20>] (ops_exit_list) from [<c062a5c4>] (cleanup_net+0x230/0x380) [<c062a5c4>] (cleanup_net) from [<c0142c20>] (process_one_work+0x1d8/0x438) [<c0142c20>] (process_one_work) from [<c0142ee4>] (worker_thread+0x64/0x5a8) [<c0142ee4>] (worker_thread) from [<c0148a98>] (kthread+0x148/0x14c) [<c0148a98>] (kthread) from [<c0100148>] (ret_from_fork+0x14/0x2c) To properly restore physical CAN devices to the initial netns on owning netns exit, introduce a flag on rtnl_link_ops that can be set by drivers. For CAN devices setting this flag, default_device_exit() considers them non-virtual, applying the usual namespace move. The issue was introduced in the commit mentioned below, as at that time CAN devices did not have a dellink() operation. Fixes: e008b5fc8dc7 ("net: Simplfy default_device_exit and improve batching.") Link: https://lore.kernel.org/r/20210302122423.872326-1-martin@strongswan.org Signed-off-by: Martin Willi <martin@strongswan.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-30net, bpf: Fix ip6ip6 crash with collect_md populated skbsDaniel Borkmann1-22/+9
[ Upstream commit a188bb5638d41aa99090ebf2f85d3505ab13fba5 ] I ran into a crash where setting up a ip6ip6 tunnel device which was /not/ set to collect_md mode was receiving collect_md populated skbs for xmit. The BPF prog was populating the skb via bpf_skb_set_tunnel_key() which is assigning special metadata dst entry and then redirecting the skb to the device, taking ip6_tnl_start_xmit() -> ipxip6_tnl_xmit() -> ip6_tnl_xmit() and in the latter it performs a neigh lookup based on skb_dst(skb) where we trigger a NULL pointer dereference on dst->ops->neigh_lookup() since the md_dst_ops do not populate neigh_lookup callback with a fake handler. Transform the md_dst_ops into generic dst_blackhole_ops that can also be reused elsewhere when needed, and use them for the metadata dst entries as callback ops. Also, remove the dst_md_discard{,_out}() ops and rely on dst_discard{,_out}() from dst_init() which free the skb the same way modulo the splat. Given we will be able to recover just fine from there, avoid any potential splats iff this gets ever triggered in future (or worse, panic on warns when set). Fixes: f38a9eb1f77b ("dst: Metadata destinations") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-30net: Consolidate common blackhole dst opsDaniel Borkmann1-0/+38
[ Upstream commit c4c877b2732466b4c63217baad05c96f775912c7 ] Move generic blackhole dst ops to the core and use them from both ipv4_dst_blackhole_ops and ip6_dst_blackhole_ops where possible. No functional change otherwise. We need these also in other locations and having to define them over and over again is not great. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-30net: check all name nodes in __dev_alloc_nameJiri Bohac1-0/+12
[ Upstream commit 6c015a2256801597fadcbc11d287774c9c512fa5 ] __dev_alloc_name(), when supplied with a name containing '%d', will search for the first available device number to generate a unique device name. Since commit ff92741270bf8b6e78aa885f166b68c7a67ab13a ("net: introduce name_node struct to be used in hashlist") network devices may have alternate names. __dev_alloc_name() does take these alternate names into account, possibly generating a name that is already taken and failing with -ENFILE as a result. This demonstrates the bug: # rmmod dummy 2>/dev/null # ip link property add dev lo altname dummy0 # modprobe dummy numdummies=1 modprobe: ERROR: could not insert 'dummy': Too many open files in system Instead of creating a device named dummy1, modprobe fails. Fix this by checking all the names in the d->name_node list, not just d->name. Signed-off-by: Jiri Bohac <jbohac@suse.cz> Fixes: ff92741270bf ("net: introduce name_node struct to be used in hashlist") Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-30flow_dissector: fix byteorder of dissected ICMP IDAlexander Lobakin1-1/+1
[ Upstream commit a25f822285420486f5da434efc8d940d42a83bce ] flow_dissector_key_icmp::id is of type u16 (CPU byteorder), ICMP header has its ID field in network byteorder obviously. Sparse says: net/core/flow_dissector.c:178:43: warning: restricted __be16 degrades to integer Convert ID value to CPU byteorder when storing it into flow_dissector_key_icmp. Fixes: 5dec597e5cd0 ("flow_dissector: extract more ICMP information") Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-30drop_monitor: Perform cleanup upon probe registration failureIdo Schimmel1-0/+23
[ Upstream commit 9398e9c0b1d44eeb700e9e766c02bcc765c82570 ] In the rare case that drop_monitor fails to register its probe on the 'napi_poll' tracepoint, it will not deactivate its hysteresis timer as part of the error path. If the hysteresis timer was armed by the shortly lived 'kfree_skb' probe and user space retries to initiate tracing, a warning will be emitted for trying to initialize an active object [1]. Fix this by properly undoing all the operations that were done prior to probe registration, in both software and hardware code paths. Note that syzkaller managed to fail probe registration by injecting a slab allocation failure [2]. [1] ODEBUG: init active (active state 0) object type: timer_list hint: sched_send_work+0x0/0x60 include/linux/list.h:135 WARNING: CPU: 1 PID: 8649 at lib/debugobjects.c:505 debug_print_object+0x16e/0x250 lib/debugobjects.c:505 Modules linked in: CPU: 1 PID: 8649 Comm: syz-executor.0 Not tainted 5.11.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:debug_print_object+0x16e/0x250 lib/debugobjects.c:505 [...] Call Trace: __debug_object_init+0x524/0xd10 lib/debugobjects.c:588 debug_timer_init kernel/time/timer.c:722 [inline] debug_init kernel/time/timer.c:770 [inline] init_timer_key+0x2d/0x340 kernel/time/timer.c:814 net_dm_trace_on_set net/core/drop_monitor.c:1111 [inline] set_all_monitor_traces net/core/drop_monitor.c:1188 [inline] net_dm_monitor_start net/core/drop_monitor.c:1295 [inline] net_dm_cmd_trace+0x720/0x1220 net/core/drop_monitor.c:1339 genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:739 genl_family_rcv_msg net/netlink/genetlink.c:783 [inline] genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:800 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2502 genl_rcv+0x24/0x40 net/netlink/genetlink.c:811 netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1338 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1927 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2348 ___sys_sendmsg+0xf3/0x170 net/socket.c:2402 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2435 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xae [2] FAULT_INJECTION: forcing a failure. name failslab, interval 1, probability 0, space 0, times 1 CPU: 1 PID: 8645 Comm: syz-executor.0 Not tainted 5.11.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: dump_stack+0xfa/0x151 should_fail.cold+0x5/0xa should_failslab+0x5/0x10 __kmalloc+0x72/0x3f0 tracepoint_add_func+0x378/0x990 tracepoint_probe_register+0x9c/0xe0 net_dm_cmd_trace+0x7fc/0x1220 genl_family_rcv_msg_doit+0x228/0x320 genl_rcv_msg+0x328/0x580 netlink_rcv_skb+0x153/0x420 genl_rcv+0x24/0x40 netlink_unicast+0x533/0x7d0 netlink_sendmsg+0x856/0xd90 sock_sendmsg+0xcf/0x120 ____sys_sendmsg+0x6e8/0x810 ___sys_sendmsg+0xf3/0x170 __sys_sendmsg+0xe5/0x1b0 do_syscall_64+0x2d/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: 70c69274f354 ("drop_monitor: Initialize timer and work item upon tracing enable") Fixes: 8ee2267ad33e ("drop_monitor: Convert to using devlink tracepoint") Reported-by: syzbot+779559d6503f3a56213d@syzkaller.appspotmail.com Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-07pktgen: fix misuse of BUG_ON() in pktgen_thread_worker()Di Zhu1-1/+1
[ Upstream commit 275b1e88cabb34dbcbe99756b67e9939d34a99b6 ] pktgen create threads for all online cpus and bond these threads to relevant cpu repecivtily. when this thread firstly be woken up, it will compare cpu currently running with the cpu specified at the time of creation and if the two cpus are not equal, BUG_ON() will take effect causing panic on the system. Notice that these threads could be migrated to other cpus before start running because of the cpu hotplug after these threads have created. so the BUG_ON() used here seems unreasonable and we can replace it with WARN_ON() to just printf a warning other than panic the system. Signed-off-by: Di Zhu <zhudi21@huawei.com> Link: https://lore.kernel.org/r/20210125124229.19334-1-zhudi21@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-07net: fix dev_ifsioc_locked() race conditionCong Wang3-14/+50
commit 3b23a32a63219f51a5298bc55a65ecee866e79d0 upstream. dev_ifsioc_locked() is called with only RCU read lock, so when there is a parallel writer changing the mac address, it could get a partially updated mac address, as shown below: Thread 1 Thread 2 // eth_commit_mac_addr_change() memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN); // dev_ifsioc_locked() memcpy(ifr->ifr_hwaddr.sa_data, dev->dev_addr,...); Close this race condition by guarding them with a RW semaphore, like netdev_get_name(). We can not use seqlock here as it does not allow blocking. The writers already take RTNL anyway, so this does not affect the slow path. To avoid bothering existing dev_set_mac_address() callers in drivers, introduce a new wrapper just for user-facing callers on ioctl and rtnetlink paths. Note, bonding also changes slave mac addresses but that requires a separate patch due to the complexity of bonding code. Fixes: 3710becf8a58 ("net: RCU locking for simple ioctl()") Reported-by: "Gong, Sishuai" <sishuai@purdue.edu> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-07net: fix up truesize of cloned skb in skb_prepare_for_shift()Marco Elver1-1/+13
commit 097b9146c0e26aabaa6ff3e5ea536a53f5254a79 upstream. Avoid the assumption that ksize(kmalloc(S)) == ksize(kmalloc(S)): when cloning an skb, save and restore truesize after pskb_expand_head(). This can occur if the allocator decides to service an allocation of the same size differently (e.g. use a different size class, or pass the allocation on to KFENCE). Because truesize is used for bookkeeping (such as sk_wmem_queued), a modified truesize of a cloned skb may result in corrupt bookkeeping and relevant warnings (such as in sk_stream_kill_queues()). Link: https://lkml.kernel.org/r/X9JR/J6dMMOy1obu@elver.google.com Reported-by: syzbot+7b99aafdcc2eedea6178@syzkaller.appspotmail.com Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20210201160420.2826895-1-elver@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04bpf: Fix bpf_fib_lookup helper MTU check for SKB ctxJesper Dangaard Brouer1-3/+10
[ Upstream commit 2c0a10af688c02adcf127aad29e923e0056c6b69 ] BPF end-user on Cilium slack-channel (Carlo Carraro) wants to use bpf_fib_lookup for doing MTU-check, but *prior* to extending packet size, by adjusting fib_params 'tot_len' with the packet length plus the expected encap size. (Just like the bpf_check_mtu helper supports). He discovered that for SKB ctx the param->tot_len was not used, instead skb->len was used (via MTU check in is_skb_forwardable() that checks against netdev MTU). Fix this by using fib_params 'tot_len' for MTU check. If not provided (e.g. zero) then keep existing TC behaviour intact. Notice that 'tot_len' for MTU check is done like XDP code-path, which checks against FIB-dst MTU. V16: - Revert V13 optimization, 2nd lookup is against egress/resulting netdev V13: - Only do ifindex lookup one time, calling dev_get_by_index_rcu(). V10: - Use same method as XDP for 'tot_len' MTU check Fixes: 4c79579b44b1 ("bpf: Change bpf_fib_lookup to return lookup status") Reported-by: Carlo Carraro <colrack@gmail.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/161287789444.790810.15247494756551413508.stgit@firesoul Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-02-23net/sched: fix miss init the mru in qdisc_skb_cbwenxu1-0/+2
[ Upstream commit aadaca9e7c392dbf877af8cefb156199f1a67bbe ] The mru in the qdisc_skb_cb should be init as 0. Only defrag packets in the act_ct will set the value. Fixes: 038ebb1a713d ("net/sched: act_ct: fix miss set mru for ovs after defrag in act_ct") Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-02-17net: gro: do not keep too many GRO packets in napi->rx_listEric Dumazet1-5/+6
commit 8dc1c444df193701910f5e80b5d4caaf705a8fb0 upstream. Commit c80794323e82 ("net: Fix packet reordering caused by GRO and listified RX cooperation") had the unfortunate effect of adding latencies in common workloads. Before the patch, GRO packets were immediately passed to upper stacks. After the patch, we can accumulate quite a lot of GRO packets (depdending on NAPI budget). My fix is counting in napi->rx_count number of segments instead of number of logical packets. Fixes: c80794323e82 ("net: Fix packet reordering caused by GRO and listified RX cooperation") Signed-off-by: Eric Dumazet <edumazet@google.com> Bisected-by: John Sperbeck <jsperbeck@google.com> Tested-by: Jian Yang <jianyang@google.com> Cc: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Link: https://lore.kernel.org/r/20210204213146.4192368-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-17udp: fix skb_copy_and_csum_datagram with odd segment sizesWillem de Bruijn1-2/+10
commit 52cbd23a119c6ebf40a527e53f3402d2ea38eccb upstream. When iteratively computing a checksum with csum_block_add, track the offset "pos" to correctly rotate in csum_block_add when offset is odd. The open coded implementation of skb_copy_and_csum_datagram did this. With the switch to __skb_datagram_iter calling csum_and_copy_to_iter, pos was reinitialized to 0 on each call. Bring back the pos by passing it along with the csum to the callback. Changes v1->v2 - pass csum value, instead of csump pointer (Alexander Duyck) Link: https://lore.kernel.org/netdev/20210128152353.GB27281@optiplex/ Fixes: 950fcaecd5cc ("datagram: consolidate datagram copy to iter helpers") Reported-by: Oliver Graute <oliver.graute@gmail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20210203192952.1849843-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10neighbour: Prevent a dead entry from updating gc_listChinmay Agarwal1-3/+4
commit eb4e8fac00d1e01ada5e57c05d24739156086677 upstream. Following race condition was detected: <CPU A, t0> - neigh_flush_dev() is under execution and calls neigh_mark_dead(n) marking the neighbour entry 'n' as dead. <CPU B, t1> - Executing: __netif_receive_skb() -> __netif_receive_skb_core() -> arp_rcv() -> arp_process().arp_process() calls __neigh_lookup() which takes a reference on neighbour entry 'n'. <CPU A, t2> - Moves further along neigh_flush_dev() and calls neigh_cleanup_and_release(n), but since reference count increased in t2, 'n' couldn't be destroyed. <CPU B, t3> - Moves further along, arp_process() and calls neigh_update()-> __neigh_update() -> neigh_update_gc_list(), which adds the neighbour entry back in gc_list(neigh_mark_dead(), removed it earlier in t0 from gc_list) <CPU B, t4> - arp_process() finally calls neigh_release(n), destroying the neighbour entry. This leads to 'n' still being part of gc_list, but the actual neighbour structure has been freed. The situation can be prevented from happening if we disallow a dead entry to have any possibility of updating gc_list. This is what the patch intends to achieve. Fixes: 9c29a2f55ec0 ("neighbor: Fix locking order for gc_list changes") Signed-off-by: Chinmay Agarwal <chinagar@codeaurora.org> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20210127165453.GA20514@chinagar-linux.qualcomm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27net: core: devlink: use right genl user_ptr when handling port param get/setOleksandr Mazur1-2/+2
commit 7e238de8283acd32c26c2bc2a50672d0ea862ff7 upstream. Fix incorrect user_ptr dereferencing when handling port param get/set: idx [0] stores the 'struct devlink' pointer; idx [1] stores the 'struct devlink_port' pointer; Fixes: 637989b5d77e ("devlink: Always use user_ptr[0] for devlink and simplify post_doit") CC: Parav Pandit <parav@mellanox.com> Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu> Signed-off-by: Vadym Kochan <vadym.kochan@plvision.eu> Link: https://lore.kernel.org/r/20210119085333.16833-1-vadym.kochan@plvision.eu Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabledTariq Toukan1-0/+5
commit a3eb4e9d4c9218476d05c52dfd2be3d6fdce6b91 upstream. With NETIF_F_HW_TLS_RX packets are decrypted in HW. This cannot be logically done when RXCSUM offload is off. Fixes: 14136564c8ee ("net: Add TLS RX offload feature") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Boris Pismenny <borisp@nvidia.com> Link: https://lore.kernel.org/r/20210117151538.9411-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27net_sched: gen_estimator: support large ewma logEric Dumazet1-4/+7
commit dd5e073381f2ada3630f36be42833c6e9c78b75e upstream. syzbot report reminded us that very big ewma_log were supported in the past, even if they made litle sense. tc qdisc replace dev xxx root est 1sec 131072sec ... While fixing the bug, also add boundary checks for ewma_log, in line with range supported by iproute2. UBSAN: shift-out-of-bounds in net/core/gen_estimator.c:83:38 shift exponent -1 is negative CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395 est_timer.cold+0xbb/0x12d net/core/gen_estimator.c:83 call_timer_fn+0x1a5/0x710 kernel/time/timer.c:1417 expire_timers kernel/time/timer.c:1462 [inline] __run_timers.part.0+0x692/0xa80 kernel/time/timer.c:1731 __run_timers kernel/time/timer.c:1712 [inline] run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1744 __do_softirq+0x2bc/0xa77 kernel/softirq.c:343 asm_call_irq_on_stack+0xf/0x20 </IRQ> __run_on_irqstack arch/x86/include/asm/irq_stack.h:26 [inline] run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:77 [inline] do_softirq_own_stack+0xaa/0xd0 arch/x86/kernel/irq_64.c:77 invoke_softirq kernel/softirq.c:226 [inline] __irq_exit_rcu+0x17f/0x200 kernel/softirq.c:420 irq_exit_rcu+0x5/0x20 kernel/softirq.c:432 sysvec_apic_timer_interrupt+0x4d/0x100 arch/x86/kernel/apic/apic.c:1096 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:628 RIP: 0010:native_save_fl arch/x86/include/asm/irqflags.h:29 [inline] RIP: 0010:arch_local_save_flags arch/x86/include/asm/irqflags.h:79 [inline] RIP: 0010:arch_irqs_disabled arch/x86/include/asm/irqflags.h:169 [inline] RIP: 0010:acpi_safe_halt drivers/acpi/processor_idle.c:111 [inline] RIP: 0010:acpi_idle_do_entry+0x1c9/0x250 drivers/acpi/processor_idle.c:516 Fixes: 1c0d32fde5bd ("net_sched: gen_estimator: complete rewrite of rate estimators") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20210114181929.1717985-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27skbuff: back tiny skbs with kmalloc() in __netdev_alloc_skb() tooAlexander Lobakin1-1/+5
commit 66c556025d687dbdd0f748c5e1df89c977b6c02a upstream. Commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for tiny skbs") ensured that skbs with data size lower than 1025 bytes will be kmalloc'ed to avoid excessive page cache fragmentation and memory consumption. However, the fix adressed only __napi_alloc_skb() (primarily for virtio_net and napi_get_frags()), but the issue can still be achieved through __netdev_alloc_skb(), which is still used by several drivers. Drivers often allocate a tiny skb for headers and place the rest of the frame to frags (so-called copybreak). Mirror the condition to __netdev_alloc_skb() to handle this case too. Since v1 [0]: - fix "Fixes:" tag; - refine commit message (mention copybreak usecase). [0] https://lore.kernel.org/netdev/20210114235423.232737-1-alobakin@pm.me Fixes: a1c7fff7e18f ("net: netdev_alloc_skb() use build_skb()") Signed-off-by: Alexander Lobakin <alobakin@pm.me> Link: https://lore.kernel.org/r/20210115150354.85967-1-alobakin@pm.me Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23net: avoid 32 x truesize under-estimation for tiny skbsEric Dumazet1-2/+7
[ Upstream commit 3226b158e67cfaa677fd180152bfb28989cb2fac ] Both virtio net and napi_get_frags() allocate skbs with a very small skb->head While using page fragments instead of a kmalloc backed skb->head might give a small performance improvement in some cases, there is a huge risk of under estimating memory usage. For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations per page (order-3 page in x86), or even 64 on PowerPC We have been tracking OOM issues on GKE hosts hitting tcp_mem limits but consuming far more memory for TCP buffers than instructed in tcp_mem[2] Even if we force napi_alloc_skb() to only use order-0 pages, the issue would still be there on arches with PAGE_SIZE >= 32768 This patch makes sure that small skb head are kmalloc backed, so that other objects in the slab page can be reused instead of being held as long as skbs are sitting in socket queues. Note that we might in the future use the sk_buff napi cache, instead of going through a more expensive __alloc_skb() Another idea would be to use separate page sizes depending on the allocated length (to never have more than 4 frags per page) I would like to thank Greg Thelen for his precious help on this matter, analysing crash dumps is always a time consuming task. Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Greg Thelen <gthelen@google.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20210113161819.1155526-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23udp: Prevent reuseport_select_sock from reading uninitialized socksBaptiste Lepers1-1/+1
[ Upstream commit fd2ddef043592e7de80af53f47fa46fd3573086e ] reuse->socks[] is modified concurrently by reuseport_add_sock. To prevent reading values that have not been fully initialized, only read the array up until the last known safe index instead of incorrectly re-reading the last index of the array. Fixes: acdcecc61285f ("udp: correct reuseport selection with connected sockets") Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20210107051110.12247-1-baptiste.lepers@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23net: fix use-after-free when UDP GRO with shared fraglistDongseok Yi1-1/+19
[ Upstream commit 53475c5dd856212e91538a9501162e821cc1f791 ] skbs in fraglist could be shared by a BPF filter loaded at TC. If TC writes, it will call skb_ensure_writable -> pskb_expand_head to create a private linear section for the head_skb. And then call skb_clone_fraglist -> skb_get on each skb in the fraglist. skb_segment_list overwrites part of the skb linear section of each fragment itself. Even after skb_clone, the frag_skbs share their linear section with their clone in PF_PACKET. Both sk_receive_queue of PF_PACKET and PF_INET (or PF_INET6) can have a link for the same frag_skbs chain. If a new skb (not frags) is queued to one of the sk_receive_queue, multiple ptypes can see and release this. It causes use-after-free. [ 4443.426215] ------------[ cut here ]------------ [ 4443.426222] refcount_t: underflow; use-after-free. [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190 refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO) [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8 [ 4443.426808] Call trace: [ 4443.426813] refcount_dec_and_test_checked+0xa4/0xc8 [ 4443.426823] skb_release_data+0x144/0x264 [ 4443.426828] kfree_skb+0x58/0xc4 [ 4443.426832] skb_queue_purge+0x64/0x9c [ 4443.426844] packet_set_ring+0x5f0/0x820 [ 4443.426849] packet_setsockopt+0x5a4/0xcd0 [ 4443.426853] __sys_setsockopt+0x188/0x278 [ 4443.426858] __arm64_sys_setsockopt+0x28/0x38 [ 4443.426869] el0_svc_common+0xf0/0x1d0 [ 4443.426873] el0_svc_handler+0x74/0x98 [ 4443.426880] el0_svc+0x8/0xc Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.) Signed-off-by: Dongseok Yi <dseok.yi@samsung.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/1610072918-174177-1-git-send-email-dseok.yi@samsung.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-17net: drop bogus skb with CHECKSUM_PARTIAL and offset beyond end of trimmed ↵Vasily Averin1-0/+6
packet commit 54970a2fbb673f090b7f02d7f57b10b2e0707155 upstream. syzbot reproduces BUG_ON in skb_checksum_help(): tun creates (bogus) skb with huge partial-checksummed area and small ip packet inside. Then ip_rcv trims the skb based on size of internal ip packet, after that csum offset points beyond of trimmed skb. Then checksum_tg() called via netfilter hook triggers BUG_ON: offset = skb_checksum_start_offset(skb); BUG_ON(offset >= skb_headlen(skb)); To work around the problem this patch forces pskb_trim_rcsum_slow() to return -EINVAL in described scenario. It allows its callers to drop such kind of packets. Link: https://syzkaller.appspot.com/bug?id=b419a5ca95062664fe1a60b764621eb4526e2cd0 Reported-by: syzbot+7010af67ced6105e5ab6@syzkaller.appspotmail.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/1b2494af-2c56-8ee2-7bc0-923fcad1cdf8@virtuozzo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12net-sysfs: take the rtnl lock when accessing xps_rxqs_map and num_tcAntoine Tenart1-5/+18
[ Upstream commit 4ae2bb81649dc03dfc95875f02126b14b773f7ab ] Accesses to dev->xps_rxqs_map (when using dev->num_tc) should be protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't see an actual bug being triggered, but let's be safe here and take the rtnl lock while accessing the map in sysfs. Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue") Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12net-sysfs: take the rtnl lock when storing xps_rxqsAntoine Tenart1-0/+7
[ Upstream commit 2d57b4f142e0b03e854612b8e28978935414bced ] Two race conditions can be triggered when storing xps rxqs, resulting in various oops and invalid memory accesses: 1. Calling netdev_set_num_tc while netif_set_xps_queue: - netif_set_xps_queue uses dev->tc_num as one of the parameters to compute the size of new_dev_maps when allocating it. dev->tc_num is also used to access the map, and the compiler may generate code to retrieve this field multiple times in the function. - netdev_set_num_tc sets dev->tc_num. If new_dev_maps is allocated using dev->tc_num and then dev->tc_num is set to a higher value through netdev_set_num_tc, later accesses to new_dev_maps in netif_set_xps_queue could lead to accessing memory outside of new_dev_maps; triggering an oops. 2. Calling netif_set_xps_queue while netdev_set_num_tc is running: 2.1. netdev_set_num_tc starts by resetting the xps queues, dev->tc_num isn't updated yet. 2.2. netif_set_xps_queue is called, setting up the map with the *old* dev->num_tc. 2.3. netdev_set_num_tc updates dev->tc_num. 2.4. Later accesses to the map lead to out of bound accesses and oops. A similar issue can be found with netdev_reset_tc. One way of triggering this is to set an iface up (for which the driver uses netdev_set_num_tc in the open path, such as bnx2x) and writing to xps_rxqs in a concurrent thread. With the right timing an oops is triggered. Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc and netdev_reset_tc should be mutually exclusive. We do that by taking the rtnl lock in xps_rxqs_store. Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue") Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12net-sysfs: take the rtnl lock when accessing xps_cpus_map and num_tcAntoine Tenart1-7/+22
[ Upstream commit fb25038586d0064123e393cadf1fadd70a9df97a ] Accesses to dev->xps_cpus_map (when using dev->num_tc) should be protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't see an actual bug being triggered, but let's be safe here and take the rtnl lock while accessing the map in sysfs. Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes") Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12net-sysfs: take the rtnl lock when storing xps_cpusAntoine Tenart1-0/+6
[ Upstream commit 1ad58225dba3f2f598d2c6daed4323f24547168f ] Two race conditions can be triggered when storing xps cpus, resulting in various oops and invalid memory accesses: 1. Calling netdev_set_num_tc while netif_set_xps_queue: - netif_set_xps_queue uses dev->tc_num as one of the parameters to compute the size of new_dev_maps when allocating it. dev->tc_num is also used to access the map, and the compiler may generate code to retrieve this field multiple times in the function. - netdev_set_num_tc sets dev->tc_num. If new_dev_maps is allocated using dev->tc_num and then dev->tc_num is set to a higher value through netdev_set_num_tc, later accesses to new_dev_maps in netif_set_xps_queue could lead to accessing memory outside of new_dev_maps; triggering an oops. 2. Calling netif_set_xps_queue while netdev_set_num_tc is running: 2.1. netdev_set_num_tc starts by resetting the xps queues, dev->tc_num isn't updated yet. 2.2. netif_set_xps_queue is called, setting up the map with the *old* dev->num_tc. 2.3. netdev_set_num_tc updates dev->tc_num. 2.4. Later accesses to the map lead to out of bound accesses and oops. A similar issue can be found with netdev_reset_tc. One way of triggering this is to set an iface up (for which the driver uses netdev_set_num_tc in the open path, such as bnx2x) and writing to xps_cpus in a concurrent thread. With the right timing an oops is triggered. Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc and netdev_reset_tc should be mutually exclusive. We do that by taking the rtnl lock in xps_cpus_store. Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes") Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller3-27/+36
Alexei Starovoitov says: ==================== pull-request: bpf 2020-12-10 The following pull-request contains BPF updates for your *net* tree. We've added 21 non-merge commits during the last 12 day(s) which contain a total of 21 files changed, 163 insertions(+), 88 deletions(-). The main changes are: 1) Fix propagation of 32-bit signed bounds from 64-bit bounds, from Alexei. 2) Fix ring_buffer__poll() return value, from Andrii. 3) Fix race in lwt_bpf, from Cong. 4) Fix test_offload, from Toke. 5) Various xsk fixes. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Thanks a lot! Also thanks to reporters, reviewers and testers of commits in this pull-request: Cong Wang, Hulk Robot, Jakub Kicinski, Jean-Philippe Brucker, John Fastabend, Magnus Karlsson, Maxim Miki