summaryrefslogtreecommitdiff
path: root/net/sched
AgeCommit message (Collapse)AuthorFilesLines
2023-02-06net/sched: mqprio: refactor offloading and unoffloading to dedicated functionsVladimir Oltean1-43/+59
Some more logic will be added to mqprio offloading, so split that code up from mqprio_init(), which is already large, and create a new function, mqprio_enable_offload(), similar to taprio_enable_offload(). Also create the opposite function mqprio_disable_offload(). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-06net/sched: mqprio: refactor nlattr parsing to a separate functionVladimir Oltean1-51/+63
mqprio_init() is quite large and unwieldy to add more code to. Split the netlink attribute parsing to a dedicated function. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03net/sched: act_ct: offload UDP NEW connectionsVlad Buslov1-12/+39
Modify the offload algorithm of UDP connections to the following: - Offload NEW connection as unidirectional. - When connection state changes to ESTABLISHED also update the hardware flow. However, in order to prevent act_ct from spamming offload add wq for every packet coming in reply direction in this state verify whether connection has already been updated to ESTABLISHED in the drivers. If that it the case, then skip flow_table and let conntrack handle such packets which will also allow conntrack to potentially promote the connection to ASSURED. - When connection state changes to ASSURED set the flow_table flow NF_FLOW_HW_BIDIRECTIONAL flag which will cause refresh mechanism to offload the reply direction. All other protocols have their offload algorithm preserved and are always offloaded as bidirectional. Note that this change tries to minimize the load on flow_table add workqueue. First, it tracks the last ctinfo that was offloaded by using new flow 'NF_FLOW_HW_ESTABLISHED' flag and doesn't schedule the refresh for reply direction packets when the offloads have already been updated with current ctinfo. Second, when 'add' task executes on workqueue it always update the offload with current flow state (by checking 'bidirectional' flow flag and obtaining actual ctinfo/cookie through meta action instead of caching any of these from the moment of scheduling the 'add' work) preventing the need from scheduling more updates if state changed concurrently while the 'add' work was pending on workqueue. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03net/sched: act_ct: set ctinfo in meta action depending on ct stateVlad Buslov1-2/+4
Currently tcf_ct_flow_table_fill_actions() function assumes that only established connections can be offloaded and always sets ctinfo to either IP_CT_ESTABLISHED or IP_CT_ESTABLISHED_REPLY strictly based on direction without checking actual connection state. To enable UDP NEW connection offload set the ctinfo, metadata cookie and NF_FLOW_HW_ESTABLISHED flow_offload flags bit based on ct->status value. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-03netfilter: flowtable: cache info of last offloadVlad Buslov1-5/+7
Modify flow table offload to cache the last ct info status that was passed to the driver offload callbacks by extending enum nf_flow_flags with new "NF_FLOW_HW_ESTABLISHED" flag. Set the flag if ctinfo was 'established' during last act_ct meta actions fill call. This infrastructure change is necessary to optimize promoting of UDP connections from 'new' to 'established' in following patches in this series. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+4
net/core/gro.c 7d2c89b32587 ("skb: Do mix page pool and page referenced frags in GRO") b1a78b9b9886 ("net: add support for ipv4 big tcp") https://lore.kernel.org/all/20230203094454.5766f160@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-02net/sched: simplify tcf_pedit_actPedro Tammela1-70/+67
Remove the check for a negative number of keys as this cannot ever happen Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-02net/sched: transition act_pedit to rcu and percpu statsPedro Tammela1-61/+87
The software pedit action didn't get the same love as some of the other actions and it's still using spinlocks and shared stats in the datapath. Transition the action to rcu and percpu stats as this improves the action's performance dramatically on multiple cpu deployments. Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-01net: sched: use skb_ip_totlen and iph_totlenXin Long2-2/+2
There are 1 action and 1 qdisc that may process IPv4 TCP GSO packets and access iph->tot_len, replace them with skb_ip_totlen() and iph_totlen() accordingly. Note that we don't need to replace the one in tcf_csum_ipv4(), as it will return for TCP GSO packets in tcf_csum_ipv4_tcp(). Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-31net: sched: sch: Bounds check priorityKees Cook1-1/+4
Nothing was explicitly bounds checking the priority index used to access clpriop[]. WARN and bail out early if it's pathological. Seen with GCC 13: ../net/sched/sch_htb.c: In function 'htb_activate_prios': ../net/sched/sch_htb.c:437:44: warning: array subscript [0, 31] is outside array bounds of 'struct htb_prio[8]' [-Warray-bounds=] 437 | if (p->inner.clprio[prio].feed.rb_node) | ~~~~~~~~~~~~~~~^~~~~~ ../net/sched/sch_htb.c:131:41: note: while referencing 'clprio' 131 | struct htb_prio clprio[TC_HTB_NUMPRIO]; | ^~~~~~ Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Cong Wang <cong.wang@bytedance.com> Link: https://lore.kernel.org/r/20230127224036.never.561-kees@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+0
Conflicts: drivers/net/ethernet/intel/ice/ice_main.c 418e53401e47 ("ice: move devlink port creation/deletion") 643ef23bd9dd ("ice: Introduce local var for readability") https://lore.kernel.org/all/20230127124025.0dacef40@canb.auug.org.au/ https://lore.kernel.org/all/20230124005714.3996270-1-anthony.l.nguyen@intel.com/ drivers/net/ethernet/engleder/tsnep_main.c 3d53aaef4332 ("tsnep: Fix TX queue stop/wake for multiple queues") 25faa6a4c5ca ("tsnep: Replace TX spin_lock with __netif_tx_lock") https://lore.kernel.org/all/20230127123604.36bb3e99@canb.auug.org.au/ net/netfilter/nf_conntrack_proto_sctp.c 13bd9b31a969 ("Revert "netfilter: conntrack: add sctp DATA_SENT state"") a44b7651489f ("netfilter: conntrack: unify established states for SCTP paths") f71cb8f45d09 ("netfilter: conntrack: sctp: use nf log infrastructure for invalid packets") https://lore.kernel.org/all/20230127125052.674281f9@canb.auug.org.au/ https://lore.kernel.org/all/d36076f3-6add-a442-6d4b-ead9f7ffff86@tessares.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-25net: Kconfig: fix spellosRandy Dunlap1-1/+1
Fix spelling in net/ Kconfig files. (reported by codespell) Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Florian Westphal <fw@strlen.de> Cc: coreteam@netfilter.org Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Link: https://lore.kernel.org/r/20230124181724.18166-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-24net/sched: sch_taprio: do not schedule in taprio_reset()Eric Dumazet1-1/+0
As reported by syzbot and hinted by Vinicius, I should not have added a qdisc_synchronize() call in taprio_reset() taprio_reset() can be called with qdisc spinlock held (and BH disabled) as shown in included syzbot report [1]. Only taprio_destroy() needed this synchronization, as explained in the blamed commit changelog. [1] BUG: scheduling while atomic: syz-executor150/5091/0x00000202 2 locks held by syz-executor150/5091: Modules linked in: Preemption disabled at: [<0000000000000000>] 0x0 Kernel panic - not syncing: scheduling while atomic: panic_on_warn set ... CPU: 1 PID: 5091 Comm: syz-executor150 Not tainted 6.2.0-rc3-syzkaller-00219-g010a74f52203 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd1/0x138 lib/dump_stack.c:106 panic+0x2cc/0x626 kernel/panic.c:318 check_panic_on_warn.cold+0x19/0x35 kernel/panic.c:238 __schedule_bug.cold+0xd5/0xfe kernel/sched/core.c:5836 schedule_debug kernel/sched/core.c:5865 [inline] __schedule+0x34e4/0x5450 kernel/sched/core.c:6500 schedule+0xde/0x1b0 kernel/sched/core.c:6682 schedule_timeout+0x14e/0x2a0 kernel/time/timer.c:2167 schedule_timeout_uninterruptible kernel/time/timer.c:2201 [inline] msleep+0xb6/0x100 kernel/time/timer.c:2322 qdisc_synchronize include/net/sch_generic.h:1295 [inline] taprio_reset+0x93/0x270 net/sched/sch_taprio.c:1703 qdisc_reset+0x10c/0x770 net/sched/sch_generic.c:1022 dev_reset_queue+0x92/0x130 net/sched/sch_generic.c:1285 netdev_for_each_tx_queue include/linux/netdevice.h:2464 [inline] dev_deactivate_many+0x36d/0x9f0 net/sched/sch_generic.c:1351 dev_deactivate+0xed/0x1b0 net/sched/sch_generic.c:1374 qdisc_graft+0xe4a/0x1380 net/sched/sch_api.c:1080 tc_modify_qdisc+0xb6b/0x19a0 net/sched/sch_api.c:1689 rtnetlink_rcv_msg+0x43e/0xca0 net/core/rtnetlink.c:6141 netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2564 netlink_unicast_kernel net/netlink/af_netlink.c:1330 [inline] netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1356 netlink_sendmsg+0x91b/0xe10 net/netlink/af_netlink.c:1932 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg+0xd3/0x120 net/socket.c:734 ____sys_sendmsg+0x712/0x8c0 net/socket.c:2476 ___sys_sendmsg+0x110/0x1b0 net/socket.c:2530 __sys_sendmsg+0xf7/0x1c0 net/socket.c:2559 do_syscall_x64 arch/x86/entry/common.c:50 [inline] Fixes: 3a415d59c1db ("net/sched: sch_taprio: fix possible use-after-free") Link: https://lore.kernel.org/netdev/167387581653.2747.13878941339893288655.git-patchwork-notify@kernel.org/T/ Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com> Link: https://lore.kernel.org/r/20230123084552.574396-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-24act_mirred: use the backlog for nested calls to mirred ingressDavide Caratti1-0/+7
William reports kernel soft-lockups on some OVS topologies when TC mirred egress->ingress action is hit by local TCP traffic [1]. The same can also be reproduced with SCTP (thanks Xin for verifying), when client and server reach themselves through mirred egress to ingress, and one of the two peers sends a "heartbeat" packet (from within a timer). Enqueueing to backlog proved to fix this soft lockup; however, as Cong noticed [2], we should preserve - when possible - the current mirred behavior that counts as "overlimits" any eventual packet drop subsequent to the mirred forwarding action [3]. A compromise solution might use the backlog only when tcf_mirred_act() has a nest level greater than one: change tcf_mirred_forward() accordingly. Also, add a kselftest that can reproduce the lockup and verifies TC mirred ability to account for further packet drops after TC mirred egress->ingress (when the nest level is 1). [1] https://lore.kernel.org/netdev/33dc43f587ec1388ba456b4915c75f02a8aae226.1663945716.git.dcaratti@redhat.com/ [2] https://lore.kernel.org/netdev/Y0w%2FWWY60gqrtGLp@pop-os.localdomain/ [3] such behavior is not guaranteed: for example, if RPS or skb RX timestamping is enabled on the mirred target device, the kernel can defer receiving the skb and return NET_RX_SUCCESS inside tcf_mirred_forward(). Reported-by: William Zhao <wizhao@redhat.com> CC: Xin Long <lucien.xin@gmail.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-24net/sched: act_mirred: better wording on protection against excessive stack ↵Davide Caratti1-8/+8
growth with commit e2ca070f89ec ("net: sched: protect against stack overflow in TC act_mirred"), act_mirred protected itself against excessive stack growth using per_cpu counter of nested calls to tcf_mirred_act(), and capping it to MIRRED_RECURSION_LIMIT. However, such protection does not detect recursion/loops in case the packet is enqueued to the backlog (for example, when the mirred target device has RPS or skb timestamping enabled). Change the wording from "recursion" to "nesting" to make it more clear to readers. CC: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-11/+21
drivers/net/ipa/ipa_interrupt.c drivers/net/ipa/ipa_interrupt.h 9ec9b2a30853 ("net: ipa: disable ipa interrupt during suspend") 8e461e1f092b ("net: ipa: introduce ipa_interrupt_enable()") d50ed3558719 ("net: ipa: enable IPA interrupt handlers separate from registration") https://lore.kernel.org/all/20230119114125.5182c7ab@canb.auug.org.au/ https://lore.kernel.org/all/79e46152-8043-a512-79d9-c3b905462774@tessares.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-18net: sched: gred: prevent races when adding offloads to statsJakub Kicinski1-0/+2
Naresh reports seeing a warning that gred is calling u64_stats_update_begin() with preemption enabled. Arnd points out it's coming from _bstats_update(). We should be holding the qdisc lock when writing to stats, they are also updated from the datapath. Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Link: https://lore.kernel.org/all/CA+G9fYsTr9_r893+62u6UGD3dVaCE-kN9C-Apmb2m=hxjc1Cqg@mail.gmail.com/ Fixes: e49efd5288bd ("net: sched: gred: support reporting stats from offloads") Link: https://lore.kernel.org/r/20230113044137.1383067-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-17sched: add new attr TCA_EXT_WARN_MSG to report tc extact messageHangbin Liu3-49/+83
We will report extack message if there is an error via netlink_ack(). But if the rule is not to be exclusively executed by the hardware, extack is not passed along and offloading failures don't get logged. In commit 81c7288b170a ("sched: cls: enable verbose logging") Marcelo made cls could log verbose info for offloading failures, which helps improving Open vSwitch debuggability when using flower offloading. It would also be helpful if userspace monitor tools, like "tc monitor", could log this kind of message, as it doesn't require vswitchd log level adjusment. Let's add a new tc attributes to report the extack message so the monitor program could receive the failures. e.g. # tc monitor added chain dev enp3s0f1np1 parent ffff: chain 0 added filter dev enp3s0f1np1 ingress protocol all pref 49152 flower chain 0 handle 0x1 ct_state +trk+new not_in_hw action order 1: gact action drop random type none pass val 0 index 1 ref 1 bind 1 Warning: mlx5_core: matching on ct_state +new isn't supported. In this patch I only report the extack message on add/del operations. It doesn't look like we need to report the extack message on get/dump operations. Note this message not only reporte to multicast groups, it could also be reported unicast, which may affect the current usersapce tool's behaivor. Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20230113034353.2766735-1-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-16net/sched: sch_taprio: fix possible use-after-freeEric Dumazet1-0/+3
syzbot reported a nasty crash [1] in net_tx_action() which made little sense until we got a repro. This repro installs a taprio qdisc, but providing an invalid TCA_RATE attribute. qdisc_create() has to destroy the just initialized taprio qdisc, and taprio_destroy() is called. However, the hrtimer used by taprio had already fired, therefore advance_sched() called __netif_schedule(). Then net_tx_action was trying to use a destroyed qdisc. We can not undo the __netif_schedule(), so we must wait until one cpu serviced the qdisc before we can proceed. Many thanks to Alexander Potapenko for his help. [1] BUG: KMSAN: uninit-value in queued_spin_trylock include/asm-generic/qspinlock.h:94 [inline] BUG: KMSAN: uninit-value in do_raw_spin_trylock include/linux/spinlock.h:191 [inline] BUG: KMSAN: uninit-value in __raw_spin_trylock include/linux/spinlock_api_smp.h:89 [inline] BUG: KMSAN: uninit-value in _raw_spin_trylock+0x92/0xa0 kernel/locking/spinlock.c:138 queued_spin_trylock include/asm-generic/qspinlock.h:94 [inline] do_raw_spin_trylock include/linux/spinlock.h:191 [inline] __raw_spin_trylock include/linux/spinlock_api_smp.h:89 [inline] _raw_spin_trylock+0x92/0xa0 kernel/locking/spinlock.c:138 spin_trylock include/linux/spinlock.h:359 [inline] qdisc_run_begin include/net/sch_generic.h:187 [inline] qdisc_run+0xee/0x540 include/net/pkt_sched.h:125 net_tx_action+0x77c/0x9a0 net/core/dev.c:5086 __do_softirq+0x1cc/0x7fb kernel/softirq.c:571 run_ksoftirqd+0x2c/0x50 kernel/softirq.c:934 smpboot_thread_fn+0x554/0x9f0 kernel/smpboot.c:164 kthread+0x31b/0x430 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 Uninit was created at: slab_post_alloc_hook mm/slab.h:732 [inline] slab_alloc_node mm/slub.c:3258 [inline] __kmalloc_node_track_caller+0x814/0x1250 mm/slub.c:4970 kmalloc_reserve net/core/skbuff.c:358 [inline] __alloc_skb+0x346/0xcf0 net/core/skbuff.c:430 alloc_skb include/linux/skbuff.h:1257 [inline] nlmsg_new include/net/netlink.h:953 [inline] netlink_ack+0x5f3/0x12b0 net/netlink/af_netlink.c:2436 netlink_rcv_skb+0x55d/0x6c0 net/netlink/af_netlink.c:2507 rtnetlink_rcv+0x30/0x40 net/core/rtnetlink.c:6108 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0xf3b/0x1270 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x1288/0x1440 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg net/socket.c:734 [inline] ____sys_sendmsg+0xabc/0xe90 net/socket.c:2482 ___sys_sendmsg+0x2a1/0x3f0 net/socket.c:2536 __sys_sendmsg net/socket.c:2565 [inline] __do_sys_sendmsg net/socket.c:2574 [inline] __se_sys_sendmsg net/socket.c:2572 [inline] __x64_sys_sendmsg+0x367/0x540 net/socket.c:2572 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd CPU: 0 PID: 13 Comm: ksoftirqd/0 Not tainted 6.0.0-rc2-syzkaller-47461-gac3859c02d7f #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/22/2022 Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-13sch_htb: Avoid grafting on htb_destroy_class_offload when destroying htbRahul Rameshbabu1-11/+16
Peek at old qdisc and graft only when deleting a leaf class in the htb, rather than when deleting the htb itself. Do not peek at the qdisc of the netdev queue when destroying the htb. The caller may already have grafted a new qdisc that is not part of the htb structure being destroyed. This fix resolves two use cases. 1. Using tc to destroy the htb. - Netdev was being prematurely activated before the htb was fully destroyed. 2. Using tc to replace the htb with another qdisc (which also leads to the htb being destroyed). - Premature netdev activation like previous case. Newly grafted qdisc was also getting accidentally overwritten when destroying the htb. Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload") Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Maxim Mikityanskiy <maxtram95@gmail.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230113005528.302625-1-rrameshbabu@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-10net: sched: disallow noqueue for qdisc classesFrederick Lawler1-0/+5
While experimenting with applying noqueue to a classful queue discipline, we discovered a NULL pointer dereference in the __dev_queue_xmit() path that generates a kernel OOPS: # dev=enp0s5 # tc qdisc replace dev $dev root handle 1: htb default 1 # tc class add dev $dev parent 1: classid 1:1 htb rate 10mbit # tc qdisc add dev $dev parent 1:1 handle 10: noqueue # ping -I $dev -w 1 -c 1 1.1.1.1 [ 2.172856] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 2.173217] #PF: supervisor instruction fetch in kernel mode ... [ 2.178451] Call Trace: [ 2.178577] <TASK> [ 2.178686] htb_enqueue+0x1c8/0x370 [ 2.178880] dev_qdisc_enqueue+0x15/0x90 [ 2.179093] __dev_queue_xmit+0x798/0xd00 [ 2.179305] ? _raw_write_lock_bh+0xe/0x30 [ 2.179522] ? __local_bh_enable_ip+0x32/0x70 [ 2.179759] ? ___neigh_create+0x610/0x840 [ 2.179968] ? eth_header+0x21/0xc0 [ 2.180144] ip_finish_output2+0x15e/0x4f0 [ 2.180348] ? dst_output+0x30/0x30 [ 2.180525] ip_push_pending_frames+0x9d/0xb0 [ 2.180739] raw_sendmsg+0x601/0xcb0 [ 2.180916] ? _raw_spin_trylock+0xe/0x50 [ 2.181112] ? _raw_spin_unlock_irqrestore+0x16/0x30 [ 2.181354] ? get_page_from_freelist+0xcd6/0xdf0 [ 2.181594] ? sock_sendmsg+0x56/0x60 [ 2.181781] sock_sendmsg+0x56/0x60 [ 2.181958] __sys_sendto+0xf7/0x160 [ 2.182139] ? handle_mm_fault+0x6e/0x1d0 [ 2.182366] ? do_user_addr_fault+0x1e1/0x660 [ 2.182627] __x64_sys_sendto+0x1b/0x30 [ 2.182881] do_syscall_64+0x38/0x90 [ 2.183085] entry_SYSCALL_64_after_hwframe+0x63/0xcd ... [ 2.187402] </TASK> Previously in commit d66d6c3152e8 ("net: sched: register noqueue qdisc"), NULL was set for the noqueue discipline on noqueue init so that __dev_queue_xmit() falls through for the noqueue case. This also sets a bypass of the enqueue NULL check in the register_qdisc() function for the struct noqueue_disc_ops. Classful queue disciplines make it past the NULL check in __dev_queue_xmit() because the discipline is set to htb (in this case), and then in the call to __dev_xmit_skb(), it calls into htb_enqueue() which grabs a leaf node for a class and then calls qdisc_enqueue() by passing in a queue discipline which assumes ->enqueue() is not set to NULL. Fix this by not allowing classes to be assigned to the noqueue discipline. Linux TC Notes states that classes cannot be set to the noqueue discipline. [1] Let's enforce that here. Links: 1. https://linux-tc-notes.sourceforge.net/tc/doc/sch_noqueue.txt Fixes: d66d6c3152e8 ("net: sched: register noqueue qdisc") Cc: stable@vger.kernel.org Signed-off-by: Frederick Lawler <fred@cloudflare.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/r/20230109163906.706000-1-fred@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-09net/sched: act_mpls: Fix warning during failed attribute validationIdo Schimmel1-1/+7
The 'TCA_MPLS_LABEL' attribute is of 'NLA_U32' type, but has a validation type of 'NLA_VALIDATE_FUNCTION'. This is an invalid combination according to the comment above 'struct nla_policy': " Meaning of `validate' field, use via NLA_POLICY_VALIDATE_FN: NLA_BINARY Validation function called for the attribute. All other Unused - but note that it's a union " This can trigger the warning [1] in nla_get_range_unsigned() when validation of the attribute fails. Despite being of 'NLA_U32' type, the associated 'min'/'max' fields in the policy are negative as they are aliased by the 'validate' field. Fix by changing the attribute type to 'NLA_BINARY' which is consistent with the above comment and all other users of NLA_POLICY_VALIDATE_FN(). As a result, move the length validation to the validation function. No regressions in MPLS tests: # ./tdc.py -f tc-tests/actions/mpls.json [...] # echo $? 0 [1] WARNING: CPU: 0 PID: 17743 at lib/nlattr.c:118 nla_get_range_unsigned+0x1d8/0x1e0 lib/nlattr.c:117 Modules linked in: CPU: 0 PID: 17743 Comm: syz-executor.0 Not tainted 6.1.0-rc8 #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014 RIP: 0010:nla_get_range_unsigned+0x1d8/0x1e0 lib/nlattr.c:117 [...] Call Trace: <TASK> __netlink_policy_dump_write_attr+0x23d/0x990 net/netlink/policy.c:310 netlink_policy_dump_write_attr+0x22/0x30 net/netlink/policy.c:411 netlink_ack_tlv_fill net/netlink/af_netlink.c:2454 [inline] netlink_ack+0x546/0x760 net/netlink/af_netlink.c:2506 netlink_rcv_skb+0x1b7/0x240 net/netlink/af_netlink.c:2546 rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:6109 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x5e9/0x6b0 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x739/0x860 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg net/socket.c:734 [inline] ____sys_sendmsg+0x38f/0x500 net/socket.c:2482 ___sys_sendmsg net/socket.c:2536 [inline] __sys_sendmsg+0x197/0x230 net/socket.c:2565 __do_sys_sendmsg net/socket.c:2574 [inline] __se_sys_sendmsg net/socket.c:2572 [inline] __x64_sys_sendmsg+0x42/0x50 net/socket.c:2572 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Link: https://lore.kernel.org/netdev/CAO4mrfdmjvRUNbDyP0R03_DrD_eFCLCguz6OxZ2TYRSv0K9gxA@mail.gmail.com/ Fixes: 2a2ea50870ba ("net: sched: add mpls manipulation actions to TC") Reported-by: Wei Chen <harperchen1110@gmail.com> Tested-by: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/20230107171004.608436-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-05Merge tag 'net-6.2-rc3' of ↵Linus Torvalds4-15/+14
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf, wifi, and netfilter. Current release - regressions: - bpf: fix nullness propagation for reg to reg comparisons, avoid null-deref - inet: control sockets should not use current thread task_frag - bpf: always use maximal size for copy_array() - eth: bnxt_en: don't link netdev to a devlink port for VFs Current release - new code bugs: - rxrpc: fix a couple of potential use-after-frees - netfilter: conntrack: fix IPv6 exthdr error check - wifi: iwlwifi: fw: skip PPAG for JF, avoid FW crashes - eth: dsa: qca8k: various fixes for the in-band register access - eth: nfp: fix schedule in atomic context when sync mc address - eth: renesas: rswitch: fix getting mac address from device tree - mobile: ipa: use proper endpoint mask for suspend Previous releases - regressions: - tcp: add TIME_WAIT sockets in bhash2, fix regression caught by Jiri / python tests - net: tc: don't intepret cls results when asked to drop, fix oob-access - vrf: determine the dst using the original ifindex for multicast - eth: bnxt_en: - fix XDP RX path if BPF adjusted packet length - fix HDS (header placement) and jumbo thresholds for RX packets - eth: ice: xsk: do not use xdp_return_frame() on tx_buf->raw_buf, avoid memory corruptions Previous releases - always broken: - ulp: prevent ULP without clone op from entering the LISTEN status - veth: fix race with AF_XDP exposing old or uninitialized descriptors - bpf: - pull before calling skb_postpull_rcsum() (fix checksum support and avoid a WARN()) - fix panic due to wrong pageattr of im->image (when livepatch and kretfunc coexist) - keep a reference to the mm, in case the task is dead - mptcp: fix deadlock in fastopen error path - netfilter: - nf_tables: perform type checking for existing sets - nf_tables: honor set timeout and garbage collection updates - ipset: fix hash:net,port,net hang with /0 subnet - ipset: avoid hung task warning when adding/deleting entries - selftests: net: - fix cmsg_so_mark.sh test hang on non-x86 systems - fix the arp_ndisc_evict_nocarrier test for IPv6 - usb: rndis_host: secure rndis_query check against int overflow - eth: r8169: fix dmar pte write access during suspend/resume with WOL - eth: lan966x: fix configuration of the PCS - eth: sparx5: fix reading of the MAC address - eth: qed: allow sleep in qed_mcp_trace_dump() - eth: hns3: - fix interrupts re-initialization after VF FLR - fix handling of promisc when MAC addr table gets full - refine the handling for VF heartbeat - eth: mlx5: - properly handle ingress QinQ-tagged packets on VST - fix io_eq_size and event_eq_size params validation on big endian - fix RoCE setting at HCA level if not supported at all - don't turn CQE compression on by default for IPoIB - eth: ena: - fix toeplitz initial hash key value - account for the number of XDP-processed bytes in interface stats - fix rx_copybreak value update Misc: - ethtool: harden phy stat handling against buggy drivers - docs: netdev: convert maintainer's doc from FAQ to a normal document" * tag 'net-6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (112 commits) caif: fix memory leak in cfctrl_linkup_request() inet: control sockets should not use current thread task_frag net/ulp: prevent ULP without clone op from entering the LISTEN status qed: allow sleep in qed_mcp_trace_dump() MAINTAINERS: Update maintainers for ptp_vmw driver usb: rndis_host: Secure rndis_query check against int overflow net: dpaa: Fix dtsec check for PCS availability octeontx2-pf: Fix lmtst ID used in aura free drivers/net/bonding/bond_3ad: return when there's no aggregator netfilter: ipset: Rework long task execution when adding/deleting entries netfilter: ipset: fix hash:net,port,net hang with /0 subnet net: sparx5: Fix reading of the MAC address vxlan: Fix memory leaks in error path net: sched: htb: fix htb_classify() kernel-doc net: sched: cbq: dont intepret cls results when asked to drop net: sched: atm: dont intepret cls results when asked to drop dt-bindings: net: marvell,orion-mdio: Fix examples dt-bindings: net: sun8i-emac: Add phy-supply property net: ipa: use proper endpoint mask for suspend selftests: net: return non-zero for failures reported in arp_ndisc_evict_nocarrier ...
2023-01-02net: sched: htb: fix htb_classify() kernel-docRandy Dunlap1-2/+6
Fix W=1 kernel-doc warning: net/sched/sch_htb.c:214: warning: expecting prototype for htb_classify(). Prototype was for HTB_DIRECT() instead by moving the HTB_DIRECT() macro above the function. Add kernel-doc notation for function parameters as well. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-02net: sched: cbq: dont intepret cls results when asked to dropJamal Hadi Salim1-2/+2
If asked to drop a packet via TC_ACT_SHOT it is unsafe to assume that res.class contains a valid pointer Sample splat reported by Kyle Zeng [ 5.405624] 0: reclassify loop, rule prio 0, protocol 800 [ 5.406326] ================================================================== [ 5.407240] BUG: KASAN: slab-out-of-bounds in cbq_enqueue+0x54b/0xea0 [ 5.407987] Read of size 1 at addr ffff88800e3122aa by task poc/299 [ 5.408731] [ 5.408897] CPU: 0 PID: 299 Comm: poc Not tainted 5.10.155+ #15 [ 5.409516] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 [ 5.410439] Call Trace: [ 5.410764] dump_stack+0x87/0xcd [ 5.411153] print_address_description+0x7a/0x6b0 [ 5.411687] ? vprintk_func+0xb9/0xc0 [ 5.411905] ? printk+0x76/0x96 [ 5.412110] ? cbq_enqueue+0x54b/0xea0 [ 5.412323] kasan_report+0x17d/0x220 [ 5.412591] ? cbq_enqueue+0x54b/0xea0 [ 5.412803] __asan_report_load1_noabort+0x10/0x20 [ 5.413119] cbq_enqueue+0x54b/0xea0 [ 5.413400] ? __kasan_check_write+0x10/0x20 [ 5.413679] __dev_queue_xmit+0x9c0/0x1db0 [ 5.413922] dev_queue_xmit+0xc/0x10 [ 5.414136] ip_finish_output2+0x8bc/0xcd0 [ 5.414436] __ip_finish_output+0x472/0x7a0 [ 5.414692] ip_finish_output+0x5c/0x190 [ 5.414940] ip_output+0x2d8/0x3c0 [ 5.415150] ? ip_mc_finish_output+0x320/0x320 [ 5.415429] __ip_queue_xmit+0x753/0x1760 [ 5.415664] ip_queue_xmit+0x47/0x60 [ 5.415874] __tcp_transmit_skb+0x1ef9/0x34c0 [ 5.416129] tcp_connect+0x1f5e/0x4cb0 [ 5.416347] tcp_v4_connect+0xc8d/0x18c0 [ 5.416577] __inet_stream_connect+0x1ae/0xb40 [ 5.416836] ? local_bh_enable+0x11/0x20 [ 5.417066] ? lock_sock_nested+0x175/0x1d0 [ 5.417309] inet_stream_connect+0x5d/0x90 [ 5.417548] ? __inet_stream_connect+0xb40/0xb40 [ 5.417817] __sys_connect+0x260/0x2b0 [ 5.418037] __x64_sys_connect+0x76/0x80 [ 5.418267] do_syscall_64+0x31/0x50 [ 5.418477] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 5.418770] RIP: 0033:0x473bb7 [ 5.418952] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 18 89 54 24 0c 48 89 34 24 89 [ 5.420046] RSP: 002b:00007fffd20eb0f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a [ 5.420472] RAX: ffffffffffffffda RBX: 00007fffd20eb578 RCX: 0000000000473bb7 [ 5.420872] RDX: 0000000000000010 RSI: 00007fffd20eb110 RDI: 0000000000000007 [ 5.421271] RBP: 00007fffd20eb150 R08: 0000000000000001 R09: 0000000000000004 [ 5.421671] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 [ 5.422071] R13: 00007fffd20eb568 R14: 00000000004fc740 R15: 0000000000000002 [ 5.422471] [ 5.422562] Allocated by task 299: [ 5.422782] __kasan_kmalloc+0x12d/0x160 [ 5.423007] kasan_kmalloc+0x5/0x10 [ 5.423208] kmem_cache_alloc_trace+0x201/0x2e0 [ 5.423492] tcf_proto_create+0x65/0x290 [ 5.423721] tc_new_tfilter+0x137e/0x1830 [ 5.423957] rtnetlink_rcv_msg+0x730/0x9f0 [ 5.424197] netlink_rcv_skb+0x166/0x300 [ 5.424428] rtnetlink_rcv+0x11/0x20 [ 5.424639] netlink_unicast+0x673/0x860 [ 5.424870] netlink_sendmsg+0x6af/0x9f0 [ 5.425100] __sys_sendto+0x58d/0x5a0 [ 5.425315] __x64_sys_sendto+0xda/0xf0 [ 5.425539] do_syscall_64+0x31/0x50 [ 5.425764] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 5.426065] [ 5.426157] The buggy address belongs to the object at ffff88800e312200 [ 5.426157] which belongs to the cache kmalloc-128 of size 128 [ 5.426955] The buggy address is located 42 bytes to the right of [ 5.426955] 128-byte region [ffff88800e312200, ffff88800e312280) [ 5.427688] The buggy address belongs to the page: [ 5.427992] page:000000009875fabc refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xe312 [ 5.428562] flags: 0x100000000000200(slab) [ 5.428812] raw: 0100000000000200 dead000000000100 dead000000000122 ffff888007843680 [ 5.429325] raw: 0000000000000000 0000000000100010 00000001ffffffff ffff88800e312401 [ 5.429875] page dumped because: kasan: bad access detected [ 5.430214] page->mem_cgroup:ffff88800e312401 [ 5.430471] [ 5.430564] Memory state around the buggy address: [ 5.430846] ffff88800e312180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 5.431267] ffff88800e312200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fc [ 5.431705] >ffff88800e312280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 5.432123] ^ [ 5.432391] ffff88800e312300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fc [ 5.432810] ffff88800e312380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 5.433229] ================================================================== [ 5.433648] Disabling lock debugging due to kernel taint Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Kyle Zeng <zengyhkyle@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-02net: sched: atm: dont intepret cls results when asked to dropJamal Hadi Salim1-1/+4
If asked to drop a packet via TC_ACT_SHOT it is unsafe to assume res.class contains a valid pointer Fixes: b0188d4dbe5f ("[NET_SCHED]: sch_atm: Lindent") Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-26net: sched: fix memory leak in tcindex_set_parmsHawkins Jiawei1-10/+2
Syzkaller reports a memory leak as follows: ==================================== BUG: memory leak unreferenced object 0xffff88810c287f00 (size 256): comm "syz-executor105", pid 3600, jiffies 4294943292 (age 12.990s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff814cf9f0>] kmalloc_trace+0x20/0x90 mm/slab_common.c:1046 [<ffffffff839c9e07>] kmalloc include/linux/slab.h:576 [inline] [<ffffffff839c9e07>] kmalloc_array include/linux/slab.h:627 [inline] [<ffffffff839c9e07>] kcalloc include/linux/slab.h:659 [inline] [<ffffffff839c9e07>] tcf_exts_init include/net/pkt_cls.h:250 [inline] [<ffffffff839c9e07>] tcindex_set_parms+0xa7/0xbe0 net/sched/cls_tcindex.c:342 [<ffffffff839caa1f>] tcindex_change+0xdf/0x120 net/sched/cls_tcindex.c:553 [<ffffffff8394db62>] tc_new_tfilter+0x4f2/0x1100 net/sched/cls_api.c:2147 [<ffffffff8389e91c>] rtnetlink_rcv_msg+0x4dc/0x5d0 net/core/rtnetlink.c:6082 [<ffffffff839eba67>] netlink_rcv_skb+0x87/0x1d0 net/netlink/af_netlink.c:2540 [<ffffffff839eab87>] netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] [<ffffffff839eab87>] netlink_unicast+0x397/0x4c0 net/netlink/af_netlink.c:1345 [<ffffffff839eb046>] netlink_sendmsg+0x396/0x710 net/netlink/af_netlink.c:1921 [<ffffffff8383e796>] sock_sendmsg_nosec net/socket.c:714 [inline] [<ffffffff8383e796>] sock_sendmsg+0x56/0x80 net/socket.c:734 [<ffffffff8383eb08>] ____sys_sendmsg+0x178/0x410 net/socket.c:2482 [<ffffffff83843678>] ___sys_sendmsg+0xa8/0x110 net/socket.c:2536 [<ffffffff838439c5>] __sys_sendmmsg+0x105/0x330 net/socket.c:2622 [<ffffffff83843c14>] __do_sys_sendmmsg net/socket.c:2651 [inline] [<ffffffff83843c14>] __se_sys_sendmmsg net/socket.c:2648 [inline] [<ffffffff83843c14>] __x64_sys_sendmmsg+0x24/0x30 net/socket.c:2648 [<ffffffff84605fd5>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff84605fd5>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 [<ffffffff84800087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd ==================================== Kernel uses tcindex_change() to change an existing filter properties. Yet the problem is that, during the process of changing, if `old_r` is retrieved from `p->perfect`, then kernel uses tcindex_alloc_perfect_hash() to newly allocate filter results, uses tcindex_filter_result_init() to clear the old filter result, without destroying its tcf_exts structure, which triggers the above memory leak. To be more specific, there are only two source for the `old_r`, according to the tcindex_lookup(). `old_r` is retrieved from `p->perfect`, or `old_r` is retrieved from `p->h`. * If `old_r` is retrieved from `p->perfect`, kernel uses tcindex_alloc_perfect_hash() to newly allocate the filter results. Then `r` is assigned with `cp->perfect + handle`, which is newly allocated. So condition `old_r && old_r != r` is true in this situation, and kernel uses tcindex_filter_result_init() to clear the old filter result, without destroying its tcf_exts structure * If `old_r` is retrieved from `p->h`, then `p->perfect` is NULL according to the tcindex_lookup(). Considering that `cp->h` is directly copied from `p->h` and `p->perfect` is NULL, `r` is assigned with `tcindex_lookup(cp, handle)`, whose value should be the same as `old_r`, so condition `old_r && old_r != r` is false in this situation, kernel ignores using tcindex_filter_result_init() to clear the old filter result. So only when `old_r` is retrieved from `p->perfect` does kernel use tcindex_filter_result_init() to clear the old filter result, which triggers the above memory leak. Considering that there already exists a tc_filter_wq workqueue to destroy the old tcindex_data by tcindex_partial_destroy_work() at the end of tcindex_set_parms(), this patch solves this memory leak bug by removing this old filter result clearing part and delegating it to the tc_filter_wq workqueue. Note that this patch doesn't introduce any other issues. If `old_r` is retrieved from `p->perfect`, this patch just delegates old filter result clearing part to the tc_filter_wq workqueue; If `old_r` is retrieved from `p->h`, kernel doesn't reach the old filter result clearing part, so removing this part has no effect. [Thanks to the suggestion from Jakub Kicinski, Cong Wang, Paolo Abeni and Dmitry Vyukov] Fixes: b9a24bb76bf6 ("net_sched: properly handle failure case of tcf_exts_init()") Link: https://lore.kernel.org/all/0000000000001de5c505ebc9ec59@google.com/ Reported-by: syzbot+232ebdbd36706c965ebf@syzkaller.appspotmail.com Tested-by: syzbot+232ebdbd36706c965ebf@syzkaller.appspotmail.com Cc: Cong Wang <cong.wang@bytedance.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-25treewide: Convert del_timer*() to timer_shutdown*()Steven Rostedt (Google)1-1/+1
Due to several bugs caused by timers being re-armed after they are shutdown and just before they are freed, a new state of timers was added called "shutdown". After a timer is set to this state, then it can no longer be re-armed. The following script was run to find all the trivial locations where del_timer() or del_timer_sync() is called in the same function that the object holding the timer is freed. It also ignores any locations where the timer->function is modified between the del_timer*() and the free(), as that is not considered a "trivial" case. This was created by using a coccinelle script and the following commands: $ cat timer.cocci @@ expression ptr, slab; identifier timer, rfield; @@ ( - del_timer(&ptr->timer); + timer_shutdown(&ptr->timer); | - del_timer_sync(&ptr->timer); + timer_shutdown_sync(&ptr->timer); ) ... when strict when != ptr->timer ( kfree_rcu(ptr, rfield); | kmem_cache_free(slab, ptr); | kfree(ptr); ) $ spatch timer.cocci . > /tmp/t.patch $ patch -p1 < /tmp/t.patch Link: https://lore.kernel.org/lkml/20221123201306.823305113@linutronix.de/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Pavel Machek <pavel@ucw.cz> [ LED ] Acked-by: Kalle Valo <kvalo@kernel.org> [ wireless ] Acked-by: Paolo Abeni <pabeni@redhat.com> [ networking ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-19net_sched: reject TCF_EM_SIMPLE case for complex ematch moduleCong Wang1-0/+2
When TCF_EM_SIMPLE was introduced, it is supposed to be convenient for ematch implementation: https://lore.kernel.org/all/20050105110048.GO26856@postel.suug.ch/ "You don't have to, providing a 32bit data chunk without TCF_EM_SIMPLE set will simply result in allocating & copy. It's an optimization, nothing more." So if an ematch module provides ops->datalen that means it wants a complex data structure (saved in its em->data) instead of a simple u32 value. We should simply reject such a combination, otherwise this u32 could be misinterpreted as a pointer. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-and-tested-by: syzbot+4caeae4c7103813598ae@syzkaller.appspotmail.com Reported-by: Jun Nie <jun.nie@linaro.org> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-13Merge tag 'net-next-6.2' of ↵Linus Torvalds37-215/+261
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Allow live renaming when an interface is up - Add retpoline wrappers for tc, improving considerably the performances of complex queue discipline configurations - Add inet drop monitor support - A few GRO performance improvements - Add infrastructure for atomic dev stats, addressing long standing data races - De-duplicate common code between OVS and conntrack offloading infrastructure - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements - Netfilter: introduce packet parser for tunneled packets - Replace IPVS timer-based estimators with kthreads to scale up the workload with the number of available CPUs - Add the helper support for connection-tracking OVS offload BPF: - Support for user defined BPF objects: the use case is to allocate own objects, build own object hierarchies and use the building blocks to build own data structures flexibly, for example, linked lists in BPF - Make cgroup local storage available to non-cgroup attached BPF programs - Avoid unnecessary deadlock detection and failures wrt BPF task storage helpers - A relevant bunch of BPF verifier fixes and improvements - Veristat tool improvements to support custom filtering, sorting, and replay of results - Add LLVM disassembler as default library for dumping JITed code - Lots of new BPF documentation for various BPF maps - Add bpf_rcu_read_{,un}lock() support for sleepable programs - Add RCU grace period chaining to BPF to wait for the completion of access from both sleepable and non-sleepable BPF programs - Add support storing struct task_struct objects as kptrs in maps - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer values - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions Protocols: - TCP: implement Protective Load Balancing across switch links - TCP: allow dynamically disabling TCP-MD5 static key, reverting back to fast[er]-path - UDP: In