Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit 9d85a6f44bd5585761947f40f7821c9cd78a1bbe ]
The 4th parameter in tc_chain_notify() should be flags rather than seq.
Let's change it back correctly.
Fixes: 32a4f5ecd738 ("net: sched: introduce chain object to uapi")
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 727d6a8b7ef3d25080fad228b2c4a1d4da5999c6 ]
Currently tcf_skbmod_act() assumes that packets use Ethernet as their L2
protocol, which is not always the case. As an example, for CAN devices:
$ ip link add dev vcan0 type vcan
$ ip link set up vcan0
$ tc qdisc add dev vcan0 root handle 1: htb
$ tc filter add dev vcan0 parent 1: protocol ip prio 10 \
matchall action skbmod swap mac
Doing the above silently corrupts all the packets. Do not perform skbmod
actions for non-Ethernet packets.
Fixes: 86da71b57383 ("net_sched: Introduce skbmod action")
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f5051bcece50140abd1a11a2d36dc3ec5484fc32 ]
Syzbot reported memory leak in tcindex_set_parms(). The problem was in
non-freed perfect hash in tcindex_partial_destroy_work().
In tcindex_set_parms() new tcindex_data is allocated and some fields from
old one are copied to new one, but not the perfect hash. Since
tcindex_partial_destroy_work() is the destroy function for old
tcindex_data, we need to free perfect hash to avoid memory leak.
Reported-and-tested-by: syzbot+f0bbb2287b8993d4fa74@syzkaller.appspotmail.com
Fixes: 331b72922c5f ("net: sched: RCU cls_tcindex")
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 77ac5e40c44eb78333fbc38482d61fc2af7dda0a upstream.
When cleaning up the nf_table in tcf_ct_flow_table_cleanup_work
there is no guarantee that the callback list, added to by
nf_flow_table_offload_add_cb, is empty. This means that it is
possible that the flow_block_cb memory allocated will be lost.
Fix this by iterating the list and freeing the flow_block_cb entries
before freeing the nf_table entry (via freeing ct_ft).
Fixes: 978703f42549 ("netfilter: flowtable: Add API for registering to flow table events")
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8955b90c3cdad199137809aac8ccbbb585355913 upstream.
The confirm operation should be checked. If there are any failed,
the packet should be dropped like in ovs and netfilter.
Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 55d96f72e8ddc0a294e0b9c94016edbb699537e1 ]
When nla_put_u32() fails, 'ret' could be 0, it should
return error code in tcf_del_walker().
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 05ff8435e50569a0a6b95e5ceaea43696e8827ab ]
modern userspace applications, like OVN, can configure the TC datapath to
"recirculate" packets several times. If more than 4 "recirculation" rules
are configured, packets can be dropped by __tcf_classify().
Changing the maximum number of reclassifications (from 4 to 16) should be
sufficient to prevent drops in most use cases, and guard against loops at
the same time.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3f2db250099f46988088800052cdf2332c7aba61 ]
Syzbot reported warning in tcindex_alloc_perfect_hash. The problem
was in too big cp->hash, which triggers warning in kmalloc. Since
cp->hash comes from userspace, there is no need to warn if value
is not correct
Fixes: b9a24bb76bf6 ("net_sched: properly handle failure case of tcf_exts_init()")
Reported-and-tested-by: syzbot+1071ad60cd7df39fdadb@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0cd58e5c53babb9237b741dbef711f0a9eb6d3fd ]
If qfq_change_class() is unable to allocate memory for qfq_aggregate,
it frees the class that has been inserted in the class hash table,
but does not unhash it.
Defer the insertion after the problematic allocation.
BUG: KASAN: use-after-free in hlist_add_head include/linux/list.h:884 [inline]
BUG: KASAN: use-after-free in qdisc_class_hash_insert+0x200/0x210 net/sched/sch_api.c:731
Write of size 8 at addr ffff88814a534f10 by task syz-executor.4/31478
CPU: 0 PID: 31478 Comm: syz-executor.4 Not tainted 5.13.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x141/0x1d7 lib/dump_stack.c:120
print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:233
__kasan_report mm/kasan/report.c:419 [inline]
kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:436
hlist_add_head include/linux/list.h:884 [inline]
qdisc_class_hash_insert+0x200/0x210 net/sched/sch_api.c:731
qfq_change_class+0x96c/0x1990 net/sched/sch_qfq.c:489
tc_ctl_tclass+0x514/0xe50 net/sched/sch_api.c:2113
rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5564
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1929
sock_sendmsg_nosec net/socket.c:654 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:674
____sys_sendmsg+0x6e8/0x810 net/socket.c:2350
___sys_sendmsg+0xf3/0x170 net/socket.c:2404
__sys_sendmsg+0xe5/0x1b0 net/socket.c:2433
do_syscall_64+0x3a/0xb0 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x4665d9
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fdc7b5f0188 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 000000000056bf80 RCX: 00000000004665d9
RDX: 0000000000000000 RSI: 00000000200001c0 RDI: 0000000000000003
RBP: 00007fdc7b5f01d0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
R13: 00007ffcf7310b3f R14: 00007fdc7b5f0300 R15: 0000000000022000
Allocated by task 31445:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:46 [inline]
set_alloc_info mm/kasan/common.c:428 [inline]
____kasan_kmalloc mm/kasan/common.c:507 [inline]
____kasan_kmalloc mm/kasan/common.c:466 [inline]
__kasan_kmalloc+0x9b/0xd0 mm/kasan/common.c:516
kmalloc include/linux/slab.h:556 [inline]
kzalloc include/linux/slab.h:686 [inline]
qfq_change_class+0x705/0x1990 net/sched/sch_qfq.c:464
tc_ctl_tclass+0x514/0xe50 net/sched/sch_api.c:2113
rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5564
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1929
sock_sendmsg_nosec net/socket.c:654 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:674
____sys_sendmsg+0x6e8/0x810 net/socket.c:2350
___sys_sendmsg+0xf3/0x170 net/socket.c:2404
__sys_sendmsg+0xe5/0x1b0 net/socket.c:2433
do_syscall_64+0x3a/0xb0 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
Freed by task 31445:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
kasan_set_track+0x1c/0x30 mm/kasan/common.c:46
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:357
____kasan_slab_free mm/kasan/common.c:360 [inline]
____kasan_slab_free mm/kasan/common.c:325 [inline]
__kasan_slab_free+0xfb/0x130 mm/kasan/common.c:368
kasan_slab_free include/linux/kasan.h:212 [inline]
slab_free_hook mm/slub.c:1583 [inline]
slab_free_freelist_hook+0xdf/0x240 mm/slub.c:1608
slab_free mm/slub.c:3168 [inline]
kfree+0xe5/0x7f0 mm/slub.c:4212
qfq_change_class+0x10fb/0x1990 net/sched/sch_qfq.c:518
tc_ctl_tclass+0x514/0xe50 net/sched/sch_api.c:2113
rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5564
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1929
sock_sendmsg_nosec net/socket.c:654 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:674
____sys_sendmsg+0x6e8/0x810 net/socket.c:2350
___sys_sendmsg+0xf3/0x170 net/socket.c:2404
__sys_sendmsg+0xe5/0x1b0 net/socket.c:2433
do_syscall_64+0x3a/0xb0 arch/x86/entry/common.c:47
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88814a534f00
which belongs to the cache kmalloc-128 of size 128
The buggy address is located 16 bytes inside of
128-byte region [ffff88814a534f00, ffff88814a534f80)
The buggy address belongs to the page:
page:ffffea0005294d00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x14a534
flags: 0x57ff00000000200(slab|node=1|zone=2|lastcpupid=0x7ff)
raw: 057ff00000000200 ffffea00004fee00 0000000600000006 ffff8880110418c0
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY), pid 29797, ts 604817765317, free_ts 604810151744
prep_new_page mm/page_alloc.c:2358 [inline]
get_page_from_freelist+0x1033/0x2b60 mm/page_alloc.c:3994
__alloc_pages+0x1b2/0x500 mm/page_alloc.c:5200
alloc_pages+0x18c/0x2a0 mm/mempolicy.c:2272
alloc_slab_page mm/slub.c:1646 [inline]
allocate_slab+0x2c5/0x4c0 mm/slub.c:1786
new_slab mm/slub.c:1849 [inline]
new_slab_objects mm/slub.c:2595 [inline]
___slab_alloc+0x4a1/0x810 mm/slub.c:2758
__slab_alloc.constprop.0+0xa7/0xf0 mm/slub.c:2798
slab_alloc_node mm/slub.c:2880 [inline]
slab_alloc mm/slub.c:2922 [inline]
__kmalloc+0x315/0x330 mm/slub.c:4050
kmalloc include/linux/slab.h:561 [inline]
kzalloc include/linux/slab.h:686 [inline]
__register_sysctl_table+0x112/0x1090 fs/proc/proc_sysctl.c:1318
mpls_dev_sysctl_register+0x1b7/0x2d0 net/mpls/af_mpls.c:1421
mpls_add_dev net/mpls/af_mpls.c:1472 [inline]
mpls_dev_notify+0x214/0x8b0 net/mpls/af_mpls.c:1588
notifier_call_chain+0xb5/0x200 kernel/notifier.c:83
call_netdevice_notifiers_info+0xb5/0x130 net/core/dev.c:2121
call_netdevice_notifiers_extack net/core/dev.c:2133 [inline]
call_netdevice_notifiers net/core/dev.c:2147 [inline]
register_netdevice+0x106b/0x1500 net/core/dev.c:10312
veth_newlink+0x585/0xac0 drivers/net/veth.c:1547
__rtnl_newlink+0x1062/0x1710 net/core/rtnetlink.c:3452
rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3500
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1298 [inline]
free_pcp_prepare+0x223/0x300 mm/page_alloc.c:1342
free_unref_page_prepare mm/page_alloc.c:3250 [inline]
free_unref_page+0x12/0x1d0 mm/page_alloc.c:3298
__vunmap+0x783/0xb60 mm/vmalloc.c:2566
free_work+0x58/0x70 mm/vmalloc.c:80
process_one_work+0x98d/0x1600 kernel/workqueue.c:2276
worker_thread+0x64c/0x1120 kernel/workqueue.c:2422
kthread+0x3b1/0x4a0 kernel/kthread.c:313
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
Memory state around the buggy address:
ffff88814a534e00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88814a534e80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88814a534f00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88814a534f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88814a535000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Fixes: 462dbc9101acd ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9c5eee0afca09cbde6bd00f77876754aaa552970 ]
Currently vlan modification action checks existence of vlan priority by
comparing it to 0. Therefore it is impossible to modify existing vlan
tag to have priority 0.
For example, the following tc command will change the vlan id but will
not affect vlan priority:
tc filter add dev eth1 ingress matchall action vlan modify id 300 \
priority 0 pipe mirred egress redirect dev eth2
The incoming packet on eth1:
ethertype 802.1Q (0x8100), vlan 200, p 4, ethertype IPv4
will be changed to:
ethertype 802.1Q (0x8100), vlan 300, p 4, ethertype IPv4
although the user has intended to have p == 0.
The fix is to add tcfv_push_prio_exists flag to struct tcf_vlan_params
and rely on it when deciding to set the priority.
Fixes: 45a497f2d149a4a8061c (net/sched: act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan action)
Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ba91c49dedbde758ba0b72f57ac90b06ddf8e548 ]
The TCP option parser in cake qdisc (cake_get_tcpopt and
cake_tcph_may_drop) could read one byte out of bounds. When the length
is 1, the execution flow gets into the loop, reads one byte of the
opcode, and if the opcode is neither TCPOPT_EOL nor TCPOPT_NOP, it reads
one more byte, which exceeds the length of 1.
This fix is inspired by commit 9609dad263f8 ("ipv4: tcp_input: fix stack
out of bounds when parsing TCP options.").
v2 changes:
Added doff validation in cake_get_tcphdr to avoid parsing garbage as TCP
header. Although it wasn't strictly an out-of-bounds access (memory was
allocated), garbage values could be read where CAKE expected the TCP
header if doff was smaller than 5.
Cc: Young Xiao <92siuyang@gmail.com>
Fixes: 8b7138814f29 ("sch_cake: Add optional ACK filter")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 13c62f5371e3eb4fc3400cfa26e64ca75f888008 ]
This this the counterpart of 8aa7b526dc0b ("openvswitch: handle DNAT
tuple collision") for act_ct. From that commit changelog:
"""
With multiple DNAT rules it's possible that after destination
translation the resulting tuples collide.
...
Netfilter handles this case by allocating a null binding for SNAT at
egress by default. Perform the same operation in openvswitch for DNAT
if no explicit SNAT is requested by the user and allocate a null binding
for SNAT for packets in the "original" direction.
"""
Fixes: 95219afbb980 ("act_ct: support asymmetric conntrack")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit fb91702b743dec78d6507c53a2dec8a8883f509d ]
Fix current behavior of skipping template allocation in case the
ct action is in zone 0.
Skipping the allocation may cause the datapath ct code to ignore the
entire ct action with all its attributes (commit, nat) in case the ct
action in zone 0 was preceded by a ct clear action.
The ct clear action sets the ct_state to untracked and resets the
skb->_nfct pointer. Under these conditions and without an allocated
ct template, the skb->_nfct pointer will remain NULL which will
cause the tc ct action handler to exit without handling commit and nat
actions, if such exist.
For example, the following rule in OVS dp:
recirc_id(0x2),ct_state(+new-est-rel-rpl+trk),ct_label(0/0x1), \
in_port(eth0),actions:ct_clear,ct(commit,nat(src=10.11.0.12)), \
recirc(0x37a)
Will result in act_ct skipping the commit and nat actions in zone 0.
The change removes the skipping of template allocation for zone 0 and
treats it the same as any other zone.
Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")
Signed-off-by: Ariel Levkovich <lariel@nvidia.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/20210526170110.54864-1-lariel@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0cc254e5aa37cf05f65bcdcdc0ac5c58010feb33 ]
Currently established connections are not offloaded if the filter has a
"ct commit" action. This behavior will not offload connections of the
following scenario:
$ tc_filter add dev $DEV ingress protocol ip prio 1 flower \
ct_state -trk \
action ct commit action goto chain 1
$ tc_filter add dev $DEV ingress protocol ip chain 1 prio 1 flower \
action mirred egress redirect dev $DEV2
$ tc_filter add dev $DEV2 ingress protocol ip prio 1 flower \
action ct commit action goto chain 1
$ tc_filter add dev $DEV2 ingress protocol ip prio 1 chain 1 flower \
ct_state +trk+est \
action mirred egress redirect dev $DEV
Offload established connections, regardless of the commit flag.
Fixes: 46475bb20f4b ("net/sched: act_ct: Software offload of established flows")
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Link: https://lore.kernel.org/r/1622029449-27060-1-git-send-email-paulb@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9453d45ecb6c2199d72e73c993e9d98677a2801b ]
Function skb_ext_add() doesn't initialize created skb extension with any
value and leaves it up to the user. However, since extension of type
TC_SKB_EXT originally contained only single value tc_skb_ext->chain its
users used to just assign the chain value without setting whole extension
memory to zero first. This assumption changed when TC_SKB_EXT extension was
extended with additional fields but not all users were updated to
initialize the new fields which leads to use of uninitialized memory
afterwards. UBSAN log:
[ 778.299821] UBSAN: invalid-load in net/openvswitch/flow.c:899:28
[ 778.301495] load of value 107 is not a valid value for type '_Bool'
[ 778.303215] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc7+ #2
[ 778.304933] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 778.307901] Call Trace:
[ 778.308680] <IRQ>
[ 778.309358] dump_stack+0xbb/0x107
[ 778.310307] ubsan_epilogue+0x5/0x40
[ 778.311167] __ubsan_handle_load_invalid_value.cold+0x43/0x48
[ 778.312454] ? memset+0x20/0x40
[ 778.313230] ovs_flow_key_extract.cold+0xf/0x14 [openvswitch]
[ 778.314532] ovs_vport_receive+0x19e/0x2e0 [openvswitch]
[ 778.315749] ? ovs_vport_find_upcall_portid+0x330/0x330 [openvswitch]
[ 778.317188] ? create_prof_cpu_mask+0x20/0x20
[ 778.318220] ? arch_stack_walk+0x82/0xf0
[ 778.319153] ? secondary_startup_64_no_verify+0xb0/0xbb
[ 778.320399] ? stack_trace_save+0x91/0xc0
[ 778.321362] ? stack_trace_consume_entry+0x160/0x160
[ 778.322517] ? lock_release+0x52e/0x760
[ 778.323444] netdev_frame_hook+0x323/0x610 [openvswitch]
[ 778.324668] ? ovs_netdev_get_vport+0xe0/0xe0 [openvswitch]
[ 778.325950] __netif_receive_skb_core+0x771/0x2db0
[ 778.327067] ? lock_downgrade+0x6e0/0x6f0
[ 778.328021] ? lock_acquire+0x565/0x720
[ 778.328940] ? generic_xdp_tx+0x4f0/0x4f0
[ 778.329902] ? inet_gro_receive+0x2a7/0x10a0
[ 778.330914] ? lock_downgrade+0x6f0/0x6f0
[ 778.331867] ? udp4_gro_receive+0x4c4/0x13e0
[ 778.332876] ? lock_release+0x52e/0x760
[ 778.333808] ? dev_gro_receive+0xcc8/0x2380
[ 778.334810] ? lock_downgrade+0x6f0/0x6f0
[ 778.335769] __netif_receive_skb_list_core+0x295/0x820
[ 778.336955] ? process_backlog+0x780/0x780
[ 778.337941] ? mlx5e_rep_tc_netdevice_event_unregister+0x20/0x20 [mlx5_core]
[ 778.339613] ? seqcount_lockdep_reader_access.constprop.0+0xa7/0xc0
[ 778.341033] ? kvm_clock_get_cycles+0x14/0x20
[ 778.342072] netif_receive_skb_list_internal+0x5f5/0xcb0
[ 778.343288] ? __kasan_kmalloc+0x7a/0x90
[ 778.344234] ? mlx5e_handle_rx_cqe_mpwrq+0x9e0/0x9e0 [mlx5_core]
[ 778.345676] ? mlx5e_xmit_xdp_frame_mpwqe+0x14d0/0x14d0 [mlx5_core]
[ 778.347140] ? __netif_receive_skb_list_core+0x820/0x820
[ 778.348351] ? mlx5e_post_rx_mpwqes+0xa6/0x25d0 [mlx5_core]
[ 778.349688] ? napi_gro_flush+0x26c/0x3c0
[ 778.350641] napi_complete_done+0x188/0x6b0
[ 778.351627] mlx5e_napi_poll+0x373/0x1b80 [mlx5_core]
[ 778.352853] __napi_poll+0x9f/0x510
[ 778.353704] ? mlx5_flow_namespace_set_mode+0x260/0x260 [mlx5_core]
[ 778.355158] net_rx_action+0x34c/0xa40
[ 778.356060] ? napi_threaded_poll+0x3d0/0x3d0
[ 778.357083] ? sched_clock_cpu+0x18/0x190
[ 778.358041] ? __common_interrupt+0x8e/0x1a0
[ 778.359045] __do_softirq+0x1ce/0x984
[ 778.359938] __irq_exit_rcu+0x137/0x1d0
[ 778.360865] irq_exit_rcu+0xa/0x20
[ 778.361708] common_interrupt+0x80/0xa0
[ 778.362640] </IRQ>
[ 778.363212] asm_common_interrupt+0x1e/0x40
[ 778.364204] RIP: 0010:native_safe_halt+0xe/0x10
[ 778.365273] Code: 4f ff ff ff 4c 89 e7 e8 50 3f 40 fe e9 dc fe ff ff 48 89 df e8 43 3f 40 fe eb 90 cc e9 07 00 00 00 0f 00 2d 74 05 62 00 fb f4 <c3> 90 e9 07 00 00 00 0f 00 2d 64 05 62 00 f4 c3 cc cc 0f 1f 44 00
[ 778.369355] RSP: 0018:ffffffff84407e48 EFLAGS: 00000246
[ 778.370570] RAX: ffff88842de46a80 RBX: ffffffff84425840 RCX: ffffffff83418468
[ 778.372143] RDX: 000000000026f1da RSI: 0000000000000004 RDI: ffffffff8343af5e
[ 778.373722] RBP: fffffbfff0884b08 R08: 0000000000000000 R09: ffff88842de46bcb
[ 778.375292] R10: ffffed1085bc8d79 R11: 0000000000000001 R12: 0000000000000000
[ 778.376860] R13: ffffffff851124a0 R14: 0000000000000000 R15: dffffc0000000000
[ 778.378491] ? rcu_eqs_enter.constprop.0+0xb8/0xe0
[ 778.379606] ? default_idle_call+0x5e/0xe0
[ 778.380578] default_idle+0xa/0x10
[ 778.381406] default_idle_call+0x96/0xe0
[ 778.382350] do_idle+0x3d4/0x550
[ 778.383153] ? arch_cpu_idle_exit+0x40/0x40
[ 778.384143] cpu_startup_entry+0x19/0x20
[ 778.385078] start_kernel+0x3c7/0x3e5
[ 778.385978] secondary_startup_64_no_verify+0xb0/0xbb
Fix the issue by providing new function tc_skb_ext_alloc() that allocates
tc skb extension and initializes its memory to 0 before returning it to the
caller. Change all existing users to use new API instead of calling
skb_ext_add() directly.
Fixes: 038ebb1a713d ("net/sched: act_ct: fix miss set mru for ovs after defrag in act_ct")
Fixes: d29334c15d33 ("net/sched: act_api: fix miss set post_ct for ovs after do conntrack in act_ct")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9b76eade16423ef06829cccfe3e100cfce31afcd ]
If Qdisc_ops->init() is failed, Qdisc_ops->reset() would be called.
When dsmark_init(Qdisc_ops->init()) is failed, it possibly doesn't
initialize dsmark_qdisc_data->q. But dsmark_reset(Qdisc_ops->reset())
uses dsmark_qdisc_data->q pointer wihtout any null checking.
So, panic would occur.
Test commands:
sysctl net.core.default_qdisc=dsmark -w
ip link add dummy0 type dummy
ip link add vw0 link dummy0 type virt_wifi
ip link set vw0 up
Splat looks like:
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
CPU: 3 PID: 684 Comm: ip Not tainted 5.12.0+ #910
RIP: 0010:qdisc_reset+0x2b/0x680
Code: 1f 44 00 00 48 b8 00 00 00 00 00 fc ff df 41 57 41 56 41 55 41 54
55 48 89 fd 48 83 c7 18 53 48 89 fa 48 c1 ea 03 48 83 ec 20 <80> 3c 02
00 0f 85 09 06 00 00 4c 8b 65 18 0f 1f 44 00 00 65 8b 1d
RSP: 0018:ffff88800fda6bf8 EFLAGS: 00010282
RAX: dffffc0000000000 RBX: ffff8880050ed800 RCX: 0000000000000000
RDX: 0000000000000003 RSI: ffffffff99e34100 RDI: 0000000000000018
RBP: 0000000000000000 R08: fffffbfff346b553 R09: fffffbfff346b553
R10: 0000000000000001 R11: fffffbfff346b552 R12: ffffffffc0824940
R13: ffff888109e83800 R14: 00000000ffffffff R15: ffffffffc08249e0
FS: 00007f5042287680(0000) GS:ffff888119800000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055ae1f4dbd90 CR3: 0000000006760002 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
? rcu_read_lock_bh_held+0xa0/0xa0
dsmark_reset+0x3d/0xf0 [sch_dsmark]
qdisc_reset+0xa9/0x680
qdisc_destroy+0x84/0x370
qdisc_create_dflt+0x1fe/0x380
attach_one_default_qdisc.constprop.41+0xa4/0x180
dev_activate+0x4d5/0x8c0
? __dev_open+0x268/0x390
__dev_open+0x270/0x390
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit dcad9ee9e0663d74a89b25b987f9c7be86432812 ]
The netdev qeueue might be stopped when byte queue limit has
reached or tx hw ring is full, net_tx_action() may still be
rescheduled if STATE_MISSED is set, which consumes unnecessary
cpu without dequeuing and transmiting any skb because the
netdev queue is stopped, see qdisc_run_end().
This patch fixes it by checking the netdev queue state before
calling qdisc_run() and clearing STATE_MISSED if netdev queue is
stopped during qdisc_run(), the net_tx_action() is rescheduled
again when netdev qeueue is restarted, see netif_tx_wake_queue().
As there is time window between netif_xmit_frozen_or_stopped()
checking and STATE_MISSED clearing, between which STATE_MISSED
may set by net_tx_action() scheduled by netif_tx_wake_queue(),
so set the STATE_MISSED again if netdev queue is restarted.
Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 102b55ee92f9fda4dde7a45d2b20538e6e3e3d1e ]
Currently qdisc_run() checks the STATE_DEACTIVATED of lockless
qdisc before calling __qdisc_run(), which ultimately clear the
STATE_MISSED when all the skb is dequeued. If STATE_DEACTIVATED
is set before clearing STATE_MISSED, there may be rescheduling
of net_tx_action() at the end of qdisc_run_end(), see below:
CPU0(net_tx_atcion) CPU1(__dev_xmit_skb) CPU2(dev_deactivate)
. . .
. set STATE_MISSED .
. __netif_schedule() .
. . set STATE_DEACTIVATED
. . qdisc_reset()
. . .
.<--------------- . synchronize_net()
clear __QDISC_STATE_SCHED | . .
. | . .
. | . some_qdisc_is_busy()
. | . return *false*
. | . .
test STATE_DEACTIVATED | . .
__qdisc_run() *not* called | . .
. | . .
test STATE_MISS | . .
__netif_schedule()--------| . .
. . .
. . .
__qdisc_run() is not called by net_tx_atcion() in CPU0 because
CPU2 has set STATE_DEACTIVATED flag during dev_deactivate(), and
STATE_MISSED is only cleared in __qdisc_run(), __netif_schedule
is called at the end of qdisc_run_end(), causing tx action
rescheduling problem.
qdisc_run() called by net_tx_action() runs in the softirq context,
which should has the same semantic as the qdisc_run() called by
__dev_xmit_skb() protected by rcu_read_lock_bh(). And there is a
synchronize_net() between STATE_DEACTIVATED flag being set and
qdisc_reset()/some_qdisc_is_busy in dev_deactivate(), we can safely
bail out for the deactived lockless qdisc in net_tx_action(), and
qdisc_reset() will reset all skb not dequeued yet.
So add the rcu_read_lock() explicitly to protect the qdisc_run()
and do the STATE_DEACTIVATED checking in net_tx_action() before
calling qdisc_run_begin(). Another option is to do the checking in
the qdisc_run_end(), but it will add unnecessary overhead for
non-tx_action case, because __dev_queue_xmit() will not see qdisc
with STATE_DEACTIVATED after synchronize_net(), the qdisc with
STATE_DEACTIVATED can only be seen by net_tx_action() because of
__netif_schedule().
The STATE_DEACTIVATED checking in qdisc_run() is to avoid race
between net_tx_action() and qdisc_reset(), see:
commit d518d2ed8640 ("net/sched: fix race between deactivation
and dequeue for NOLOCK qdisc"). As the bailout added above for
deactived lockless qdisc in net_tx_action() provides better
protection for the race without calling qdisc_run() at all, so
remove the STATE_DEACTIVATED checking in qdisc_run().
After qdisc_reset(), there is no skb in qdisc to be dequeued, so
clear the STATE_MISSED in dev_reset_queue() too.
Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
V8: Clearing STATE_MISSED before calling __netif_schedule() has
avoid the endless rescheduling problem, but there may still
be a unnecessary rescheduling, so adjust the commit log.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a90c57f2cedd52a511f739fb55e6244e22e1a2fb ]
Lockless qdisc has below concurrent problem:
cpu0 cpu1
. .
q->enqueue .
. .
qdisc_run_begin() .
. .
dequeue_skb() .
. .
sch_direct_xmit() .
. .
. q->enqueue
. qdisc_run_begin()
. return and do nothing
. .
qdisc_run_end() .
cpu1 enqueue a skb without calling __qdisc_run() because cpu0
has not released the lock yet and spin_trylock() return false
for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
enqueued by cpu1 when calling dequeue_skb() because cpu1 may
enqueue the skb after cpu0 calling dequeue_skb() and before
cpu0 calling qdisc_run_end().
Lockless qdisc has below another concurrent problem when
tx_action is involved:
cpu0(serving tx_action) cpu1 cpu2
. . .
. q->enqueue .
. qdisc_run_begin() .
. dequeue_skb() .
. . q->enqueue
. . .
. sch_direct_xmit() .
. . qdisc_run_begin()
. . return and do nothing
. . .
clear __QDISC_STATE_SCHED . .
qdisc_run_begin() . .
return and do nothing . .
. . .
. qdisc_run_end() .
This patch fixes the above data race by:
1. If the first spin_trylock() return false and STATE_MISSED is
not set, set STATE_MISSED and retry another spin_trylock() in
case other CPU may not see STATE_MISSED after it releases the
lock.
2. reschedule if STATE_MISSED is set after the lock is released
at the end of qdisc_run_end().
For tx_action case, STATE_MISSED is also set when cpu1 is at the
end if qdisc_run_end(), so tx_action will be rescheduled again
to dequeue the skb enqueued by cpu2.
Clear STATE_MISSED before retrying a dequeuing when dequeuing
returns NULL in order to reduce the overhead of the second
spin_trylock() and __netif_schedule() calling.
Also clear the STATE_MISSED before calling __netif_schedule()
at the end of qdisc_run_end() to avoid doing another round of
dequeuing in the pfifo_fast_dequeue().
The performance impact of this patch, tested using pktgen and
dummy netdev with pfifo_fast qdisc attached:
threads without+this_patch with+this_patch delta
1 2.61Mpps 2.60Mpps -0.3%
2 3.97Mpps 3.82Mpps -3.7%
4 5.62Mpps 5.59Mpps -0.5%
8 2.78Mpps 2.77Mpps -0.3%
16 2.22Mpps 2.22Mpps -0.0%
Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit e70f7a11876a1a788ceadf75e9e5f7af2c868680 upstream.
the following script:
# tc qdisc add dev eth0 handle 0x1 root fq_pie flows 2
# tc qdisc add dev eth0 clsact
# tc filter add dev eth0 egress matchall action skbedit priority 0x10002
# ping 192.0.2.2 -I eth0 -c2 -w1 -q
produces the following splat:
BUG: KASAN: slab-out-of-bounds in fq_pie_qdisc_enqueue+0x1314/0x19d0 [sch_fq_pie]
Read of size 4 at addr ffff888171306924 by task ping/942
CPU: 3 PID: 942 Comm: ping Not tainted 5.12.0+ #441
Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
Call Trace:
dump_stack+0x92/0xc1
print_address_description.constprop.7+0x1a/0x150
kasan_report.cold.13+0x7f/0x111
fq_pie_qdisc_enqueue+0x1314/0x19d0 [sch_fq_pie]
__dev_queue_xmit+0x1034/0x2b10
ip_finish_output2+0xc62/0x2120
__ip_finish_output+0x553/0xea0
ip_output+0x1ca/0x4d0
ip_send_skb+0x37/0xa0
raw_sendmsg+0x1c4b/0x2d00
sock_sendmsg+0xdb/0x110
__sys_sendto+0x1d7/0x2b0
__x64_sys_sendto+0xdd/0x1b0
do_syscall_64+0x3c/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fe69735c3eb
Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 75 42 2c 00 41 89 ca 8b 00 85 c0 75 14 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 75 c3 0f 1f 40 00 41 57 4d 89 c7 41 56 41 89
RSP: 002b:00007fff06d7fb38 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 000055e961413700 RCX: 00007fe69735c3eb
RDX: 0000000000000040 RSI: 000055e961413700 RDI: 0000000000000003
RBP: 0000000000000040 R08: 000055e961410500 R09: 0000000000000010
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff06d81260
R13: 00007fff06d7fb40 R14: 00007fff06d7fc30 R15: 000055e96140f0a0
Allocated by task 917:
kasan_save_stack+0x19/0x40
__kasan_kmalloc+0x7f/0xa0
__kmalloc_node+0x139/0x280
fq_pie_init+0x555/0x8e8 [sch_fq_pie]
qdisc_create+0x407/0x11b0
tc_modify_qdisc+0x3c2/0x17e0
rtnetlink_rcv_msg+0x346/0x8e0
netlink_rcv_skb+0x120/0x380
netlink_unicast+0x439/0x630
netlink_sendmsg+0x719/0xbf0
sock_sendmsg+0xe2/0x110
____sys_sendmsg+0x5ba/0x890
___sys_sendmsg+0xe9/0x160
__sys_sendmsg+0xd3/0x170
do_syscall_64+0x3c/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff888171306800
which belongs to the cache kmalloc-256 of size 256
The buggy address is located 36 bytes to the right of
256-byte region [ffff888171306800, ffff888171306900)
The buggy address belongs to the page:
page:00000000bcfb624e refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x171306
head:00000000bcfb624e order:1 compound_mapcount:0
flags: 0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
raw: 0017ffffc0010200 dead000000000100 dead000000000122 ffff888100042b40
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888171306800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff888171306880: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
>ffff888171306900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^
ffff888171306980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888171306a00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
fix fq_pie traffic path to avoid selecting 'q->flows + q->flows_cnt' as a
valid flow: it's an address beyond the allocated memory.
Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler")
CC: stable@vger.kernel.org
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3a62fed2fd7b6fea96d720e779cafc30dfb3a22e upstream.
the patch that fixed an endless loop in_fq_pie_init() was not considering
that 65535 is a valid class id. The correct bugfix for this infinite loop
is to change 'idx' to become an u32, like Colin proposed in the past [1].
Fix this as follows:
- restore 65536 as maximum possible values of 'flows_cnt'
- use u32 'idx' when iterating on 'q->flows'
- fix the TDC selftest
This reverts commit bb2f930d6dd708469a587dc9ed1efe1ef969c0bf.
[1] https://lore.kernel.org/netdev/20210407163808.499027-1-colin.king@canonical.com/
CC: Colin Ian King <colin.king@canonical.com>
CC: stable@vger.kernel.org
Fixes: bb2f930d6dd7 ("net/sched: fix infinite loop in sch_fq_pie")
Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit ed8157f1ebf1ae81a8fa2653e3f20d2076fad1c9 ]
There is a reproducible sequence from the userland that will trigger a WARN_ON()
condition in taprio_get_start_time, which causes kernel to panic if configured
as "panic_on_warn". Catch this condition in parse_taprio_schedule to
prevent this condition.
Reported as bug on syzkaller:
https://syzkaller.appspot.com/bug?extid=d50710fd0873a9c6b40c
Reported-by: syzbot+d50710fd0873a9c6b40c@syzkaller.appspotmail.com
Signed-off-by: Du Cheng <ducheng2@gmail.com>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6215afcb9a7e35cef334dc0ae7f998cc72c8465f ]
A make W=1 build complains that:
net/sched/cls_flower.c:214:20: warning: cast from restricted __be16
net/sched/cls_flower.c:214:20: warning: incorrect type in argument 1 (different base types)
net/sched/cls_flower.c:214:20: expected unsigned short [usertype] val
net/sched/cls_flower.c:214:20: got restricted __be16 [usertype] dst
This is because we use htons on struct flow_dissector_key_ports members
src and dst, which are defined as __be16, so they are already in network
byte order, not host. The byte swap function for the other direction
should have been used.
Because htons and ntohs do the same thing (either both swap, or none
does), this change has no functional effect except to silence the
warnings.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 4ba86128ba077fbb7d86516ae24ed642e6c3adef upstream.
This reverts commit 6855e8213e06efcaf7c02a15e12b1ae64b9a7149.
Following commit in series fixes the issue without introducing regression
in error rollback of tcf_action_destroy().
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6855e8213e06efcaf7c02a15e12b1ae64b9a7149 upstream.
Currently, action creation using ACT API in replace mode is buggy.
When invoking for non-existent action index 42,
tc action replace action bpf obj foo.o sec <xyz> index 42
kernel creates the action, fills up the netlink response, and then just
deletes the action after notifying userspace.
tc action show action bpf
doesn't list the action.
This happens due to the following sequence when ovr = 1 (replace mode)
is enabled:
tcf_idr_check_alloc is used to atomically check and either obtain
reference for existing action at index, or reserve the index slot using
a dummy entry (ERR_PTR(-EBUSY)).
This is necessary as pointers to these actions will be held after
dropping the idrinfo lock, so bumping the reference count is necessary
as we need to insert the actions, and notify userspace by dumping their
attributes. Finally, we drop the reference we took using the
tcf_action_put_many call in tcf_action_add. However, for the case where
a new action is created due to free index, its refcount remains one.
This when paired with the put_many call leads to the kernel setting up
the action, notifying userspace of its creation, and then tearing it
down. For existing actions, the refcount is still held so they remain
unaffected.
Fortunately due to rtnl_lock serialization requirement, such an action
with refcount == 1 will not be concurrently deleted by anything else, at
best CLS API can move its refcount up and down by binding to it after it
has been published from tcf_idr_insert_many. Since refcount is atleast
one until put_many call, CLS API cannot delete it. Also __tcf_action_put
release path already ensures deterministic outcome (either new action
will be created or existing action will be reused in case CLS API tries
to bind to action concurrently) due to idr lock serialization.
We fix this by making refcount of newly created actions as 2 in ACT API
replace mode. A relaxed store will suffice as visibility is ensured only
after the tcf_idr_insert_many call.
Note that in case of creation or overwriting using CLS API only (i.e.
bind = 1), overwriting existing action object is not allowed, and any
such request is silently ignored (without error).
The refcount bump that occurs in tcf_idr_check_alloc call there for
existing action will pair with tcf_exts_destroy call made from the
owner module for the same action. In case of action creation, there
is no existing action, so no tcf_exts_destroy callback happens.
This means no code changes for CLS API.
Fixes: cae422f379f3 ("net: sched: use reference counting action init")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 990b03b05b2fba79de2a1ee9dc359fc552d95ba6 ]
The 'unlocked_driver_cb' struct field in 'bo' is not being initialized
in tcf_block_offload_init(). The uninitialized 'unlocked_driver_cb'
will be used when calling unlocked_driver_cb(). So initialize 'bo' to
zero to avoid the issue.
Addresses-Coverity: ("Uninitialized scalar variable")
Fixes: 0fdcf78d5973 ("net: use flow_indr_dev_setup_offload()")
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b3650bf76a32380d4d80a3e21b5583e7303f216c ]
With recent changes that separated action module load from action
initialization tcf_action_init() function error handling code was modified
to manually release the loaded modules if loading/initialization of any
further action in same batch failed. For the case when all modules
successfully loaded and some of the actions were initialized before one of
them failed in init handler. In this case for all previous actions the
module will be released twice by the error handler: First time by the loop
that manually calls module_put() for all ops, and second time by the action
destroy code that puts the module after destroying the action.
Reproduction:
$ sudo tc actions add action simple sdata \"2\" index 2
$ sudo tc actions add action simple sdata \"1\" index 1 \
action simple sdata \"2\" index 2
RTNETLINK answers: File exists
We have an error talking to the kernel
$ sudo tc actions ls action simple
total acts 1
action order 0: Simple <"2">
index 2 ref 1 bind 0
$ sudo tc actions flush action simple
$ sudo tc actions ls action simple
$ sudo tc actions add action simple sdata \"2\" index 2
Error: Failed to load TC action module.
We have an error talking to the kernel
$ lsmod | grep simple
act_simple 20480 -1
Fix the issue by modifying module reference counting handling in action
initialization code:
- Get module reference in tcf_idr_create() and put it in tcf_idr_release()
instead of taking over the reference held by the caller.
- Modify users of tcf_action_init_1() to always release the module
reference which they obtain before calling init function instead of
assuming that created action takes over the reference.
- Finally, modify tcf_action_init_1() to not release the module reference
when overwriting existing action as this is no longer necessary since both
upper and lower layers obtain and manage their own module references
independently.
Fixes: d349f9976868 ("net_sched: fix RTNL deadlock again caused by request_module()")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 87c750e8c38bce706eb32e4d8f1e3402f2cebbd4 upstream.
Action init code increments reference counter when it changes an action.
This is the desired behavior for cls API which needs to obtain action
reference for every classifier that points to action. However, act API just
needs to change the action and releases the reference before returning.
This sequence breaks when the requested action doesn't exist, which causes
act API init code to create new action with specified index, but action is
still released before returning and is deleted (unless it was referenced
concurrently by cls API).
Reproduction:
$ sudo tc actions ls action gact
$ sudo tc actions change action gact drop index 1
$ sudo tc actions ls action gact
Extend tcf_action_init() to accept 'init_res' array and initialize it with
action->ops->init() result. In tcf_action_add() remove pointers to created
actions from actions array before passing it to tcf_action_put_many().
Fixes: cae422f379f3 ("net: sched: use reference counting action init")
Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|