summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2024-08-07genirq/irqdesc: Honor caller provided affinity in alloc_desc()Shay Drory1-0/+1
Currently, whenever a caller is providing an affinity hint for an interrupt, the allocation code uses it to calculate the node and copies the cpumask into irq_desc::affinity. If the affinity for the interrupt is not marked 'managed' then the startup of the interrupt ignores irq_desc::affinity and uses the system default affinity mask. Prevent this by setting the IRQD_AFFINITY_SET flag for the interrupt in the allocator, which causes irq_setup_affinity() to use irq_desc::affinity on interrupt startup if the mask contains an online CPU. [ tglx: Massaged changelog ] Fixes: 45ddcecbfa94 ("genirq: Use affinity hint in irqdesc allocation") Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/all/20240806072044.837827-1-shayd@nvidia.com
2024-08-07lockdep: Fix lockdep_set_notrack_class() for CONFIG_LOCK_STATKent Overstreet1-0/+6
We won't find a contended lock if it's not being tracked. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-06dma-debug: avoid deadlock between dma debug vs printk and netconsoleRik van Riel1-1/+4
Currently the dma debugging code can end up indirectly calling printk under the radix_lock. This happens when a radix tree node allocation fails. This is a problem because the printk code, when used together with netconsole, can end up inside the dma debugging code while trying to transmit a message over netcons. This creates the possibility of either a circular deadlock on the same CPU, with that CPU trying to grab the radix_lock twice, or an ABBA deadlock between different CPUs, where one CPU grabs the console lock first and then waits for the radix_lock, while the other CPU is holding the radix_lock and is waiting for the console lock. The trace captured by lockdep is of the ABBA variant. -> #2 (&dma_entry_hash[i].lock){-.-.}-{2:2}: _raw_spin_lock_irqsave+0x5a/0x90 debug_dma_map_page+0x79/0x180 dma_map_page_attrs+0x1d2/0x2f0 bnxt_start_xmit+0x8c6/0x1540 netpoll_start_xmit+0x13f/0x180 netpoll_send_skb+0x20d/0x320 netpoll_send_udp+0x453/0x4a0 write_ext_msg+0x1b9/0x460 console_flush_all+0x2ff/0x5a0 console_unlock+0x55/0x180 vprintk_emit+0x2e3/0x3c0 devkmsg_emit+0x5a/0x80 devkmsg_write+0xfd/0x180 do_iter_readv_writev+0x164/0x1b0 vfs_writev+0xf9/0x2b0 do_writev+0x6d/0x110 do_syscall_64+0x80/0x150 entry_SYSCALL_64_after_hwframe+0x4b/0x53 -> #0 (console_owner){-.-.}-{0:0}: __lock_acquire+0x15d1/0x31a0 lock_acquire+0xe8/0x290 console_flush_all+0x2ea/0x5a0 console_unlock+0x55/0x180 vprintk_emit+0x2e3/0x3c0 _printk+0x59/0x80 warn_alloc+0x122/0x1b0 __alloc_pages_slowpath+0x1101/0x1120 __alloc_pages+0x1eb/0x2c0 alloc_slab_page+0x5f/0x150 new_slab+0x2dc/0x4e0 ___slab_alloc+0xdcb/0x1390 kmem_cache_alloc+0x23d/0x360 radix_tree_node_alloc+0x3c/0xf0 radix_tree_insert+0xf5/0x230 add_dma_entry+0xe9/0x360 dma_map_page_attrs+0x1d2/0x2f0 __bnxt_alloc_rx_frag+0x147/0x180 bnxt_alloc_rx_data+0x79/0x160 bnxt_rx_skb+0x29/0xc0 bnxt_rx_pkt+0xe22/0x1570 __bnxt_poll_work+0x101/0x390 bnxt_poll+0x7e/0x320 __napi_poll+0x29/0x160 net_rx_action+0x1e0/0x3e0 handle_softirqs+0x190/0x510 run_ksoftirqd+0x4e/0x90 smpboot_thread_fn+0x1a8/0x270 kthread+0x102/0x120 ret_from_fork+0x2f/0x40 ret_from_fork_asm+0x11/0x20 This bug is more likely than it seems, because when one CPU has run out of memory, chances are the other has too. The good news is, this bug is hidden behind the CONFIG_DMA_API_DEBUG, so not many users are likely to trigger it. Signed-off-by: Rik van Riel <riel@surriel.com> Reported-by: Konstantin Ovsepian <ovs@meta.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-08-05workqueue: Correct declaration of cpu_pwq in struct workqueue_structUros Bizjak1-1/+1
cpu_pwq is used in various percpu functions that expect variable in __percpu address space. Correct the declaration of cpu_pwq to struct pool_workqueue __rcu * __percpu *cpu_pwq to declare the variable as __percpu pointer. The patch also fixes following sparse errors: workqueue.c:380:37: warning: duplicate [noderef] workqueue.c:380:37: error: multiple address spaces given: __rcu & __percpu workqueue.c:2271:15: error: incompatible types in comparison expression (different address spaces): workqueue.c:2271:15: struct pool_workqueue [noderef] __rcu * workqueue.c:2271:15: struct pool_workqueue [noderef] __percpu * and uncovers a couple of exisiting "incorrect type in assignment" warnings (from __rcu address space), which this patch does not address. Found by GCC's named address space checks. There were no changes in the resulting object files. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05workqueue: Fix spruious data race in __flush_work()Tejun Heo1-20/+25
When flushing a work item for cancellation, __flush_work() knows that it exclusively owns the work item through its PENDING bit. 134874e2eee9 ("workqueue: Allow cancel_work_sync() and disable_work() from atomic contexts on BH work items") added a read of @work->data to determine whether to use busy wait for BH work items that are being canceled. While the read is safe when @from_cancel, @work->data was read before testing @from_cancel to simplify code structure: data = *work_data_bits(work); if (from_cancel && !WARN_ON_ONCE(data & WORK_STRUCT_PWQ) && (data & WORK_OFFQ_BH)) { While the read data was never used if !@from_cancel, this could trigger KCSAN data race detection spuriously: ================================================================== BUG: KCSAN: data-race in __flush_work / __flush_work write to 0xffff8881223aa3e8 of 8 bytes by task 3998 on cpu 0: instrument_write include/linux/instrumented.h:41 [inline] ___set_bit include/asm-generic/bitops/instrumented-non-atomic.h:28 [inline] insert_wq_barrier kernel/workqueue.c:3790 [inline] start_flush_work kernel/workqueue.c:4142 [inline] __flush_work+0x30b/0x570 kernel/workqueue.c:4178 flush_work kernel/workqueue.c:4229 [inline] ... read to 0xffff8881223aa3e8 of 8 bytes by task 50 on cpu 1: __flush_work+0x42a/0x570 kernel/workqueue.c:4188 flush_work kernel/workqueue.c:4229 [inline] flush_delayed_work+0x66/0x70 kernel/workqueue.c:4251 ... value changed: 0x0000000000400000 -> 0xffff88810006c00d Reorganize the code so that @from_cancel is tested before @work->data is accessed. The only problem is triggering KCSAN detection spuriously. This shouldn't need READ_ONCE() or other access qualifiers. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: syzbot+b3e4f2f51ed645fd5df2@syzkaller.appspotmail.com Fixes: 134874e2eee9 ("workqueue: Allow cancel_work_sync() and disable_work() from atomic contexts on BH work items") Link: http://lkml.kernel.org/r/000000000000ae429e061eea2157@google.com Cc: Jens Axboe <axboe@kernel.dk>
2024-08-05workqueue: Remove incorrect "WARN_ON_ONCE(!list_empty(&worker->entry));" ↵Lai Jiangshan1-1/+0
from dying worker The commit 68f83057b913 ("workqueue: Reap workers via kthread_stop() and remove detach_completion") changes the procedure of destroying workers; the dying workers are kept in the cull_list in wake_dying_workers() with the pool lock held and removed from the cull_list by the newly added reap_dying_workers() without the pool lock. This can cause a warning if the dying worker is wokenup earlier than reaped as reported by Marc: 2024/07/23 18:01:21 [M83LP63]: [ 157.267727] ------------[ cut here ]------------ 2024/07/23 18:01:21 [M83LP63]: [ 157.267735] WARNING: CPU: 21 PID: 725 at kernel/workqueue.c:3340 worker_thread+0x54e/0x558 2024/07/23 18:01:21 [M83LP63]: [ 157.267746] Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables sunrpc dm_service_time s390_trng vfio_ccw mdev vfio_iommu_type1 vfio sch_fq_codel 2024/07/23 18:01:21 [M83LP63]: loop dm_multipath configfs nfnetlink lcs ctcm fsm zfcp scsi_transport_fc ghash_s390 prng chacha_s390 libchacha aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common scm_block eadm_sch scsi_dh_rdac scsi_dh_emc scsi_dh_alua pkey zcrypt rng_core autofs4 2024/07/23 18:01:21 [M83LP63]: [ 157.267792] CPU: 21 PID: 725 Comm: kworker/dying Not tainted 6.10.0-rc2-00239-g68f83057b913 #95 2024/07/23 18:01:21 [M83LP63]: [ 157.267796] Hardware name: IBM 3906 M04 704 (LPAR) 2024/07/23 18:01:21 [M83LP63]: [ 157.267802] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 2024/07/23 18:01:21 [M83LP63]: [ 157.267797] Krnl PSW : 0704d00180000000 000003d600fcd9fa (worker_thread+0x552/0x558) 2024/07/23 18:01:21 [M83LP63]: [ 157.267806] Krnl GPRS: 6479696e6700776f 000002c901b62780 000003d602493ec8 000002c914954600 2024/07/23 18:01:21 [M83LP63]: [ 157.267809] 0000000000000000 0000000000000008 000002c901a85400 000002c90719e840 2024/07/23 18:01:21 [M83LP63]: [ 157.267811] 000002c90719e880 000002c901a85420 000002c91127adf0 000002c901a85400 2024/07/23 18:01:21 [M83LP63]: [ 157.267813] 000002c914954600 0000000000000000 000003d600fcd772 000003560452bd98 2024/07/23 18:01:21 [M83LP63]: [ 157.267822] Krnl Code: 000003d600fcd9ec: c0e500674262 brasl %r14,000003d601cb5eb0 2024/07/23 18:01:21 [M83LP63]: [ 157.267822] 000003d600fcd9f2: a7f4ffc8 brc 15,000003d600fcd982 2024/07/23 18:01:21 [M83LP63]: [ 157.267822] #000003d600fcd9f6: af000000 mc 0,0 2024/07/23 18:01:21 [M83LP63]: [ 157.267822] >000003d600fcd9fa: a7f4fec2 brc 15,000003d600fcd77e 2024/07/23 18:01:21 [M83LP63]: [ 157.267822] 000003d600fcd9fe: 0707 bcr 0,%r7 2024/07/23 18:01:21 [M83LP63]: [ 157.267822] 000003d600fcda00: c00400682e10 brcl 0,000003d601cd3620 2024/07/23 18:01:21 [M83LP63]: [ 157.267822] 000003d600fcda06: eb7ff0500024 stmg %r7,%r15,80(%r15) 2024/07/23 18:01:21 [M83LP63]: [ 157.267822] 000003d600fcda0c: b90400ef lgr %r14,%r15 2024/07/23 18:01:21 [M83LP63]: [ 157.267853] Call Trace: 2024/07/23 18:01:21 [M83LP63]: [ 157.267855] [<000003d600fcd9fa>] worker_thread+0x552/0x558 2024/07/23 18:01:21 [M83LP63]: [ 157.267859] ([<000003d600fcd772>] worker_thread+0x2ca/0x558) 2024/07/23 18:01:21 [M83LP63]: [ 157.267862] [<000003d600fd6c80>] kthread+0x120/0x128 2024/07/23 18:01:21 [M83LP63]: [ 157.267865] [<000003d600f5305c>] __ret_from_fork+0x3c/0x58 2024/07/23 18:01:21 [M83LP63]: [ 157.267868] [<000003d601cc746a>] ret_from_fork+0xa/0x30 2024/07/23 18:01:21 [M83LP63]: [ 157.267873] Last Breaking-Event-Address: 2024/07/23 18:01:21 [M83LP63]: [ 157.267874] [<000003d600fcd778>] worker_thread+0x2d0/0x558 Since the procedure of destroying workers is changed, the WARN_ON_ONCE() becomes incorrect and should be removed. Cc: Marc Hartmayer <mhartmay@linux.ibm.com> Link: https://lore.kernel.org/lkml/87le1sjd2e.fsf@linux.ibm.com/ Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com> Fixes: 68f83057b913 ("workqueue: Reap workers via kthread_stop() and remove detach_completion") Cc: stable@vger.kernel.org # v6.11+ Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05workqueue: Fix UBSAN 'subtraction overflow' error in shift_and_mask()Will Deacon1-1/+1
UBSAN reports the following 'subtraction overflow' error when booting in a virtual machine on Android: | Internal error: UBSAN: integer subtraction overflow: 00000000f2005515 [#1] PREEMPT SMP | Modules linked in: | CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.10.0-00006-g3cbe9e5abd46-dirty #4 | Hardware name: linux,dummy-virt (DT) | pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) | pc : cancel_delayed_work+0x34/0x44 | lr : cancel_delayed_work+0x2c/0x44 | sp : ffff80008002ba60 | x29: ffff80008002ba60 x28: 0000000000000000 x27: 0000000000000000 | x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 | x23: 0000000000000000 x22: 0000000000000000 x21: ffff1f65014cd3c0 | x20: ffffc0e84c9d0da0 x19: ffffc0e84cab3558 x18: ffff800080009058 | x17: 00000000247ee1f8 x16: 00000000247ee1f8 x15: 00000000bdcb279d | x14: 0000000000000001 x13: 0000000000000075 x12: 00000a0000000000 | x11: ffff1f6501499018 x10: 00984901651fffff x9 : ffff5e7cc35af000 | x8 : 0000000000000001 x7 : 3d4d455453595342 x6 : 000000004e514553 | x5 : ffff1f6501499265 x4 : ffff1f650ff60b10 x3 : 0000000000000620 | x2 : ffff80008002ba78 x1 : 0000000000000000 x0 : 0000000000000000 | Call trace: | cancel_delayed_work+0x34/0x44 | deferred_probe_extend_timeout+0x20/0x70 | driver_register+0xa8/0x110 | __platform_driver_register+0x28/0x3c | syscon_init+0x24/0x38 | do_one_initcall+0xe4/0x338 | do_initcall_level+0xac/0x178 | do_initcalls+0x5c/0xa0 | do_basic_setup+0x20/0x30 | kernel_init_freeable+0x8c/0xf8 | kernel_init+0x28/0x1b4 | ret_from_fork+0x10/0x20 | Code: f9000fbf 97fffa2f 39400268 37100048 (d42aa2a0) | ---[ end trace 0000000000000000 ]--- | Kernel panic - not syncing: UBSAN: integer subtraction overflow: Fatal exception This is due to shift_and_mask() using a signed immediate to construct the mask and being called with a shift of 31 (WORK_OFFQ_POOL_SHIFT) so that it ends up decrementing from INT_MIN. Use an unsigned constant '1U' to generate the mask in shift_and_mask(). Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Fixes: 1211f3b21c2a ("workqueue: Preserve OFFQ bits in cancel[_sync] paths") Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05cgroup/cpuset: Eliminate unncessary sched domains rebuilds in hotplugWaiman Long1-16/+17
It was found that some hotplug operations may cause multiple rebuild_sched_domains_locked() calls. Some of those intermediate calls may use cpuset states not in the final correct form leading to incorrect sched domain setting. Fix this problem by using the existing force_rebuild flag to inhibit immediate rebuild_sched_domains_locked() calls if set and only doing one final call at the end. Also renaming the force_rebuild flag to force_sd_rebuild to make its meaning for clear. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05cgroup/cpuset: Clear effective_xcpus on cpus_allowed clearing only if ↵Waiman Long1-1/+2
cpus.exclusive not set Commit e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") adds a user writable cpuset.cpus.exclusive file for setting exclusive CPUs to be used for the creation of partitions. Since then effective_xcpus depends on both the cpuset.cpus and cpuset.cpus.exclusive setting. If cpuset.cpus.exclusive is set, effective_xcpus will depend only on cpuset.cpus.exclusive. When it is not set, effective_xcpus will be set according to the cpuset.cpus value when the cpuset becomes a valid partition root. When cpuset.cpus is being cleared by the user, effective_xcpus should only be cleared when cpuset.cpus.exclusive is not set. However, that is not currently the case. # cd /sys/fs/cgroup/ # mkdir test # echo +cpuset > cgroup.subtree_control # cd test # echo 3 > cpuset.cpus.exclusive # cat cpuset.cpus.exclusive.effective 3 # echo > cpuset.cpus # cat cpuset.cpus.exclusive.effective // was cleared Fix it by clearing effective_xcpus only if cpuset.cpus.exclusive is not set. Fixes: e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") Cc: stable@vger.kernel.org # v6.7+ Reported-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05cgroup/cpuset: fix panic caused by partcmd_updateChen Ridong1-0/+2
We find a bug as below: BUG: unable to handle page fault for address: 00000003 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 3 PID: 358 Comm: bash Tainted: G W I 6.6.0-10893-g60d6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/4 RIP: 0010:partition_sched_domains_locked+0x483/0x600 Code: 01 48 85 d2 74 0d 48 83 05 29 3f f8 03 01 f3 48 0f bc c2 89 c0 48 9 RSP: 0018:ffffc90000fdbc58 EFLAGS: 00000202 RAX: 0000000100000003 RBX: ffff888100b3dfa0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000002fe80 RBP: ffff888100b3dfb0 R08: 0000000000000001 R09: 0000000000000000 R10: ffffc90000fdbcb0 R11: 0000000000000004 R12: 0000000000000002 R13: ffff888100a92b48 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f44a5425740(0000) GS:ffff888237d80000(0000) knlGS:0000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000100030973 CR3: 000000010722c000 CR4: 00000000000006e0 Call Trace: <TASK> ? show_regs+0x8c/0xa0 ? __die_body+0x23/0xa0 ? __die+0x3a/0x50 ? page_fault_oops+0x1d2/0x5c0 ? partition_sched_domains_locked+0x483/0x600 ? search_module_extables+0x2a/0xb0 ? search_exception_tables+0x67/0x90 ? kernelmode_fixup_or_oops+0x144/0x1b0 ? __bad_area_nosemaphore+0x211/0x360 ? up_read+0x3b/0x50 ? bad_area_nosemaphore+0x1a/0x30 ? exc_page_fault+0x890/0xd90 ? __lock_acquire.constprop.0+0x24f/0x8d0 ? __lock_acquire.constprop.0+0x24f/0x8d0 ? asm_exc_page_fault+0x26/0x30 ? partition_sched_domains_locked+0x483/0x600 ? partition_sched_domains_locked+0xf0/0x600 rebuild_sched_domains_locked+0x806/0xdc0 update_partition_sd_lb+0x118/0x130 cpuset_write_resmask+0xffc/0x1420 cgroup_file_write+0xb2/0x290 kernfs_fop_write_iter+0x194/0x290 new_sync_write+0xeb/0x160 vfs_write+0x16f/0x1d0 ksys_write+0x81/0x180 __x64_sys_write+0x21/0x30 x64_sys_call+0x2f25/0x4630 do_syscall_64+0x44/0xb0 entry_SYSCALL_64_after_hwframe+0x78/0xe2 RIP: 0033:0x7f44a553c887 It can be reproduced with cammands: cd /sys/fs/cgroup/ mkdir test cd test/ echo +cpuset > ../cgroup.subtree_control echo root > cpuset.cpus.partition cat /sys/fs/cgroup/cpuset.cpus.effective 0-3 echo 0-3 > cpuset.cpus // taking away all cpus from root This issue is caused by the incorrect rebuilding of scheduling domains. In this scenario, test/cpuset.cpus.partition should be an invalid root and should not trigger the rebuilding of scheduling domains. When calling update_parent_effective_cpumask with partcmd_update, if newmask is not null, it should recheck newmask whether there are cpus is available for parect/cs that has tasks. Fixes: 0c7f293efc87 ("cgroup/cpuset: Add cpuset.cpus.exclusive.effective for v2") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-08-05timekeeping: Fix bogus clock_was_set() invocation in do_adjtimex()Thomas Gleixner1-1/+1
The addition of the bases argument to clock_was_set() fixed up all call sites correctly except for do_adjtimex(). This uses CLOCK_REALTIME instead of CLOCK_SET_WALL as argument. CLOCK_REALTIME is 0. As a result the effect of that clock_was_set() notification is incomplete and might result in timers expiring late because the hrtimer code does not re-evaluate the affected clock bases. Use CLOCK_SET_WALL instead of CLOCK_REALTIME to tell the hrtimers code which clock bases need to be re-evaluated. Fixes: 17a1b8826b45 ("hrtimer: Add bases argument to clock_was_set()") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/877ccx7igo.ffs@tglx
2024-08-05ntp: Safeguard against time_constant overflowJustin Stitt1-3/+2
Using syzkaller with the recently reintroduced signed integer overflow sanitizer produces this UBSAN report: UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:738:18 9223372036854775806 + 4 cannot be represented in type 'long' Call Trace: handle_overflow+0x171/0x1b0 __do_adjtimex+0x1236/0x1440 do_adjtimex+0x2be/0x740 The user supplied time_constant value is incremented by four and then clamped to the operating range. Before commit eea83d896e31 ("ntp: NTP4 user space bits update") the user supplied value was sanity checked to be in the operating range. That change removed the sanity check and relied on clamping after incrementing which does not work correctly when the user supplied value is in the overflow zone of the '+ 4' operation. The operation requires CAP_SYS_TIME and the side effect of the overflow is NTP getting out of sync. Similar to the fixups for time_maxerror and time_esterror, clamp the user space supplied value to the operating range. [ tglx: Switch to clamping ] Fixes: eea83d896e31 ("ntp: NTP4 user space bits update") Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240517-b4-sio-ntp-c-v2-1-f3a80096f36f@google.com Closes: https://github.com/KSPP/linux/issues/352
2024-08-05ntp: Clamp maxerror and esterror to operating rangeJustin Stitt1-2/+2
Using syzkaller alongside the newly reintroduced signed integer overflow sanitizer spits out this report: UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:461:16 9223372036854775807 + 500 cannot be represented in type 'long' Call Trace: handle_overflow+0x171/0x1b0 second_overflow+0x2d6/0x500 accumulate_nsecs_to_secs+0x60/0x160 timekeeping_advance+0x1fe/0x890 update_wall_time+0x10/0x30 time_maxerror is unconditionally incremented and the result is checked against NTP_PHASE_LIMIT, but the increment itself can overflow, resulting in wrap-around to negative space. Before commit eea83d896e31 ("ntp: NTP4 user space bits update") the user supplied value was sanity checked to be in the operating range. That change removed the sanity check and relied on clamping in handle_overflow() which does not work correctly when the user supplied value is in the overflow zone of the '+ 500' operation. The operation requires CAP_SYS_TIME and the side effect of the overflow is NTP getting out of sync. Miroslav confirmed that the input value should be clamped to the operating range and the same applies to time_esterror. The latter is not used by the kernel, but the value still should be in the operating range as it was before the sanity check got removed. Clamp them to the operating range. [ tglx: Changed it to clamping and included time_esterror ] Fixes: eea83d896e31 ("ntp: NTP4 user space bits update") Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Miroslav Lichvar <mlichvar@redhat.com> Link: https://lore.kernel.org/all/20240517-b4-sio-ntp-usec-v2-1-d539180f2b79@google.com Closes: https://github.com/KSPP/linux/issues/354
2024-08-05kprobes: Fix to check symbol prefixes correctlyMasami Hiramatsu (Google)1-2/+2
Since str_has_prefix() takes the prefix as the 2nd argument and the string as the first, is_cfi_preamble_symbol() always fails to check the prefix. Fix the function parameter order so that it correctly check the prefix. Link: https://lore.kernel.org/all/172260679559.362040.7360872132937227206.stgit@devnote2/ Fixes: de02f2ac5d8c ("kprobes: Prohibit probing on CFI preamble symbol") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-08-04profiling: remove profile=sleep supportTetsuo Handa2-20/+1
The kernel sleep profile is no longer working due to a recursive locking bug introduced by commit 42a20f86dc19 ("sched: Add wrapper for get_wchan() to keep task blocked") Booting with the 'profile=sleep' kernel command line option added or executing # echo -n sleep > /sys/kernel/profiling after boot causes the system to lock up. Lockdep reports kthreadd/3 is trying to acquire lock: ffff93ac82e08d58 (&p->pi_lock){....}-{2:2}, at: get_wchan+0x32/0x70 but task is already holding lock: ffff93ac82e08d58 (&p->pi_lock){....}-{2:2}, at: try_to_wake_up+0x53/0x370 with the call trace being lock_acquire+0xc8/0x2f0 get_wchan+0x32/0x70 __update_stats_enqueue_sleeper+0x151/0x430 enqueue_entity+0x4b0/0x520 enqueue_task_fair+0x92/0x6b0 ttwu_do_activate+0x73/0x140 try_to_wake_up+0x213/0x370 swake_up_locked+0x20/0x50 complete+0x2f/0x40 kthread+0xfb/0x180 However, since nobody noticed this regression for more than two years, let's remove 'profile=sleep' support based on the assumption that nobody needs this functionality. Fixes: 42a20f86dc19 ("sched: Add wrapper for get_wchan() to keep task blocked") Cc: stable@vger.kernel.org # v5.16+ Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-08-04Merge tag 'timers-urgent-2024-08-04' of ↵Linus Torvalds2-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "Two fixes for the timer/clocksource code: - The recent fix to make the take over of the broadcast timer more reliable retrieves a per CPU pointer in preemptible context. This went unnoticed in testing as some compilers hoist the access into the non-preemotible section where the pointer is actually used, but obviously compilers can rightfully invoke it where the code put it. Move it into the non-preemptible section right to the actual usage side to cure it. - The clocksource watchdog is supposed to emit a warning when the retry count is greater than one and the number of retries reaches the limit. The condition is backwards and warns always when the count is greater than one. Fixup the condition to prevent spamming dmesg" * tag 'timers-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: Fix brown-bag boolean thinko in cs_watchdog_read() tick/broadcast: Move per CPU pointer access into the atomic section
2024-08-04Merge tag 'sched-urgent-2024-08-04' of ↵Linus Torvalds2-21/+53
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: - When stime is larger than rtime due to accounting imprecision, then utime = rtime - stime becomes negative. As this is unsigned math, the result becomes a huge positive number. Cure it by resetting stime to rtime in that case, so utime becomes 0. - Restore consistent state when sched_cpu_deactivate() fails. When offlining a CPU fails in sched_cpu_deactivate() after the SMT present counter has been decremented, then the function aborts but fails to increment the SMT present counter and leaves it imbalanced. Consecutive operations cause it to underflow. Add the missing fixup for the error path. For SMT accounting the runqueue needs to marked online again in the error exit path to restore consistent state. * tag 'sched-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Fix unbalance set_rq_online/offline() in sched_cpu_deactivate() sched/core: Introduce sched_set_rq_on/offline() helper sched/smt: Fix unbalance sched_smt_present dec/inc sched/smt: Introduce sched_smt_present_inc/dec() helper sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime
2024-08-04Merge tag 'locking-urgent-2024-08-04' of ↵Linus Torvalds2-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Thomas Gleixner: "Two fixes for locking and jump labels: - Ensure that the atomic_cmpxchg() conditions are correct and evaluating to true on any non-zero value except 1. The missing check of the return value leads to inconsisted state of the jump label counter. - Add a missing type conversion in the paravirt spinlock code which makes loongson build again" * tag 'locking-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: jump_label: Fix the fix, brown paper bags galore locking/pvqspinlock: Correct the type of "old" variable in pv_kick_node()
2024-08-02clocksource: Fix brown-bag boolean thinko in cs_watchdog_read()Paul E. McKenney1-1/+1
The current "nretries > 1 || nretries >= max_retries" check in cs_watchdog_read() will always evaluate to true, and thus pr_warn(), if nretries is greater than 1. The intent is instead to never warn on the first try, but otherwise warn if the successful retry was the last retry. Therefore, change that "||" to "&&". Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected") Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240802154618.4149953-2-paulmck@kernel.org
2024-07-31jump_label: Fix the fix, brown paper bags galorePeter Zijlstra1-2/+2
Per the example of: !atomic_cmpxchg(&key->enabled, 0, 1) the inverse was written as: atomic_cmpxchg(&key->enabled, 1, 0) except of course, that while !old is only true for old == 0, old is true for everything except old == 0. Fix it to read: atomic_cmpxchg(&key->enabled, 1, 0) == 1 such that only the 1->0 transition returns true and goes on to disable the keys. Fixes: 83ab38ef0a0b ("jump_label: Fix concurrency issues in static_key_slow_dec()") Reported-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Darrick J. Wong <djwong@kernel.org> Link: https://lkml.kernel.org/r/20240731105557.GY33588@noisy.programming.kicks-ass.net
2024-07-31tick/broadcast: Move per CPU pointer access into the atomic sectionThomas Gleixner1-1/+2
The recent fix for making the take over of the broadcast timer more reliable retrieves a per CPU pointer in preemptible context. This went unnoticed as compilers hoist the access into the non-preemptible region where the pointer is actually used. But of course it's valid that the compiler keeps it at the place where the code puts it which rightfully triggers: BUG: using smp_processor_id() in preemptible [00000000] code: caller is hotplug_cpu__broadcast_tick_pull+0x1c/0xc0 Move it to the actual usage site which is in a non-preemptible region. Fixes: f7d43dd206e7 ("tick/broadcast: Make takeover of broadcast hrtimer reliable") Reported-by: David Wang <00107082@163.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Yu Liao <liaoyu15@huawei.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/87ttg56ers.ffs@tglx
2024-07-29profiling: remove stale percpu flip buffer variablesLinus Torvalds1-6/+0
For some reason I didn't see this issue on my arm64 or x86-64 builds, but Stephen Rothwell reports that commit 2accfdb7eff6 ("profiling: attempt to remove per-cpu profile flip buffer") left these static variables around, and the powerpc build is unhappy about them: kernel/profile.c:52:28: warning: 'cpu_profile_flip' defined but not used [-Wunused-variable] 52 | static DEFINE_PER_CPU(int, cpu_profile_flip); | ^~~~~~~~~~~~~~~~ .. So remove these stale left-over remnants too. Fixes: 2accfdb7eff6 ("profiling: attempt to remove per-cpu profile flip buffer") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-29task_work: make TWA_NMI_CURRENT handling conditional on IRQ_WORKLinus Torvalds1-0/+6
The TWA_NMI_CURRENT handling very much depends on IRQ_WORK, but that isn't universally enabled everywhere. Maybe the IRQ_WORK infrastructure should just be unconditional - x86 ends up indirectly enabling it through unconditionally enabling PERF_EVENTS, for example. But it also gets enabled by having SMP support, or even if you just have PRINTK enabled. But in the meantime TWA_NMI_CURRENT causes tons of build failures on various odd minimal configs. Which did show up in linux-next, but despite that nobody bothered to fix it or even inform me until -rc1 was out. Fixes: 466e4d801cd4 ("task_work: Add TWA_NMI_CURRENT as an additional notify mode") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: kernelci.org bot <bot@kernelci.org> Reported-by: Guenter Roeck <linux@roeck-us.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-29profiling: attempt to remove per-cpu profile flip bufferLinus Torvalds1-181/+2
This is the really old legacy kernel profiling code, which has long since been obviated by "real profiling" (ie 'prof' and company), and mainly remains as a source of syzbot reports. There are anecdotal reports that people still use it for boot-time profiling, but it's unlikely that such use would care about the old NUMA optimizations in this code from 2004 (commit ad02973d42: "profile: 512x Altix timer interrupt livelock fix" in the BK import archive at [1]) So in order to head off future syzbot reports, let's try to simplify this code and get rid of the per-cpu profile buffers that are quite a large portion of the complexity footprint of this thing (including CPU hotplug callbacks etc). It's unlikely anybody will actually notice, or possibly, as Thomas put it: "Only people who indulge in nostalgia will notice :)". That said, if it turns out that this code is actually actively used by somebody, we can always revert this removal. Thus the "attempt" in the summary line. [ Note: in a small nod to "the profiling code can cause NUMA problems", this also removes the "increment the last entry in the profiling array on any unknown hits" logic. That would account any program counter in a module to that single counter location, and might exacerbate any NUMA cacheline bouncing issues ] Link: https://lore.kernel.org/all/CAHk-=wgs52BxT4Zjmjz8aNvHWKxf5_ThBY4bYL1Y6CTaNL2dTw@mail.gmail.com/ Link: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git [1] Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-29profiling: remove prof_cpu_maskTetsuo Handa2-40/+13
syzbot is reporting uninit-value at profile_hits(), for there is a race window between if (!alloc_cpumask_var(&prof_cpu_mask, GFP_KERNEL)) return -ENOMEM; cpumask_copy(prof_cpu_mask, cpu_possible_mask); in profile_init() and cpumask_available(prof_cpu_mask) && cpumask_test_cpu(smp_processor_id(), prof_cpu_mask)) in profile_tick(); prof_cpu_mask remains uninitialzed until cpumask_copy() completes while cpumask_available(prof_cpu_mask) returns true as soon as alloc_cpumask_var(&prof_cpu_mask) completes. We could replace alloc_cpumask_var() with zalloc_cpumask_var() and call cpumask_copy() from create_proc_profile() on only UP kernels, for profile_online_cpu() calls cpumask_set_cpu() as needed via cpuhp_setup_state(CPUHP_AP_ONLINE_DYN) on SMP kernels. But this patch removes prof_cpu_mask because it seems unnecessary. The cpumask_test_cpu(smp_processor_id(), prof_cpu_mask) test in profile_tick() is likely always true due to a CPU cannot call profile_tick() if that CPU is offline and cpumask_set_cpu(cpu, prof_cpu_mask) is called when that CPU becomes online and cpumask_clear_cpu(cpu, prof_cpu_mask) is called when that CPU becomes offline . This test could be false during transition between online and offline. But according to include/linux/cpuhotplug.h , CPUHP_PROFILE_PREPARE belongs to PREPARE section, which means that the CPU subjected to profile_dead_cpu() cannot be inside profile_tick() (i.e. no risk of use-after-free bug) because interrupt for that CPU is disabled during PREPARE section. Therefore, this test is guaranteed to be true, and can be removed. (Since profile_hits() checks prof_buffer != NULL, we don't need to check prof_buffer != NULL here unless get_irq_regs() or user_mode() is such slow that we want to avoid when prof_buffer == NULL). do_profile_hits() is called from profile_tick() from timer interrupt only if cpumask_test_cpu(smp_processor_id(), prof_cpu_mask) is true and prof_buffer is not NULL. But syzbot is also reporting that sometimes do_profile_hits() is called while current thread is still doing vzalloc(), where prof_buffer must be NULL at this moment. This indicates that multiple threads concurrently tried to write to /sys/kernel/profiling interface, which caused that somebody else try to re-allocate prof_buffer despite somebody has already allocated prof_buffer. Fix this by using serialization. Reported-by: syzbot <syzbot+b1a83ab2a9eb9321fbdd@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=b1a83ab2a9eb9321fbdd Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: syzbot <syzbot+b1a83ab2a9eb9321fbdd@syzkaller.appspotmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-29sched/core: Fix unbalance set_rq_online/offline() in sched_cpu_deactivate()Yang Yingliang1-0/+1
If cpuset_cpu_inactive() fails, set_rq_online() need be called to rollback. Fixes: 120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control") Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-5-yangyingliang@huaweicloud.com
2024-07-29sched/core: Introduce sched_set_rq_on/offline() helperYang Yingliang1-14/+26
Introduce sched_set_rq_on/offline() helper, so it can be called in normal or error path simply. No functional changed. Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-4-yangyingliang@huaweicloud.com
2024-07-29sched/smt: Fix unbalance sched_smt_present dec/incYang Yingliang1-0/+1
I got the following warn report while doing stress test: jump label: negative count! WARNING: CPU: 3 PID: 38 at kernel/jump_label.c:263 static_key_slow_try_dec+0x9d/0xb0 Call Trace: <TASK> __static_key_slow_dec_cpuslocked+0x16/0x70 sched_cpu_deactivate+0x26e/0x2a0 cpuhp_invoke_callback+0x3ad/0x10d0 cpuhp_thread_fun+0x3f5/0x680 smpboot_thread_fn+0x56d/0x8d0 kthread+0x309/0x400 ret_from_fork+0x41/0x70 ret_from_fork_asm+0x1b/0x30 </TASK> Because when cpuset_cpu_inactive() fails in sched_cpu_deactivate(), the cpu offline failed, but sched_smt_present is decremented before calling sched_cpu_deactivate(), it leads to unbalanced dec/inc, so fix it by incrementing sched_smt_present in the error path. Fixes: c5511d03ec09 ("sched/smt: Make sched_smt_present track topology") Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20240703031610.587047-3-yangyingliang@huaweicloud.com
2024-07-29sched/smt: Introduce sched_smt_present_inc/dec() helperYang Yingliang1-7/+19
Introduce sched_smt_present_inc/dec() helper, so it can be called in normal or error path simply. No functional changed. Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-2-yangyingliang@huaweicloud.com
2024-07-29sched/cputime: Fix mul_u64_u64_div_u64() precision for cputimeZheng Zucheng1-0/+6
In extreme test scenarios: the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime, utime = 18446744073709518790 ns, rtime = 135989749728000 ns In cputime_adjust() process, stime is greater than rtime due to mul_u64_u64_div_u64() precision problem. before call mul_u64_u64_div_u64(), stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. after call mul_u64_u64_div_u64(), stime = 135989949653530 unsigned reversion occurs because rtime is less than stime. utime = rtime - stime = 135989749728000 - 135989949653530 = -199925530 = (u64)18446744073709518790 Trigger condition: 1). User task run in kernel mode most of time 2). ARM64 architecture 3). TICK_CPU_ACCOUNTING=y CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20240726023235.217771-1-zhengzucheng@huawei.com
2024-07-29locking/pvqspinlock: Correct the type of "old" variable in pv_kick_node()Uros Bizjak1-1/+1
"enum vcpu_state" is not compatible with "u8" type for all targets, resulting in: error: initialization of 'u8 *' {aka 'unsigned char *'} from incompatible pointer type 'enum vcpu_state *' for LoongArch. Correct the type of "old" variable to "u8". Fixes: fea0e1820b51 ("locking/pvqspinlock: Use try_cmpxchg() in qspinlock_paravirt.h") Closes: https://lore.kernel.org/lkml/20240719024010.3296488-1-maobibo@loongson.cn/ Reported-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20240721164552.50175-1-ubizjak@gmail.com
2024-07-28minmax: make generic MIN() and MAX() macros available everywhereLinus Torvalds1-2/+0
This just standardizes the use of MIN() and MAX() macros, with the very traditional semantics. The goal is to use these for C constant expressions and for top-level / static initializers, and so be able to simplify the min()/max() macros. These macro names were used by various kernel code - they are very traditional, after all - and all such users have been fixed up, with a few different approaches: - trivial duplicated macro definitions have been removed Note that 'trivial' here means that it's obviously kernel code that already included all the major kernel headers, and thus gets the new generic MIN/MAX macros automatically. - non-trivial duplicated macro definitions are guarded with #ifndef This is the "yes, they define their own versions, but no, the include situation is not entirely obvious, and maybe they don't get the generic version automatically" case. - strange use case #1 A couple of drivers decided that the way they want to describe their versioning is with #define MAJ 1 #define MIN 2 #define DRV_VERSION __stringify(MAJ) "." __stringify(MIN) which adds zero value and I just did my Alexander the Great impersonation, and rewrote that pointless Gordian knot as #define DRV_VERSION "1.2" instead. - strange use case #2 A couple of drivers thought that it's a good idea to have a random 'MIN' or 'MAX' define for a value or index into a table, rather than the traditional macro that takes arguments. These values were re-written as C enum's instead. The new function-line macros only expand when followed by an open parenthesis, and thus don't clash with enum use. Happily, there weren't really all that many of these cases, and a lot of users already had the pattern of using '#ifndef' guarding (or in one case just using '#undef MIN') before defining their own private version that does the same thing. I left such cases alone. Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-27Merge tag 'timers-urgent-2024-07-26' of ↵Linus Torvalds2-205/+215
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer migration updates from Thomas Gleixner: "Fixes and minor updates for the timer migration code: - Stop testing the group->parent pointer as it is not guaranteed to be stable over a chain of operations by design. This includes a warning which would be nice to have but it produces false positives due to the racy nature of the check. - Plug a race between CPUs going in and out of idle and a CPU hotplug operation. The latter can create and connect a new hierarchy level which is missed in the concurrent updates of CPUs which go into idle. As a result the events of such a CPU might not be processed and timers go stale. Cure it by splitting the hotplug operation into a prepare and online callback. The prepare callback is guaranteed to run on an online and therefore active CPU. This CPU updates the hierarchy and being online ensures that there is always at least one migrator active which handles the modified hierarchy correctly when going idle. The online callback which runs on the incoming CPU then just marks the CPU active and brings it into operation. - Improve tracing and polish the code further so it is more obvious what's going on" * tag 'timers-urgent-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers/migration: Fix grammar in comment timers/migration: Spare write when nothing changed timers/migration: Rename childmask by groupmask to make naming more obvious timers/migration: Read childmask and parent pointer in a single place timers/migration: Use a single struct for hierarchy walk data timers/migration: Improve tracing timers/migration: Move hierarchy setup into cpuhotplug prepare callback timers/migration: Do not rely always on group->parent
2024-07-25Merge tag 'net-6.11-rc1' of ↵Linus Torvalds1-15/+13
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf and netfilter. A lot of networking people were at a conference last week, busy catching COVID, so relatively short PR. Current release - regressions: - tcp: process the 3rd ACK with sk_socket for TFO and MPTCP Current release - new code bugs: - l2tp: protect session IDR and tunnel session list with one lock, make sure the state is coherent to avoid a warning - eth: bnxt_en: update xdp_rxq_info in queue restart logic - eth: airoha: fix location of the MBI_RX_AGE_SEL_MASK field Previous releases - regressions: - xsk: require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len, the field reuses previously un-validated pad Previous releases - always broken: - tap/tun: drop short frames to prevent crashes later in the stack - eth: ice: add a per-VF limit on number of FDIR filters - af_unix: disable MSG_OOB handling for sockets in sockmap/sockhash" * tag 'net-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (34 commits) tun: add missing verification for short frame tap: add missing verification for short frame mISDN: Fix a use after free in hfcmulti_tx() gve: Fix an edge case for TSO skb validity check bnxt_en: update xdp_rxq_info in queue restart logic tcp: process the 3rd ACK with sk_socke