summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-11-08fork: do not invoke uffd on fork if error occursLorenzo Stoakes3-1/+37
[ Upstream commit f64e67e5d3a45a4a04286c47afade4b518acd47b ] Patch series "fork: do not expose incomplete mm on fork". During fork we may place the virtual memory address space into an inconsistent state before the fork operation is complete. In addition, we may encounter an error during the fork operation that indicates that the virtual memory address space is invalidated. As a result, we should not be exposing it in any way to external machinery that might interact with the mm or VMAs, machinery that is not designed to deal with incomplete state. We specifically update the fork logic to defer khugepaged and ksm to the end of the operation and only to be invoked if no error arose, and disallow uffd from observing fork events should an error have occurred. This patch (of 2): Currently on fork we expose the virtual address space of a process to userland unconditionally if uffd is registered in VMAs, regardless of whether an error arose in the fork. This is performed in dup_userfaultfd_complete() which is invoked unconditionally, and performs two duties - invoking registered handlers for the UFFD_EVENT_FORK event via dup_fctx(), and clearing down userfaultfd_fork_ctx objects established in dup_userfaultfd(). This is problematic, because the virtual address space may not yet be correctly initialised if an error arose. The change in commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") makes this more pertinent as we may be in a state where entries in the maple tree are not yet consistent. We address this by, on fork error, ensuring that we roll back state that we would otherwise expect to clean up through the event being handled by userland and perform the memory freeing duty otherwise performed by dup_userfaultfd_complete(). We do this by implementing a new function, dup_userfaultfd_fail(), which performs the same loop, only decrementing reference counts. Note that we perform mmgrab() on the parent and child mm's, however userfaultfd_ctx_put() will mmdrop() this once the reference count drops to zero, so we will avoid memory leaks correctly here. Link: https://lkml.kernel.org/r/cover.1729014377.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/d3691d58bb58712b6fb3df2be441d175bd3cdf07.1729014377.git.lorenzo.stoakes@oracle.com Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08mei: use kvmalloc for read bufferAlexander Usyskin1-2/+2
[ Upstream commit 4adf613e01bf99e1739f6ff3e162ad5b7d578d1a ] Read buffer is allocated according to max message size, reported by the firmware and may reach 64K in systems with pxp client. Contiguous 64k allocation may fail under memory pressure. Read buffer is used as in-driver message storage and not required to be contiguous. Use kvmalloc to allow kernel to allocate non-contiguous memory. Fixes: 3030dc056459 ("mei: add wrapper for queuing control commands.") Cc: stable <stable@kernel.org> Reported-by: Rohit Agarwal <rohiagar@chromium.org> Closes: https://lore.kernel.org/all/20240813084542.2921300-1-rohiagar@chromium.org/ Tested-by: Brian Geffon <bgeffon@google.com> Signed-off-by: Alexander Usyskin <alexander.usyskin@intel.com> Acked-by: Tomas Winkler <tomasw@gmail.com> Link: https://lore.kernel.org/r/20241015123157.2337026-1-alexander.usyskin@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08mptcp: init: protect sched with rcu_read_lockMatthieu Baerts (NGI0)1-0/+2
[ Upstream commit 3deb12c788c385e17142ce6ec50f769852fcec65 ] Enabling CONFIG_PROVE_RCU_LIST with its dependence CONFIG_RCU_EXPERT creates this splat when an MPTCP socket is created: ============================= WARNING: suspicious RCU usage 6.12.0-rc2+ #11 Not tainted ----------------------------- net/mptcp/sched.c:44 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 no locks held by mptcp_connect/176. stack backtrace: CPU: 0 UID: 0 PID: 176 Comm: mptcp_connect Not tainted 6.12.0-rc2+ #11 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:123) lockdep_rcu_suspicious (kernel/locking/lockdep.c:6822) mptcp_sched_find (net/mptcp/sched.c:44 (discriminator 7)) mptcp_init_sock (net/mptcp/protocol.c:2867 (discriminator 1)) ? sock_init_data_uid (arch/x86/include/asm/atomic.h:28) inet_create.part.0.constprop.0 (net/ipv4/af_inet.c:386) ? __sock_create (include/linux/rcupdate.h:347 (discriminator 1)) __sock_create (net/socket.c:1576) __sys_socket (net/socket.c:1671) ? __pfx___sys_socket (net/socket.c:1712) ? do_user_addr_fault (arch/x86/mm/fault.c:1419 (discriminator 1)) __x64_sys_socket (net/socket.c:1728) do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1)) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) That's because when the socket is initialised, rcu_read_lock() is not used despite the explicit comment written above the declaration of mptcp_sched_find() in sched.c. Adding the missing lock/unlock avoids the warning. Fixes: 1730b2b2c5a5 ("mptcp: add sched in mptcp_sock") Cc: stable@vger.kernel.org Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/523 Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241021-net-mptcp-sched-lock-v1-1-637759cf061c@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08tpm: Lazily flush the auth sessionJarkko Sakkinen4-20/+44
[ Upstream commit df745e25098dcb2f706399c0d06dd8d1bab6b6ec ] Move the allocation of chip->auth to tpm2_start_auth_session() so that this field can be used as flag to tell whether auth session is active or not. Instead of flushing and reloading the auth session for every transaction separately, keep the session open unless /dev/tpm0 is used. Reported-by: Pengyu Ma <mapengyu@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219229 Cc: stable@vger.kernel.org # v6.10+ Fixes: 7ca110f2679b ("tpm: Address !chip->auth in tpm_buf_append_hmac_session*()") Tested-by: Pengyu Ma <mapengyu@gmail.com> Tested-by: Stefan Berger <stefanb@linux.ibm.com> Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08drm/amdgpu/smu13: fix profile reportingAlex Deucher1-3/+3
[ Upstream commit 935abb86a95def8c20dbb184ce30051db168e541 ] The following 3 commits landed in parallel: commit d7d2688bf4ea ("drm/amd/pm: update workload mask after the setting") commit 7a1613e47e65 ("drm/amdgpu/smu13: always apply the powersave optimization") commit 7c210ca5a2d7 ("drm/amdgpu: handle default profile on on devices without fullscreen 3D") While everything is set correctly, this caused the profile to be reported incorrectly because both the powersave and fullscreen3d bits were set in the mask and when the driver prints the profile, it looks for the first bit set. Fixes: d7d2688bf4ea ("drm/amd/pm: update workload mask after the setting") Reviewed-by: Kenneth Feng <kenneth.feng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit ecfe9b237687a55d596fff0650ccc8cc455edd3f) Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08drm/amd/pm: Vangogh: Fix kernel memory out of bounds writeTvrtko Ursulin1-1/+3
[ Upstream commit 4aa923a6e6406b43566ef6ac35a3d9a3197fa3e8 ] KASAN reports that the GPU metrics table allocated in vangogh_tables_init() is not large enough for the memset done in smu_cmn_init_soft_gpu_metrics(). Condensed report follows: [ 33.861314] BUG: KASAN: slab-out-of-bounds in smu_cmn_init_soft_gpu_metrics+0x73/0x200 [amdgpu] [ 33.861799] Write of size 168 at addr ffff888129f59500 by task mangoapp/1067 ... [ 33.861808] CPU: 6 UID: 1000 PID: 1067 Comm: mangoapp Tainted: G W 6.12.0-rc4 #356 1a56f59a8b5182eeaf67eb7cb8b13594dd23b544 [ 33.861816] Tainted: [W]=WARN [ 33.861818] Hardware name: Valve Galileo/Galileo, BIOS F7G0107 12/01/2023 [ 33.861822] Call Trace: [ 33.861826] <TASK> [ 33.861829] dump_stack_lvl+0x66/0x90 [ 33.861838] print_report+0xce/0x620 [ 33.861853] kasan_report+0xda/0x110 [ 33.862794] kasan_check_range+0xfd/0x1a0 [ 33.862799] __asan_memset+0x23/0x40 [ 33.862803] smu_cmn_init_soft_gpu_metrics+0x73/0x200 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779] [ 33.863306] vangogh_get_gpu_metrics_v2_4+0x123/0xad0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779] [ 33.864257] vangogh_common_get_gpu_metrics+0xb0c/0xbc0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779] [ 33.865682] amdgpu_dpm_get_gpu_metrics+0xcc/0x110 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779] [ 33.866160] amdgpu_get_gpu_metrics+0x154/0x2d0 [amdgpu 13b1bc364ec578808f676eba412c20eaab792779] [ 33.867135] dev_attr_show+0x43/0xc0 [ 33.867147] sysfs_kf_seq_show+0x1f1/0x3b0 [ 33.867155] seq_read_iter+0x3f8/0x1140 [ 33.867173] vfs_read+0x76c/0xc50 [ 33.867198] ksys_read+0xfb/0x1d0 [ 33.867214] do_syscall_64+0x90/0x160 ... [ 33.867353] Allocated by task 378 on cpu 7 at 22.794876s: [ 33.867358] kasan_save_stack+0x33/0x50 [ 33.867364] kasan_save_track+0x17/0x60 [ 33.867367] __kasan_kmalloc+0x87/0x90 [ 33.867371] vangogh_init_smc_tables+0x3f9/0x840 [amdgpu] [ 33.867835] smu_sw_init+0xa32/0x1850 [amdgpu] [ 33.868299] amdgpu_device_init+0x467b/0x8d90 [amdgpu] [ 33.868733] amdgpu_driver_load_kms+0x19/0xf0 [amdgpu] [ 33.869167] amdgpu_pci_probe+0x2d6/0xcd0 [amdgpu] [ 33.869608] local_pci_probe+0xda/0x180 [ 33.869614] pci_device_probe+0x43f/0x6b0 Empirically we can confirm that the former allocates 152 bytes for the table, while the latter memsets the 168 large block. Root cause appears that when GPU metrics tables for v2_4 parts were added it was not considered to enlarge the table to fit. The fix in this patch is rather "brute force" and perhaps later should be done in a smarter way, by extracting and consolidating the part version to size logic to a common helper, instead of brute forcing the largest possible allocation. Nevertheless, for now this works and fixes the out of bounds write. v2: * Drop impossible v3_0 case. (Mario) Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Fixes: 41cec40bc9ba ("drm/amd/pm: Vangogh: Add new gpu_metrics_v2_4 to acquire gpu_metrics") Cc: Mario Limonciello <mario.limonciello@amd.com> Cc: Evan Quan <evan.quan@amd.com> Cc: Wenyou Yang <WenYou.Yang@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20241025145639.19124-1-tursulin@igalia.com Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 0880f58f9609f0200483a49429af0f050d281703) Cc: stable@vger.kernel.org # v6.6+ Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08tpm: Rollback tpm2_load_null()Jarkko Sakkinen1-20/+24
[ Upstream commit cc7d8594342a25693d40fe96f97e5c6c29ee609c ] Do not continue on tpm2_create_primary() failure in tpm2_load_null(). Cc: stable@vger.kernel.org # v6.10+ Fixes: eb24c9788cd9 ("tpm: disable the TPM if NULL name changes") Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08tpm: Return tpm2_sessions_init() when null key creation failsJarkko Sakkinen1-2/+9
[ Upstream commit d658d59471ed80c4a8aaf082ccc3e83cdf5ae4c1 ] Do not continue tpm2_sessions_init() further if the null key pair creation fails. Cc: stable@vger.kernel.org # v6.10+ Fixes: d2add27cf2b8 ("tpm: Add NULL primary creation") Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08iov_iter: fix copy_page_from_iter_atomic() if KMAP_LOCAL_FORCE_MAPHugh Dickins1-2/+4
[ Upstream commit c749d9b7ebbc5716af7a95f7768634b30d9446ec ] generic/077 on x86_32 CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP=y with highmem, on huge=always tmpfs, issues a warning and then hangs (interruptibly): WARNING: CPU: 5 PID: 3517 at mm/highmem.c:622 kunmap_local_indexed+0x62/0xc9 CPU: 5 UID: 0 PID: 3517 Comm: cp Not tainted 6.12.0-rc4 #2 ... copy_page_from_iter_atomic+0xa6/0x5ec generic_perform_write+0xf6/0x1b4 shmem_file_write_iter+0x54/0x67 Fix copy_page_from_iter_atomic() by limiting it in that case (include/linux/skbuff.h skb_frag_must_loop() does similar). But going forward, perhaps CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP is too surprising, has outlived its usefulness, and should just be removed? Fixes: 908a1ad89466 ("iov_iter: Handle compound highmem pages in copy_page_from_iter_atomic()") Signed-off-by: Hugh Dickins <hughd@google.com> Link: https://lore.kernel.org/r/dd5f0c89-186e-18e1-4f43-19a60f5a9774@google.com Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08posix-cpu-timers: Clear TICK_DEP_BIT_POSIX_TIMER on cloneBenjamin Segall2-0/+10
[ Upstream commit b5413156bad91dc2995a5c4eab1b05e56914638a ] When cloning a new thread, its posix_cputimers are not inherited, and are cleared by posix_cputimers_init(). However, this does not clear the tick dependency it creates in tsk->tick_dep_mask, and the handler does not reach the code to clear the dependency if there were no timers to begin with. Thus if a thread has a cputimer running before clone/fork, all descendants will prevent nohz_full unless they create a cputimer of their own. Fix this by entirely clearing the tick_dep_mask in copy_process(). (There is currently no inherited state that needs a tick dependency) Process-wide timers do not have this problem because fork does not copy signal_struct as a baseline, it creates one from scratch. Fixes: b78783000d5c ("posix-cpu-timers: Migrate to use new tick dependency mask model") Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/xm26o737bq8o.fsf@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08sched/numa: Fix the potential null pointer dereference in task_numa_work()Shawn Wang1-2/+2
[ Upstream commit 9c70b2a33cd2aa6a5a59c5523ef053bd42265209 ] When running stress-ng-vm-segv test, we found a null pointer dereference error in task_numa_work(). Here is the backtrace: [323676.066985] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020 ...... [323676.067108] CPU: 35 PID: 2694524 Comm: stress-ng-vm-se ...... [323676.067113] pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--) [323676.067115] pc : vma_migratable+0x1c/0xd0 [323676.067122] lr : task_numa_work+0x1ec/0x4e0 [323676.067127] sp : ffff8000ada73d20 [323676.067128] x29: ffff8000ada73d20 x28: 0000000000000000 x27: 000000003e89f010 [323676.067130] x26: 0000000000080000 x25: ffff800081b5c0d8 x24: ffff800081b27000 [323676.067133] x23: 0000000000010000 x22: 0000000104d18cc0 x21: ffff0009f7158000 [323676.067135] x20: 0000000000000000 x19: 0000000000000000 x18: ffff8000ada73db8 [323676.067138] x17: 0001400000000000 x16: ffff800080df40b0 x15: 0000000000000035 [323676.067140] x14: ffff8000ada73cc8 x13: 1fffe0017cc72001 x12: ffff8000ada73cc8 [323676.067142] x11: ffff80008001160c x10: ffff000be639000c x9 : ffff8000800f4ba4 [323676.067145] x8 : ffff000810375000 x7 : ffff8000ada73974 x6 : 0000000000000001 [323676.067147] x5 : 0068000b33e26707 x4 : 0000000000000001 x3 : ffff0009f7158000 [323676.067149] x2 : 0000000000000041 x1 : 0000000000004400 x0 : 0000000000000000 [323676.067152] Call trace: [323676.067153] vma_migratable+0x1c/0xd0 [323676.067155] task_numa_work+0x1ec/0x4e0 [323676.067157] task_work_run+0x78/0xd8 [323676.067161] do_notify_resume+0x1ec/0x290 [323676.067163] el0_svc+0x150/0x160 [323676.067167] el0t_64_sync_handler+0xf8/0x128 [323676.067170] el0t_64_sync+0x17c/0x180 [323676.067173] Code: d2888001 910003fd f9000bf3 aa0003f3 (f9401000) [323676.067177] SMP: stopping secondary CPUs [323676.070184] Starting crashdump kernel... stress-ng-vm-segv in stress-ng is used to stress test the SIGSEGV error handling function of the system, which tries to cause a SIGSEGV error on return from unmapping the whole address space of the child process. Normally this program will not cause kernel crashes. But before the munmap system call returns to user mode, a potential task_numa_work() for numa balancing could be added and executed. In this scenario, since the child process has no vma after munmap, the vma_next() in task_numa_work() will return a null pointer even if the vma iterator restarts from 0. Recheck the vma pointer before dereferencing it in task_numa_work(). Fixes: 214dbc428137 ("sched: convert to vma iterator") Signed-off-by: Shawn Wang <shawnwang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org # v6.2+ Link: https://lkml.kernel.org/r/20241025022208.125527-1-shawnwang@linux.alibaba.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08cxl/acpi: Ensure ports ready at cxl_acpi_probe() returnDan Williams1-0/+7
[ Upstream commit 48f62d38a07d464a499fa834638afcfd2b68f852 ] In order to ensure root CXL ports are enabled upon cxl_acpi_probe() when the 'cxl_port' driver is built as a module, arrange for the module to be pre-loaded or built-in. The "Fixes:" but no "Cc: stable" on this patch reflects that the issue is merely by inspection since the bug that triggered the discovery of this potential problem [1] is fixed by other means. However, a stable backport should do no harm. Fixes: 8dd2bc0f8e02 ("cxl/mem: Add the cxl_mem driver") Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1] Signed-off-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Gregory Price <gourry@gourry.net> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Link: https://patch.msgid.link/172964781969.81806.17276352414854540808.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08cxl/port: Fix cxl_bus_rescan() vs bus_rescan_devices()Dan Williams1-3/+10
[ Upstream commit 3d6ebf16438de5d712030fefbb4182b46373d677 ] It turns out since its original introduction, pre-2.6.12, bus_rescan_devices() has skipped devices that might be in the process of attaching or detaching from their driver. For CXL this behavior is unwanted and expects that cxl_bus_rescan() is a probe barrier. That behavior is simple enough to achieve with bus_for_each_dev() paired with call to device_attach(), and it is unclear why bus_rescan_devices() took the position of lockless consumption of dev->driver which is racy. The "Fixes:" but no "Cc: stable" on this patch reflects that the issue is merely by inspection since the bug that triggered the discovery of this potential problem [1] is fixed by other means. However, a stable backport should do no harm. Fixes: 8dd2bc0f8e02 ("cxl/mem: Add the cxl_mem driver") Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1] Signed-off-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Gregory Price <gourry@gourry.net> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Link: https://patch.msgid.link/172964781104.81806.4277549800082443769.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08scsi: ufs: core: Fix another deadlock during RTC updatePeter Wang1-1/+1
[ Upstream commit cb7e509c4e0197f63717fee54fb41c4990ba8d3a ] If ufshcd_rtc_work calls ufshcd_rpm_put_sync() and the pm's usage_count is 0, we will enter the runtime suspend callback. However, the runtime suspend callback will wait to flush ufshcd_rtc_work, causing a deadlock. Replace ufshcd_rpm_put_sync() with ufshcd_rpm_put() to avoid the deadlock. Fixes: 6bf999e0eb41 ("scsi: ufs: core: Add UFS RTC support") Cc: stable@vger.kernel.org #6.11.x Signed-off-by: Peter Wang <peter.wang@mediatek.com> Link: https://lore.kernel.org/r/20241024015453.21684-1-peter.wang@mediatek.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08riscv: Remove duplicated GET_RMChunyan Zhang1-2/+0
[ Upstream commit 164f66de6bb6ef454893f193c898dc8f1da6d18b ] The macro GET_RM defined twice in this file, one can be removed. Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Chunyan Zhang <zhangchunyan@iscas.ac.cn> Fixes: 956d705dd279 ("riscv: Unaligned load/store handling for M_MODE") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20241008094141.549248-3-zhangchunyan@iscas.ac.cn Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08riscv: Remove unused GENERATING_ASM_OFFSETSChunyan Zhang1-2/+0
[ Upstream commit 46d4e5ac6f2f801f97bcd0ec82365969197dc9b1 ] The macro is not used in the current version of kernel, it looks like can be removed to avoid a build warning: ../arch/riscv/kernel/asm-offsets.c: At top level: ../arch/riscv/kernel/asm-offsets.c:7: warning: macro "GENERATING_ASM_OFFSETS" is not used [-Wunused-macros] 7 | #define GENERATING_ASM_OFFSETS Fixes: 9639a44394b9 ("RISC-V: Provide a cleaner raw_smp_processor_id()") Cc: stable@vger.kernel.org Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Chunyan Zhang <zhangchunyan@iscas.ac.cn> Link: https://lore.kernel.org/r/20241008094141.549248-2-zhangchunyan@iscas.ac.cn Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08riscv: Use '%u' to format the output of 'cpu'WangYuli1-1/+1
[ Upstream commit e0872ab72630dada3ae055bfa410bf463ff1d1e0 ] 'cpu' is an unsigned integer, so its conversion specifier should be %u, not %d. Suggested-by: Wentao Guan <guanwentao@uniontech.com> Suggested-by: Maciej W. Rozycki <macro@orcam.me.uk> Link: https://lore.kernel.org/all/alpine.DEB.2.21.2409122309090.40372@angie.orcam.me.uk/ Signed-off-by: WangYuli <wangyuli@uniontech.com> Reviewed-by: Charlie Jenkins <charlie@rivosinc.com> Tested-by: Charlie Jenkins <charlie@rivosinc.com> Fixes: f1e58583b9c7 ("RISC-V: Support cpu hotplug") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/4C127DEECDA287C8+20241017032010.96772-1-wangyuli@uniontech.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08riscv: Prevent a bad reference count on CPU nodesMiquel Sabaté Solà1-2/+5
[ Upstream commit 37233169a6ea912020c572f870075a63293b786a ] When populating cache leaves we previously fetched the CPU device node at the very beginning. But when ACPI is enabled we go through a specific branch which returns early and does not call 'of_node_put' for the node that was acquired. Since we are not using a CPU device node for the ACPI code anyways, we can simply move the initialization of it just passed the ACPI block, and we are guaranteed to have an 'of_node_put' call for the acquired node. This prevents a bad reference count of the CPU device node. Moreover, the previous function did not check for errors when acquiring the device node, so a return -ENOENT has been added for that case. Signed-off-by: Miquel Sabaté Solà <mikisabate@gmail.com> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Reviewed-by: Sunil V L <sunilvl@ventanamicro.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Fixes: 604f32ea6909 ("riscv: cacheinfo: initialize cacheinfo's level and type from ACPI PPTT") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240913080053.36636-1-mikisabate@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08riscv: efi: Set NX compat flag in PE/COFF headerHeinrich Schuchardt1-1/+1
[ Upstream commit d41373a4b910961df5a5e3527d7bde6ad45ca438 ] The IMAGE_DLLCHARACTERISTICS_NX_COMPAT informs the firmware that the EFI binary does not rely on pages that are both executable and writable. The flag is used by some distro versions of GRUB to decide if the EFI binary may be executed. As the Linux kernel neither has RWX sections nor needs RWX pages for relocation we should set the flag. Cc: Ard Biesheuvel <ardb@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com> Reviewed-by: Emil Renner Berthing <emil.renner.berthing@canonical.com> Fixes: cb7d2dd5612a ("RISC-V: Add PE/COFF header for EFI stub") Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20240929140233.211800-1-heinrich.schuchardt@canonical.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08ALSA: hda/realtek: Limit internal Mic boost on Dell platformKailang Yang1-3/+18
[ Upstream commit 78e7be018784934081afec77f96d49a2483f9188 ] Dell want to limit internal Mic boost on all Dell platform. Signed-off-by: Kailang Yang <kailang@realtek.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/561fc5f5eff04b6cbd79ed173cd1c1db@realtek.com Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08Input: edt-ft5x06 - fix regmap leak when probe failsDmitry Torokhov1-1/+18
[ Upstream commit bffdf9d7e51a7be8eeaac2ccf9e54a5fde01ff65 ] The driver neglects to free the instance of I2C regmap constructed at the beginning of the edt_ft5x06_ts_probe() method when probe fails. Additionally edt_ft5x06_ts_remove() is freeing the regmap too early, before the rest of the device resources that are managed by devm are released. Fix this by installing a custom devm action that will ensure that the regmap is released at the right time during normal teardown as well as in case of probe failure. Note that devm_regmap_init_i2c() could not be used because the driver may replace the original regmap with a regmap specific for M06 devices in the middle of the probe, and using devm_regmap_init_i2c() would result in releasing the M06 regmap too early. Reported-by: Li Zetao <lizetao1@huawei.com> Fixes: 9dfd9708ffba ("Input: edt-ft5x06 - convert to use regmap API") Cc: stable@vger.kernel.org Reviewed-by: Oliver Graute <oliver.graute@kococonnector.com> Link: https://lore.kernel.org/r/ZxL6rIlVlgsAu-Jv@google.com Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08riscv: vdso: Prevent the compiler from inserting calls to memset()Alexandre Ghiti1-0/+1
[ Upstream commit bf40167d54d55d4b54d0103713d86a8638fb9290 ] The compiler is smart enough to insert a call to memset() in riscv_vdso_get_cpus(), which generates a dynamic relocation. So prevent this by using -fno-builtin option. Fixes: e2c0cdfba7f6 ("RISC-V: User-facing API") Cc: stable@vger.kernel.org Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Guo Ren <guoren@kernel.org> Link: https://lore.kernel.org/r/20241016083625.136311-2-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08spi: spi-fsl-dspi: Fix crash when not using GPIO chip selectFrank Li1-1/+5
[ Upstream commit 25f00a13dccf8e45441265768de46c8bf58e08f6 ] Add check for the return value of spi_get_csgpiod() to avoid passing a NULL pointer to gpiod_direction_output(), preventing a crash when GPIO chip select is not used. Fix below crash: [ 4.251960] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 4.260762] Mem abort info: [ 4.263556] ESR = 0x0000000096000004 [ 4.267308] EC = 0x25: DABT (current EL), IL = 32 bits [ 4.272624] SET = 0, FnV = 0 [ 4.275681] EA = 0, S1PTW = 0 [ 4.278822] FSC = 0x04: level 0 translation fault [ 4.283704] Data abort info: [ 4.286583] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 4.292074] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 4.297130] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 4.302445] [0000000000000000] user address but active_mm is swapper [ 4.308805] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP [ 4.315072] Modules linked in: [ 4.318124] CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.0-rc4-next-20241023-00008-ga20ec42c5fc1 #359 [ 4.328130] Hardware name: LS1046A QDS Board (DT) [ 4.332832] pstate: 40000005 (nZcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 4.339794] pc : gpiod_direction_output+0x34/0x5c [ 4.344505] lr : gpiod_direction_output+0x18/0x5c [ 4.349208] sp : ffff80008003b8f0 [ 4.352517] x29: ffff80008003b8f0 x28: 0000000000000000 x27: ffffc96bcc7e9068 [ 4.359659] x26: ffffc96bcc6e00b0 x25: ffffc96bcc598398 x24: ffff447400132810 [ 4.366800] x23: 0000000000000000 x22: 0000000011e1a300 x21: 0000000000020002 [ 4.373940] x20: 0000000000000000 x19: 0000000000000000 x18: ffffffffffffffff [ 4.381081] x17: ffff44740016e600 x16: 0000000500000003 x15: 0000000000000007 [ 4.388221] x14: 0000000000989680 x13: 0000000000020000 x12: 000000000000001e [ 4.395362] x11: 0044b82fa09b5a53 x10: 0000000000000019 x9 : 0000000000000008 [ 4.402502] x8 : 0000000000000002 x7 : 0000000000000007 x6 : 0000000000000000 [ 4.409641] x5 : 0000000000000200 x4 : 0000000002000000 x3 : 0000000000000000 [ 4.416781] x2 : 0000000000022202 x1 : 0000000000000000 x0 : 0000000000000000 [ 4.423921] Call trace: [ 4.426362] gpiod_direction_output+0x34/0x5c (P) [ 4.431067] gpiod_direction_output+0x18/0x5c (L) [ 4.435771] dspi_setup+0x220/0x334 Fixes: 9e264f3f85a5 ("spi: Replace all spi->chip_select and spi->cs_gpiod references with function call") Cc: stable@vger.kernel.org Signed-off-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20241023203032.1388491-1-Frank.Li@nxp.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08btrfs: fix error propagation of split biosNaohiro Aota2-24/+16
[ Upstream commit d48e1dea3931de64c26717adc2b89743c7ab6594 ] The purpose of btrfs_bbio_propagate_error() shall be propagating an error of split bio to its original btrfs_bio, and tell the error to the upper layer. However, it's not working well on some cases. * Case 1. Immediate (or quick) end_bio with an error When btrfs sends btrfs_bio to mirrored devices, btrfs calls btrfs_bio_end_io() when all the mirroring bios are completed. If that btrfs_bio was split, it is from btrfs_clone_bioset and its end_io function is btrfs_orig_write_end_io. For this case, btrfs_bbio_propagate_error() accesses the orig_bbio's bio context to increase the error count. That works well in most cases. However, if the end_io is called enough fast, orig_bbio's (remaining part after split) bio context may not be properly set at that time. Since the bio context is set when the orig_bbio (the last btrfs_bio) is sent to devices, that might be too late for earlier split btrfs_bio's completion. That will result in NULL pointer dereference. That bug is easily reproducible by running btrfs/146 on zoned devices [1] and it shows the following trace. [1] You need raid-stripe-tree feature as it create "-d raid0 -m raid1" FS. BUG: kernel NULL pointer dereference, address: 0000000000000020 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 UID: 0 PID: 13 Comm: kworker/u32:1 Not tainted 6.11.0-rc7-BTRFS-ZNS+ #474 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-btrfs-5) RIP: 0010:btrfs_bio_end_io+0xae/0xc0 [btrfs] BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 RSP: 0018:ffffc9000006f248 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888005a7f080 RCX: ffffc9000006f1dc RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff888005a7f080 RBP: ffff888011dfc540 R08: 0000000000000000 R09: 0000000000000001 R10: ffffffff82e508e0 R11: 0000000000000005 R12: ffff88800ddfbe58 R13: ffff888005a7f080 R14: ffff888005a7f158 R15: ffff888005a7f158 FS: 0000000000000000(0000) GS:ffff88803ea80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000020 CR3: 0000000002e22006 CR4: 0000000000370ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die_body.cold+0x19/0x26 ? page_fault_oops+0x13e/0x2b0 ? _printk+0x58/0x73 ? do_user_addr_fault+0x5f/0x750 ? exc_page_fault+0x76/0x240 ? asm_exc_page_fault+0x22/0x30 ? btrfs_bio_end_io+0xae/0xc0 [btrfs] ? btrfs_log_dev_io_error+0x7f/0x90 [btrfs] btrfs_orig_write_end_io+0x51/0x90 [btrfs] dm_submit_bio+0x5c2/0xa50 [dm_mod] ? find_held_lock+0x2b/0x80 ? blk_try_enter_queue+0x90/0x1e0 __submit_bio+0xe0/0x130 ? ktime_get+0x10a/0x160 ? lockdep_hardirqs_on+0x74/0x100 submit_bio_noacct_nocheck+0x199/0x410 btrfs_submit_bio+0x7d/0x150 [btrfs] btrfs_submit_chunk+0x1a1/0x6d0 [btrfs] ? lockdep_hardirqs_on+0x74/0x100 ? __folio_start_writeback+0x10/0x2c0 btrfs_submit_bbio+0x1c/0x40 [btrfs] submit_one_bio+0x44/0x60 [btrfs] submit_extent_folio+0x13f/0x330 [btrfs] ? btrfs_set_range_writeback+0xa3/0xd0 [btrfs] extent_writepage_io+0x18b/0x360 [btrfs] extent_write_locked_range+0x17c/0x340 [btrfs] ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs] run_delalloc_cow+0x71/0xd0 [btrfs] btrfs_run_delalloc_range+0x176/0x500 [btrfs] ? find_lock_delalloc_range+0x119/0x260 [btrfs] writepage_delalloc+0x2ab/0x480 [btrfs] extent_write_cache_pages+0x236/0x7d0 [btrfs] btrfs_writepages+0x72/0x130 [btrfs] do_writepages+0xd4/0x240 ? find_held_lock+0x2b/0x80 ? wbc_attach_and_unlock_inode+0x12c/0x290 ? wbc_attach_and_unlock_inode+0x12c/0x290 __writeback_single_inode+0x5c/0x4c0 ? do_raw_spin_unlock+0x49/0xb0 writeback_sb_inodes+0x22c/0x560 __writeback_inodes_wb+0x4c/0xe0 wb_writeback+0x1d6/0x3f0 wb_workfn+0x334/0x520 process_one_work+0x1ee/0x570 ? lock_is_held_type+0xc6/0x130 worker_thread+0x1d1/0x3b0 ? __pfx_worker_thread+0x10/0x10 kthread+0xee/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: dm_mod btrfs blake2b_generic xor raid6_pq rapl CR2: 0000000000000020 * Case 2. Earlier completion of orig_bbio for mirrored btrfs_bios btrfs_bbio_propagate_error() assumes the end_io function for orig_bbio is called last among split bios. In that case, btrfs_orig_write_end_io() sets the bio->bi_status to BLK_STS_IOERR by seeing the bioc->error [2]. Otherwise, the increased orig_bio's bioc->error is not checked by anyone and return BLK_STS_OK to the upper layer. [2] Actually, this is not true. Because we only increases orig_bioc->errors by max_errors, the condition "atomic_read(&bioc->error) > bioc->max_errors" is still not met if only one split btrfs_bio fails. * Case 3. Later completion of orig_bbio for un-mirrored btrfs_bios In contrast to the above case, btrfs_bbio_propagate_error() is not working well if un-mirrored orig_bbio is completed last. It sets orig_bbio->bio.bi_status to the btrfs_bio's error. But, that is easily over-written by orig_bbio's completion status. If the status is BLK_STS_OK, the upper layer would not know the failure. * Solution Considering the above cases, we can only save the error status in the orig_bbio (remaining part after split) itself as it is always available. Also, the saved error status should be propagated when all the split btrfs_bios are finished (i.e, bbio->pending_ios == 0). This commit introduces "status" to btrfs_bbio and saves the first error of split bios to original btrfs_bio's "status" variable. When all the split bios are finished, the saved status is loaded into original btrfs_bio's status. With this commit, btrfs/146 on zoned devices does not hit the NULL pointer dereference anymore. Fixes: 852eee62d31a ("btrfs: allow btrfs_submit_bio to split bios") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08btrfs: merge btrfs_orig_bbio_end_io() into btrfs_bio_end_io()Qu Wenruo1-18/+11
[ Upstream commit 9ca0e58cb752b09816f56f7a3147a39773d5e831 ] There are only two differences between the two functions: - btrfs_orig_bbio_end_io() does extra error propagation This is mostly to allow tolerance for write errors. - btrfs_orig_bbio_end_io() does extra pending_ios check This check can handle both the original bio, or the cloned one. (All accounting happens in the original one). This makes btrfs_orig_bbio_end_io() a much safer call. In fact we already had a double freeing error due to usage of btrfs_bio_end_io() in the error path of btrfs_submit_chunk(). So just move the whole content of btrfs_orig_bbio_end_io() into btrfs_bio_end_io(). For normal paths this brings no change, because they are already calling btrfs_orig_bbio_end_io() in the first place. For error paths (not only inside bio.c but also external callers), this change will introduce extra checks, especially for external callers, as they will error out without submitting the btrfs bio. But considering it's already in the error path, such slower but much safer checks are still an overall win. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: d48e1dea3931 ("btrfs: fix error propagation of split bios") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08phy: freescale: imx8m-pcie: Do CMN_RST just before PHY PLL lock checkRichard Zhu1-5/+5
[ Upstream commit f89263b69731e0144d275fff777ee0dd92069200 ] When enable initcall_debug together with higher debug level below. CONFIG_CONSOLE_LOGLEVEL_DEFAULT=9 CONFIG_CONSOLE_LOGLEVEL_QUIET=9 CONFIG_MESSAGE_LOGLEVEL_DEFAULT=7 The initialization of i.MX8MP PCIe PHY might be timeout failed randomly. To fix this issue, adjust the sequence of the resets refer to the power up sequence listed below. i.MX8MP PCIe PHY power up sequence: /--------------------------------------------- 1.8v supply ---------/ /--------------------------------------------------- 0.8v supply ---/ ---\ /-------------------------------------------------- X REFCLK Valid Reference Clock ---/ \-------------------------------------------------- ------------------------------------------- | i_init_restn -------------- ------------------------------------ | i_cmn_rstn --------------------- ------------------------------- | o_pll_lock_done -------------------------- Logs: imx6q-pcie 33800000.pcie: host bridge /soc@0/pcie@33800000 ranges: imx6q-pcie 33800000.pcie: IO 0x001ff80000..0x001ff8ffff -> 0x0000000000 imx6q-pcie 33800000.pcie: MEM 0x0018000000..0x001fefffff -> 0x0018000000 probe of clk_imx8mp_audiomix.reset.0 returned 0 after 1052 usecs probe of 30e20000.clock-controller returned 0 after 32971 usecs phy phy-32f00000.pcie-phy.4: phy poweron failed --> -110 probe of 30e10000.dma-controller returned 0 after 10235 usecs imx6q-pcie 33800000.pcie: waiting for PHY ready timeout! dwhdmi-imx 32fd8000.hdmi: Detected HDMI TX controller v2.13a with HDCP (samsung_dw_hdmi_phy2) imx6q-pcie 33800000.pcie: probe with driver imx6q-pcie failed with error -110 Fixes: dce9edff16ee ("phy: freescale: imx8m-pcie: Add i.MX8MP PCIe PHY support") Cc: stable@vger.kernel.org Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com> Signed-off-by: Frank Li <Frank.Li@nxp.com> v2 changes: - Rebase to latest fixes branch of linux-phy git repo. - Richard's environment have problem and can't sent out patch. So I help post this fix patch. Link: https://lore.kernel.org/r/20241021155241.943665-1-Frank.Li@nxp.com Signed-off-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08cgroup/bpf: use a dedicated workqueue for cgroup bpf destructionChen Ridong1-1/+18
[ Upstream commit 117932eea99b729ee5d12783601a4f7f5fd58a23 ] A hung_task problem shown below was found: INFO: task kworker/0:0:8 blocked for more than 327 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Workqueue: events cgroup_bpf_release Call Trace: <TASK> __schedule+0x5a2/0x2050 ? find_held_lock+0x33/0x100 ? wq_worker_sleeping+0x9e/0xe0 schedule+0x9f/0x180 schedule_preempt_disabled+0x25/0x50 __mutex_lock+0x512/0x740 ? cgroup_bpf_release+0x1e/0x4d0 ? cgroup_bpf_release+0xcf/0x4d0 ? process_scheduled_works+0x161/0x8a0 ? cgroup_bpf_release+0x1e/0x4d0 ? mutex_lock_nested+0x2b/0x40 ? __pfx_delay_tsc+0x10/0x10 mutex_lock_nested+0x2b/0x40 cgroup_bpf_release+0xcf/0x4d0 ? process_scheduled_works+0x161/0x8a0 ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0 ? process_scheduled_works+0x161/0x8a0 process_scheduled_works+0x23a/0x8a0 worker_thread+0x231/0x5b0 ? __pfx_worker_thread+0x10/0x10 kthread+0x14d/0x1c0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x59/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> This issue can be reproduced by the following pressuse test: 1. A large number of cpuset cgroups are deleted. 2. Set cpu on and off repeatly. 3. Set watchdog_thresh repeatly. The scripts can be obtained at LINK mentioned above the signature. The reason for this issue is cgroup_mutex and cpu_hotplug_lock are acquired in different tasks, which may lead to deadlock. It can lead to a deadlock through the following steps: 1. A large number of cpusets are deleted asynchronously, which puts a large number of cgroup_bpf_release works into system_wq. The max_active of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are cgroup_bpf_release works, and many cgroup_bpf_release works will be put into inactive queue. As illustrated in the diagram, there are 256 (in the acvtive queue) + n (in the inactive queue) works. 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put smp_call_on_cpu work into system_wq. However step 1 has already filled system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has to wait until the works that were put into the inacvtive queue earlier have executed (n cgroup_bpf_release), so it will be blocked for a while. 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2. 4. Cpusets that were deleted at step 1 put cgroup_release works into cgroup_destroy_wq. They are competing to get cgroup_mutex all the time. When cgroup_metux is acqured by work at css_killed_work_fn, it will call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read. However, cpuset_css_offline will be blocked for step 3. 5. At this moment, there are 256 works in active queue that are cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as a result, all of them are blocked. Consequently, sscs.work can not be executed. Ultimately, this situation leads to four processes being blocked, forming a deadlock. system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4) ... 2000+ cgroups deleted asyn 256 actives + n inactives __lockup_detector_reconfigure P(cpu_hotplug_lock.read) put sscs.work into system_wq 256 + n + 1(sscs.work) sscs.work wait to be executed warting sscs.work finish percpu_down_write P(cpu_hotplug_lock.write) ...blocking... css_killed_work_fn P(cgroup_mutex) cpuset_css_offline P(cpu_hotplug_lock.read) ...blocking... 256 cgroup_bpf_release mutex_lock(&cgroup_mutex); ..blocking... To fix the problem, place cgroup_bpf_release works on a dedicated workqueue which can break the loop and solve the problem. System wqs are for misc things which shouldn't create a large number of concurrent work items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent work items, it should use its own dedicated workqueue. Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") Cc: stable@vger.kernel.org # v5.3+ Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t Tested-by: Vishal Chourasia <vishalc@linux.ibm.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08block: fix sanity checks in blk_rq_map_user_bvecXinyu Zhang1-3/+1
[ Upstream commit 2ff949441802a8d076d9013c7761f63e8ae5a9bd ] blk_rq_map_user_bvec contains a check bytes + bv->bv_len > nr_iter which causes unnecessary failures in NVMe passthrough I/O, reproducible as follows: - register a 2 page, page-aligned buffer against a ring - use that buffer to do a 1 page io_uring NVMe passthrough read The second (i = 1) iteration of the loop in blk_rq_map_user_bvec will then have nr_iter == 1 page, bytes == 1 page, bv->bv_len == 1 page, so the check bytes + bv->bv_len > nr_iter will succeed, causing the I/O to fail. This failure is unnecessary, as when the check succeeds, it means we've checked the entire buffer that will be used by the request - i.e. blk_rq_map_user_bvec should complete successfully. Therefore, terminate the loop early and return successfully when the check bytes + bv->bv_len > nr_iter succeeds. While we're at it, also remove the check that all segments in the bvec are single-page. While this seems to be true for all users of the function, it doesn't appear to be required anywhere downstream. CC: stable@vger.kernel.org Signed-off-by: Xinyu Zhang <xizhang@purestorage.com> Co-developed-by: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Uday Shankar <ushankar@purestorage.com> Fixes: 37987547932c ("block: extend functionality to map bvec iterator") Link: https://lore.kernel.org/r/20241023211519.4177873-1-ushankar@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-08mmc: sdhci-pci-gli: GL9767: Fix low power mode in the SD Express processBen Chuang1-0/+3
commit c4dedaaeb3f78d3718e9c1b1e4d972a6b99073cd upstream. When starting the SD Express process, the low power negotiation mode will be disabled, so we need to re-enable it after switching back to SD mode. Fixes: 0e92aec2efa0 ("mmc: sdhci-pci-gli: Add support SD Express card for GL9767") Signed-off-by: Ben Chuang <ben.chuang@genesyslogic.com.tw> Cc: stable@vger.kernel.org Message-ID: <20241025060017.1663697-2-benchuanggli@gmail.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-08mmc: sdhci-pci-gli: GL9767: Fix low power mode on the set clock functionBen Chuang1-14/+21
commit 8c68b5656e55e9324875881f1000eb4ee3603a87 upstream. On sdhci_gl9767_set_clock(), the vendor header space(VHS) is read-only after calling gl9767_disable_ssc_pll() and gl9767_set_ssc_pll_205mhz(). So the low power negotiation mode cannot be enabled again. Introduce gl9767_set_low_power_negotiation() function to fix it. The explanation process is as below. static void sdhci_gl9767_set_clock() { ... gl9767_vhs_write(); ... value |= PCIE_GLI_9767_CFG_LOW_PWR_OFF; pci_write_config_dword(pdev, PCIE_GLI_9767_CFG, value); <--- (a) gl9767_disable_ssc_pll(); <--- (b) sdhci_writew(host, 0, SDHCI_CLOCK_CONTROL); if (clock == 0) return; <-- (I) ... if (clock == 200000000 && ios->timing == MMC_TIMING_UHS_SDR104) { ... gl9767_set_ssc_pll_205mhz(); <--- (c) } ... value &= ~PCIE_GLI_9767_CFG_LOW_PWR_OFF; pci_write_config_dword(pdev, PCIE_GLI_9767_CFG, value); <-- (II) gl9767_vhs_read(); } (a) disable low power negotiation mode. When return on (I), the low power mode is disab