summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
5 daysmm/kmemleak: avoid deadlock by moving pr_warn() outside kmemleak_lockBreno Leitao1-1/+4
commit 47b0f6d8f0d2be4d311a49e13d2fd5f152f492b2 upstream. When netpoll is enabled, calling pr_warn_once() while holding kmemleak_lock in mem_pool_alloc() can cause a deadlock due to lock inversion with the netconsole subsystem. This occurs because pr_warn_once() may trigger netpoll, which eventually leads to __alloc_skb() and back into kmemleak code, attempting to reacquire kmemleak_lock. This is the path for the deadlock. mem_pool_alloc() -> raw_spin_lock_irqsave(&kmemleak_lock, flags); -> pr_warn_once() -> netconsole subsystem -> netpoll -> __alloc_skb -> __create_object -> raw_spin_lock_irqsave(&kmemleak_lock, flags); Fix this by setting a flag and issuing the pr_warn_once() after kmemleak_lock is released. Link: https://lkml.kernel.org/r/20250731-kmemleak_lock-v1-1-728fd470198f@debian.org Fixes: c5665868183f ("mm: kmemleak: use the memory pool for early allocations") Signed-off-by: Breno Leitao <leitao@debian.org> Reported-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 daysmm/kmemleak: avoid soft lockup in __kmemleak_do_cleanup()Waiman Long1-0/+5
commit d1534ae23c2b6be350c8ab060803fbf6e9682adc upstream. A soft lockup warning was observed on a relative small system x86-64 system with 16 GB of memory when running a debug kernel with kmemleak enabled. watchdog: BUG: soft lockup - CPU#8 stuck for 33s! [kworker/8:1:134] The test system was running a workload with hot unplug happening in parallel. Then kemleak decided to disable itself due to its inability to allocate more kmemleak objects. The debug kernel has its CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE set to 40,000. The soft lockup happened in kmemleak_do_cleanup() when the existing kmemleak objects were being removed and deleted one-by-one in a loop via a workqueue. In this particular case, there are at least 40,000 objects that need to be processed and given the slowness of a debug kernel and the fact that a raw_spinlock has to be acquired and released in __delete_object(), it could take a while to properly handle all these objects. As kmemleak has been disabled in this case, the object removal and deletion process can be further optimized as locking isn't really needed. However, it is probably not worth the effort to optimize for such an edge case that should rarely happen. So the simple solution is to call cond_resched() at periodic interval in the iteration loop to avoid soft lockup. Link: https://lkml.kernel.org/r/20250728190248.605750-1-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 daysmm/shmem, swap: improve cached mTHP handling and fix potential hangKairui Song1-9/+30
commit 5c241ed8d031693dadf33dd98ed2e7cc363e9b66 upstream. The current swap-in code assumes that, when a swap entry in shmem mapping is order 0, its cached folios (if present) must be order 0 too, which turns out not always correct. The problem is shmem_split_large_entry is called before verifying the folio will eventually be swapped in, one possible race is: CPU1 CPU2 shmem_swapin_folio /* swap in of order > 0 swap entry S1 */ folio = swap_cache_get_folio /* folio = NULL */ order = xa_get_order /* order > 0 */ folio = shmem_swap_alloc_folio /* mTHP alloc failure, folio = NULL */ <... Interrupted ...> shmem_swapin_folio /* S1 is swapped in */ shmem_writeout /* S1 is swapped out, folio cached */ shmem_split_large_entry(..., S1) /* S1 is split, but the folio covering it has order > 0 now */ Now any following swapin of S1 will hang: `xa_get_order` returns 0, and folio lookup will return a folio with order > 0. The `xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always return false causing swap-in to return -EEXIST. And this looks fragile. So fix this up by allowing seeing a larger folio in swap cache, and check the whole shmem mapping range covered by the swapin have the right swap value upon inserting the folio. And drop the redundant tree walks before the insertion. This will actually improve performance, as it avoids two redundant Xarray tree walks in the hot path, and the only side effect is that in the failure path, shmem may redundantly reallocate a few folios causing temporary slight memory pressure. And worth noting, it may seems the order and value check before inserting might help reducing the lock contention, which is not true. The swap cache layer ensures raced swapin will either see a swap cache folio or failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is bypassed), so holding the folio lock and checking the folio flag is already good enough for avoiding the lock contention. The chance that a folio passes the swap entry value check but the shmem mapping slot has changed should be very low. Link: https://lkml.kernel.org/r/20250728075306.12704-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250728075306.12704-2-ryncsn@gmail.com Fixes: 809bc86517cc ("mm: shmem: support large folio swap out") Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Dev Jain <dev.jain@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 daysmm/ptdump: take the memory hotplug lock inside ptdump_walk_pgd()Anshuman Khandual1-0/+2
commit 59305202c67fea50378dcad0cc199dbc13a0e99a upstream. Memory hot remove unmaps and tears down various kernel page table regions as required. The ptdump code can race with concurrent modifications of the kernel page tables. When leaf entries are modified concurrently, the dump code may log stale or inconsistent information for a VA range, but this is otherwise not harmful. But when intermediate levels of kernel page table are freed, the dump code will continue to use memory that has been freed and potentially reallocated for another purpose. In such cases, the ptdump code may dereference bogus addresses, leading to a number of potential problems. To avoid the above mentioned race condition, platforms such as arm64, riscv and s390 take memory hotplug lock, while dumping kernel page table via the sysfs interface /sys/kernel/debug/kernel_page_tables. Similar race condition exists while checking for pages that might have been marked W+X via /sys/kernel/debug/kernel_page_tables/check_wx_pages which in turn calls ptdump_check_wx(). Instead of solving this race condition again, let's just move the memory hotplug lock inside generic ptdump_check_wx() which will benefit both the scenarios. Drop get_online_mems() and put_online_mems() combination from all existing platform ptdump code paths. Link: https://lkml.kernel.org/r/20250620052427.2092093-1-anshuman.khandual@arm.com Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove") Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 daysmm/huge_memory: don't ignore queried cachemode in vmf_insert_pfn_pud()David Hildenbrand1-4/+3
commit 09fefdca80aebd1023e827cb0ee174983d829d18 upstream. Patch series "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes", v3. While working on improving vm_normal_page() and friends, I stumbled over this issues: refcounted "normal" folios must not be marked using pmd_special() / pud_special(). Otherwise, we're effectively telling the system that these folios are no "normal", violating the rules we documented for vm_normal_page(). Fortunately, there are not many pmd_special()/pud_special() users yet. So far there doesn't seem to be serious damage. Tested using the ndctl tests ("ndctl:dax" suite). This patch (of 3): We set up the cache mode but ... don't forward the updated pgprot to insert_pfn_pud(). Only a problem on x86-64 PAT when mapping PFNs using PUDs that require a special cachemode. Fix it by using the proper pgprot where the cachemode was setup. It is unclear in which configurations we would get the cachemode wrong: through vfio seems possible. Getting cachemodes wrong is usually ... bad. As the fix is easy, let's backport it to stable. Identified by code inspection. Link: https://lkml.kernel.org/r/20250613092702.1943533-1-david@redhat.com Link: https://lkml.kernel.org/r/20250613092702.1943533-2-david@redhat.com Fixes: 7b806d229ef1 ("mm: remove vmf_insert_pfn_xxx_prot() for huge page-table entries") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Dan Williams <dan.j.williams@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 daysmm, slab: restore NUMA policy support for large kmallocVlastimil Babka1-1/+6
commit e2d18cbf178775ad377ad88ee55e6e183c38d262 upstream. The slab allocator observes the task's NUMA policy in various places such as allocating slab pages. Large kmalloc() allocations used to do that too, until an unintended change by c4cab557521a ("mm/slab_common: cleanup kmalloc_large()") resulted in ignoring mempolicy and just preferring the local node. Restore the NUMA policy support. Fixes: c4cab557521a ("mm/slab_common: cleanup kmalloc_large()") Cc: <stable@vger.kernel.org> Acked-by: Christoph Lameter (Ampere) <cl@gentwo.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 daysuserfaultfd: fix a crash in UFFDIO_MOVE when PMD is a migration entrySuren Baghdasaryan1-7/+10
commit aba6faec0103ed8f169be8dce2ead41fcb689446 upstream. When UFFDIO_MOVE encounters a migration PMD entry, it proceeds with obtaining a folio and accessing it even though the entry is swp_entry_t. Add the missing check and let split_huge_pmd() handle migration entries. While at it also remove unnecessary folio check. [surenb@google.com: remove extra folio check, per David] Link: https://lkml.kernel.org/r/20250807200418.1963585-1-surenb@google.com Link: https://lkml.kernel.org/r/20250806220022.926763-1-surenb@google.com Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: syzbot+b446dbe27035ef6bd6c2@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68794b5c.a70a0220.693ce.0050.GAE@google.com/ Reviewed-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5 daysmm/damon/core: commit damos->target_nidBijan Tabatabai1-0/+1
commit 579bd5006fe7f4a7abb32da0160d376476cab67d upstream. When committing new scheme parameters from the sysfs, the target_nid field of the damos struct would not be copied. This would result in the target_nid field to retain its original value, despite being updated in the sysfs interface. This patch fixes this issue by copying target_nid in damos_commit(). Link: https://lkml.kernel.org/r/20250709004729.17252-1-bijan311@gmail.com Fixes: 83dc7bbaecae ("mm/damon/sysfs: use damon_commit_ctx()") Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysmm: fix a UAF when vma->mm is freed after vma->vm_refcnt got droppedSuren Baghdasaryan1-2/+1
commit 9bbffee67ffd16360179327b57f3b1245579ef08 upstream. By inducing delays in the right places, Jann Horn created a reproducer for a hard to hit UAF issue that became possible after VMAs were allowed to be recycled by adding SLAB_TYPESAFE_BY_RCU to their cache. Race description is borrowed from Jann's discovery report: lock_vma_under_rcu() looks up a VMA locklessly with mas_walk() under rcu_read_lock(). At that point, the VMA may be concurrently freed, and it can be recycled by another process. vma_start_read() then increments the vma->vm_refcnt (if it is in an acceptable range), and if this succeeds, vma_start_read() can return a recycled VMA. In this scenario where the VMA has been recycled, lock_vma_under_rcu() will then detect the mismatching ->vm_mm pointer and drop the VMA through vma_end_read(), which calls vma_refcount_put(). vma_refcount_put() drops the refcount and then calls rcuwait_wake_up() using a copy of vma->vm_mm. This is wrong: It implicitly assumes that the caller is keeping the VMA's mm alive, but in this scenario the caller has no relation to the VMA's mm, so the rcuwait_wake_up() can cause UAF. The diagram depicting the race: T1 T2 T3 == == == lock_vma_under_rcu mas_walk <VMA gets removed from mm> mmap <the same VMA is reallocated> vma_start_read __refcount_inc_not_zero_limited_acquire munmap __vma_enter_locked refcount_add_not_zero vma_end_read vma_refcount_put __refcount_dec_and_test rcuwait_wait_event <finish operation> rcuwait_wake_up [UAF] Note that rcuwait_wait_event() in T3 does not block because refcount was already dropped by T1. At this point T3 can exit and free the mm causing UAF in T1. To avoid this we move vma->vm_mm verification into vma_start_read() and grab vma->vm_mm to stabilize it before vma_refcount_put() operation. [surenb@google.com: v3] Link: https://lkml.kernel.org/r/20250729145709.2731370-1-surenb@google.com Link: https://lkml.kernel.org/r/20250728175355.2282375-1-surenb@google.com Fixes: 3104138517fc ("mm: make vma cache SLAB_TYPESAFE_BY_RCU") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: Jann Horn <jannh@google.com> Closes: https://lore.kernel.org/all/CAG48ez0-deFbVH=E3jbkWx=X3uVbd8nWeo6kbJPQ0KoUD+m2tA@mail.gmail.com/ Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
10 daysmm: shmem: fix the shmem large folio allocation for the i915 driverBaolin Wang1-2/+2
commit 8d58d65621118fdca3ed6a0b3d658ba7e0e5153c upstream. After commit acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs"), we extend the 'huge=' option to allow any sized large folios for tmpfs, which means tmpfs will allow getting a highest order hint based on the size of write() and fallocate() paths, and then will try each allowable large order. However, when the i915 driver allocates shmem memory, it doesn't provide hint information about the size of the large folio to be allocated, resulting in the inability to allocate PMD-sized shmem, which in turn affects GPU performance. Patryk added: : In my tests, the performance drop ranges from a few percent up to 13% : in Unigine Superposition under heavy memory usage on the CPU Core Ultra : 155H with the Xe 128 EU GPU. Other users have reported performance : impact up to 30% on certain workloads. Please find more in the : regressions reports: : https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14645 : https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13845 : : I believe the change should be backported to all active kernel branches : after version 6.12. To fix this issue, we can use the inode's size as a write size hint in shmem_read_folio_gfp() to help allocate PMD-sized large folios. Link: https://lkml.kernel.org/r/f7e64e99a3a87a8144cc6b2f1dddf7a89c12ce44.1753926601.git.baolin.wang@linux.alibaba.com Fixes: acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reported-by: Patryk Kowalczyk <patryk@kowalczyk.ws> Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Tested-by: Patryk Kowalczyk <patryk@kowalczyk.ws> Suggested-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysmm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to ↵Kemeng Shi1-1/+1
swap_range_alloc() commit 4f78252da887ee7e9d1875dd6e07d9baa936c04f upstream. Patch series "Some randome fixes and cleanups to swapfile". Patch 0-3 are some random fixes. Patch 4 is a cleanup. More details can be found in respective patches. This patch (of 4): When folio_alloc_swap() encounters a failure in either mem_cgroup_try_charge_swap() or add_to_swap_cache(), nr_swap_pages counter is not decremented for allocated entry. However, the following put_swap_folio() will increase nr_swap_pages counter unpairly and lead to an imbalance. Move nr_swap_pages decrement from folio_alloc_swap() to swap_range_alloc() to pair the nr_swap_pages counting. Link: https://lkml.kernel.org/r/20250522122554.12209-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20250522122554.12209-2-shikemeng@huaweicloud.com Fixes: 0ff67f990bd4 ("mm, swap: remove swap slot cache") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysmm: swap: fix potential buffer overflow in setup_clusters()Kemeng Shi1-3/+7
commit 152c1339dc13ad46f1b136e8693de15980750835 upstream. In setup_swap_map(), we only ensure badpages are in range (0, last_page]. As maxpages might be < last_page, setup_clusters() will encounter a buffer overflow when a badpage is >= maxpages. Only call inc_cluster_info_page() for badpage which is < maxpages to fix the issue. Link: https://lkml.kernel.org/r/20250522122554.12209-4-shikemeng@huaweicloud.com Fixes: b843786b0bd0 ("mm: swapfile: fix SSD detection with swapfile on btrfs") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysmm: swap: correctly use maxpages in swapon syscall to avoid potential deadloopKemeng Shi1-27/+26
commit 255116c5b0fa2145ede28c2f7b248df5e73834d1 upstream. We use maxpages from read_swap_header() to initialize swap_info_struct, however the maxpages might be reduced in setup_swap_extents() and the si->max is assigned with the reduced maxpages from the setup_swap_extents(). Obviously, this could lead to memory waste as we allocated memory based on larger maxpages, besides, this could lead to a potential deadloop as following: 1) When calling setup_clusters() with larger maxpages, unavailable pages within range [si->max, larger maxpages) are not accounted with inc_cluster_info_page(). As a result, these pages are assumed available but can not be allocated. The cluster contains these pages can be moved to frag_clusters list after it's all available pages were allocated. 2) When the cluster mentioned in 1) is the only cluster in frag_clusters list, cluster_alloc_swap_entry() assume order 0 allocation will never failed and will enter a deadloop by keep trying to allocate page from the only cluster in frag_clusters which contains no actually available page. Call setup_swap_extents() to get the final maxpages before swap_info_struct initialization to fix the issue. After this change, span will include badblocks and will become large value which I think is correct value: In summary, there are two kinds of swapfile_activate operations. 1. Filesystem style: Treat all blocks logical continuity and find usable physical extents in logical range. In this way, si->pages will be actual usable physical blocks and span will be "1 + highest_block - lowest_block". 2. Block device style: Treat all blocks physically continue and only one single extent is added. In this way, si->pages will be si->max and span will be "si->pages - 1". Actually, si->pages and si->max is only used in block device style and span value is set with si->pages. As a result, span value in block device style will become a larger value as you mentioned. I think larger value is correct based on: 1. Span value in filesystem style is "1 + highest_block - lowest_block" which is the range cover all possible phisical blocks including the badblocks. 2. For block device style, si->pages is the actual usable block number and is already in pr_info. The original span value before this patch is also refer to usable block number which is redundant in pr_info. [shikemeng@huaweicloud.com: ensure si->pages == si->max - 1 after setup_swap_extents()] Link: https://lkml.kernel.org/r/20250522122554.12209-3-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20250718065139.61989-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20250522122554.12209-3-shikemeng@huaweicloud.com Fixes: 661383c6111a ("mm: swap: relaim the cached parts that got scanned") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysmm/hmm: move pmd_to_hmm_pfn_flags() to the respective #ifdefferyAndy Shevchenko1-1/+1
commit 188cb385bbf04d486df3e52f28c47b3961f5f0c0 upstream. When pmd_to_hmm_pfn_flags() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_TRANSPARENT_HUGEPAGE=n: mm/hmm.c:186:29: warning: unused function 'pmd_to_hmm_pfn_flags' [-Wunused-function] Fix this by moving the function to the respective existing ifdeffery for its the only user. See also: 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build") Link: https://lkml.kernel.org/r/20250710082403.664093-1-andriy.shevchenko@linux.intel.com Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Bill Wendling <morbo@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Justin Stitt <justinstitt@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
10 daysslub: Fix a documentation build error for krealloc()Jonathan Corbet1-5/+5
[ Upstream commit e8a45f198e3ae2434108f815bc28f37f6fe6742b ] The kerneldoc comment for krealloc() contains an unmarked literal block, leading to these warnings in the docs build: ./mm/slub.c:4936: WARNING: Block quote ends without a blank line; unexpected unindent. [docutils] ./mm/slub.c:4936: ERROR: Undefined substitution referenced: "--------". [docutils] Mark up and indent the block properly to bring a bit of peace to our build logs. Fixes: 489a744e5fb1 (mm: krealloc: clarify valid usage of __GFP_ZERO) Signed-off-by: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20250611155916.2579160-6-willy@infradead.org Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-01mm/zsmalloc: do not pass __GFP_MOVABLE if CONFIG_COMPACTION=nHarry Yoo1-0/+3
commit 694d6b99923eb05a8fd188be44e26077d19f0e21 upstream. Commit 48b4800a1c6a ("zsmalloc: page migration support") added support for migrating zsmalloc pages using the movable_operations migration framework. However, the commit did not take into account that zsmalloc supports migration only when CONFIG_COMPACTION is enabled. Tracing shows that zsmalloc was still passing the __GFP_MOVABLE flag even when compaction is not supported. This can result in unmovable pages being allocated from movable page blocks (even without stealing page blocks), ZONE_MOVABLE and CMA area. Possible user visible effects: - Some ZONE_MOVABLE memory can be not actually movable - CMA allocation can fail because of this - Increased memory fragmentation due to ignoring the page mobility grouping feature I'm not really sure who uses kernels without compaction support, though :( To fix this, clear the __GFP_MOVABLE flag when !IS_ENABLED(CONFIG_COMPACTION). Link: https://lkml.kernel.org/r/20250704103053.6913-1-harry.yoo@oracle.com Fixes: 48b4800a1c6a ("zsmalloc: page migration support") Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-01mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_listJinjiang Tu2-0/+12
commit 9f1e8cd0b7c4c944e9921b52a6661b5eda2705ab upstream. In shrink_folio_list(), the hwpoisoned folio may be large folio, which can't be handled by unmap_poisoned_folio(). For THP, try_to_unmap_one() must be passed with TTU_SPLIT_HUGE_PMD to split huge PMD first and then retry. Without TTU_SPLIT_HUGE_PMD, we will trigger null-ptr deref of pvmw.pte. Even we passed TTU_SPLIT_HUGE_PMD, we will trigger a WARN_ON_ONCE due to the page isn't in swapcache. Since UCE is rare in real world, and race with reclaimation is more rare, just skipping the hwpoisoned large folio is enough. memory_failure() will handle it if the UCE is triggered again. This happens when memory reclaim for large folio races with memory_failure(), and will lead to kernel panic. The race is as follows: cpu0 cpu1 shrink_folio_list memory_failure TestSetPageHWPoison unmap_poisoned_folio --> trigger BUG_ON due to unmap_poisoned_folio couldn't handle large folio [tujinjiang@huawei.com: add comment to unmap_poisoned_folio()] Link: https://lkml.kernel.org/r/69fd4e00-1b13-d5f7-1c82-705c7d977ea4@huawei.com Link: https://lkml.kernel.org/r/20250627125747.3094074-2-tujinjiang@huawei.com Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Fixes: 1b0449544c64 ("mm/vmscan: don't try to reclaim hwpoison folio") Reported-by: syzbot+3b220254df55d8ca8a61@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68412d57.050a0220.2461cf.000e.GAE@google.com/ Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-01kasan: use vmalloc_dump_obj() for vmalloc error reportsMarco Elver1-1/+3
commit 6ade153349c6bb990d170cecc3e8bdd8628119ab upstream. Since 6ee9b3d84775 ("kasan: remove kasan_find_vm_area() to prevent possible deadlock"), more detailed info about the vmalloc mapping and the origin was dropped due to potential deadlocks. While fixing the deadlock is necessary, that patch was too quick in killing an otherwise useful feature, and did no due-diligence in understanding if an alternative option is available. Restore printing more helpful vmalloc allocation info in KASAN reports with the help of vmalloc_dump_obj(). Example report: | BUG: KASAN: vmalloc-out-of-bounds in vmalloc_oob+0x4c9/0x610 | Read of size 1 at addr ffffc900002fd7f3 by task kunit_try_catch/493 | | CPU: [...] | Call Trace: | <TASK> | dump_stack_lvl+0xa8/0xf0 | print_report+0x17e/0x810 | kasan_report+0x155/0x190 | vmalloc_oob+0x4c9/0x610 | [...] | | The buggy address belongs to a 1-page vmalloc region starting at 0xffffc900002fd000 allocated at vmalloc_oob+0x36/0x610 | The buggy address belongs to the physical page: | page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x126364 | flags: 0x200000000000000(node=0|zone=2) | raw: 0200000000000000 0000000000000000 dead000000000122 0000000000000000 | raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 | page dumped because: kasan: bad access detected | | [..] Link: https://lkml.kernel.org/r/20250716152448.3877201-1-elver@google.com Fixes: 6ee9b3d84775 ("kasan: remove kasan_find_vm_area() to prevent possible deadlock") Signed-off-by: Marco Elver <elver@google.com> Suggested-by: Uladzislau Rezki <urezki@gmail.com> Acked-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yeoreum Yun <yeoreum.yun@arm.com> Cc: Yunseong Kim <ysk@kzalloc.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-01mm/ksm: fix -Wsometimes-uninitialized from clang-21 in advisor_mode_show()Nathan Chancellor1-3/+3
commit 153ad566724fe6f57b14f66e9726d295d22e576d upstream. After a recent change in clang to expose uninitialized warnings from const variables [1], there is a false positive warning from the if statement in advisor_mode_show(). mm/ksm.c:3687:11: error: variable 'output' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] 3687 | else if (ksm_advisor == KSM_ADVISOR_SCAN_TIME) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mm/ksm.c:3690:33: note: uninitialized use occurs here 3690 | return sysfs_emit(buf, "%s\n", output); | ^~~~~~ Rewrite the if statement to implicitly make KSM_ADVISOR_NONE the else branch so that it is obvious to the compiler that ksm_advisor can only be KSM_ADVISOR_NONE or KSM_ADVISOR_SCAN_TIME due to the assignments in advisor_mode_store(). Link: https://lkml.kernel.org/r/20250715-ksm-fix-clang-21-uninit-warning-v1-1-f443feb4bfc4@kernel.org Fixes: 66790e9a735b ("mm/ksm: add sysfs knobs for advisor") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Closes: https://github.com/ClangBuiltLinux/linux/issues/2100 Link: https://github.com/llvm/llvm-project/commit/2464313eef01c5b1edf0eccf57a32cdee01472c7 [1] Acked-by: David Hildenbrand <david@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Stefan Roesch <shr@devkernel.io> Cc: xu xin <xu.xin16@zte.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-17mm/damon: fix divide by zero in damon_get_intervals_score()Honggyu Kim1-0/+1
commit bd225b9591442065beb876da72656f4a2d627d03 upstream. The current implementation allows having zero size regions with no special reasons, but damon_get_intervals_score() gets crashed by divide by zero when the region size is zero. [ 29.403950] Oops: divide error: 0000 [#1] SMP NOPTI This patch fixes the bug, but does not disallow zero size regions to keep the backward compatibility since disallowing zero size regions might be a breaking change for some users. In addition, the same crash can happen when intervals_goal.access_bp is zero so this should be fixed in stable trees as well. Link: https://lkml.kernel.org/r/20250702000205.1921-5-honggyu.kim@sk.com Fixes: f04b0fedbe71 ("mm/damon/core: implement intervals auto-tuning") Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-17mm/damon/core: handle damon_call_control as normal under kdmond deactivationSeongJae Park1-4/+3
commit bb1b5929b4279b136816f95ce1e8f1fa689bf4a1 upstream. DAMON sysfs interface internally uses damon_call() to update DAMON parameters as users requested, online. However, DAMON core cancels any damon_call() requests when it is deactivated by DAMOS watermarks. As a result, users cannot change DAMON parameters online while DAMON is deactivated. Note that users can turn DAMON off and on with different watermarks to work around. Since deactivated DAMON is nearly same to stopped DAMON, the work around should have no big problem. Anyway, a bug is a bug. There is no real good reason to cancel the damon_call() request under DAMOS deactivation. Fix it by simply handling the request as normal, rather than cancelling under the situation. Link: https://lkml.kernel.org/r/20250629204914.54114-1-sj@kernel.org Fixes: 42b7491af14c ("mm/damon/core: introduce damon_call()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-17mm/rmap: fix potential out-of-bounds page table access during batched unmapLance Yang1-18/+28
commit ddd05742b45b083975a0855ef6ebbf88cf1f532a upstream. As pointed out by David[1], the batched unmap logic in try_to_unmap_one() may read past the end of a PTE table when a large folio's PTE mappings are not fully contained within a single page table. While this scenario might be rare, an issue triggerable from userspace must be fixed regardless of its likelihood. This patch fixes the out-of-bounds access by refactoring the logic into a new helper, folio_unmap_pte_batch(). The new helper correctly calculates the safe batch size by capping the scan at both the VMA and PMD boundaries. To simplify the code, it also supports partial batching (i.e., any number of pages from 1 up to the calculated safe maximum), as there is no strong reason to special-case for fully mapped folios. Link: https://lkml.kernel.org/r/20250701143100.6970-1-lance.yang@linux.dev Link: https://lkml.kernel.org/r/20250630011305.23754-1-lance.yang@linux.dev Link: https://lkml.kernel.org/r/20250627062319.84936-1-lance.yang@linux.dev Link: https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat.com [1] Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation") Signed-off-by: Lance Yang <lance.yang@linux.dev> Suggested-by: David Hildenbrand <david@redhat.com> Reported-by: David Hildenbrand <david@redhat.com> Closes: https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat.com Suggested-by: Barry Song <baohua@kernel.org> Acked-by: Barry Song <baohua@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <huang.ying.caritas@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mingzhe Yang <mingzhe.yang@ly.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-17mm/vmalloc: leave lazy MMU mode on PTE mapping errorAlexander Gordeev1-7/+15
commit fea18c686320a53fce7ad62a87a3e1d10ad02f31 upstream. vmap_pages_pte_range() enters the lazy MMU mode, but fails to leave it in case an error is encountered. Link: https://lkml.kernel.org/r/20250623075721.2817094-1-agordeev@linux.ibm.com Fixes: 2ba3e6947aed ("mm/vmalloc: track which page-table levels were modified") Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202506132017.T1l1l6ME-lkp@intel.com/ Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-17kasan: remove kasan_find_vm_area() to prevent possible deadlockYeoreum Yun1-43/+2
commit 6ee9b3d84775944fb8c8a447961cd01274ac671c upstream. find_vm_area() couldn't be called in atomic_context. If find_vm_area() is called to reports vm area information, kasan can trigger deadlock like: CPU0 CPU1 vmalloc(); alloc_vmap_area(); spin_lock(&vn->busy.lock) spin_lock_bh(&some_lock); <interrupt occurs> <in softirq> spin_lock(&some_lock); <access invalid address> kasan_report(); print_report(); print_address_description(); kasan_find_vm_area(); find_vm_area(); spin_lock(&vn->busy.lock) // deadlock! To prevent possible deadlock while kasan reports, remove kasan_find_vm_area(). Link: https://lkml.kernel.org/r/20250703181018.580833-1-yeoreum.yun@arm.com Fixes: c056a364e954 ("kasan: print virtual mapping info in reports") Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com> Reported-by: Yunseong Kim <ysk@kzalloc.com> Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-10mm/vmalloc: fix data race in show_numa_info()Jeongjun Park1-28/+35
commit 5c5f0468d172ddec2e333d738d2a1f85402cf0bc upstream. The following data-race was found in show_numa_info(): ================================================================== BUG: KCSAN: data-race in vmalloc_info_show / vmalloc_info_show read to 0xffff88800971fe30 of 4 bytes by task 8289 on cpu 0: show_numa_info mm/vmalloc.c:4936 [inline] vmalloc_info_show+0x5a8/0x7e0 mm/vmalloc.c:5016 seq_read_iter+0x373/0xb40 fs/seq_file.c:230 proc_reg_read_iter+0x11e/0x170 fs/proc/inode.c:299 .... write to 0xffff88800971fe30 of 4 bytes by task 8287 on cpu 1: show_numa_info mm/vmalloc.c:4934 [inline] vmalloc_info_show+0x38f/0x7e0 mm/vmalloc.c:5016 seq_read_iter+0x373/0xb40 fs/seq_file.c:230 proc_reg_read_iter+0x11e/0x170 fs/proc/inode.c:299 .... value changed: 0x0000008f -> 0x00000000 ================================================================== According to this report,there is a read/write data-race because m->private is accessible to multiple CPUs. To fix this, instead of allocating the heap in proc_vmalloc_init() and passing the heap address to m->private, vmalloc_info_show() should allocate the heap. Link: https://lkml.kernel.org/r/20250508165620.15321-1-aha310510@gmail.com Fixes: 8e1d743f2c26 ("mm: vmalloc: support multiple nodes in vmallocinfo") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Suggested-by: Eric Dumazet <edumazet@google.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-10fs: export anon_inode_make_secure_inode() and fix secretmem LSM bypassShivank Garg1-8/+1
[ Upstream commit cbe4134ea4bc493239786220bd69cb8a13493190 ] Export anon_inode_make_secure_inode() to allow KVM guest_memfd to create anonymous inodes with proper security context. This replaces the current pattern of calling alloc_anon_inode() followed by inode_init_security_anon() for creating security context manually. This change also fixes a security regression in secretmem where the S_PRIVATE flag was not cleared after alloc_anon_inode(), causing LSM/SELinux checks to be bypassed for secretmem file descriptors. As guest_memfd currently resides in the KVM module, we need to export this symbol for use outside the core kernel. In the future, guest_memfd might be moved to core-mm, at which point the symbols no longer would have to be exported. When/if that happens is still unclear. Fixes: 2bfe15c52612 ("mm: create security context for memfd_secret inodes") Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Mike Rapoport <rppt@kernel.org> Signed-off-by: Shivank Garg <shivankg@amd.com> Link: https://lore.kernel.org/20250620070328.803704-3-shivankg@amd.com Acked-by: "Mike Rapoport (Microsoft)" <rppt@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-07-06mm/gup: revert "mm: gup: fix infinite loop within __get_longterm_locked"David Hildenbrand1-4/+10
commit 517f496e1e61bd169d585dab4dd77e7147506322 upstream. After commit 1aaf8c122918 ("mm: gup: fix infinite loop within __get_longterm_locked") we are able to longterm pin folios that are not supposed to get longterm pinned, simply because they temporarily have the LRU flag cleared (esp. temporarily isolated). For example, two __get_longterm_locked() callers can race, or __get_longterm_locked() can race with anything else that temporarily isolates folios. The introducing commit mentions the use case of a driver that uses vm_ops->fault to insert pages allocated through cma_alloc() into the page tables, assuming they can later get longterm pinned. These pages/ folios would never have the LRU flag set and consequently cannot get isolated. There is no known in-tree user making use of that so far, fortunately. To handle that in the future -- and avoid retrying forever to isolate/migrate them -- we will need a different mechanism for the CMA area *owner* to indicate that it actually already allocated the page and is fine with longterm pinning it. The LRU flag is not suitable for that. Probably we can lookup the relevant CMA area and query the bitmap; we only have have to care about some races, probably. If already allocated, we could just allow longterm pinning) Anyhow, let's fix the "must not be longterm pinned" problem first by reverting the original commit. Link: https://lkml.kernel.org/r/20250611131314.594529-1-david@redhat.com Fixes: 1aaf8c122918 ("mm: gup: fix infinite loop within __get_longterm_locked") Signed-off-by: David Hildenbrand <david@redhat.com> Closes: https://lore.kernel.org/all/20250522092755.GA3277597@tiffany/ Reported-by: Hyesoo Yu <hyesoo.yu@samsung.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Peter Xu <peterx@redhat.com> Cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Cc: Aijun Sun <aijun.sun@unisoc.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-06mm/shmem, swap: fix softlockup with mTHP swapinKairui Song3-21/+28
commit a05dd8ae5cbb1cb45f349922cfea4f548a5e5d6f upstream. Following softlockup can be easily reproduced on my test machine with: echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled swapon /dev/zram0 # zram0 is a 48G swap device mkdir -p /sys/fs/cgroup/memory/test echo 1G > /sys/fs/cgroup/test/memory.max echo $BASHPID > /sys/fs/cgroup/test/cgroup.procs while true; do dd if=/dev/zero of=/tmp/test.img bs=1M count=5120 cat /tmp/test.img > /dev/null rm /tmp/test.img done Then after a while: watchdog: BUG: soft lockup - CPU#0 stuck for 763s! [cat:5787] Modules linked in: zram virtiofs CPU: 0 UID: 0 PID: 5787 Comm: cat Kdump: loaded Tainted: G L 6.15.0.orig-gf3021d9246bc-dirty #118 PREEMPT(voluntary)· Tainted: [L]=SOFTLOCKUP Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015 RIP: 0010:mpol_shared_policy_lookup+0xd/0x70 Code: e9 b8 b4 ff ff 31 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 <48> 8b 1f 48 85 db 74 41 4c 8d 67 08 48 89 fb 48 89 f5 4c 89 e7 e8 RSP: 0018:ffffc90002b1fc28 EFLAGS: 00000202 RAX: 00000000001c20ca RBX: 0000000000724e1e RCX: 0000000000000001 RDX: ffff888118e214c8 RSI: 0000000000057d42 RDI: ffff888118e21518 RBP: 000000000002bec8 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000bf4 R11: 0000000000000000 R12: 0000000000000001 R13: 00000000001c20ca R14: 00000000001c20ca R15: 0000000000000000 FS: 00007f03f995c740(0000) GS:ffff88a07ad9a000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f03f98f1000 CR3: 0000000144626004 CR4: 0000000000770eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> shmem_alloc_folio+0x31/0xc0 shmem_swapin_folio+0x309/0xcf0 ? filemap_get_entry+0x117/0x1e0 ? xas_load+0xd/0xb0 ? filemap_get_entry+0x101/0x1e0 shmem_get_folio_gfp+0x2ed/0x5b0 shmem_file_read_iter+0x7f/0x2e0 vfs_read+0x252/0x330 ksys_read+0x68/0xf0 do_syscall_64+0x4c/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f03f9a46991 Code: 00 48 8b 15 81 14 10 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 20 ad 01 00 f3 0f 1e fa 80 3d 35 97 10 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec RSP: 002b:00007fff3c52bd28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f03f9a46991 RDX: 0000000000040000 RSI: 00007f03f98ba000 RDI: 0000000000000003 RBP: 00007fff3c52bd50 R08: 0000000000000000 R09: 00007f03f9b9a380 R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000 R13: 00007f03f98ba000 R14: 0000000000000003 R15: 0000000000000000 </TASK> The reason is simple, readahead brought some order 0 folio in swap cache, and the swapin