linux.git/mm, branch v6.12.81

mm/memory: fix PMD/PUD checks in follow_pfnmap_start()

2026-04-11T12:24:53+00:00

[ Upstream commit ffef67b93aa352b34e6aeba3d52c19a63885409a ]

follow_pfnmap_start() suffers from two problems:

(1) We are not re-fetching the pmd/pud after taking the PTL

Therefore, we are not properly stabilizing what the lock actually
protects.  If there is concurrent zapping, we would indicate to the
caller that we found an entry, however, that entry might already have
been invalidated, or contain a different PFN after taking the lock.

Properly use pmdp_get() / pudp_get() after taking the lock.

(2) pmd_leaf() / pud_leaf() are not well defined on non-present entries

pmd_leaf()/pud_leaf() could wrongly trigger on non-present entries.

There is no real guarantee that pmd_leaf()/pud_leaf() returns something
reasonable on non-present entries.  Most architectures indeed either
perform a present check or make it work by smart use of flags.

However, for example loongarch checks the _PAGE_HUGE flag in pmd_leaf(),
and always sets the _PAGE_HUGE flag in __swp_entry_to_pmd().  Whereby
pmd_trans_huge() explicitly checks pmd_present(), pmd_leaf() does not do
that.

Let's check pmd_present()/pud_present() before assuming "the is a present
PMD leaf" when spotting pmd_leaf()/pud_leaf(), like other page table
handling code that traverses user page tables does.

Given that non-present PMD entries are likely rare in VM_IO|VM_PFNMAP, (1)
is likely more relevant than (2).  It is questionable how often (1) would
actually trigger, but let's CC stable to be sure.

This was found by code inspection.

Link: https://lkml.kernel.org/r/20260323-follow_pfnmap_fix-v1-1-5b0ec10872b3@kernel.org
Fixes: 6da8e9634bb7 ("mm: new follow_pfnmap API")
Signed-off-by: David Hildenbrand (Arm) 
Acked-by: Mike Rapoport (Microsoft) 
Reviewed-by: Lorenzo Stoakes (Oracle) 
Cc: Liam Howlett 
Cc: Michal Hocko 
Cc: Peter Xu 
Cc: Suren Baghdasaryan 
Cc: Vlastimil Babka 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

mm: replace READ_ONCE() with standard page table accessors

2026-04-11T12:24:52+00:00

[ Upstream commit c0efdb373c3aaacb32db59cadb0710cac13e44ae ]

Replace all READ_ONCE() with a standard page table accessors i.e
pxdp_get() that defaults into READ_ONCE() in cases where platform does not
override.

Link: https://lkml.kernel.org/r/20251007063100.2396936-1-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual 
Acked-by: David Hildenbrand 
Reviewed-by: Lance Yang 
Reviewed-by: Wei Yang 
Reviewed-by: Dev Jain 
Signed-off-by: Andrew Morton 
Stable-dep-of: ffef67b93aa3 ("mm/memory: fix PMD/PUD checks in follow_pfnmap_start()")
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

mm/damon/sysfs: check contexts->nr before accessing contexts_arr[0]

2026-04-02T11:09:50+00:00

commit 1bfe9fb5ed2667fb075682408b776b5273162615 upstream.

Multiple sysfs command paths dereference contexts_arr[0] without first
verifying that kdamond->contexts->nr == 1.  A user can set nr_contexts to
0 via sysfs while DAMON is running, causing NULL pointer dereferences.

In more detail, the issue can be triggered by privileged users like
below.

First, start DAMON and make contexts directory empty
(kdamond->contexts->nr == 0).

    # damo start
    # cd /sys/kernel/mm/damon/admin/kdamonds/0
    # echo 0 > contexts/nr_contexts

Then, each of below commands will cause the NULL pointer dereference.

    # echo update_schemes_stats > state
    # echo update_schemes_tried_regions > state
    # echo update_schemes_tried_bytes > state
    # echo update_schemes_effective_quotas > state
    # echo update_tuned_intervals > state

Guard all commands (except OFF) at the entry point of
damon_sysfs_handle_cmd().

Link: https://lkml.kernel.org/r/20260321175427.86000-3-sj@kernel.org
Fixes: 0ac32b8affb5 ("mm/damon/sysfs: support DAMOS stats")
Signed-off-by: Josh Law 
Reviewed-by: SeongJae Park 
Signed-off-by: SeongJae Park 
Cc: 	[5.18+]
Signed-off-by: Andrew Morton 
Signed-off-by: SeongJae Park 
Signed-off-by: Greg Kroah-Hartman

mm/shmem, swap: avoid redundant Xarray lookup during swapin

2026-03-25T10:08:56+00:00

commit 0cfc0e7e3d062b93e9eec6828de000981cdfb152 upstream.

Currently shmem calls xa_get_order to get the swap radix entry order,
requiring a full tree walk.  This can be easily combined with the swap
entry value checking (shmem_confirm_swap) to avoid the duplicated lookup
and abort early if the entry is gone already.  Which should improve the
performance.

Link: https://lkml.kernel.org/r/20250728075306.12704-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20250728075306.12704-3-ryncsn@gmail.com
Signed-off-by: Kairui Song 
Reviewed-by: Kemeng Shi 
Reviewed-by: Dev Jain 
Reviewed-by: Baolin Wang 
Cc: Baoquan He 
Cc: Barry Song 
Cc: Chris Li 
Cc: Hugh Dickins 
Cc: Matthew Wilcox (Oracle) 
Cc: Nhat Pham 
Signed-off-by: Andrew Morton 

Stable-dep-of: 8a1968bd997f ("mm/shmem, swap: fix race of truncate and swap entry split")
[ hughd: removed series cover letter and skip_swapcache dependencies ]
Signed-off-by: Hugh Dickins 
Signed-off-by: Greg Kroah-Hartman

mm/shmem, swap: improve cached mTHP handling and fix potential hang

2026-03-25T10:08:56+00:00

commit 5c241ed8d031693dadf33dd98ed2e7cc363e9b66 upstream.

The current swap-in code assumes that, when a swap entry in shmem mapping
is order 0, its cached folios (if present) must be order 0 too, which
turns out not always correct.

The problem is shmem_split_large_entry is called before verifying the
folio will eventually be swapped in, one possible race is:

    CPU1                          CPU2
shmem_swapin_folio
/* swap in of order > 0 swap entry S1 */
  folio = swap_cache_get_folio
  /* folio = NULL */
  order = xa_get_order
  /* order > 0 */
  folio = shmem_swap_alloc_folio
  /* mTHP alloc failure, folio = NULL */
  <... Interrupted ...>
                                 shmem_swapin_folio
                                 /* S1 is swapped in */
                                 shmem_writeout
                                 /* S1 is swapped out, folio cached */
  shmem_split_large_entry(..., S1)
  /* S1 is split, but the folio covering it has order > 0 now */

Now any following swapin of S1 will hang: `xa_get_order` returns 0, and
folio lookup will return a folio with order > 0.  The
`xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always
return false causing swap-in to return -EEXIST.

And this looks fragile.  So fix this up by allowing seeing a larger folio
in swap cache, and check the whole shmem mapping range covered by the
swapin have the right swap value upon inserting the folio.  And drop the
redundant tree walks before the insertion.

This will actually improve performance, as it avoids two redundant Xarray
tree walks in the hot path, and the only side effect is that in the
failure path, shmem may redundantly reallocate a few folios causing
temporary slight memory pressure.

And worth noting, it may seems the order and value check before inserting
might help reducing the lock contention, which is not true.  The swap
cache layer ensures raced swapin will either see a swap cache folio or
failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is
bypassed), so holding the folio lock and checking the folio flag is
already good enough for avoiding the lock contention.  The chance that a
folio passes the swap entry value check but the shmem mapping slot has
changed should be very low.

Link: https://lkml.kernel.org/r/20250728075306.12704-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20250728075306.12704-2-ryncsn@gmail.com
Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
Signed-off-by: Kairui Song 
Reviewed-by: Kemeng Shi 
Reviewed-by: Baolin Wang 
Tested-by: Baolin Wang 
Cc: Baoquan He 
Cc: Barry Song 
Cc: Chris Li 
Cc: Hugh Dickins 
Cc: Matthew Wilcox (Oracle) 
Cc: Nhat Pham 
Cc: Dev Jain 
Cc: 
Signed-off-by: Andrew Morton 

[ hughd: removed skip_swapcache dependencies ]
Signed-off-by: Hugh Dickins 
Signed-off-by: Greg Kroah-Hartman

mm: shmem: avoid unpaired folio_unlock() in shmem_swapin_folio()

2026-03-25T10:08:56+00:00

commit e08d5f515613a9860bfee7312461a19f422adb5e upstream.

If we get a folio from swap_cache_get_folio() successfully but encounter a
failure before the folio is locked, we will unlock the folio which was not
previously locked.

Put the folio and set it to NULL when a failure occurs before the folio is
locked to fix the issue.

Link: https://lkml.kernel.org/r/20250516170939.965736-1-shikemeng@huaweicloud.com
Link: https://lkml.kernel.org/r/20250516170939.965736-2-shikemeng@huaweicloud.com
Fixes: 058313515d5a ("mm: shmem: fix potential data corruption during shmem swapin")
Signed-off-by: Kemeng Shi 
Reviewed-by: Baolin Wang 
Reviewed-by: Kairui Song 
Cc: Hugh Dickins 
Cc: kernel test robot 
Signed-off-by: Andrew Morton 

[ hughd: removed series cover letter comments ]
Signed-off-by: Hugh Dickins 
Signed-off-by: Greg Kroah-Hartman

mm: shmem: fix potential data corruption during shmem swapin

2026-03-25T10:08:56+00:00

commit 058313515d5aab10d0a01dd634f92ed4a4e71d4c upstream.

Alex and Kairui reported some issues (system hang or data corruption) when
swapping out or swapping in large shmem folios.  This is especially easy
to reproduce when the tmpfs is mount with the 'huge=within_size'
parameter.  Thanks to Kairui's reproducer, the issue can be easily
replicated.

The root cause of the problem is that swap readahead may asynchronously
swap in order 0 folios into the swap cache, while the shmem mapping can
still store large swap entries.  Then an order 0 folio is inserted into
the shmem mapping without splitting the large swap entry, which overwrites
the original large swap entry, leading to data corruption.

When getting a folio from the swap cache, we should split the large swap
entry stored in the shmem mapping if the orders do not match, to fix this
issue.

Link: https://lkml.kernel.org/r/2fe47c557e74e9df5fe2437ccdc6c9115fa1bf70.1740476943.git.baolin.wang@linux.alibaba.com
Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
Signed-off-by: Baolin Wang 
Reported-by: Alex Xu (Hello71) 
Reported-by: Kairui Song 
Closes: https://lore.kernel.org/all/1738717785.im3r5g2vxc.none@localhost/
Tested-by: Kairui Song 
Cc: David Hildenbrand 
Cc: Lance Yang 
Cc: Matthew Wilcow 
Cc: Hugh Dickins 
Cc: 
Signed-off-by: Andrew Morton 

[ hughd: removed skip_swapcache dependencies ]
Signed-off-by: Hugh Dickins 
Signed-off-by: Greg Kroah-Hartman

mm/kfence: fix KASAN hardware tag faults during late enablement

2026-03-25T10:08:42+00:00

[ Upstream commit d155aab90fffa00f93cea1f107aef0a3d548b2ff ]

When KASAN hardware tags are enabled, re-enabling KFENCE late (via
/sys/module/kfence/parameters/sample_interval) causes KASAN faults.

This happens because the KFENCE pool and metadata are allocated via the
page allocator, which tags the memory, while KFENCE continues to access it
using untagged pointers during initialization.

Use __GFP_SKIP_KASAN for late KFENCE pool and metadata allocations to
ensure the memory remains untagged, consistent with early allocations from
memblock.  To support this, add __GFP_SKIP_KASAN to the allowlist in
__alloc_contig_verify_gfp_mask().

Link: https://lkml.kernel.org/r/20260220144940.2779209-1-glider@google.com
Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure")
Signed-off-by: Alexander Potapenko 
Suggested-by: Ernesto Martinez Garcia 
Cc: Andrey Konovalov 
Cc: Andrey Ryabinin 
Cc: Dmitry Vyukov 
Cc: Greg KH 
Cc: Kees Cook 
Cc: Marco Elver 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

mm/page_alloc: forward the gfp flags from alloc_contig_range() to post_alloc_hook()

2026-03-25T10:08:42+00:00

[ Upstream commit 7b755570064fcb9cde37afd48f6bc65151097ba7 ]

In the __GFP_COMP case, we already pass the gfp_flags to
prep_new_page()->post_alloc_hook().  However, in the !__GFP_COMP case, we
essentially pass only hardcoded __GFP_MOVABLE to post_alloc_hook(),
preventing some action modifiers from being effective..

Let's pass our now properly adjusted gfp flags there as well.

This way, we can now support __GFP_ZERO for alloc_contig_*().

As a side effect, we now also support __GFP_SKIP_ZERO and__GFP_ZEROTAGS;
but we'll keep the more special stuff (KASAN, NOLOCKDEP) disabled for now.

It's worth noting that with __GFP_ZERO, we might unnecessarily zero pages
when we have to release part of our range using free_contig_range() again.
This can be optimized in the future, if ever required; the caller we'll
be converting (powernv/memtrace) next won't trigger this.

Link: https://lkml.kernel.org/r/20241203094732.200195-6-david@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Vlastimil Babka 
Reviewed-by: Oscar Salvador 
Cc: Christophe Leroy 
Cc: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: Naveen N Rao 
Cc: Nicholas Piggin 
Cc: Vishal Moola (Oracle) 
Cc: Zi Yan 
Signed-off-by: Andrew Morton 
Stable-dep-of: d155aab90fff ("mm/kfence: fix KASAN hardware tag faults during late enablement")
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

mm/page_alloc: sort out the alloc_contig_range() gfp flags mess

2026-03-25T10:08:42+00:00

[ Upstream commit f6037a4a686523dee1967ef7620349822e019ff8 ]

It's all a bit complicated for alloc_contig_range().  For example, we
don't support many flags, so let's start bailing out on unsupported ones
-- ignoring the placement hints, as we are already given the range to
allocate.

While we currently set cc.gfp_mask, in __alloc_contig_migrate_range() we
simply create yet another GFP mask whereby we ignore the reclaim flags
specify by the caller.  That looks very inconsistent.

Let's clean it up, constructing the gfp flags used for
compaction/migration exactly once.  Update the documentation of the
gfp_mask parameter for alloc_contig_range() and alloc_contig_pages().

Link: https://lkml.kernel.org/r/20241203094732.200195-5-david@redhat.com
Signed-off-by: David Hildenbrand 
Acked-by: Zi Yan 
Reviewed-by: Vlastimil Babka 
Reviewed-by: Oscar Salvador 
Cc: Christophe Leroy 
Cc: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: Naveen N Rao 
Cc: Nicholas Piggin 
Cc: Vishal Moola (Oracle) 
Signed-off-by: Andrew Morton 
Stable-dep-of: d155aab90fff ("mm/kfence: fix KASAN hardware tag faults during late enablement")
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman