linux.git/mm, branch v6.10-rc6

mm/memory: don't require head page for do_set_pmd()

2024-06-25T03:52:11+00:00

The requirement that the head page be passed to do_set_pmd() was added in
commit ef37b2ea08ac ("mm/memory: page_add_file_rmap() ->
folio_add_file_rmap_[pte|pmd]()") and prevents pmd-mapping in the
finish_fault() and filemap_map_pages() paths if the page to be inserted is
anything but the head page for an otherwise suitable vma and pmd-sized
page.

Matthew said:

: We're going to stop using PMDs to map large folios unless the fault is
: within the first 4KiB of the PMD.  No idea how many workloads that
: affects, but it only needs to be backported as far as v6.8, so we may
: as well backport it.

Link: https://lkml.kernel.org/r/20240611153216.2794513-1-abrestic@rivosinc.com
Fixes: ef37b2ea08ac ("mm/memory: page_add_file_rmap() -> folio_add_file_rmap_[pte|pmd]()")
Signed-off-by: Andrew Bresticker 
Acked-by: David Hildenbrand 
Acked-by: Hugh Dickins 
Cc: 
Signed-off-by: Andrew Morton

mm/page_alloc: Separate THP PCP into movable and non-movable categories

2024-06-25T03:52:11+00:00

Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
THP-sized allocations") no longer differentiates the migration type of
pages in THP-sized PCP list, it's possible that non-movable allocation
requests may get a CMA page from the list, in some cases, it's not
acceptable.

If a large number of CMA memory are configured in system (for example, the
CMA memory accounts for 50% of the system memory), starting a virtual
machine with device passthrough will get stuck.  During starting the
virtual machine, it will call pin_user_pages_remote(..., FOLL_LONGTERM,
...) to pin memory.  Normally if a page is present and in CMA area,
pin_user_pages_remote() will migrate the page from CMA area to non-CMA
area because of FOLL_LONGTERM flag.  But if non-movable allocation
requests return CMA memory, migrate_longterm_unpinnable_pages() will
migrate a CMA page to another CMA page, which will fail to pass the check
in check_and_migrate_movable_pages() and cause migration endless.

Call trace:
pin_user_pages_remote
--__gup_longterm_locked // endless loops in this function
----_get_user_pages_locked
----check_and_migrate_movable_pages
------migrate_longterm_unpinnable_pages
--------alloc_migration_target

This problem will also have a negative impact on CMA itself.  For example,
when CMA is borrowed by THP, and we need to reclaim it through cma_alloc()
or dma_alloc_coherent(), we must move those pages out to ensure CMA's
users can retrieve that contigous memory.  Currently, CMA's memory is
occupied by non-movable pages, meaning we can't relocate them.  As a
result, cma_alloc() is more likely to fail.

To fix the problem above, we add one PCP list for THP, which will not
introduce a new cacheline for struct per_cpu_pages.  THP will have 2 PCP
lists, one PCP list is used by MOVABLE allocation, and the other PCP list
is used by UNMOVABLE allocation.  MOVABLE allocation contains GPF_MOVABLE,
and UNMOVABLE allocation contains GFP_UNMOVABLE and GFP_RECLAIMABLE.

Link: https://lkml.kernel.org/r/1718845190-4456-1-git-send-email-yangge1116@126.com
Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations")
Signed-off-by: yangge 
Cc: Baolin Wang 
Cc: Barry Song <21cnbao@gmail.com>
Cc: Mel Gorman 
Cc: 
Signed-off-by: Andrew Morton

mm/migrate: make migrate_pages_batch() stats consistent

2024-06-25T03:52:10+00:00

As Ying pointed out in [1], stats->nr_thp_failed needs to be updated to
avoid stats inconsistency between MIGRATE_SYNC and MIGRATE_ASYNC when
calling migrate_pages_batch().

Because if not, when migrate_pages_batch() is called via
migrate_pages(MIGRATE_ASYNC), nr_thp_failed will not be increased and when
migrate_pages_batch() is called via migrate_pages(MIGRATE_SYNC*),
nr_thp_failed will be increase in migrate_pages_sync() by
stats->nr_thp_failed += astats.nr_thp_split.

[1] https://lore.kernel.org/linux-mm/87msnq7key.fsf@yhuang6-desk2.ccr.corp.intel.com/

Link: https://lkml.kernel.org/r/20240620012712.19804-1-zi.yan@sent.com
Link: https://lkml.kernel.org/r/20240618134151.29214-1-zi.yan@sent.com
Fixes: 7262f208ca68 ("mm/migrate: split source folio if it is on deferred split list")
Signed-off-by: Zi Yan 
Suggested-by: "Huang, Ying" 
Reviewed-by: "Huang, Ying" 
Cc: David Hildenbrand 
Cc: Hugh Dickins 
Cc: Matthew Wilcox (Oracle) 
Cc: Yang Shi 
Cc: Yin Fengwei 
Signed-off-by: Andrew Morton

kasan: fix bad call to unpoison_slab_object

2024-06-25T03:52:09+00:00

Commit 29d7355a9d05 ("kasan: save alloc stack traces for mempool") messed
up one of the calls to unpoison_slab_object: the last two arguments are
supposed to be GFP flags and whether to init the object memory.

Fix the call.

Without this fix, __kasan_mempool_unpoison_object provides the object's
size as GFP flags to unpoison_slab_object, which can cause LOCKDEP reports
(and probably other issues).

Link: https://lkml.kernel.org/r/20240614143238.60323-1-andrey.konovalov@linux.dev
Fixes: 29d7355a9d05 ("kasan: save alloc stack traces for mempool")
Signed-off-by: Andrey Konovalov 
Reported-by: Brad Spengler 
Acked-by: Marco Elver 
Cc: 
Signed-off-by: Andrew Morton

mm: handle profiling for fake memory allocations during compaction

2024-06-25T03:52:09+00:00

During compaction isolated free pages are marked allocated so that they
can be split and/or freed.  For that, post_alloc_hook() is used inside
split_map_pages() and release_free_list().  split_map_pages() marks free
pages allocated, splits the pages and then lets
alloc_contig_range_noprof() free those pages.  release_free_list() marks
free pages and immediately frees them.  This usage of post_alloc_hook()
affect memory allocation profiling because these functions might not be
called from an instrumented allocator, therefore current->alloc_tag is
NULL and when debugging is enabled (CONFIG_MEM_ALLOC_PROFILING_DEBUG=y)
that causes warnings.  To avoid that, wrap such post_alloc_hook() calls
into an instrumented function which acts as an allocator which will be
charged for these fake allocations.  Note that these allocations are very
short lived until they are freed, therefore the associated counters should
usually read 0.

Link: https://lkml.kernel.org/r/20240614230504.3849136-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan 
Acked-by: Vlastimil Babka 
Cc: Kees Cook 
Cc: Kent Overstreet 
Cc: Pasha Tatashin 
Cc: Sourav Panda 
Signed-off-by: Andrew Morton

mm/slab: fix 'variable obj_exts set but not used' warning

2024-06-25T03:52:09+00:00

slab_post_alloc_hook() uses prepare_slab_obj_exts_hook() to obtain
slabobj_ext object.  Currently the only user of slabobj_ext object in this
path is memory allocation profiling, therefore when it's not enabled this
object is not needed.  This also generates a warning when compiling with
CONFIG_MEM_ALLOC_PROFILING=n.  Move the code under this configuration to
fix the warning.  If more slabobj_ext users appear in the future, the code
will have to be changed back to call prepare_slab_obj_exts_hook().

Link: https://lkml.kernel.org/r/20240614225951.3845577-1-surenb@google.com
Fixes: 4b8736964640 ("mm/slab: add allocation accounting into slab allocation and free paths")
Signed-off-by: Suren Baghdasaryan 
Reported-by: kernel test robot 
Closes: https://lore.kernel.org/oe-kbuild-all/202406150444.F6neSaiy-lkp@intel.com/
Cc: Kent Overstreet 
Cc: Kees Cook 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton

/proc/pid/smaps: add mseal info for vma

2024-06-25T03:52:09+00:00

Add sl in /proc/pid/smaps to indicate vma is sealed

Link: https://lkml.kernel.org/r/20240614232014.806352-2-jeffxu@google.com
Fixes: 8be7258aad44 ("mseal: add mseal syscall")
Signed-off-by: Jeff Xu 
Acked-by: David Hildenbrand 
Cc: Adhemerval Zanella 
Cc: Jann Horn 
Cc: Jorge Lucangeli Obes 
Cc: Kees Cook 
Cc: Randy Dunlap 
Cc: Stephen Röttger 
Signed-off-by: Andrew Morton

mm: fix incorrect vbq reference in purge_fragmented_block

2024-06-25T03:52:08+00:00

xa_for_each() in _vm_unmap_aliases() loops through all vbs.  However,
since commit 062eacf57ad9 ("mm: vmalloc: remove a global vmap_blocks
xarray") the vb from xarray may not be on the corresponding CPU
vmap_block_queue.  Consequently, purge_fragmented_block() might use the
wrong vbq->lock to protect the free list, leading to vbq->free breakage.

Incorrect lock protection can exhaust all vmalloc space as follows:
CPU0                                            CPU1
+--------------------------------------------+
|    +--------------------+     +-----+      |
+--> |                    |---->|     |------+
     | CPU1:vbq free_list |     | vb1 |
+--- |                    |<----|     |<-----+
|    +--------------------+     +-----+      |
+--------------------------------------------+

_vm_unmap_aliases()                             vb_alloc()
                                                new_vmap_block()
xa_for_each(&vbq->vmap_blocks, idx, vb)
--> vb in CPU1:vbq->freelist

purge_fragmented_block(vb)
spin_lock(&vbq->lock)                           spin_lock(&vbq->lock)
--> use CPU0:vbq->lock                          --> use CPU1:vbq->lock

list_del_rcu(&vb->free_list)                    list_add_tail_rcu(&vb->free_list, &vbq->free)
    __list_del(vb->prev, vb->next)
        next->prev = prev
    +--------------------+
    |                    |
    | CPU1:vbq free_list |
+---|                    |<--+
|   +--------------------+   |
+----------------------------+
                                                __list_add(new, head->prev, head)
+--------------------------------------------+
|    +--------------------+     +-----+      |
+--> |                    |---->|     |------+
     | CPU1:vbq free_list |     | vb2 |
+--- |                    |<----|     |<-----+
|    +--------------------+     +-----+      |
+--------------------------------------------+

        prev->next = next
+--------------------------------------------+
|----------------------------+               |
|    +--------------------+  |  +-----+      |
+--> |                    |--+  |     |------+
     | CPU1:vbq free_list |     | vb2 |
+--- |                    |<----|     |<-----+
|    +--------------------+     +-----+      |
+--------------------------------------------+
Here’s a list breakdown. All vbs, which were to be added to
‘prev’, cannot be used by list_for_each_entry_rcu(vb, &vbq->free,
free_list) in vb_alloc(). Thus, vmalloc space is exhausted.

This issue affects both erofs and f2fs, the stacktrace is as follows:
erofs:
[] __switch_to+0x174
[] __schedule+0x624
[] schedule+0x7c
[] schedule_preempt_disabled+0x24
[] __mutex_lock+0x374
[] __mutex_lock_slowpath+0x14
[] mutex_lock+0x24
[] reclaim_and_purge_vmap_areas+0x44
[] alloc_vmap_area+0x2e0
[] vm_map_ram+0x1b0
[] z_erofs_lz4_decompress+0x278
[] z_erofs_decompress_queue+0x650
[] z_erofs_runqueue+0x7f4
[] z_erofs_read_folio+0x104
[] filemap_read_folio+0x6c
[] filemap_fault+0x300
[] __do_fault+0xc8
[] handle_mm_fault+0xb38
[] do_page_fault+0x288
[] do_translation_fault[jt]+0x40
[] do_mem_abort+0x58
[] el0_ia+0x70
[] el0t_64_sync_handler[jt]+0xb0
[] ret_to_user[jt]+0x0

f2fs:
[] __switch_to+0x174
[] __schedule+0x624
[] schedule+0x7c
[] schedule_preempt_disabled+0x24
[] __mutex_lock+0x374
[] __mutex_lock_slowpath+0x14
[] mutex_lock+0x24
[] reclaim_and_purge_vmap_areas+0x44
[] alloc_vmap_area+0x2e0
[] vm_map_ram+0x1b0
[] f2fs_prepare_decomp_mem+0x144
[] f2fs_alloc_dic+0x264
[] f2fs_read_multi_pages+0x428
[] f2fs_mpage_readpages+0x314
[] f2fs_readahead+0x50
[] read_pages+0x80
[] page_cache_ra_unbounded+0x1a0
[] page_cache_ra_order+0x274
[] do_sync_mmap_readahead+0x11c
[] filemap_fault+0x1a0
[] f2fs_filemap_fault+0x28
[] __do_fault+0xc8
[] handle_mm_fault+0xb38
[] do_page_fault+0x288
[] do_translation_fault[jt]+0x40
[] do_mem_abort+0x58
[] el0_ia+0x70
[] el0t_64_sync_handler[jt]+0xb0
[] ret_to_user[jt]+0x0

To fix this, introducee cpu within vmap_block to record which this vb
belongs to.

Link: https://lkml.kernel.org/r/20240614021352.1822225-1-zhaoyang.huang@unisoc.com
Link: https://lkml.kernel.org/r/20240607023116.1720640-1-zhaoyang.huang@unisoc.com
Fixes: fc1e0d980037 ("mm/vmalloc: prevent stale TLBs in fully utilized blocks")
Signed-off-by: Zhaoyang Huang 
Suggested-by: Hailong.Liu 
Reviewed-by: Uladzislau Rezki (Sony) 
Cc: Baoquan He 
Cc: Christoph Hellwig 
Cc: Lorenzo Stoakes 
Cc: Thomas Gleixner 
Cc: 
Signed-off-by: Andrew Morton

Merge tag 'fixes-2024-06-23' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

2024-06-23T14:32:24+00:00

Pull memblock fix from Mike Rapoport:
 "Fix fragility in checks for unset node ID.

  Use numa_valid_node() function to verify that nid is a valid node
  ID instead of inconsistent comparisons with either NUMA_NO_NODE or
  MAX_NUMNODES"

* tag 'fixes-2024-06-23' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  memblock: use numa_valid_node() helper to check for invalid node ID

Merge tag 'mm-hotfixes-stable-2024-06-17-11-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

2024-06-17T19:30:07+00:00

Pull misc fixes from Andrew Morton:
 "Mainly MM singleton fixes. And a couple of ocfs2 regression fixes"

* tag 'mm-hotfixes-stable-2024-06-17-11-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  kcov: don't lose track of remote references during softirqs
  mm: shmem: fix getting incorrect lruvec when replacing a shmem folio
  mm/debug_vm_pgtable: drop RANDOM_ORVALUE trick
  mm: fix possible OOB in numa_rebuild_large_mapping()
  mm/migrate: fix kernel BUG at mm/compaction.c:2761!
  selftests: mm: make map_fixed_noreplace test names stable
  mm/memfd: add documentation for MFD_NOEXEC_SEAL MFD_EXEC
  mm: mmap: allow for the maximum number of bits for randomizing mmap_base by default
  gcov: add support for GCC 14
  zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING
  mm: huge_memory: fix misused mapping_large_folio_support() for anon folios
  lib/alloc_tag: fix RCU imbalance in pgalloc_tag_get()
  lib/alloc_tag: do not register sysctl interface when CONFIG_SYSCTL=n
  MAINTAINERS: remove Lorenzo as vmalloc reviewer
  Revert "mm: init_mlocked_on_free_v3"
  mm/page_table_check: fix crash on ZONE_DEVICE
  gcc: disable '-Warray-bounds' for gcc-9
  ocfs2: fix NULL pointer dereference in ocfs2_abort_trigger()
  ocfs2: fix NULL pointer dereference in ocfs2_journal_dirty()