linux.git/mm/swap.c, branch v5.18

mm/munlock: protect the per-CPU pagevec by a local_lock_t

2022-04-01T18:46:09+00:00

The access to mlock_pvec is protected by disabling preemption via
get_cpu_var() or implicit by having preemption disabled by the caller
(in mlock_page_drain() case).  This breaks on PREEMPT_RT since
folio_lruvec_lock_irq() acquires a sleeping lock in this section.

Create struct mlock_pvec which consits of the local_lock_t and the
pagevec.  Acquire the local_lock() before accessing the per-CPU pagevec.
Replace mlock_page_drain() with a _local() version which is invoked on
the local CPU and acquires the local_lock_t and a _remote() version
which uses the pagevec from a remote CPU which offline.

Link: https://lkml.kernel.org/r/YjizWi9IY0mpvIfb@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior 
Acked-by: Hugh Dickins 
Cc: Vlastimil Babka 
Cc: Matthew Wilcox 
Cc: Thomas Gleixner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: delete __ClearPageWaiters()

2022-03-25T02:06:45+00:00

The PG_waiters bit is not included in PAGE_FLAGS_CHECK_AT_FREE, and
vmscan.c's free_unref_page_list() callers rely on that not to generate
bad_page() alerts.  So __page_cache_release(), put_pages_list() and
release_pages() (and presumably copy-and-pasted free_zone_device_page())
are redundant and misleading to make a special point of clearing it (as
the "__" implies, it could only safely be used on the freeing path).

Delete __ClearPageWaiters().  Remark on this in one of the "possible"
comments in folio_wake_bit(), and delete the superfluous comments.

Link: https://lkml.kernel.org/r/3eafa969-5b1a-accf-88fe-318784c791a@google.com
Signed-off-by: Hugh Dickins 
Tested-by: Yu Zhao 
Reviewed-by: Yang Shi 
Reviewed-by: David Hildenbrand 
Cc: Matthew Wilcox 
Cc: Nicholas Piggin 
Cc: Yu Zhao 
Cc: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache

2022-03-23T00:03:12+00:00

Pull folio updates from Matthew Wilcox:

 - Rewrite how munlock works to massively reduce the contention on
   i_mmap_rwsem (Hugh Dickins):

     https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/

 - Sort out the page refcount mess for ZONE_DEVICE pages (Christoph
   Hellwig):

     https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/

 - Convert GUP to use folios and make pincount available for order-1
   pages. (Matthew Wilcox)

 - Convert a few more truncation functions to use folios (Matthew
   Wilcox)

 - Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew
   Wilcox)

 - Convert rmap_walk to use folios (Matthew Wilcox)

 - Convert most of shrink_page_list() to use a folio (Matthew Wilcox)

 - Add support for creating large folios in readahead (Matthew Wilcox)

* tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache: (114 commits)
  mm/damon: minor cleanup for damon_pa_young
  selftests/vm/transhuge-stress: Support file-backed PMD folios
  mm/filemap: Support VM_HUGEPAGE for file mappings
  mm/readahead: Switch to page_cache_ra_order
  mm/readahead: Align file mappings for non-DAX
  mm/readahead: Add large folio readahead
  mm: Support arbitrary THP sizes
  mm: Make large folios depend on THP
  mm: Fix READ_ONLY_THP warning
  mm/filemap: Allow large folios to be added to the page cache
  mm: Turn can_split_huge_page() into can_split_folio()
  mm/vmscan: Convert pageout() to take a folio
  mm/vmscan: Turn page_check_references() into folio_check_references()
  mm/vmscan: Account large folios correctly
  mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios
  mm/vmscan: Free non-shmem folios without splitting them
  mm/rmap: Constify the rmap_walk_control argument
  mm/rmap: Convert rmap_walk() to take a folio
  mm: Turn page_anon_vma() into folio_anon_vma()
  mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read()
  ...

mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

2022-03-22T22:57:08+00:00

On systems that run FIFO:1 applications that busy loop, any SCHED_OTHER
task that attempts to execute on such a CPU (such as work threads) will
not be scheduled, which leads to system hangs.

Commit d479960e44f27e0e5 ("mm: disable LRU pagevec during the migration
temporarily") relies on queueing work items on all online CPUs to ensure
visibility of lru_disable_count.

To fix this, replace the usage of work items with synchronize_rcu,
which provides the same guarantees.

Readers of lru_disable_count are protected by either disabling
preemption or rcu_read_lock:

  preempt_disable, local_irq_disable  [bh_lru_lock()]
  rcu_read_lock                       [rt_spin_lock CONFIG_PREEMPT_RT]
  preempt_disable                     [local_lock !CONFIG_PREEMPT_RT]

Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
preempt_disable() regions of code.  So any CPU which sees
lru_disable_count = 0 will have exited the critical section when
synchronize_rcu() returns.

Link: https://lkml.kernel.org/r/Yin7hDxdt0s/x+fp@fuller.cnet
Signed-off-by: Marcelo Tosatti 
Reviewed-by: Nicolas Saenz Julienne 
Acked-by: Minchan Kim 
Cc: Matthew Wilcox 
Cc: Mel Gorman 
Cc: Juri Lelli 
Cc: Thomas Gleixner 
Cc: Sebastian Andrzej Siewior 
Cc: Paul E. McKenney 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/swap: fix confusing comment in folio_mark_accessed

2022-03-22T22:57:01+00:00

For unevictable pages, we don't need mark them.

Link: https://lkml.kernel.org/r/20220311141519.59948-1-libang.linuxer@gmail.com
Signed-off-by: Bang Li 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: Turn deactivate_file_page() into deactivate_file_folio()

2022-03-21T16:59:02+00:00

This function has one caller which already has a reference to the
page, so we don't need to use get_page_unless_zero().  Also move the
prototype to mm/internal.h.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Miaohe Lin

mm: remove the extra ZONE_DEVICE struct page refcount

2022-03-03T17:47:33+00:00

ZONE_DEVICE struct pages have an extra reference count that complicates
the code for put_page() and several places in the kernel that need to
check the reference count to see that a page is not being used (gup,
compaction, migration, etc.). Clean up the code so the reference count
doesn't need to be treated specially for ZONE_DEVICE pages.

Note that this excludes the special idle page wakeup for fsdax pages,
which still happens at refcount 1.  This is a separate issue and will
be sorted out later.  Given that only fsdax pages require the
notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig
symbol can go away and be replaced with a FS_DAX check for this hook
in the put_page fastpath.

Based on an earlier patch from Ralph Campbell .

Link: https://lkml.kernel.org/r/20220210072828.2930359-8-hch@lst.de
Signed-off-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Ralph Campbell 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Dan Williams 
Acked-by: Felix Kuehling 
Tested-by: "Sierra Guiza, Alejandro (Alex)" 

Cc: Alex Deucher 
Cc: Alistair Popple 
Cc: Ben Skeggs 
Cc: Chaitanya Kulkarni 
Cc: Christian Knig 
Cc: Karol Herbst 
Cc: Lyude Paul 
Cc: Miaohe Lin 
Cc: Muchun Song 
Cc: "Pan, Xinhui" 
Signed-off-by: Andrew Morton 
Signed-off-by: Matthew Wilcox (Oracle)

mm: simplify freeing of devmap managed pages

2022-03-03T17:47:33+00:00

Make put_devmap_managed_page return if it took charge of the page
or not and remove the separate page_is_devmap_managed helper.

Link: https://lkml.kernel.org/r/20220210072828.2930359-6-hch@lst.de
Signed-off-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
Reviewed-by: Dan Williams 
Tested-by: "Sierra Guiza, Alejandro (Alex)" 

Cc: Alex Deucher 
Cc: Alistair Popple 
Cc: Ben Skeggs 
Cc: Christian Knig 
Cc: Felix Kuehling 
Cc: Karol Herbst 
Cc: Lyude Paul 
Cc: Miaohe Lin 
Cc: Muchun Song 
Cc: "Pan, Xinhui" 
Cc: Ralph Campbell 
Signed-off-by: Andrew Morton 
Signed-off-by: Matthew Wilcox (Oracle)

mm: move free_devmap_managed_page to memremap.c

2022-03-03T17:47:33+00:00

free_devmap_managed_page has nothing to do with the code in swap.c,
move it to live with the rest of the code for devmap handling.

Link: https://lkml.kernel.org/r/20220210072828.2930359-5-hch@lst.de
Signed-off-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
Reviewed-by: Muchun Song 
Reviewed-by: Dan Williams 
Tested-by: "Sierra Guiza, Alejandro (Alex)" 

Cc: Alex Deucher 
Cc: Alistair Popple 
Cc: Ben Skeggs 
Cc: Christian Knig 
Cc: Felix Kuehling 
Cc: Karol Herbst 
Cc: Lyude Paul 
Cc: Miaohe Lin 
Cc: "Pan, Xinhui" 
Cc: Ralph Campbell 
Signed-off-by: Andrew Morton 
Signed-off-by: Matthew Wilcox (Oracle)

mm/munlock: mlock_page() munlock_page() batch by pagevec

2022-02-17T16:59:22+00:00

A weakness of the page->mlock_count approach is the need for lruvec lock
while holding page table lock.  That is not an overhead we would allow on
normal pages, but I think acceptable just for pages in an mlocked area.
But let's try to amortize the extra cost by gathering on per-cpu pagevec
before acquiring the lruvec lock.

I have an unverified conjecture that the mlock pagevec might work out
well for delaying the mlock processing of new file pages until they have
got off lru_cache_add()'s pagevec and on to LRU.

The initialization of page->mlock_count is subject to races and awkward:
0 or !!PageMlocked or 1?  Was it wrong even in the implementation before
this commit, which just widens the window?  I haven't gone back to think
it through.  Maybe someone can point out a better way to initialize it.

Bringing lru_cache_add_inactive_or_unevictable()'s mlock initialization
into mm/mlock.c has helped: mlock_new_page(), using the mlock pagevec,
rather than lru_cache_add()'s pagevec.

Experimented with various orderings: the right thing seems to be for
mlock_page() and mlock_new_page() to TestSetPageMlocked before adding to
pagevec, but munlock_page() to leave TestClearPageMlocked to the later
pagevec processing.

Dropped the VM_BUG_ON_PAGE(PageTail)s this time around: they have made
their point, and the thp_nr_page()s already contain a VM_BUG_ON_PGFLAGS()
for that.

This still leaves acquiring lruvec locks under page table lock each time
the pagevec fills (or a THP is added): which I suppose is rather silly,
since they sit on pagevec waiting to be processed long after page table
lock has been dropped; but I'm disinclined to uglify the calling sequence
until some load shows an actual problem with it (nothing wrong with
taking lruvec lock under page table lock, just "nicer" to do it less).

Signed-off-by: Hugh Dickins 
Acked-by: Vlastimil Babka 
Signed-off-by: Matthew Wilcox (Oracle)