linux.git/fs/userfaultfd.c, branch v6.10-rc6

Merge tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

2024-05-19T16:21:03+00:00

Pull mm updates from Andrew Morton:
"The usual shower of singleton fixes and minor series all over MM,
documented (hopefully adequately) in the respective changelogs.
Notable series include:

- Lucas Stach has provided some page-mapping cleanup/consolidation/
maintainability work in the series "mm/treewide: Remove pXd_huge()
API".

- In the series "Allow migrate on protnone reference with
MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
MPOL_PREFERRED_MANY mode, yielding almost doubled performance in
one test.

- In their series "Memory allocation profiling" Kent Overstreet and
Suren Baghdasaryan have contributed a means of determining (via
/proc/allocinfo) whereabouts in the kernel memory is being
allocated: number of calls and amount of memory.

- Matthew Wilcox has provided the series "Various significant MM
patches" which does a number of rather unrelated things, but in
largely similar code sites.

- In his series "mm: page_alloc: freelist migratetype hygiene"
Johannes Weiner has fixed the page allocator's handling of
migratetype requests, with resulting improvements in compaction
efficiency.

- In the series "make the hugetlb migration strategy consistent"
Baolin Wang has fixed a hugetlb migration issue, which should
improve hugetlb allocation reliability.

- Liu Shixin has hit an I/O meltdown caused by readahead in a
memory-tight memcg. Addressed in the series "Fix I/O high when
memory almost met memcg limit".

- In the series "mm/filemap: optimize folio adding and splitting"
Kairui Song has optimized pagecache insertion, yielding ~10%
performance improvement in one test.

- Baoquan He has cleaned up and consolidated the early zone
initialization code in the series "mm/mm_init.c: refactor
free_area_init_core()".

- Baoquan has also redone some MM initializatio code in the series
"mm/init: minor clean up and improvement".

- MM helper cleanups from Christoph Hellwig in his series "remove
follow_pfn".

- More cleanups from Matthew Wilcox in the series "Various
page->flags cleanups".

- Vlastimil Babka has contributed maintainability improvements in the
series "memcg_kmem hooks refactoring".

- More folio conversions and cleanups in Matthew Wilcox's series:
"Convert huge_zero_page to huge_zero_folio"
"khugepaged folio conversions"
"Remove page_idle and page_young wrappers"
"Use folio APIs in procfs"
"Clean up __folio_put()"
"Some cleanups for memory-failure"
"Remove page_mapping()"
"More folio compat code removal"

- David Hildenbrand chipped in with "fs/proc/task_mmu: convert
hugetlb functions to work on folis".

- Code consolidation and cleanup work related to GUP's handling of
hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".

- Rick Edgecombe has developed some fixes to stack guard gaps in the
series "Cover a guard gap corner case".

- Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the
series "mm/ksm: fix ksm exec support for prctl".

- Baolin Wang has implemented NUMA balancing for multi-size THPs.
This is a simple first-cut implementation for now. The series is
"support multi-size THP numa balancing".

- Cleanups to vma handling helper functions from Matthew Wilcox in
the series "Unify vma_address and vma_pgoff_address".

- Some selftests maintenance work from Dev Jain in the series
"selftests/mm: mremap_test: Optimizations and style fixes".

- Improvements to the swapping of multi-size THPs from Ryan Roberts
in the series "Swap-out mTHP without splitting".

- Kefeng Wang has significantly optimized the handling of arm64's
permission page faults in the series
"arch/mm/fault: accelerate pagefault when badaccess"
"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"

- GUP cleanups from David Hildenbrand in "mm/gup: consistently call
it GUP-fast".

- hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault
path to use struct vm_fault".

- selftests build fixes from John Hubbard in the series "Fix
selftests/mm build without requiring "make headers"".

- Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
series "Improved Memory Tier Creation for CPUless NUMA Nodes".
Fixes the initialization code so that migration between different
memory types works as intended.

- David Hildenbrand has improved follow_pte() and fixed an errant
driver in the series "mm: follow_pte() improvements and acrn
follow_pte() fixes".

- David also did some cleanup work on large folio mapcounts in his
series "mm: mapcount for large folios + page_mapcount() cleanups".

- Folio conversions in KSM in Alex Shi's series "transfer page to
folio in KSM".

- Barry Song has added some sysfs stats for monitoring multi-size
THP's in the series "mm: add per-order mTHP alloc and swpout
counters".

- Some zswap cleanups from Yosry Ahmed in the series "zswap
same-filled and limit checking cleanups".

- Matthew Wilcox has been looking at buffer_head code and found the
documentation to be lacking. The series is "Improve buffer head
documentation".

- Multi-size THPs get more work, this time from Lance Yang. His
series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free"
optimizes the freeing of these things.

- Kemeng Shi has added more userspace-visible writeback
instrumentation in the series "Improve visibility of writeback".

- Kemeng Shi then sent some maintenance work on top in the series
"Fix and cleanups to page-writeback".

- Matthew Wilcox reduces mmap_lock traffic in the anon vma code in
the series "Improve anon_vma scalability for anon VMAs". Intel's
test bot reported an improbable 3x improvement in one test.

- SeongJae Park adds some DAMON feature work in the series
"mm/damon: add a DAMOS filter type for page granularity access recheck"
"selftests/damon: add DAMOS quota goal test"

- Also some maintenance work in the series
"mm/damon/paddr: simplify page level access re-check for pageout"
"mm/damon: misc fixes and improvements"

- David Hildenbrand has disabled some known-to-fail selftests ni the
series "selftests: mm: cow: flag vmsplice() hugetlb tests as
XFAIL".

- memcg metadata storage optimizations from Shakeel Butt in "memcg:
reduce memory consumption by memcg stats".

- DAX fixes and maintenance work from Vishal Verma in the series
"dax/bus.c: Fixups for dax-bus locking""

* tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits)
memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order
selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime
mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp
mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault
selftests: cgroup: add tests to verify the zswap writeback path
mm: memcg: make alloc_mem_cgroup_per_node_info() return bool
mm/damon/core: fix return value from damos_wmark_metric_value
mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED
selftests: cgroup: remove redundant enabling of memory controller
Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree
Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT
Docs/mm/damon/design: use a list for supported filters
Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command
Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file
selftests/damon: classify tests for functionalities and regressions
selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None'
selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts
selftests/damon/_damon_sysfs: check errors from nr_schemes file reads
mm/damon/core: initialize ->esz_bp from damos_quota_init_priv()
selftests/damon: add a test for DAMOS quota goal
...

Merge tag 'vfs-6.10.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2024-05-13T19:23:17+00:00

Pull vfs rw iterator updates from Christian Brauner:
 "The core fs signalfd, userfaultfd, and timerfd subsystems did still
  use f_op->read() instead of f_op->read_iter(). Convert them over since
  we should aim to get rid of f_op->read() at some point.

  Aside from that io_uring and others want to mark files as FMODE_NOWAIT
  so it can make use of per-IO nonblocking hints to enable more
  efficient IO. Converting those users to f_op->read_iter() allows them
  to be marked with FMODE_NOWAIT"

* tag 'vfs-6.10.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  signalfd: convert to ->read_iter()
  userfaultfd: convert to ->read_iter()
  timerfd: convert to ->read_iter()
  new helper: copy_to_iter_full()

mm/userfaultfd: reset ptes when close() for wr-protected ones

2024-05-06T00:28:04+00:00

Userfaultfd unregister includes a step to remove wr-protect bits from all
the relevant pgtable entries, but that only covered an explicit
UFFDIO_UNREGISTER ioctl, not a close() on the userfaultfd itself.  Cover
that too.  This fixes a WARN trace.

The only user visible side effect is the user can observe leftover
wr-protect bits even if the user close()ed on an userfaultfd when
releasing the last reference of it.  However hopefully that should be
harmless, and nothing bad should happen even if so.

This change is now more important after the recent page-table-check
patch we merged in mm-unstable (446dd9ad37d0 ("mm/page_table_check:
support userfault wr-protect entries")), as we'll do sanity check on
uffd-wp bits without vma context.  So it's better if we can 100%
guarantee no uffd-wp bit leftovers, to make sure each report will be
valid.

Link: https://lore.kernel.org/all/000000000000ca4df20616a0fe16@google.com/
Fixes: f369b07c8614 ("mm/uffd: reset write protection when unregister with wp-mode")
Analyzed-by: David Hildenbrand 
Link: https://lkml.kernel.org/r/20240422133311.2987675-1-peterx@redhat.com
Reported-by: syzbot+d8426b591c36b21c750e@syzkaller.appspotmail.com
Signed-off-by: Peter Xu 
Reviewed-by: David Hildenbrand 
Cc: Nadav Amit 
Cc: 
Signed-off-by: Andrew Morton

userfaultfd: early return in dup_userfaultfd()

2024-04-26T03:56:25+00:00

When vma->vm_userfaultfd_ctx.ctx is NULL, vma->vm_flags should have
cleared __VM_UFFD_FLAGS. Therefore, there is no need to down_write or
clear the flag, which will affect fork performance. Fix this by
returning early if octx is NULL in dup_userfaultfd().

By applying this patch we can get a 1.3% performance improvement for
lmbench fork_prot. Results are as follows:
                   base      early return
Process fork+exit: 419.1106  413.4804

Link: https://lkml.kernel.org/r/20240327090835.3232629-1-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng 
Reviewed-by: Peter Xu 
Cc: Axel Rasmussen 
Cc: David Hildenbrand 
Cc: Kefeng Wang 
Cc: Liam R. Howlett 
Cc: Lokesh Gidra 
Cc: Nanyong Sun 
Cc: Suren Baghdasaryan 
Signed-off-by: Andrew Morton

userfaultfd: convert to ->read_iter()

2024-04-10T22:23:04+00:00

Rather than use the older style ->read() hook, use ->read_iter() so that
userfaultfd can support both O_NONBLOCK and IOCB_NOWAIT for non-blocking
read attempts.

Split the fd setup into two parts, so that userfaultfd can mark the file
mode with FMODE_NOWAIT before installing it into the process table. With
that, we can also defer grabbing the mm until we know the rest will
succeed, as the fd isn't visible before then.

Signed-off-by: Jens Axboe

userfaultfd: use per-vma locks in userfaultfd operations

2024-02-22T23:27:20+00:00

All userfaultfd operations, except write-protect, opportunistically use
per-vma locks to lock vmas.  On failure, attempt again inside mmap_lock
critical section.

Write-protect operation requires mmap_lock as it iterates over multiple
vmas.

Link: https://lkml.kernel.org/r/20240215182756.3448972-5-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra 
Reviewed-by: Liam R. Howlett 
Cc: Andrea Arcangeli 
Cc: Axel Rasmussen 
Cc: Brian Geffon 
Cc: David Hildenbrand 
Cc: Jann Horn 
Cc: Kalesh Singh 
Cc: Matthew Wilcox (Oracle) 
Cc: Mike Rapoport (IBM) 
Cc: Nicolas Geoffray 
Cc: Peter Xu 
Cc: Ryan Roberts 
Cc: Suren Baghdasaryan 
Cc: Tim Murray 
Signed-off-by: Andrew Morton

userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctx

2024-02-22T23:27:20+00:00

Increments and loads to mmap_changing are always in mmap_lock critical
section.  This ensures that if userspace requests event notification for
non-cooperative operations (e.g.  mremap), userfaultfd operations don't
occur concurrently.

This can be achieved by using a separate read-write semaphore in
userfaultfd_ctx such that increments are done in write-mode and loads in
read-mode, thereby eliminating the dependency on mmap_lock for this
purpose.

This is a preparatory step before we replace mmap_lock usage with per-vma
locks in fill/move ioctls.

Link: https://lkml.kernel.org/r/20240215182756.3448972-3-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra 
Reviewed-by: Mike Rapoport (IBM) 
Reviewed-by: Liam R. Howlett 
Cc: Andrea Arcangeli 
Cc: Axel Rasmussen 
Cc: Brian Geffon 
Cc: David Hildenbrand 
Cc: Jann Horn 
Cc: Kalesh Singh 
Cc: Matthew Wilcox (Oracle) 
Cc: Nicolas Geoffray 
Cc: Peter Xu 
Cc: Ryan Roberts 
Cc: Suren Baghdasaryan 
Cc: Tim Murray 
Signed-off-by: Andrew Morton

userfaultfd: move userfaultfd_ctx struct to header file

2024-02-22T23:27:20+00:00

Patch series "per-vma locks in userfaultfd", v7.

Performing userfaultfd operations (like copy/move etc.) in critical
section of mmap_lock (read-mode) causes significant contention on the lock
when operations requiring the lock in write-mode are taking place
concurrently.  We can use per-vma locks instead to significantly reduce
the contention issue.

Android runtime's Garbage Collector uses userfaultfd for concurrent
compaction.  mmap-lock contention during compaction potentially causes
jittery experience for the user.  During one such reproducible scenario,
we observed the following improvements with this patch-set:

- Wall clock time of compaction phase came down from ~3s to <500ms
- Uninterruptible sleep time (across all threads in the process) was
  ~10ms (none in mmap_lock) during compaction, instead of >20s


This patch (of 4):

Move the struct to userfaultfd_k.h to be accessible from mm/userfaultfd.c.
There are no other changes in the struct.

This is required to prepare for using per-vma locks in userfaultfd
operations.

Link: https://lkml.kernel.org/r/20240215182756.3448972-1-lokeshgidra@google.com
Link: https://lkml.kernel.org/r/20240215182756.3448972-2-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra 
Reviewed-by: Mike Rapoport (IBM) 
Reviewed-by: Liam R. Howlett 
Cc: Andrea Arcangeli 
Cc: Axel Rasmussen 
Cc: Brian Geffon 
Cc: David Hildenbrand 
Cc: Jann Horn 
Cc: Kalesh Singh 
Cc: Matthew Wilcox (Oracle) 
Cc: Nicolas Geoffray 
Cc: Peter Xu 
Cc: Ryan Roberts 
Cc: Suren Baghdasaryan 
Cc: Tim Murray 
Signed-off-by: Andrew Morton

userfaultfd: fix return error if mmap_changing is non-zero in MOVE ioctl

2024-02-22T18:24:38+00:00

To be consistent with other uffd ioctl's returning EAGAIN when
mmap_changing is detected, we should change UFFDIO_MOVE to do the same.

Link: https://lkml.kernel.org/r/20240117223922.1445327-1-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra 
Acked-by: Suren Baghdasaryan 
Acked-by: Mike Rapoport (IBM) 
Cc: Andrea Arcangeli 
Cc: Axel Rasmussen 
Cc: Brian Geffon 
Cc: David Hildenbrand 
Cc: Jann Horn 
Cc: Kalesh Singh 
Cc: Matthew Wilcox (Oracle) 
Cc: Nicolas Geoffray 
Cc: Peter Xu 
Signed-off-by: Andrew Morton

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

2024-01-17T21:03:37+00:00

Pull kvm updates from Paolo Bonzini:
 "Generic:

   - Use memdup_array_user() to harden against overflow.

   - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all
     architectures.

   - Clean up Kconfigs that all KVM architectures were selecting

   - New functionality around "guest_memfd", a new userspace API that
     creates an anonymous file and returns a file descriptor that refers
     to it. guest_memfd files are bound to their owning virtual machine,
     cannot be mapped, read, or written by userspace, and cannot be
     resized. guest_memfd files do however support PUNCH_HOLE, which can
     be used to switch a memory area between guest_memfd and regular
     anonymous memory.

   - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
     per-page attributes for a given page of guest memory; right now the
     only attribute is whether the guest expects to access memory via
     guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
     TDX or ARM64 pKVM is checked by firmware or hypervisor that
     guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in
     the case of pKVM).

  x86:

   - Support for "software-protected VMs" that can use the new
     guest_memfd and page attributes infrastructure. This is mostly
     useful for testing, since there is no pKVM-like infrastructure to
     provide a meaningfully reduced TCB.

   - Fix a relatively benign off-by-one error when splitting huge pages
     during CLEAR_DIRTY_LOG.

   - Fix a bug where KVM could incorrectly test-and-clear dirty bits in
     non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with
     a non-huge SPTE.

   - Use more generic lockdep assertions in paths that don't actually
     care about whether the caller is a reader or a writer.

   - let Xen guests opt out of having PV clock reported as "based on a
     stable TSC", because some of them don't expect the "TSC stable" bit
     (added to the pvclock ABI by KVM, but never set by Xen) to be set.

   - Revert a bogus, made-up nested SVM consistency check for
     TLB_CONTROL.

   - Advertise flush-by-ASID support for nSVM unconditionally, as KVM
     always flushes on nested transitions, i.e. always satisfies flush
     requests. This allows running bleeding edge versions of VMware
     Workstation on top of KVM.

   - Sanity check that the CPU supports flush-by-ASID when enabling SEV
     support.

   - On AMD machines with vNMI, always rely on hardware instead of
     intercepting IRET in some cases to detect unmasking of NMIs

   - Support for virtualizing Linear Address Masking (LAM)

   - Fix a variety of vPMU bugs where KVM fail to stop/reset counters
     and other state prior to refreshing the vPMU model.

   - Fix a double-overflow PMU bug by tracking emulated counter events
     using a dedicated field instead of snapshotting the "previous"
     counter. If the hardware PMC count triggers overflow that is
     recognized in the same VM-Exit that KVM manually bumps an event
     count, KVM would pend PMIs for both the hardware-triggered overflow
     and for KVM-triggered overflow.

   - Turn off KVM_WERROR by default for all configs so that it's not
     inadvertantly enabled by non-KVM developers, which can be
     problematic for subsystems that require no regressions for W=1
     builds.

   - Advertise all of the host-supported CPUID bits that enumerate
     IA32_SPEC_CTRL "features".

   - Don't force a masterclock update when a vCPU synchronizes to the
     current TSC generation, as updating the masterclock can cause
     kvmclock's time to "jump" unexpectedly, e.g. when userspace
     hotplugs a pre-created vCPU.

   - Use RIP-relative address to read kvm_rebooting in the VM-Enter
     fault paths, partly as a super minor optimization, but mostly to
     make KVM play nice with position independent executable builds.

   - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
     CONFIG_HYPERV as a minor optimization, and to self-document the
     code.

   - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV
     "emulation" at build time.

  ARM64:

   - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base
     granule sizes. Branch shared with the arm64 tree.

   - Large Fine-Grained Trap rework, bringing some sanity to the
     feature, although there is more to come. This comes with a prefix
     branch shared with the arm64 tree.

   - Some additional Nested Virtualization groundwork, mostly
     introducing the NV2 VNCR support and retargetting the NV support to
     that version of the architecture.

   - A small set of vgic fixes and associated cleanups.

  Loongarch:

   - Optimization for memslot hugepage checking

   - Cleanup and fix some HW/SW timer issues

   - Add LSX/LASX (128bit/256bit SIMD) support

  RISC-V:

   - KVM_GET_REG_LIST improvement for vector registers

   - Generate ISA extension reg_list using macros in get-reg-list
     selftest

   - Support for reporting steal time along with selftest

  s390:

   - Bugfixes

  Selftests:

   - Fix an annoying goof where the NX hugepage test prints out garbage
     instead of the magic token needed to run the test.

   - Fix build errors when a header is delete/moved due to a missing
     flag in the Makefile.

   - Detect if KVM bugged/killed a selftest's VM and print out a helpful
     message instead of complaining that a random ioctl() failed.

   - Annotate the guest printf/assert helpers with __printf(), and fix
     the various bugs that were lurking due to lack of said annotation"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits)
  x86/kvm: Do not try to disable kvmclock if it was not enabled
  KVM: x86: add missing "depends on KVM"
  KVM: fix direction of dependency on MMU notifiers
  KVM: introduce CONFIG_KVM_COMMON
  KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd
  KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache
  RISC-V: KVM: selftests: Add get-reg-list test for STA registers
  RISC-V: KVM: selftests: Add steal_time test support
  RISC-V: KVM: selftests: Add guest_sbi_probe_extension
  RISC-V: KVM: selftests: Move sbi_ecall to processor.c
  RISC-V: KVM: Implement SBI STA extension
  RISC-V: KVM: Add support for SBI STA registers
  RISC-V: KVM: Add support for SBI extension registers
  RISC-V: KVM: Add SBI STA info to vcpu_arch
  RISC-V: KVM: Add steal-update vcpu request
  RISC-V: KVM: Add SBI STA extension skeleton
  RISC-V: paravirt: Implement steal-time support
  RISC-V: Add SBI STA extension definitions
  RISC-V: paravirt: Add skeleton for pv-time support
  RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr()
  ...