linux.git/fs/userfaultfd.c, branch v6.6.131

mm/userfaultfd: fix release hang over concurrent GUP

2025-04-25T08:45:31+00:00

commit fe4cdc2c4e248f48de23bc778870fd71e772a274 upstream.

This patch should fix a possible userfaultfd release() hang during
concurrent GUP.

This problem was initially reported by Dimitris Siakavaras in July 2023
[1] in a firecracker use case.  Firecracker has a separate process
handling page faults remotely, and when the process releases the
userfaultfd it can race with a concurrent GUP from KVM trying to fault in
a guest page during the secondary MMU page fault process.

A similar problem was reported recently again by Jinjiang Tu in March 2025
[2], even though the race happened this time with a mlockall() operation,
which does GUP in a similar fashion.

In 2017, commit 656710a60e36 ("userfaultfd: non-cooperative: closing the
uffd without triggering SIGBUS") was trying to fix this issue.  AFAIU,
that fixes well the fault paths but may not work yet for GUP.  In GUP, the
issue is NOPAGE will be almost treated the same as "page fault resolved"
in faultin_page(), then the GUP will follow page again, seeing page
missing, and it'll keep going into a live lock situation as reported.

This change makes core mm return RETRY instead of NOPAGE for both the GUP
and fault paths, proactively releasing the mmap read lock.  This should
guarantee the other release thread make progress on taking the write lock
and avoid the live lock even for GUP.

When at it, rearrange the comments to make sure it's uptodate.

[1] https://lore.kernel.org/r/79375b71-db2e-3e66-346b-254c90d915e2@cslab.ece.ntua.gr
[2] https://lore.kernel.org/r/20250307072133.3522652-1-tujinjiang@huawei.com

Link: https://lkml.kernel.org/r/20250312145131.1143062-1-peterx@redhat.com
Signed-off-by: Peter Xu 
Cc: Andrea Arcangeli 
Cc: Mike Rapoport (IBM) 
Cc: Axel Rasmussen 
Cc: Jinjiang Tu 
Cc: Dimitris Siakavaras 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Greg Kroah-Hartman

Fix userfaultfd_api to return EINVAL as expected

2024-07-18T11:21:22+00:00

commit 1723f04caacb32cadc4e063725d836a0c4450694 upstream.

Currently if we request a feature that is not set in the Kernel config we
fail silently and return all the available features.  However, the man
page indicates we should return an EINVAL.

We need to fix this issue since we can end up with a Kernel warning should
a program request the feature UFFD_FEATURE_WP_UNPOPULATED on a kernel with
the config not set with this feature.

 [  200.812896] WARNING: CPU: 91 PID: 13634 at mm/memory.c:1660 zap_pte_range+0x43d/0x660
 [  200.820738] Modules linked in:
 [  200.869387] CPU: 91 PID: 13634 Comm: userfaultfd Kdump: loaded Not tainted 6.9.0-rc5+ #8
 [  200.877477] Hardware name: Dell Inc. PowerEdge R6525/0N7YGH, BIOS 2.7.3 03/30/2022
 [  200.885052] RIP: 0010:zap_pte_range+0x43d/0x660

Link: https://lkml.kernel.org/r/20240626130513.120193-1-audra@redhat.com
Fixes: e06f1e1dd499 ("userfaultfd: wp: enabled write protection in userfaultfd API")
Signed-off-by: Audra Mitchell 
Cc: Al Viro 
Cc: Andrea Arcangeli 
Cc: Christian Brauner 
Cc: Jan Kara 
Cc: Mike Rapoport 
Cc: Peter Xu 
Cc: Rafael Aquini 
Cc: Shaohua Li 
Cc: Shuah Khan 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Greg Kroah-Hartman

mm/userfaultfd: reset ptes when close() for wr-protected ones

2024-05-17T10:02:36+00:00

commit c88033efe9a391e72ba6b5df4b01d6e628f4e734 upstream.

Userfaultfd unregister includes a step to remove wr-protect bits from all
the relevant pgtable entries, but that only covered an explicit
UFFDIO_UNREGISTER ioctl, not a close() on the userfaultfd itself.  Cover
that too.  This fixes a WARN trace.

The only user visible side effect is the user can observe leftover
wr-protect bits even if the user close()ed on an userfaultfd when
releasing the last reference of it.  However hopefully that should be
harmless, and nothing bad should happen even if so.

This change is now more important after the recent page-table-check
patch we merged in mm-unstable (446dd9ad37d0 ("mm/page_table_check:
support userfault wr-protect entries")), as we'll do sanity check on
uffd-wp bits without vma context.  So it's better if we can 100%
guarantee no uffd-wp bit leftovers, to make sure each report will be
valid.

Link: https://lore.kernel.org/all/000000000000ca4df20616a0fe16@google.com/
Fixes: f369b07c8614 ("mm/uffd: reset write protection when unregister with wp-mode")
Analyzed-by: David Hildenbrand 
Link: https://lkml.kernel.org/r/20240422133311.2987675-1-peterx@redhat.com
Reported-by: syzbot+d8426b591c36b21c750e@syzkaller.appspotmail.com
Signed-off-by: Peter Xu 
Reviewed-by: David Hildenbrand 
Cc: Nadav Amit 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Greg Kroah-Hartman

mm: userfaultfd: remove stale comment about core dump locking

2023-08-24T23:20:27+00:00

Since commit 7f3bfab52cab ("mm/gup: take mmap_lock in get_dump_page()"),
which landed in v5.10, core dumping doesn't enter fault handling without
holding the mmap_lock anymore.  Remove the stale parts of the comments,
but leave the behavior as-is - letting core dumping block on userfault
handling would be a bad idea and could lead to deadlocks if the dumping
process was handling its own userfaults.

Link: https://lkml.kernel.org/r/20230815212216.264445-1-jannh@google.com
Signed-off-by: Jann Horn 
Signed-off-by: Andrew Morton

mm: handle userfaults under VMA lock

2023-08-24T23:20:17+00:00

Enable handle_userfault to operate under VMA lock by releasing VMA lock
instead of mmap_lock and retrying.  Note that FAULT_FLAG_RETRY_NOWAIT
should never be used when handling faults under per-VMA lock protection
because that would break the assumption that lock is dropped on retry.

[surenb@google.com: fix a lockdep issue in vma_assert_write_locked]
  Link: https://lkml.kernel.org/r/20230712195652.969194-1-surenb@google.com
Link: https://lkml.kernel.org/r/20230630211957.1341547-7-surenb@google.com
Signed-off-by: Suren Baghdasaryan 
Acked-by: Peter Xu 
Cc: Alistair Popple 
Cc: Al Viro 
Cc: Christian Brauner 
Cc: Christoph Hellwig 
Cc: David Hildenbrand 
Cc: David Howells 
Cc: Davidlohr Bueso 
Cc: Hillf Danton 
Cc: "Huang, Ying" 
Cc: Hugh Dickins 
Cc: Jan Kara 
Cc: Johannes Weiner 
Cc: Josef Bacik 
Cc: Laurent Dufour 
Cc: Liam R. Howlett 
Cc: Lorenzo Stoakes 
Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: Michel Lespinasse 
Cc: Minchan Kim 
Cc: Pavel Tatashin 
Cc: Punit Agrawal 
Cc: Vlastimil Babka 
Cc: Yu Zhao 
Signed-off-by: Andrew Morton

mm: lock vma explicitly before doing vm_flags_reset and vm_flags_reset_once

2023-08-21T20:37:46+00:00

Implicit vma locking inside vm_flags_reset() and vm_flags_reset_once() is
not obvious and makes it hard to understand where vma locking is happening.
Also in some cases (like in dup_userfaultfd()) vma should be locked earlier
than vma_flags modification. To make locking more visible, change these
functions to assert that the vma write lock is taken and explicitly lock
the vma beforehand. Fix userfaultfd functions which should lock the vma
earlier.

Link: https://lkml.kernel.org/r/20230804152724.3090321-5-surenb@google.com
Suggested-by: Linus Torvalds 
Signed-off-by: Suren Baghdasaryan 
Cc: Jann Horn 
Cc: Liam R. Howlett 
Signed-off-by: Andrew Morton

mm: userfaultfd: add new UFFDIO_POISON ioctl

2023-08-18T17:12:16+00:00

The basic idea here is to "simulate" memory poisoning for VMs.  A VM
running on some host might encounter a memory error, after which some
page(s) are poisoned (i.e., future accesses SIGBUS).  They expect that
once poisoned, pages can never become "un-poisoned".  So, when we live
migrate the VM, we need to preserve the poisoned status of these pages.

When live migrating, we try to get the guest running on its new host as
quickly as possible.  So, we start it running before all memory has been
copied, and before we're certain which pages should be poisoned or not.

So the basic way to use this new feature is:

- On the new host, the guest's memory is registered with userfaultfd, in
  either MISSING or MINOR mode (doesn't really matter for this purpose).
- On any first access, we get a userfaultfd event. At this point we can
  communicate with the old host to find out if the page was poisoned.
- If so, we can respond with a UFFDIO_POISON - this places a swap marker
  so any future accesses will SIGBUS. Because the pte is now "present",
  future accesses won't generate more userfaultfd events, they'll just
  SIGBUS directly.

UFFDIO_POISON does not handle unmapping previously-present PTEs.  This
isn't needed, because during live migration we want to intercept all
accesses with userfaultfd (not just writes, so WP mode isn't useful for
this).  So whether minor or missing mode is being used (or both), the PTE
won't be present in any case, so handling that case isn't needed.

Similarly, UFFDIO_POISON won't replace existing PTE markers.  This might
be okay to do, but it seems to be safer to just refuse to overwrite any
existing entry (like a UFFD_WP PTE marker).

Link: https://lkml.kernel.org/r/20230707215540.2324998-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen 
Acked-by: Peter Xu 
Cc: Al Viro 
Cc: Brian Geffon 
Cc: Christian Brauner 
Cc: David Hildenbrand 
Cc: Gaosheng Cui 
Cc: Huang, Ying 
Cc: Hugh Dickins 
Cc: James Houghton 
Cc: Jan Alexander Steffens (heftig) 
Cc: Jiaqi Yan 
Cc: Jonathan Corbet 
Cc: Kefeng Wang 
Cc: Liam R. Howlett 
Cc: Miaohe Lin 
Cc: Mike Kravetz 
Cc: Mike Rapoport (IBM) 
Cc: Muchun Song 
Cc: Nadav Amit 
Cc: Naoya Horiguchi 
Cc: Ryan Roberts 
Cc: Shuah Khan 
Cc: Suleiman Souhlal 
Cc: Suren Baghdasaryan 
Cc: T.J. Alumbaugh 
Cc: Yu Zhao 
Cc: ZhangPeng 
Signed-off-by: Andrew Morton

mm: userfaultfd: check for start + len overflow in validate_range

2023-08-18T17:12:16+00:00

Most userfaultfd ioctls take a `start + len` range as an argument.  We
have the validate_range helper to check that such ranges are valid. 
However, some (but not all!) ioctls *also* check that `start + len`
doesn't wrap around (overflow).

Just check for this in validate_range.  This saves some repetitive code,
and adds the check to some ioctls which weren't bothering to check for it
before.

[axelrasmussen@google.com: call validate_range() on the src range too]
  Link: https://lkml.kernel.org/r/20230714182932.2608735-1-axelrasmussen@google.com
[axelrasmussen@google.com: fix src/dst validation]
  Link: https://lkml.kernel.org/r/20230810192128.1855570-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20230707215540.2324998-3-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen 
Reviewed-by: Peter Xu 
Cc: Al Viro 
Cc: Brian Geffon 
Cc: Christian Brauner 
Cc: David Hildenbrand 
Cc: Gaosheng Cui 
Cc: Huang, Ying 
Cc: Hugh Dickins 
Cc: James Houghton 
Cc: Jan Alexander Steffens (heftig) 
Cc: Jiaqi Yan 
Cc: Jonathan Corbet 
Cc: Kefeng Wang 
Cc: Liam R. Howlett 
Cc: Miaohe Lin 
Cc: Mike Kravetz 
Cc: Mike Rapoport (IBM) 
Cc: Muchun Song 
Cc: Nadav Amit 
Cc: Naoya Horiguchi 
Cc: Ryan Roberts 
Cc: Shuah Khan 
Cc: Suleiman Souhlal 
Cc: Suren Baghdasaryan 
Cc: T.J. Alumbaugh 
Cc: Yu Zhao 
Cc: ZhangPeng 
Signed-off-by: Andrew Morton

mm/gup: retire follow_hugetlb_page()

2023-08-18T17:12:04+00:00

Now __get_user_pages() should be well prepared to handle thp completely,
as long as hugetlb gup requests even without the hugetlb's special path.

Time to retire follow_hugetlb_page().

Tweak misc comments to reflect reality of follow_hugetlb_page()'s removal.

Link: https://lkml.kernel.org/r/20230628215310.73782-7-peterx@redhat.com
Signed-off-by: Peter Xu 
Acked-by: David Hildenbrand 
Cc: Andrea Arcangeli 
Cc: Hugh Dickins 
Cc: James Houghton 
Cc: Jason Gunthorpe 
Cc: John Hubbard 
Cc: Kirill A . Shutemov 
Cc: Lorenzo Stoakes 
Cc: Matthew Wilcox 
Cc: Mike Kravetz 
Cc: Mike Rapoport (IBM) 
Cc: Vlastimil Babka 
Cc: Yang Shi 
Signed-off-by: Andrew Morton

Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes.

2023-06-23T23:58:19+00:00