diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2023-04-27 19:42:02 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2023-04-27 19:42:02 -0700 |
| commit | 7fa8a8ee9400fe8ec188426e40e481717bc5e924 (patch) | |
| tree | cc8fd6b4f936ec01e73238643757451e20478c07 /arch | |
| parent | 91ec4b0d11fe115581ce2835300558802ce55e6c (diff) | |
| parent | 4d4b6d66db63ceed399f1fb1a4b24081d2590eb1 (diff) | |
| download | linux-7fa8a8ee9400fe8ec188426e40e481717bc5e924.tar.gz linux-7fa8a8ee9400fe8ec188426e40e481717bc5e924.tar.bz2 linux-7fa8a8ee9400fe8ec188426e40e481717bc5e924.zip | |
Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics
flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...
Diffstat (limited to 'arch')
63 files changed, 521 insertions, 350 deletions
diff --git a/arch/Kconfig b/arch/Kconfig index e3511afbb7f2..205fd23e0cad 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -465,6 +465,38 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM irqs disabled over activate_mm. Architectures that do IPI based TLB shootdowns should enable this. +# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references. +# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching +# to/from kernel threads when the same mm is running on a lot of CPUs (a large +# multi-threaded application), by reducing contention on the mm refcount. +# +# This can be disabled if the architecture ensures no CPUs are using an mm as a +# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm +# or its kernel page tables). This could be arranged by arch_exit_mmap(), or +# final exit(2) TLB flush, for example. +# +# To implement this, an arch *must*: +# Ensure the _lazy_tlb variants of mmgrab/mmdrop are used when manipulating +# the lazy tlb reference of a kthread's ->active_mm (non-arch code has been +# converted already). +config MMU_LAZY_TLB_REFCOUNT + def_bool y + depends on !MMU_LAZY_TLB_SHOOTDOWN + +# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an +# mm as a lazy tlb beyond its last reference count, by shooting down these +# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may +# be using the mm as a lazy tlb, so that they may switch themselves to using +# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs +# may be using mm as a lazy tlb mm. +# +# To implement this, an arch *must*: +# - At the time of the final mmdrop of the mm, ensure mm_cpumask(mm) contains +# at least all possible CPUs in which the mm is lazy. +# - It must meet the requirements for MMU_LAZY_TLB_REFCOUNT=n (see above). +config MMU_LAZY_TLB_SHOOTDOWN + bool + config ARCH_HAVE_NMI_SAFE_CMPXCHG bool diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index d9a13ccf89a3..ab6d701365bb 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -556,7 +556,7 @@ endmenu # "ARC Architecture Configuration" config ARCH_FORCE_MAX_ORDER int "Maximum zone order" - default "12" if ARC_HUGEPAGE_16M - default "11" + default "11" if ARC_HUGEPAGE_16M + default "10" source "kernel/power/Kconfig" diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c index ce4e939a7f07..2b89b6c53801 100644 --- a/arch/arc/mm/init.c +++ b/arch/arc/mm/init.c @@ -74,11 +74,6 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 size) base, TO_MB(size), !in_use ? "Not used":""); } -bool arch_has_descending_max_zone_pfns(void) -{ - return !IS_ENABLED(CONFIG_ARC_HAS_PAE40); -} - /* * First memory setup routine called from setup_arch() * 1. setup swapper's mm @init_mm diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index a8b82c2f333b..0fb4b218f665 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1352,20 +1352,19 @@ config ARM_MODULE_PLTS configurations. If unsure, say y. config ARCH_FORCE_MAX_ORDER - int "Maximum zone order" - default "12" if SOC_AM33XX - default "9" if SA1111 - default "11" - help - The kernel memory allocator divides physically contiguous memory - blocks into "zones", where each zone is a power of two number of - pages. This option selects the largest power of two that the kernel - keeps in the memory allocator. If you need to allocate very large - blocks of physically contiguous memory, then you may need to - increase this value. - - This config option is actually maximum order plus one. For example, - a value of 11 means that the largest free memory block is 2^10 pages. + int "Order of maximal physically contiguous allocations" + default "11" if SOC_AM33XX + default "8" if SA1111 + default "10" + help + The kernel page allocator limits the size of maximal physically + contiguous allocations. The limit is called MAX_ORDER and it + defines the maximal power of two of number of pages that can be + allocated as a single contiguous block. This option allows + overriding the default setting when ability to allocate very + large blocks of physically contiguous memory is required. + + Don't change if unsure. config ALIGNMENT_TRAP def_bool CPU_CP15_MMU diff --git a/arch/arm/configs/imx_v6_v7_defconfig b/arch/arm/configs/imx_v6_v7_defconfig index 62f2b8972160..4de293da4789 100644 --- a/arch/arm/configs/imx_v6_v7_defconfig +++ b/arch/arm/configs/imx_v6_v7_defconfig @@ -31,7 +31,7 @@ CONFIG_SOC_VF610=y CONFIG_SMP=y CONFIG_ARM_PSCI=y CONFIG_HIGHMEM=y -CONFIG_ARCH_FORCE_MAX_ORDER=14 +CONFIG_ARCH_FORCE_MAX_ORDER=13 CONFIG_CMDLINE="noinitrd console=ttymxc0,115200" CONFIG_KEXEC=y CONFIG_CPU_FREQ=y diff --git a/arch/arm/configs/milbeaut_m10v_defconfig b/arch/arm/configs/milbeaut_m10v_defconfig index bd29e5012cb0..385ad0f391a8 100644 --- a/arch/arm/configs/milbeaut_m10v_defconfig +++ b/arch/arm/configs/milbeaut_m10v_defconfig @@ -26,7 +26,7 @@ CONFIG_THUMB2_KERNEL=y # CONFIG_THUMB2_AVOID_R_ARM_THM_JUMP11 is not set # CONFIG_ARM_PATCH_IDIV is not set CONFIG_HIGHMEM=y -CONFIG_ARCH_FORCE_MAX_ORDER=12 +CONFIG_ARCH_FORCE_MAX_ORDER=11 CONFIG_SECCOMP=y CONFIG_KEXEC=y CONFIG_EFI=y diff --git a/arch/arm/configs/pxa_defconfig b/arch/arm/configs/pxa_defconfig index e656d3af2266..b46e39369dbb 100644 --- a/arch/arm/configs/pxa_defconfig +++ b/arch/arm/configs/pxa_defconfig @@ -20,7 +20,7 @@ CONFIG_PXA_SHARPSL=y CONFIG_MACH_AKITA=y CONFIG_MACH_BORZOI=y CONFIG_AEABI=y -CONFIG_ARCH_FORCE_MAX_ORDER=9 +CONFIG_ARCH_FORCE_MAX_ORDER=8 CONFIG_CMDLINE="root=/dev/ram0 ro" CONFIG_KEXEC=y CONFIG_CPU_FREQ=y diff --git a/arch/arm/configs/sama7_defconfig b/arch/arm/configs/sama7_defconfig index 0d964c613d71..954112041403 100644 --- a/arch/arm/configs/sama7_defconfig +++ b/arch/arm/configs/sama7_defconfig @@ -19,7 +19,7 @@ CONFIG_ATMEL_CLOCKSOURCE_TCB=y # CONFIG_CACHE_L2X0 is not set # CONFIG_ARM_PATCH_IDIV is not set # CONFIG_CPU_SW_DOMAIN_PAN is not set -CONFIG_ARCH_FORCE_MAX_ORDER=15 +CONFIG_ARCH_FORCE_MAX_ORDER=14 CONFIG_UACCESS_WITH_MEMCPY=y # CONFIG_ATAGS is not set CONFIG_CMDLINE="console=ttyS0,115200 earlyprintk ignore_loglevel" diff --git a/arch/arm/configs/sp7021_defconfig b/arch/arm/configs/sp7021_defconfig index 5bca2eb59b86..c6448ac860b6 100644 --- a/arch/arm/configs/sp7021_defconfig +++ b/arch/arm/configs/sp7021_defconfig @@ -17,7 +17,7 @@ CONFIG_ARCH_SUNPLUS=y # CONFIG_VDSO is not set CONFIG_SMP=y CONFIG_THUMB2_KERNEL=y -CONFIG_ARCH_FORCE_MAX_ORDER=12 +CONFIG_ARCH_FORCE_MAX_ORDER=11 CONFIG_VFP=y CONFIG_NEON=y CONFIG_MODULES=y diff --git a/arch/arm/mach-rpc/ecard.c b/arch/arm/mach-rpc/ecard.c index 53813f9464a2..c30df1097c52 100644 --- a/arch/arm/mach-rpc/ecard.c +++ b/arch/arm/mach-rpc/ecard.c @@ -253,7 +253,7 @@ static int ecard_init_mm(void) current->mm = mm; current->active_mm = mm; activate_mm(active_mm, mm); - mmdrop(active_mm); + mmdrop_lazy_tlb(active_mm); ecard_init_pgtables(mm); return 0; } diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 3f5bf55050e8..b1201d25a8a4 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -95,6 +95,7 @@ config ARM64 select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 select ARCH_SUPPORTS_NUMA_BALANCING select ARCH_SUPPORTS_PAGE_TABLE_CHECK + select ARCH_SUPPORTS_PER_VMA_LOCK select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT select ARCH_WANT_DEFAULT_BPF_JIT select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT @@ -1505,39 +1506,34 @@ config XEN # include/linux/mmzone.h requires the following to be true: # -# MAX_ORDER - 1 + PAGE_SHIFT <= SECTION_SIZE_BITS +# MAX_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS # -# so the maximum value of MAX_ORDER is SECTION_SIZE_BITS + 1 - PAGE_SHIFT: +# so the maximum value of MAX_ORDER is SECTION_SIZE_BITS - PAGE_SHIFT: # # | SECTION_SIZE_BITS | PAGE_SHIFT | max MAX_ORDER | default MAX_ORDER | # ----+-------------------+--------------+-----------------+--------------------+ -# 4K | 27 | 12 | 16 | 11 | -# 16K | 27 | 14 | 14 | 12 | -# 64K | 29 | 16 | 14 | 14 | +# 4K | 27 | 12 | 15 | 10 | +# 16K | 27 | 14 | 13 | 11 | +# 64K | 29 | 16 | 13 | 13 | config ARCH_FORCE_MAX_ORDER - int "Maximum zone order" if ARM64_4K_PAGES || ARM64_16K_PAGES - default "14" if ARM64_64K_PAGES - range 12 14 if ARM64_16K_PAGES - default "12" if ARM64_16K_PAGES - range 11 16 if ARM64_4K_PAGES - default "11" - help - The kernel memory allocator divides physically contiguous memory - blocks into "zones", where each zone is a power of two number of - pages. This option selects the largest power of two that the kernel - keeps in the memory allocator. If you need to allocate very large - blocks of physically contiguous memory, then you may need to - increase this value. - - This config option is actually maximum order plus one. For example, - a value of 11 means that the largest free memory block is 2^10 pages. - - We make sure that we can allocate up to a HugePage size for each configuration. - Hence we have : - MAX_ORDER = (PMD_SHIFT - PAGE_SHIFT) + 1 => PAGE_SHIFT - 2 - - However for 4K, we choose a higher default value, 11 as opposed to 10, giving us - 4M allocations matching the default size used by generic code. + int "Order of maximal physically contiguous allocations" if EXPERT && (ARM64_4K_PAGES || ARM64_16K_PAGES) + default "13" if ARM64_64K_PAGES + default "11" if ARM64_16K_PAGES + default "10" + help + The kernel page allocator limits the size of maximal physically + contiguous allocations. The limit is called MAX_ORDER and it + defines the maximal power of two of number of pages that can be + allocated as a single contiguous block. This option allows + overriding the default setting when ability to allocate very + large blocks of physically contiguous memory is required. + + The maximal size of allocation cannot exceed the size of the + section, so the value of MAX_ORDER should satisfy + + MAX_ORDER + PAGE_SHIFT <= SECTION_SIZE_BITS + + Don't change if unsure. config UNMAP_KERNEL_AT_EL0 bool "Unmap kernel when running in userspace (aka \"KAISER\")" if EXPERT diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h index efcd68154a3a..c735afdf639b 100644 --- a/arch/arm64/include/asm/memory.h +++ b/arch/arm64/include/asm/memory.h @@ -261,9 +261,11 @@ static inline const void *__tag_set(const void *addr, u8 tag) } #ifdef CONFIG_KASAN_HW_TAGS -#define arch_enable_tagging_sync() mte_enable_kernel_sync() -#define arch_enable_tagging_async() mte_enable_kernel_async() -#define arch_enable_tagging_asymm() mte_enable_kernel_asymm() +#define arch_enable_tag_checks_sync() mte_enable_kernel_sync() +#define arch_enable_tag_checks_async() mte_enable_kernel_async() +#define arch_enable_tag_checks_asymm() mte_enable_kernel_asymm() +#define arch_suppress_tag_checks_start() mte_enable_tco() +#define arch_suppress_tag_checks_stop() mte_disable_tco() #define arch_force_async_tag_fault() mte_check_tfsr_exit() #define arch_get_random_tag() mte_get_random_tag() #define arch_get_mem_tag(addr) mte_get_mem_tag(addr) diff --git a/arch/arm64/include/asm/mte-kasan.h b/arch/arm64/include/asm/mte-kasan.h index 9f79425fc65a..2e98028c1965 100644 --- a/arch/arm64/include/asm/mte-kasan.h +++ b/arch/arm64/include/asm/mte-kasan.h @@ -13,9 +13,74 @@ #include <linux/types.h> +#ifdef CONFIG_KASAN_HW_TAGS + +/* Whether the MTE asynchronous mode is enabled. */ +DECLARE_STATIC_KEY_FALSE(mte_async_or_asymm_mode); + +static inline bool system_uses_mte_async_or_asymm_mode(void) +{ + return static_branch_unlikely(&mte_async_or_asymm_mode); +} + +#else /* CONFIG_KASAN_HW_TAGS */ + +static inline bool system_uses_mte_async_or_asymm_mode(void) +{ + return false; +} + +#endif /* CONFIG_KASAN_HW_TAGS */ + #ifdef CONFIG_ARM64_MTE /* + * The Tag Check Flag (TCF) mode for MTE is per EL, hence TCF0 + * affects EL0 and TCF affects EL1 irrespective of which TTBR is + * used. + * The kernel accesses TTBR0 usually with LDTR/STTR instructions + * when UAO is available, so these would act as EL0 accesses using + * TCF0. + * However futex.h code uses exclusives which would be executed as + * EL1, this can potentially cause a tag check fault even if the + * user disables TCF0. + * + * To address the problem we set the PSTATE.TCO bit in uaccess_enable() + * and reset it in uaccess_disable(). + * + * The Tag check override (TCO) bit disables temporarily the tag checking + * preventing the issue. + */ +static inline void mte_disable_tco(void) +{ + asm volatile(ALTERNATIVE("nop", SET_PSTATE_TCO(0), + ARM64_MTE, CONFIG_KASAN_HW_TAGS)); +} + +static inline void mte_enable_tco(void) +{ + asm volatile(ALTERNATIVE("nop", SET_PSTATE_TCO(1), + ARM64_MTE, CONFIG_KASAN_HW_TAGS)); +} + +/* + * These functions disable tag checking only if in MTE async mode + * since the sync mode generates exceptions synchronously and the + * nofault or load_unaligned_zeropad can handle them. + */ +static inline void __mte_disable_tco_async(void) +{ + if (system_uses_mte_async_or_asymm_mode()) + mte_disable_tco(); +} + +static inline void __mte_enable_tco_async(void) +{ + if (system_uses_mte_async_or_asymm_mode()) + mte_enable_tco(); +} + +/* * These functions are meant to be only used from KASAN runtime through * the arch_*() interface defined in asm/memory.h. * These functions don't include system_supports_mte() checks, @@ -138,6 +203,22 @@ void mte_enable_kernel_asymm(void); #else /* CONFIG_ARM64_MTE */ +static inline void mte_disable_tco(void) +{ +} + +static inline void mte_enable_tco(void) +{ +} + +static inline void __mte_disable_tco_async(void) +{ +} + +static inline void __mte_enable_tco_async(void) +{ +} + static inline u8 mte_get_ptr_tag(void *ptr) { return 0xFF; diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h index 20dd06d70af5..c028afb1cd0b 100644 --- a/arch/arm64/include/asm/mte.h +++ b/arch/arm64/include/asm/mte.h @@ -178,14 +178,6 @@ static inline void mte_disable_tco_entry(struct task_struct *task) } #ifdef CONFIG_KASAN_HW_TAGS -/* Whether the MTE asynchronous mode is enabled. */ -DECLARE_STATIC_KEY_FALSE(mte_async_or_asymm_mode); - -static inline bool system_uses_mte_async_or_asymm_mode(void) -{ - return static_branch_unlikely(&mte_async_or_asymm_mode); -} - void mte_check_tfsr_el1(void); static inline void mte_check_tfsr_entry(void) @@ -212,10 +204,6 @@ static inline void mte_check_tfsr_exit(void) mte_check_tfsr_el1(); } #else -static inline bool system_uses_mte_async_or_asymm_mode(void) -{ - return false; -} static inline void mte_check_tfsr_el1(void) { } diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index b6ba466e2e8a..0bd18de9fd97 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/ |
