Age | Commit message (Collapse) | Author | Files | Lines |
|
[ Upstream commit cd3c93167da0e760b5819246eae7a4ea30fd014b ]
Since we are about to stash some more information into the pp_magic
field, let's move the magic signature checks into a pair of helper
functions so it can be changed in one place.
Reviewed-by: Mina Almasry <almasrymina@google.com>
Tested-by: Yonglong Liu <liuyonglong@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-1-6a9ef2e0cba8@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stable-dep-of: ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them when destroying the pool")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
__alloc_pages_slowpath has no change detection for ac->nodemask in the
part of retry path, while cpuset can modify it in parallel. For some
processes that set mempolicy as MPOL_BIND, this results ac->nodemask
changes, and then the should_reclaim_retry will judge based on the latest
nodemask and jump to retry, while the get_page_from_freelist only
traverses the zonelist from ac->preferred_zoneref, which selected by a
expired nodemask and may cause infinite retries in some cases
cpu 64:
__alloc_pages_slowpath {
/* ..... */
retry:
/* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
if (alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);
/* cpu 1:
cpuset_write_resmask
update_nodemask
update_nodemasks_hier
update_tasks_nodemask
mpol_rebind_task
mpol_rebind_policy
mpol_rebind_nodemask
// mempolicy->nodes has been modified,
// which ac->nodemask point to
*/
/* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
did_some_progress > 0, &no_progress_loops))
goto retry;
}
Simultaneously starting multiple cpuset01 from LTP can quickly reproduce
this issue on a multi node server when the maximum memory pressure is
reached and the swap is enabled
Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn
Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The page allocator tracks the number of zones that have unaccepted memory
using static_branch_enc/dec() and uses that static branch in hot paths to
determine if it needs to deal with unaccepted memory.
Borislav and Thomas pointed out that the tracking is racy: operations on
static_branch are not serialized against adding/removing unaccepted pages
to/from the zone.
Sanity checks inside static_branch machinery detects it:
WARNING: CPU: 0 PID: 10 at kernel/jump_label.c:276 __static_key_slow_dec_cpuslocked+0x8e/0xa0
The comment around the WARN() explains the problem:
/*
* Warn about the '-1' case though; since that means a
* decrement is concurrent with a first (0->1) increment. IOW
* people are trying to disable something that wasn't yet fully
* enabled. This suggests an ordering problem on the user side.
*/
The effect of this static_branch optimization is only visible on
microbenchmark.
Instead of adding more complexity around it, remove it altogether.
Link: https://lkml.kernel.org/r/20250506133207.1009676-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory")
Link: https://lore.kernel.org/all/20250506092445.GBaBnVXXyvnazly6iF@fat_crate.local
Reported-by: Borislav Petkov <bp@alien8.de>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org> [6.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
try_alloc_pages() will not attempt to allocate memory if the system has
*any* unaccepted memory. Memory is accepted as needed and can remain in
the system indefinitely, causing the interface to always fail.
Rather than immediately giving up, attempt to use already accepted memory
on free lists.
Pass 'alloc_flags' to cond_accept_memory() and do not accept new memory
for ALLOC_TRYLOCK requests.
Found via code inspection - only BPF uses this at present and the
runtime effects are unclear.
Link: https://lkml.kernel.org/r/20250506112509.905147-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: 97769a53f117 ("mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation")
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 51ff4d7486f0 ("mm: avoid extra mem_alloc_profiling_enabled()
checks") introduces a possible use-after-free scenario, when page
is non-compound, page[0] could be released by other thread right
after put_page_testzero failed in current thread, pgalloc_tag_sub_pages
afterwards would manipulate an invalid page for accounting remaining
pages:
[timeline] [thread1] [thread2]
| alloc_page non-compound
V
| get_page, rf counter inc
V
| in ___free_pages
| put_page_testzero fails
V
| put_page, page released
V
| in ___free_pages,
| pgalloc_tag_sub_pages
| manipulate an invalid page
V
Restore __free_pages() to its state before, retrieve alloc tag
beforehand.
Link: https://lkml.kernel.org/r/20250505193034.91682-1-00107082@163.com
Fixes: 51ff4d7486f0 ("mm: avoid extra mem_alloc_profiling_enabled() checks")
Signed-off-by: David Wang <00107082@163.com>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Vlastimil points out that commit a211c6550efc ("mm: page_alloc:
defrag_mode kswapd/kcompactd watermarks") switched kswapd from
zone_watermark_ok_safe() to the standard, percpu-cached version of reading
free pages, thus dropping the watermark safety precautions for systems
with high CPU counts (e.g. >212 cpus on 64G). Restore them.
Since zone_watermark_ok_safe() is no longer the right interface, and this
was the last caller of the function anyway, open-code the
zone_page_state_snapshot() conditional and delete the function.
Link: https://lkml.kernel.org/r/20250416135142.778933-2-hannes@cmpxchg.org
Fixes: a211c6550efc ("mm: page_alloc: defrag_mode kswapd/kcompactd watermarks")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When the last page in the zone is accepted, __accept_page() calls
static_branch_dec(). This function takes cpu_hotplug_lock, which can lead
to a deadlock if the allocation occurs during CPU bringup path as
_cpu_up() also takes the lock.
To prevent this deadlock, defer static_branch_dec() to a workqueue.
Call static_branch_dec() only when the workqueue is not yet initialized.
Workqueues are initialized before CPU bring up, so this will not conflict
with the first scenario.
Link: https://lkml.kernel.org/r/20250329171030.3942298-1-kirill.shutemov@linux.intel.com
Fixes: 55ad43e8ba0f ("mm: add a helper to accept page")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Srikanth Aithal <sraithal@amd.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal
single pages from biggest buddy") as the root cause of a 56.4% regression
in vm-scalability::lru-file-mmap-read.
Carlos reports an earlier patch, c0cd6f557b90 ("mm: page_alloc: fix
freelist movement during block conversion"), as the root cause for a
regression in worst-case zone->lock+irqoff hold times.
Both of these patches modify the page allocator's fallback path to be less
greedy in an effort to stave off fragmentation. The flip side of this is
that fallbacks are also less productive each time around, which means the
fallback search can run much more frequently.
Carlos' traces point to rmqueue_bulk() specifically, which tries to refill
the percpu cache by allocating a large batch of pages in a loop. It
highlights how once the native freelists are exhausted, the fallback code
first scans orders top-down for whole blocks to claim, then falls back to
a bottom-up search for the smallest buddy to steal. For the next batch
page, it goes through the same thing again.
This can be made more efficient. Since rmqueue_bulk() holds the
zone->lock over the entire batch, the freelists are not subject to outside
changes; when the search for a block to claim has already failed, there is
no point in trying again for the next page.
Modify __rmqueue() to remember the last successful fallback mode, and
restart directly from there on the next rmqueue_bulk() iteration.
Oliver confirms that this improves beyond the regression that the test
robot reported against c2f6ea38fc1b:
commit:
f3b92176f4 ("tools/selftests: add guard region test for /proc/$pid/pagemap")
c2f6ea38fc ("mm: page_alloc: don't steal single pages from biggest buddy")
acc4d5ff0b ("Merge tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
2c847f27c3 ("mm: page_alloc: speed up fallbacks in rmqueue_bulk()") <--- your patch
f3b92176f4f7100f c2f6ea38fc1b640aa7a2e155cc1 acc4d5ff0b61eb1715c498b6536 2c847f27c37da65a93d23c237c5
---------------- --------------------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev %change %stddev
\ | \ | \ | \
25525364 ± 3% -56.4% 11135467 -57.8% 10779336 +31.6% 33581409 vm-scalability.throughput
Carlos confirms that worst-case times are almost fully recovered
compared to before the earlier culprit patch:
2dd482ba627d (before freelist hygiene): 1ms
c0cd6f557b90 (after freelist hygiene): 90ms
next-20250319 (steal smallest buddy): 280ms
this patch : 8ms
[jackmanb@google.com: comment updates]
Link: https://lkml.kernel.org/r/D92AC0P9594X.3BML64MUKTF8Z@google.com
[hannes@cmpxchg.org: reset rmqueue_mode in rmqueue_buddy() error loop, per Yunsheng Lin]
Link: https://lkml.kernel.org/r/20250409140023.GA2313@cmpxchg.org
Link: https://lkml.kernel.org/r/20250407180154.63348-1-hannes@cmpxchg.org
Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion")
Fixes: c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Carlos Song <carlos.song@nxp.com>
Tested-by: Carlos Song <carlos.song@nxp.com>
Tested-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202503271547.fc08b188-lkp@intel.com
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Shivank Garg <shivankg@amd.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> [6.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
spin_trylock followed by spin_lock will cause extra write cache access.
If the lock is contended it may cause unnecessary cache line bouncing and
will execute redundant irq restore/save pair. Therefore, check
alloc/fpi_flags first and use spin_trylock or spin_lock.
Link: https://lkml.kernel.org/r/20250331002809.94758-1-alexei.starovoitov@gmail.com
Fixes: 97769a53f117 ("mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Daniel Borkman <daniel@iogearbox.net>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
- The series "mm: fixes for fallouts from mem_init() cleanup" from Mike
Rapoport fixes a couple of issues with the just-merged "arch, mm:
reduce code duplication in mem_init()" series
- The series "MAINTAINERS: add my isub-entries to MM part." from Mike
Rapoport does some maintenance on MAINTAINERS
- The series "remove tlb_remove_page_ptdesc()" from Qi Zheng does some
cleanup work to the page mapping code
- The series "mseal system mappings" from Jeff Xu permits sealing of
"system mappings", such as vdso, vvar, vvar_vclock, vectors (arm
compat-mode), sigpage (arm compat-mode)
- Plus the usual shower of singleton patches
* tag 'mm-stable-2025-04-02-22-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (31 commits)
mseal sysmap: add arch-support txt
mseal sysmap: enable s390
selftest: test system mappings are sealed
mseal sysmap: update mseal.rst
mseal sysmap: uprobe mapping
mseal sysmap: enable arm64
mseal sysmap: enable x86-64
mseal sysmap: generic vdso vvar mapping
selftests: x86: test_mremap_vdso: skip if vdso is msealed
mseal sysmap: kernel config and header change
mm: pgtable: remove tlb_remove_page_ptdesc()
x86: pgtable: convert to use tlb_remove_ptdesc()
riscv: pgtable: unconditionally use tlb_remove_ptdesc()
mm: pgtable: convert some architectures to use tlb_remove_ptdesc()
mm: pgtable: change pt parameter of tlb_remove_ptdesc() to struct ptdesc*
mm: pgtable: make generic tlb_remove_table() use struct ptdesc
microblaze/mm: put mm_cmdline_setup() in .init.text section
mm/memory_hotplug: fix call folio_test_large with tail page in do_migrate_range
MAINTAINERS: mm: add entry for secretmem
MAINTAINERS: mm: add entry for numa memblocks and numa emulation
...
|
|
Fix an obvious bug. try_alloc_pages() should set_page_refcounted.
[ Not so obvious: it was probably correct at the time it was written but
was at some point then rebased on top of v6.14-rc1.
And at that point there was a semantic conflict with commit
efabfe1420f5 ("mm/page_alloc: move set_page_refcounted() to callers
of get_page_from_freelist()") and became buggy.
- Linus ]
Fixes: 97769a53f117 ("mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil BAbka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch replaces the direct check for the __PG_HWPOISON flag with the
PageHWPoison() macro, improving code readability and maintaining
consistency with other parts of the memory management code.
Link: https://lkml.kernel.org/r/20250320063346.489030-1-ye.liu@linux.dev
Signed-off-by: Ye Liu <liuye@kylinos.cn>
Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Brendan points out that defrag_mode doesn't properly clear
ALLOC_NOFRAGMENT on its last-ditch attempt to allocate. But looking
closer, the problem is actually more severe: it doesn't actually *check*
whether it's already retried, and keeps looping. This means the OOM path
is never taken, and the thread can loop indefinitely.
This is verified with an intentional OOM test on defrag_mode=1, which
results in the machine hanging. After this patch, it triggers the OOM
kill reliably and recovers.
Clear ALLOC_NOFRAGMENT properly, and only retry once.
Link: https://lkml.kernel.org/r/20250401041231.GA2117727@cmpxchg.org
Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- The series "Enable strict percpu address space checks" from Uros
Bizjak uses x86 named address space qualifiers to provide
compile-time checking of percpu area accesses.
This has caused a small amount of fallout - two or three issues were
reported. In all cases the calling code was found to be incorrect.
- The series "Some cleanup for memcg" from Chen Ridong implements some
relatively monir cleanups for the memcontrol code.
- The series "mm: fixes for device-exclusive entries (hmm)" from David
Hildenbrand fixes a boatload of issues which David found then using
device-exclusive PTE entries when THP is enabled. More work is
needed, but this makes thins better - our own HMM selftests now
succeed.
- The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
remove the z3fold and zbud implementations. They have been deprecated
for half a year and nobody has complained.
- The series "mm: further simplify VMA merge operation" from Lorenzo
Stoakes implements numerous simplifications in this area. No runtime
effects are anticipated.
- The series "mm/madvise: remove redundant mmap_lock operations from
process_madvise()" from SeongJae Park rationalizes the locking in the
madvise() implementation. Performance gains of 20-25% were observed
in one MADV_DONTNEED microbenchmark.
- The series "Tiny cleanup and improvements about SWAP code" from
Baoquan He contains a number of touchups to issues which Baoquan
noticed when working on the swap code.
- The series "mm: kmemleak: Usability improvements" from Catalin
Marinas implements a couple of improvements to the kmemleak
user-visible output.
- The series "mm/damon/paddr: fix large folios access and schemes
handling" from Usama Arif provides a couple of fixes for DAMON's
handling of large folios.
- The series "mm/damon/core: fix wrong and/or useless damos_walk()
behaviors" from SeongJae Park fixes a few issues with the accuracy of
kdamond's walking of DAMON regions.
- The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
Stoakes changes the interaction between framebuffer deferred-io and
core MM. No functional changes are anticipated - this is preparatory
work for the future removal of page structure fields.
- The series "mm/damon: add support for hugepage_size DAMOS filter"
from Usama Arif adds a DAMOS filter which permits the filtering by
huge page sizes.
- The series "mm: permit guard regions for file-backed/shmem mappings"
from Lorenzo Stoakes extends the guard region feature from its
present "anon mappings only" state. The feature now covers shmem and
file-backed mappings.
- The series "mm: batched unmap lazyfree large folios during
reclamation" from Barry Song cleans up and speeds up the unmapping
for pte-mapped large folios.
- The series "reimplement per-vma lock as a refcount" from Suren
Baghdasaryan puts the vm_lock back into the vma. Our reasons for
pulling it out were largely bogus and that change made the code more
messy. This patchset provides small (0-10%) improvements on one
microbenchmark.
- The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
improves" from SeongJae Park does some maintenance work on the DAMON
docs.
- The series "hugetlb/CMA improvements for large systems" from Frank
van der Linden addresses a pile of issues which have been observed
when using CMA on large machines.
- The series "mm/damon: introduce DAMOS filter type for unmapped pages"
from SeongJae Park enables users of DMAON/DAMOS to filter my the
page's mapped/unmapped status.
- The series "zsmalloc/zram: there be preemption" from Sergey
Senozhatsky teaches zram to run its compression and decompression
operations preemptibly.
- The series "selftests/mm: Some cleanups from trying to run them" from
Brendan Jackman fixes a pile of unrelated issues which Brendan
encountered while runnimg our selftests.
- The series "fs/proc/task_mmu: add guard region bit to pagemap" from
Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
determine whether a particular page is a guard page.
- The series "mm, swap: remove swap slot cache" from Kairui Song
removes the swap slot cache from the allocation path - it simply
wasn't being effective.
- The series "mm: cleanups for device-exclusive entries (hmm)" from
David Hildenbrand implements a number of unrelated cleanups in this
code.
- The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
implements a number of preparatoty cleanups to the GENERIC_PTDUMP
Kconfig logic.
- The series "mm/damon: auto-tune aggregation interval" from SeongJae
Park implements a feedback-driven automatic tuning feature for
DAMON's aggregation interval tuning.
- The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
preparation for implementing lazy mmu mode for arm64 to optimize
vmalloc.
- The series "mm/page_alloc: Some clarifications for migratetype
fallback" from Brendan Jackman reworks some commentary to make the
code easier to follow.
- The series "page_counter cleanup and size reduction" from Shakeel
Butt cleans up the page_counter code and fixes a size increase which
we accidentally added late last year.
- The series "Add a command line option that enables control of how
many threads should be used to allocate huge pages" from Thomas
Prescher does that. It allows the careful operator to significantly
reduce boot time by tuning the parallalization of huge page
initialization.
- The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
from Tang Yizhou fixes the tracing output from the dirty page
balancing code.
- The series "mm/damon: make allow filters after reject filters useful
and intuitive" from SeongJae Park improves the handling of allow and
reject filters. Behaviour is made more consistent and the documention
is updated accordingly.
- The series "Switch zswap to object read/write APIs" from Yosry Ahmed
updates zswap to the new object read/write APIs and thus permits the
removal of some legacy code from zpool and zsmalloc.
- The series "Some trivial cleanups for shmem" from Baolin Wang does as
it claims.
- The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
Alistair Popple regularizes the weird ZONE_DEVICE page refcount
handling in DAX, permittig the removal of a number of special-case
checks.
- The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
preparatoty refactoring and cleanup of the mremap() code.
- The series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
which we determine whether a large folio is known to be mapped
exclusively into a single MM.
- The series "mm/damon: add sysfs dirs for managing DAMOS filters based
on handling layers" from SeongJae Park adds a couple of new sysfs
directories to ease the management of DAMON/DAMOS filters.
- The series "arch, mm: reduce code duplication in mem_init()" from
Mike Rapoport consolidates many per-arch implementations of
mem_init() into code generic code, where that is practical.
- The series "mm/damon/sysfs: commit parameters online via
damon_call()" from SeongJae Park continues the cleaning up of sysfs
access to DAMON internal data.
- The series "mm: page_ext: Introduce new iteration API" from Luiz
Capitulino reworks the page_ext initialization to fix a boot-time
crash which was observed with an unusual combination of compile and
cmdline options.
- The series "Buddy allocator like (or non-uniform) folio split" from
Zi Yan reworks the code to split a folio into smaller folios. The
main benefit is lessened memory consumption: fewer post-split folios
are generated.
- The series "Minimize xa_node allocation during xarry split" from Zi
Yan reduces the number of xarray xa_nodes which are generated during
an xarray split.
- The series "drivers/base/memory: Two cleanups" from Gavin Shan
performs some maintenance work on the drivers/base/memory code.
- The series "Add tracepoints for lowmem reserves, watermarks and
totalreserve_pages" from Martin Liu adds some more tracepoints to the
page allocator code.
- The series "mm/madvise: cleanup requests validations and
classifications" from SeongJae Park cleans up some warts which
SeongJae observed during his earlier madvise work.
- The series "mm/hwpoison: Fix regressions in memory failure handling"
from Shuai Xue addresses two quite serious regressions which Shuai
has observed in the memory-failure implementation.
- The series "mm: reliable huge page allocator" from Johannes Weiner
makes huge page allocations cheaper and more reliable by reducing
fragmentation.
- The series "Minor memcg cleanups & prep for memdescs" from Matthew
Wilcox is preparatory work for the future implementation of memdescs.
- The series "track memory used by balloon drivers" from Nico Pache
introduces a way to track memory used by our various balloon drivers.
- The series "mm/damon: introduce DAMOS filter type for active pages"
from Nhat Pham permits users to filter for active/inactive pages,
separately for file and anon pages.
- The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
separates the proactive reclaim statistics from the direct reclaim
statistics.
- The series "mm/vmscan: don't try to reclaim hwpoison folio" from
Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
code.
* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
x86/mm: restore early initialization of high_memory for 32-bits
mm/vmscan: don't try to reclaim hwpoison folio
mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
selftests/mm: speed up split_huge_page_test
selftests/mm: uffd-unit-tests support for hugepages > 2M
docs/mm/damon/design: document active DAMOS filter type
mm/damon: implement a new DAMOS filter type for active pages
fs/dax: don't disassociate zero page entries
MM documentation: add "Unaccepted" meminfo entry
selftests/mm: add commentary about 9pfs bugs
fork: use __vmalloc_node() for stack allocation
docs/mm: Physical Memory: Populate the "Zones" section
xen: balloon: update the NR_BALLOON_PAGES state
hv_balloon: update the NR_BALLOON_PAGES state
balloon_compaction: update the NR_BALLOON_PAGES state
meminfo: add a per node counter for balloon drivers
mm: remove references to folio in __memcg_kmem_uncharge_page()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf try_alloc_pages() support from Alexei Starovoitov:
"The pull includes work from Sebastian, Vlastimil and myself with a lot
of help from Michal and Shakeel.
This is a first step towards making kmalloc reentrant to get rid of
slab wrappers: bpf_mem_alloc, kretprobe's objpool, etc. These patches
make page allocator safe from any context.
Vlastimil kicked off this effort at LSFMM 2024:
https://lwn.net/Articles/974138/
and we continued at LSFMM 2025:
https://lore.kernel.org/all/CAADnVQKfkGxudNUkcPJgwe3nTZ=xohnRshx9kLZBTmR_E1DFEg@mail.gmail.com/
Why:
SLAB wrappers bind memory to a particular subsystem making it
unavailable to the rest of the kernel. Some BPF maps in production
consume Gbytes of preallocated memory. Top 5 in Meta: 1.5G, 1.2G,
1.1G, 300M, 200M. Once we have kmalloc that works in any context BPF
map preallocation won't be necessary.
How:
Synchronous kmalloc/page alloc stack has multiple stages going from
fast to slow: cmpxchg16 -> slab_alloc -> new_slab -> alloc_pages ->
rmqueue_pcplist -> __rmqueue, where rmqueue_pcplist was already
relying on trylock.
This set changes rmqueue_bulk/rmqueue_buddy to attempt a trylock and
return ENOMEM if alloc_flags & ALLOC_TRYLOCK. It then wraps this
functionality into try_alloc_pages() helper. We make sure that the
logic is sane in PREEMPT_RT.
End result: try_alloc_pages()/free_pages_nolock() are safe to call
from any context.
try_kmalloc() for any context with similar trylock approach will
follow. It will use try_alloc_pages() when slab needs a new page.
Though such try_kmalloc/page_alloc() is an opportunistic allocator,
this design ensures that the probability of successful allocation of
small objects (up to one page in size) is high.
Even before we have try_kmalloc(), we already use try_alloc_pages() in
BPF arena implementation and it's going to be used more extensively in
BPF"
* tag 'bpf_try_alloc_pages' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
mm: Fix the flipped condition in gfpflags_allow_spinning()
bpf: Use try_alloc_pages() to allocate pages for bpf needs.
mm, bpf: Use memcg in try_alloc_pages().
memcg: Use trylock to access memcg stock_lock.
mm, bpf: Introduce free_pages_nolock()
mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
locking/local_lock: Introduce localtry_lock_t
|
|
The `movable` variable is always used when `CONFIG_TRANSPARENT_HUGEPAGE`
is enabled, so the `__maybe_unused` attribute is not necessary. This
patch removes it and keeps the variable declaration within the `#ifdef`
block for better clarity.
Link: https://lkml.kernel.org/r/20250319091726.401158-1-liuyerd@163.com
Signed-off-by: Liu Ye<liuye@kylinos.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The last argument to split_page_memcg() is now always 0, so remove it,
effectively reverting commit b8791381d7ed.
Link: https://lkml.kernel.org/r/20250314133617.138071-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The previous patch added pageblock_order reclaim to kswapd/kcompactd,
which helps, but produces only one block at a time. Allocation stalls and
THP failure rates are still higher than they could be.
To adequately reflect ALLOC_NOFRAGMENT demand for pageblocks, change the
watermarking for kswapd & kcompactd: instead of targeting the high
watermark in order-0 pages and checking for one suitable block, simply
require that the high watermark is entirely met in pageblocks.
To this end, track the number of free pages within contiguous pageblocks,
then change pgdat_balanced() and compact_finished() to check watermarks
against this new value.
This further reduces THP latencies and allocation stalls, and improves THP
success rates against the previous patch:
DEFRAGMODE-ASYNC DEFRAGMODE-ASYNC-WMARKS
Hugealloc Time mean 34300.36 ( +0.00%) 28904.00 ( -15.73%)
Hugealloc Time stddev 36390.42 ( +0.00%) 33464.37 ( -8.04%)
Kbuild Real time 196.13 ( +0.00%) 196.59 ( +0.23%)
Kbuild User time 1234.74 ( +0.00%) 1231.67 ( -0.25%)
Kbuild System time 62.62 ( +0.00%) 59.10 ( -5.54%)
THP fault alloc 57054.53 ( +0.00%) 63223.67 ( +10.81%)
THP fault fallback 11581.40 ( +0.00%) 5412.47 ( -53.26%)
Direct compact fail 107.80 ( +0.00%) 59.07 ( -44.79%)
Direct compact success 4.53 ( +0.00%) 2.80 ( -31.33%)
Direct compact success rate % 3.20 ( +0.00%) 3.99 ( +18.66%)
Compact daemon scanned migrate 5461033.93 ( +0.00%) 2267500.33 ( -58.48%)
Compact daemon scanned free 5824897.93 ( +0.00%) 2339773.00 ( -59.83%)
Compact direct scanned migrate 58336.93 ( +0.00%) 47659.93 ( -18.30%)
Compact direct scanned free 32791.87 ( +0.00%) 40729.67 ( +24.21%)
Compact total migrate scanned 5519370.87 ( +0.00%) 2315160.27 ( -58.05%)
Compact total free scanned 5857689.80 ( +0.00%) 2380502.67 ( -59.36%)
Alloc stall 2424.60 ( +0.00%) 638.87 ( -73.62%)
Pages kswapd scanned 2657018.33 ( +0.00%) 4002186.33 ( +50.63%)
Pages kswapd reclaimed 559583.07 ( +0.00%) 718577.80 ( +28.41%)
Pages direct scanned 722094.07 ( +0.00%) 355172.73 ( -50.81%)
Pages direct reclaimed 107257.80 ( +0.00%) 31162.80 ( -70.95%)
Pages total scanned 3379112.40 ( +0.00%) 4357359.07 ( +28.95%)
Pages total reclaimed 666840.87 ( +0.00%) 749740.60 ( +12.43%)
Swap out 77238.20 ( +0.00%) 110084.33 ( +42.53%)
Swap in 11712.80 ( +0.00%) 24457.00 ( +108.80%)
File refaults 143438.80 ( +0.00%) 188226.93 ( +31.22%)
Also of note is that compaction work overall is reduced. The reason for
this is that when free pageblocks are more readily available, allocations
are also much more likely to get physically placed in LRU order, instead
of being forced to scavenge free space here and there. This means that
reclaim by itself has better chances of freeing up whole blocks, and the
system relies less on compaction.
Comparing all changes to the vanilla kernel:
VANILLA DEFRAGMODE-ASYNC-WMARKS
Hugealloc Time mean 52739.45 ( +0.00%) 28904.00 ( -45.19%)
Hugealloc Time stddev 56541.26 ( +0.00%) 33464.37 ( -40.81%)
Kbuild Real time 197.47 ( +0.00%) 196.59 ( -0.44%)
Kbuild User time 1240.49 ( +0.00%) 1231.67 ( -0.71%)
Kbuild System time 70.08 ( +0.00%) 59.10 ( -15.45%)
THP fault alloc 46727.07 ( +0.00%) 63223.67 ( +35.30%)
THP fault fallback 21910.60 ( +0.00%) 5412.47 ( -75.29%)
Direct compact fail 195.80 ( +0.00%) 59.07 ( -69.48%)
Direct compact success 7.93 ( +0.00%) 2.80 ( -57.46%)
Direct compact success rate % 3.51 ( +0.00%) 3.99 ( +10.49%)
Compact daemon scanned migrate 3369601.27 ( +0.00%) 2267500.33 ( -32.71%)
Compact daemon scanned free 5075474.47 ( +0.00%) 2339773.00 ( -53.90%)
Compact direct scanned migrate 161787.27 ( +0.00%) 47659.93 ( -70.54%)
Compact direct scanned free 163467.53 ( +0.00%) 40729.67 ( -75.08%)
Compact total migrate scanned 3531388.53 ( +0.00%) 2315160.27 ( -34.44%)
Compact total free scanned 5238942.00 ( +0.00%) 2380502.67 ( -54.56%)
Alloc stall 2371.07 ( +0.00%) 638.87 ( -73.02%)
Pages kswapd scanned 2160926.73 ( +0.00%) 4002186.33 ( +85.21%)
Pages kswapd reclaimed 533191.07 ( +0.00%) 718577.80 ( +34.77%)
Pages direct scanned 400450.33 ( +0.00%) 355172.73 ( -11.31%)
Pages direct reclaimed 94441.73 ( +0.00%) 31162.80 ( -67.00%)
Pages total scanned 2561377.07 ( +0.00%) 4357359.07 ( +70.12%)
Pages total reclaimed 627632.80 ( +0.00%) 749740.60 ( +19.46%)
Swap out 47959.53 ( +0.00%) 110084.33 ( +129.53%)
Swap in 7276.00 ( +0.00%) 24457.00 ( +236.10%)
File refaults 138043.00 ( +0.00%) 188226.93 ( +36.35%)
THP allocation latencies and %sys time are down dramatically.
THP allocation failures are down from nearly 50% to 8.5%. And to recall
previous data points, the success rates are steady and reliable without
the cumulative deterioration of fragmentation events.
Compaction work is down overall. Direct compaction work especially is
drastically reduced. As an aside, its success rate of 4% indicates there
is room for improvement. For now it's good to rely on it less.
Reclaim work is up overall, however direct reclaim work is down. Part of
the increase can be attributed to a higher use of THPs, which due to
internal fragmentation increase the memory footprint. This is not
necessarily an unexpected side-effect for users of THP.
However, taken both points together, there may well be some opportunities
for fine tuning in the reclaim/compaction coordination.
[hannes@cmpxchg.org: fix squawks from rebasing]
Link: https://lkml.kernel.org/r/20250314210558.GD1316033@cmpxchg.org
Link: https://lkml.kernel.org/r/20250313210647.1314586-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When defrag_mode is enabled, allocation fallbacks strongly prefer whole
block conversions instead of polluting or stealing partially used blocks.
This means there is a demand for pageblocks even from sub-block requests.
Let kswapd/kcompactd help produce them.
By the time kswapd gets woken up, normal rmqueue and block conversion
fallbacks have been attempted and failed. So always wake kswapd with the
block order; it will take care of producing a suitable compaction gap and
then chain-wake kcompactd with the block order when its done.
VANILLA DEFRAGMODE-ASYNC
Hugealloc Time mean 52739.45 ( +0.00%) 34300.36 ( -34.96%)
Hugealloc Time stddev 56541.26 ( +0.00%) 36390.42 ( -35.64%)
Kbuild Real time 197.47 ( +0.00%) 196.13 ( -0.67%)
Kbuild User time 1240.49 ( +0.00%) 1234.74 ( -0.46%)
Kbuild System time 70.08 ( +0.00%) 62.62 ( -10.50%)
THP fault alloc 46727.07 ( +0.00%) 57054.53 ( +22.10%)
THP fault fallback 21910.60 ( +0.00%) 11581.40 ( -47.14%)
Direct compact fail 195.80 ( +0.00%) 107.80 ( -44.72%)
Direct compact success 7.93 ( +0.00%) 4.53 ( -38.06%)
Direct compact success rate % 3.51 ( +0.00%) 3.20 ( -6.89%)
Compact daemon scanned migrate 3369601.27 ( +0.00%) 5461033.93 ( +62.07%)
Compact daemon scanned free 5075474.47 ( +0.00%) 5824897.93 ( +14.77%)
Compact direct scanned migrate 161787.27 ( +0.00%) 58336.93 ( -63.94%)
Compact direct scanned free 163467.53 ( +0.00%) 32791.87 ( -79.94%)
Compact total migrate scanned 3531388.53 ( +0.00%) 5519370.87 ( +56.29%)
Compact total free scanned 5238942.00 ( +0.00%) 5857689.80 ( +11.81%)
Alloc stall 2371.07 ( +0.00%) 2424.60 ( +2.26%)
Pages kswapd scanned 2160926.73 ( +0.00%) 2657018.33 ( +22.96%)
Pages kswapd reclaimed 533191.07 ( +0.00%) 559583.07 ( +4.95%)
Pages direct scanned 400450.33 ( +0.00%) 722094.07 ( +80.32%)
Pages direct reclaimed 94441.73 ( +0.00%) 107257.80 ( +13.57%)
Pages total scanned 2561377.07 ( +0.00%) 3379112.40 ( +31.93%)
Pages total reclaimed 627632.80 ( +0.00%) 666840.87 ( +6.25%)
Swap out 47959.53 ( +0.00%) 77238.20 ( +61.05%)
Swap in 7276.00 ( +0.00%) 11712.80 ( +60.97%)
File refaults 138043.00 ( +0.00%) 143438.80 ( +3.91%)
With this patch, defrag_mode=1 beats the vanilla kernel in THP success
rates and allocation latencies. The trend holds over time:
thp_fault_alloc
VANILLA DEFRAGMODE-ASYNC
61988 52066
56474 58844
57258 58233
50187 58476
52388 54516
55409 59938
52925 57204
47648 60238
43669 55733
40621 56211
36077 59861
41721 57771
36685 58579
34641 51868
33215 56280
DEFRAGMODE-ASYNC also wins on %sys as ~3/4 of the direct compaction work
is shifted to kcompactd.
Reclaim activity is higher. Part of that is simply due to the increased
memory footprint from higher THP use. The other aspect is that *direct*
reclaim/compaction are still going for requested orders rather than
targeting the page blocks required for fallbacks, which is less efficient
than it could be. However, this is already a useful tradeoff to make, as
in many environments peak periods are short and retaining the ability to
produce THP through them is more important.
Link: https://lkml.kernel.org/r/20250313210647.1314586-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The page allocator groups requests by migratetype to stave off
fragmentation. However, in practice this is routinely defeated by the
fact that it gives up *before* invoking reclaim and compaction - which may
well produce suitable pages. As a result, fragmentation of physical
memory is a common ongoing process in many load scenarios.
Fragmentation deteriorates compaction's ability to produce huge pages.
Depending on the lifetime of the fragmenting allocations, those effects
can be long-lasting or even permanent, requiring drastic measures like
forcible idle states or even reboots as the only reliable ways to recover
the address space for THP production.
In a kernel build test with supplemental THP pressure, the THP allocation
rate steadily declines over 15 runs:
thp_fault_alloc
61988
56474
57258
50187
52388
55409
52925
47648
43669
40621
36077
41721
36685
34641
33215
This is a hurdle in adopting THP in any environment where hosts are shared
between multiple overlapping workloads (cloud environments), and rarely
experience true idle periods. To make THP a reliable and predictable
optimization, there needs to be a stronger guarantee to avoid such
fragmentation.
Introduce defrag_mode. When enabled, reclaim/compaction is invoked to its
full extent *before* falling back. Specifically, ALLOC_NOFRAGMENT is
enforced on the allocator fastpath and the reclaiming slowpath.
For now, fallbacks are permitted to avert OOMs. There is a plan to add
defrag_mode=2 to prefer OOMs over fragmentation, but this requires
additional prep work in compaction and the reserve management to make it
ready for all possible allocation contexts.
The following test results are from a kernel build with periodic bursts of
THP allocations, over 15 runs:
vanilla defrag_mode=1
@claimer[unmovable]: 189 103
@claimer[movable]: 92 103
@claimer[reclaimable]: 207 61
@pollute[unmovable from movable]: 25 0
@pollute[unmovable from reclaimable]: 28 0
@pollute[movable from unmovable]: 38835 0
@pollute[movable from reclaimable]: 147136 0
@pollute[reclaimable from unmovable]: 178 0
@pollute[reclaimable from movable]: 33 0
@steal[unmovable from movable]: 11 0
@steal[unmovable from reclaimable]: 5 0
@steal[reclaimable from unmovable]: 107 0
@steal[reclaimable from movable]: 90 0
@steal[movable from reclaimable]: 354 0
@steal[movable from unmovable]: 130 0
Both types of polluting fallbacks are eliminated in this workload.
Interestingly, whole block conversions are reduced as well. This is
because once a block is claimed for a type, its empty space remains
available for future allocations, instead of being padded with fallbacks;
this allows the native type to group up instead of spreading out to new
blocks. The assumption in the allocator has been that pollution from
movable allocations is less harmful than from other types, since they can
be reclaimed or migrated out should the space be needed. However, since
fallbacks occur *before* reclaim/compaction is invoked, movable pollution
will still cause non-movable allocations to spread out and claim more
blocks.
Without fragmentation, THP rates hold steady with defrag_mode=1:
thp_fault_alloc
32478
20725
45045
32130
14018
21711
40791
29134
34458
45381
28305
17265
22584
28454
30850
While the downward trend is eliminated, the keen reader will of course
notice that the baseline rate is much smaller than the vanilla kernel's to
begin with. This is due to deficiencies in how reclaim and compaction are
currently driven: ALLOC_NOFRAGMENT increases the extent to which smaller
allocations are competing with THPs for pageblocks, while making no effort
themselves to reclaim or compact beyond their own request size. This
effect already exists with the current usage of ALLOC_NOFRAGMENT, but is
amplified by defrag_mode insisting on whole block stealing much more
strongly.
Subsequent patches will address defrag_mode reclaim strategy to raise the
THP success baseline above the vanilla kernel.
Link: https://lkml.kernel.org/r/20250313210647.1314586-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When the page allocator places pages of a certain migratetype into blocks
of another type, it has lasting effects on the ability to compact and
defragment down the line. For improving placement and compaction,
visibility into such events is crucial.
The most common case, allocator fallbacks, is already annotated, but
compaction capturing is also allowed to grab pages of a different type.
Extend the tracepoint to cover this case.
Link: https://lkml.kernel.org/r/20250313210647.1314586-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This commit introduces a new trace event,
`mm_calculate_totalreserve_pages`, which reports the new reserve value at
the exact time when it takes effect.
The `totalreserve_pages` value represents the total amount of memory
reserved across all zones and nodes in the system. This reserved memory
is crucial for ensuring that critical kernel operations have access to
sufficient memory, even under memory pressure.
By tracing the `totalreserve_pages` value, developers can gain insights
that how the total reserved memory changes over time.
Link: https://lkml.kernel.org/r/20250308034606.2036033-4-liumartin@google.com
Signed-off-by: Martin Liu <liumarti |