summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2024-05-11mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wpOscar Salvador1-1/+1
commit 1cb9dc4b475c ("mm: hwpoison: support recovery from HugePage copy-on-write faults") added support to use the mc variants when coping hugetlb pages on CoW faults. Add the missing VM_FAULT_SET_HINDEX, so the right si_addr_lsb will be passed to userspace to report the extension of the faulty area. Link: https://lkml.kernel.org/r/20240509100148.22384-3-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-11mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_faultOscar Salvador1-1/+2
Patch series "Minor fixups for hugetlb fault path". This series contains a couple of fixups for hugetlb_fault and hugetlb_wp respectively, where a VM_FAULT_SET_HINDEX call was missing. I did not bother with a Fixes tag because the missing piece here is that we will not report to userspace the right extension of the faulty area by adjusting struct kernel_siginfo.si_addr_lsb, but I do not consider that to be a big issue because I assume that userspace already knows the size of the mapping anyway. This patch (of 2): commit af19487f00f3 ("mm: make PTE_MARKER_SWAPIN_ERROR more general") added the code to handle pte_markers in hugetlb faulting path. In case of an UFFD_POISON event, a PTE_MARKER_POISONED will be created and we will return VM_FAULT_HWPOISON_LARGE upon detecting that in the fault path. Add the missing VM_FAULT_SET_HINDEX, so the right si_addr_lsb will be passed to userspace to report the extension of the faulty area. Link: https://lkml.kernel.org/r/20240509100148.22384-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20240509100148.22384-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-11mm: memcg: make alloc_mem_cgroup_per_node_info() return boolXiu Jianfeng1-5/+5
alloc_mem_cgroup_per_node_info() returns int that doesn't map to any errno error code. The only existing caller doesn't really need an error code so change the function to return bool (true on success) because this is slightly less confusing and more consistent with the other code. Link: https://lkml.kernel.org/r/20240507132324.1158510-1-xiujianfeng@huawei.com Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-11mm/damon/core: fix return value from damos_wmark_metric_valueAlex Rusuf1-4/+5
damos_wmark_metric_value's return value is 'unsigned long', so returning -EINVAL as 'unsigned long' may turn out to be very different from the expected one (using 2's complement) and treat as usual matric's value. So, fix that, checking if returned value is not 0. Link: https://lkml.kernel.org/r/20240506180238.53842-1-sj@kernel.org Fixes: ee801b7dd782 ("mm/damon/schemes: activate schemes based on a watermarks mechanism") Signed-off-by: Alex Rusuf <yorha.op@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-11mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPEDYosry Ahmed1-6/+9
Previously, all NR_VM_EVENT_ITEMS stats were maintained per-memcg, although some of those fields are not exposed anywhere. Commit 14e0f6c957e39 ("memcg: reduce memory for the lruvec and memcg stats") changed this such that we only maintain the stats we actually expose per-memcg via a translation table. Additionally, commit 514462bbe927b ("memcg: warn for unexpected events and stats") added a warning if a per-memcg stat update is attempted for a stat that is not in the translation table. The warning started firing for the NR_{FILE/SHMEM}_PMDMAPPED stat updates in the rmap code. These stats are not maintained per-memcg, and hence are not in the translation table. Do not use __lruvec_stat_mod_folio() when updating NR_FILE_PMDMAPPED and NR_SHMEM_PMDMAPPED. Use __mod_node_page_state() instead, which updates the global per-node stats only. Link: https://lkml.kernel.org/r/20240506192924.271999-1-yosryahmed@google.com Fixes: 514462bbe927 ("memcg: warn for unexpected events and stats") Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: syzbot+9319a4268a640e26b72b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/0000000000001b9d500617c8b23c@google.com Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-11mm/damon/core: initialize ->esz_bp from damos_quota_init_priv()SeongJae Park1-0/+1
Patch series "mm/damon: misc fixes and improvements". Add miscelleneous and non-urgent fixes and improvements for DAMON code, selftests, and documents. This patch (of 10): damos_quota_init_priv() function should initialize all private fields of struct damos_quota. However, it is not initializing ->esz_bp field. This could result in use of uninitialized variable from damon_feed_loop_next_input() function. There is no such issue at the moment because every caller of the function is passing damos_quota object that already having the field zero value. But we cannot guarantee the future, and the function is not doing what it is promising. A bug is a bug. This fix is for preventing possible future issues. Link: https://lkml.kernel.org/r/20240503180318.72798-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240503180318.72798-2-sj@kernel.org Fixes: 9294a037c015 ("mm/damon/core: implement goal-oriented feedback-driven quota auto-tuning") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-10Merge tag 'mm-hotfixes-stable-2024-05-10-13-14' of ↵Linus Torvalds3-3/+7
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM fixes from Andrew Morton: "18 hotfixes, 7 of which are cc:stable. More fixups for this cycle's page_owner updates. And a few userfaultfd fixes. Otherwise, random singletons - see the individual changelogs for details" * tag 'mm-hotfixes-stable-2024-05-10-13-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mailmap: add entry for Barry Song selftests/mm: fix powerpc ARCH check mailmap: add entry for John Garry XArray: set the marks correctly when splitting an entry selftests/vDSO: fix runtime errors on LoongArch selftests/vDSO: fix building errors on LoongArch mm,page_owner: don't remove __GFP_NOLOCKDEP in add_stack_record_to_list fs/proc/task_mmu: fix uffd-wp confusion in pagemap_scan_pmd_entry() fs/proc/task_mmu: fix loss of young/dirty bits during pagemap scan mm/vmalloc: fix return value of vb_alloc if size is 0 mm: use memalloc_nofs_save() in page_cache_ra_order() kmsan: compiler_types: declare __no_sanitize_or_inline lib/test_xarray.c: fix error assumptions on check_xa_multi_store_adv_add() tools: fix userspace compilation with new test_xarray changes MAINTAINERS: update URL's for KEYS/KEYRINGS_INTEGRITY and TPM DEVICE DRIVER mm: page_owner: fix wrong information in dump_page_owner maple_tree: fix mas_empty_area_rev() null pointer dereference mm/userfaultfd: reset ptes when close() for wr-protected ones
2024-05-10Merge tag 'loongarch-kvm-6.10' of ↵Paolo Bonzini9-170/+206
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.10 1. Add ParaVirt IPI support. 2. Add software breakpoint support. 3. Add mmio trace events support.
2024-05-07thp: remove HPAGE_PMD_ORDER minimum assertionMatthew Wilcox (Oracle)1-5/+0
We now handle order-1 folios correctly, so we don't need this assertion any more. Link: https://lkml.kernel.org/r/20240429190114.3126789-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/vmscan: remove ignore_references argument of reclaim_folio_list()SeongJae Park1-6/+4
All reclaim_folio_list() callers are passing 'true' for 'ignore_references' parameter. In other words, the parameter is not really being used. Simplify the code by removing the parameter. Link: https://lkml.kernel.org/r/20240429224451.67081-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/vmscan: remove ignore_references argument of reclaim_pages()SeongJae Park4-7/+7
All reclaim_pages() callers are setting 'ignore_references' parameter 'true'. In other words, the parameter is not really being used. Remove the argument to make it simple. Link: https://lkml.kernel.org/r/20240429224451.67081-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/damon/paddr: do page level access check for pageout DAMOS action on its ownSeongJae Park1-4/+12
'pageout' DAMOS action implementation of 'paddr' DAMON operations set asks reclaim_pages() to do page level access check if the user is not asking DAMOS to do that on its own. Simplify the logic by making the check always be done by 'paddr'. Link: https://lkml.kernel.org/r/20240429224451.67081-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/damon/paddr: avoid unnecessary page level access check for pageout DAMOS ↵SeongJae Park1-1/+11
action Patch series "mm/damon/paddr: simplify page level access re-check for pageout. The 'pageout' DAMOS action implementation of 'paddr' asks reclaim_pages() to do page level access check again. But the user can ask 'paddr' to do the page level access check on its own, using DAMOS filter of 'young page' type. Meanwhile, 'paddr' is the only user of reclaim_pages() that asks the page level access check. Make 'paddr' does the page level access check on its own always, and simplify reclaim_pages() by removing the page level access check request handling logic. As a result of the change for reclaim_pages(), reclaim_folio_list(), which is called by reclaim_pages(), also no more need to do the page level access check. Simplify the function, too. This patch (of 4): 'pageout' DAMOS action implementation of 'paddr' asks reclaim_pages() to do the page level access check. User could ask DAMOS to do the page level access check on its own using 'young page' type DAMOS filter. In the case, pageout DAMOS action unnecessarily asks reclaim_pages() to do the check again. Ask the page level access check only if the scheme is not having the filter. Link: https://lkml.kernel.org/r/20240429224451.67081-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240429224451.67081-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/gup: fix hugepd handling in hugetlb reworkPeter Xu1-25/+39
Commit a12083d721d7 added hugepd handling for gup-slow, reusing gup-fast functions. follow_hugepd() correctly took the vma pointer in, however didn't pass it over into the lower functions, which was overlooked. The issue is gup_fast_hugepte() uses the vma pointer to make the correct decision on whether an unshare is needed for a FOLL_PIN|FOLL_LONGTERM. Now without vma ponter it will constantly return "true" (needs an unshare) for a page cache, even though in the SHARED case it will be wrong to unshare. The other problem is, even if an unshare is needed, it now returns 0 rather than -EMLINK, which will not trigger a follow up FAULT_FLAG_UNSHARE fault. That will need to be fixed too when the unshare is wanted. gup_longterm test didn't expose this issue in the past because it didn't yet test R/O unshare in this case, another separate patch will enable that in future tests. Fix it by passing vma correctly to the bottom, rename gup_fast_hugepte() back to gup_hugepte() as it is shared between the fast/slow paths, and also allow -EMLINK to be returned properly by gup_hugepte() even though gup-fast will take it the same as zero. Link: https://lkml.kernel.org/r/20240430131303.264331-1-peterx@redhat.com Fixes: a12083d721d7 ("mm/gup: handle hugepd for follow_page()") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/hugetlb: align cma on allocation order, not demotion orderFrank van der Linden1-1/+1
Align the CMA area for hugetlb gigantic pages to their size, not the size that they can be demoted to. Otherwise there might be misaligned sections at the start and end of the CMA area that will never be used for hugetlb page allocations. Link: https://lkml.kernel.org/r/20240430161437.2100295-1-fvdl@google.com Fixes: a01f43901cfb ("hugetlb: be sure to free demoted CMA pages to CMA") Signed-off-by: Frank van der Linden <fvdl@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm: memcg: use READ_ONCE()/WRITE_ONCE() to access stock->nr_pagesBreno Leitao1-9/+15
A memcg pointer in the per-cpu stock can be accessed by drain_all_stock() and consume_stock() in parallel, causing a potential race, which is believed to e harmless. KCSAN shows this data-race clearly in the splat below: BUG: KCSAN: data-race in drain_all_stock.part.0 / try_charge_memcg write to 0xffff88903f8b0788 of 4 bytes by task 35901 on cpu 2: try_charge_memcg (mm/memcontrol.c:2323 mm/memcontrol.c:2746) __mem_cgroup_charge (mm/memcontrol.c:7287 mm/memcontrol.c:7301) do_anonymous_page (mm/memory.c:1054 mm/memory.c:4375 mm/memory.c:4433) __handle_mm_fault (mm/memory.c:3878 mm/memory.c:5300 mm/memory.c:5441) handle_mm_fault (mm/memory.c:5606) do_user_addr_fault (arch/x86/mm/fault.c:1363) exc_page_fault (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:72 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1563) asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623) read to 0xffff88903f8b0788 of 4 bytes by task 287 on cpu 27: drain_all_stock.part.0 (mm/memcontrol.c:2433) mem_cgroup_css_offline (mm/memcontrol.c:5398 mm/memcontrol.c:5687) css_killed_work_fn (kernel/cgroup/cgroup.c:5521 kernel/cgroup/cgroup.c:5794) process_one_work (kernel/workqueue.c:3254) worker_thread (kernel/workqueue.c:3329 kernel/workqueue.c:3416) kthread (kernel/kthread.c:388) ret_from_fork (arch/x86/kernel/process.c:147) ret_from_fork_asm (arch/x86/entry/entry_64.S:257) value changed: 0x00000014 -> 0x00000013 This happens because drain_all_stock() is reading stock->nr_pages, while consume_stock() might be updating the same address, causing a potential data-race. Make the shared addresses bulletproof regarding to reads and writes, similarly to what stock->cached_objcg and stock->cached. Annotate all accesses to stock->nr_pages with READ_ONCE()/WRITE_ONCE(). Link: https://lkml.kernel.org/r/20240501095420.679208-1-leitao@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm: fix race between __split_huge_pmd_locked() and GUP-fastRyan Roberts2-23/+28
__split_huge_pmd_locked() can be called for a present THP, devmap or (non-present) migration entry. It calls pmdp_invalidate() unconditionally on the pmdp and only determines if it is present or not based on the returned old pmd. This is a problem for the migration entry case because pmd_mkinvalid(), called by pmdp_invalidate() must only be called for a present pmd. On arm64 at least, pmd_mkinvalid() will mark the pmd such that any future call to pmd_present() will return true. And therefore any lockless pgtable walker could see the migration entry pmd in this state and start interpretting the fields as if it were present, leading to BadThings (TM). GUP-fast appears to be one such lockless pgtable walker. x86 does not suffer the above problem, but instead pmd_mkinvalid() will corrupt the offset field of the swap entry within the swap pte. See link below for discussion of that problem. Fix all of this by only calling pmdp_invalidate() for a present pmd. And for good measure let's add a warning to all implementations of pmdp_invalidate[_ad](). I've manually reviewed all other pmdp_invalidate[_ad]() call sites and believe all others to be conformant. This is a theoretical bug found during code review. I don't have any test case to trigger it in practice. Link: https://lkml.kernel.org/r/20240501143310.1381675-1-ryan.roberts@arm.com Link: https://lore.kernel.org/all/0dd7827a-6334-439a-8fd0-43c98e6af22b@arm.com/ Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/debug_vm_pgtable: test pmd_leaf() behavior with pmd_mkinvalid()Ryan Roberts1-0/+1
An invalidated pmd should still cause pmd_leaf() to return true. Let's test for that to ensure all arches remain consistent. Link: https://lkml.kernel.org/r/20240501144439.1389048-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07memcg: use proper type for mod_memcg_stateShakeel Butt1-1/+2
The memcg stats update functions can take arbitrary integer but the only input which make sense is enum memcg_stat_item and we don't want these functions to be called with arbitrary integer, so replace the parameter type with enum memcg_stat_item and compiler will be able to warn if memcg stat update functions are called with incorrect index value. Link: https://lkml.kernel.org/r/20240501172617.678560-9-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: T.J. Mercier <tjmercier@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07memcg: warn for unexpected events and statsShakeel Butt1-16/+23
To reduce memory usage by the memcg events and stats, the kernel uses indirection table and only allocate stats and events which are being used by the memcg code. To make this more robust, let's add warnings where unexpected stats and events indexes are used. Link: https://lkml.kernel.org/r/20240501172617.678560-8-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: T.J. Mercier <tjmercier@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm: cleanup WORKINGSET_NODES in workingsetShakeel Butt1-3/+4
WORKINGSET_NODES is not exposed in the memcg stats and thus there is no need to use the memcg specific stat update functions for it. In future if we decide to expose WORKINGSET_NODES in the memcg stats, we can revert this patch. Link: https://lkml.kernel.org/r/20240501172617.678560-7-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07memcg: cleanup __mod_memcg_lruvec_stateShakeel Butt1-2/+0
There are no memcg specific stats for NR_SHMEM_PMDMAPPED and NR_FILE_PMDMAPPED. Let's remove them. Link: https://lkml.kernel.org/r/20240501172617.678560-6-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07memcg: reduce memory for the lruvec and memcg statsShakeel Butt1-20/+114
At the moment, the amount of memory allocated for stats related structs in the mem_cgroup corresponds to the size of enum node_stat_item. However not all fields in enum node_stat_item have corresponding memcg stats. So, let's use indirection mechanism similar to the one used for memcg vmstats management. For a given x86_64 config, the size of stats with and without patch is: structs size in bytes w/o with struct lruvec_stats 1128 648 struct lruvec_stats_percpu 752 432 struct memcg_vmstats 1832 1352 struct memcg_vmstats_percpu 1280 960 The memory savings are further compounded by the fact that these structs are allocated for each cpu and for each node. To be precise, for each memcg the memory saved would be: Memory saved = ((21 * 3 * NR_NODES) + (21 * 2 * NR_NODES * NR_CPUS) + (21 * 3) + (21 * 2 * NR_CPUS)) * sizeof(long) Where 21 is the number of fields eliminated. Link: https://lkml.kernel.org/r/20240501172617.678560-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: T.J. Mercier <tjmercier@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm: memcg: account memory used for memcg vmstats and lruvec statsRoman Gushchin1-3/+4
The percpu memory used by memcg's memory statistics is already accounted. For consistency, let's enable accounting for vmstats and lruvec stats as well. Link: https://lkml.kernel.org/r/20240501172617.678560-4-shakeel.butt@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07memcg: dynamically allocate lruvec_statsShakeel Butt1-12/+75
To decouple the dependency of lruvec_stats on NR_VM_NODE_STAT_ITEMS, we need to dynamically allocate lruvec_stats in the mem_cgroup_per_node structure. Also move the definition of lruvec_stats_percpu and lruvec_stats and related functions to the memcontrol.c to facilitate later patches. No functional changes in the patch. Link: https://lkml.kernel.org/r/20240501172617.678560-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: T.J. Mercier <tjmercier@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07memcg: reduce memory size of mem_cgroup_events_indexShakeel Butt1-2/+4
Patch series "memcg: reduce memory consumption by memcg stats", v4. Most of the memory overhead of a memcg object is due to memcg stats maintained by the kernel. Since stats updates happen in performance critical codepaths, the stats are maintained per-cpu and numa specific stats are maintained per-node * per-cpu. This drastically increase the overhead on large machines i.e. large of CPUs and multiple numa nodes. This patch series tries to reduce the overhead by at least not allocating the memory for stats which are not memcg specific. This patch (of 8): mem_cgroup_events_index is a translation table to get the right index of the memcg relevant entry for the general vm_event_item. At the moment, it is defined as integer array. However on a typical system the max entry of vm_event_item (NR_VM_EVENT_ITEMS) is 113, so we don't need to use int as storage type of the array. For now just use int8_t as type and add a BUILD_BUG_ON(). Another benefit of this change is that the translation table fits in 2 cachelines while previously it would require 8 cachelines (assuming 64 bytes cacheline). Link: https://lkml.kernel.org/r/20240501172617.678560-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20240501172617.678560-2-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-07mm/hugetlb: document why hugetlb uses folio_mapcount() for COW reuse decisionsDavid Hildenbrand1-0/+7
Let's document why hugetlb still uses folio_mapcount() and is prone to leaking memory between processes, for example using vmsplice() that still uses FOLL_GET. More details can be found in [1], especially around how hugetlb pages cannot really be overcommitted, and why we don't particularly care about these vmsplice() leaks for hugetlb -- in contrast to ordinary memory. [1] https://lore.kernel.org/all/8b42a24d-caf0-46ef-9e15-0f88d47d2f21@redhat.com/ Link: https://lkml.kernel.org/r/20240502085259.103784-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Suggested-by: Peter Xu <peterx@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/swapfile: mark racy access on si->highest_bitlinke li1-1/+1
In scan_swap_map_slots(), si->highest_bit can by changed by swap_range_alloc() concurrently. All reads on si->highest_bit except one is either protected by lock or read using READ_ONCE. So mark the one racy read on si->highest_bit as benign using READ_ONCE. This patch is aimed at reducing the number of benign races reported by KCSAN in order to focus future debugging effort on harmful races. Link: https://lkml.kernel.org/r/tencent_912BC3E8B0291DA4A0028AB424076375DA07@qq.com Signed-off-by: linke li <lilinke99@qq.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/rmap: change the type of we_locked from int to boolHao Ge1-1/+1
Change the type of we_locked from int to bool because folio_trylock return bool Link: https://lkml.kernel.org/r/20240428012049.8182-1-gehao@kylinos.cn Signed-off-by: Hao Ge <gehao@kylinos.cn> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/rmap: do not add fully unmapped large folio to deferred split listZi Yan1-3/+10
In __folio_remove_rmap(), a large folio is added to deferred split list if any page in a folio loses its final mapping. But it is possible that the folio is fully unmapped and adding it to deferred split list is unnecessary. For PMD-mapped THPs, that was not really an issue, because removing the last PMD mapping in the absence of PTE mappings would not have added the folio to the deferred split queue. However, for PTE-mapped THPs, which are now more prominent due to mTHP, they are always added to the deferred split queue. One side effect is that the THP_DEFERRED_SPLIT_PAGE stat for a PTE-mapped folio can be unintentionally increased, making it look like there are many partially mapped folios -- although the whole folio is fully unmapped stepwise. Core-mm now tries batch-unmapping consecutive PTEs of PTE-mapped THPs where possible starting from commit b06dc281aa99 ("mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]()"). When it happens, a whole PTE-mapped folio is unmapped in one go and can avoid being added to deferred split list, reducing the THP_DEFERRED_SPLIT_PAGE noise. But there will still be noise when we cannot batch-unmap a complete PTE-mapped folio in one go -- or where this type of batching is not implemented yet, e.g., migration. To avoid the unnecessary addition, folio->_nr_pages_mapped is checked to tell if the whole folio is unmapped. If the folio is already on deferred split list, it will be skipped, too. Note: commit 98046944a159 ("mm: huge_memory: add the missing folio_test_pmd_mappable() for THP split statistics") tried to exclude mTHP deferred split stats from THP_DEFERRED_SPLIT_PAGE, but it does not fix the above issue. A fully unmapped PTE-mapped order-9 THP was still added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE, since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio() the order-9 folio is folio_test_pmd_mappable(). Link: https://lkml.kernel.org/r/20240502132852.862138-1-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Lance Yang <ioworker0@gmail.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/damon/paddr: implement DAMOS filter type YOUNGSeongJae Park1-0/+5
DAMOS filter of type YOUNG is defined, but not yet implemented by any DAMON operations set. Add the implementation on 'paddr', the DAMON operations set for the physical address space. Link: https://lkml.kernel.org/r/20240426195247.100306-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Tested-by: Honggyu Kim <honggyu.kim@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/damon: add DAMOS filter type YOUNGSeongJae Park1-0/+1
Define yet another DAMOS filter type, YOUNG. Like anon and memcg, the type of filter will be applied to each page in the memory region, and see if the page is accessed since the last check. Based on the 'matching' parameter, the page is filtered out or in. Note that this commit is adding only the type definition. The implementation should be made by DAMON operations sets. A commit for the implementation on 'paddr' DAMON operations set will follow. Link: https://lkml.kernel.org/r/20240426195247.100306-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Tested-by: Honggyu Kim <honggyu.kim@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/damon/paddr: implement damon_folio_mkold()SeongJae Park1-11/+16
damon_pa_mkold() receives physical address, get the folio covering the address, and makes the folio as old. A following commit will reuse the internal logic for marking a given folio as old. To avoid duplication of the code, split the internal logic. Also, change the rmap walker function's name from __damon_pa_mkold() to damon_folio_mkold_one(), following the change of the caller's name and the naming rule that more commonly used by other rmap walkers. Link: https://lkml.kernel.org/r/20240426195247.100306-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Tested-by: Honggyu Kim <honggyu.kim@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/damon/paddr: implement damon_folio_young()SeongJae Park1-13/+19
Patch series "mm/damon: add a DAMOS filter type for page granularity access recheck". DAMON provides its best-effort accuracy-overhead tradeoff under the user-defined ranges of acceptable level of the monitoring accuracy and overhead. A recent discussion for tiered memory management support from DAMON[1] concluded that finding memory regions of specific access pattern with low overhead despite of low accuracy via DAMON first, and then double checking the access of the region again in a finer (e.g., page) granularity could be a useful strategy for some DAMOS schemes. Add a new type of DAMOS filter, namely 'young' for such a case. It checks each page of DAMOS target region is accessed since the last check, and filters it out or in if 'matching' parameter is 'true' or 'false', respectively. Because this is a filter type that applied in page granularity, the support depends on DAMON operations set, similar to 'anon' and 'memcg' DAMOS filter types. Implement the support on the DAMON operations set for the physical address space, 'paddr', since one of the expected usages[1] is based on the physical address space. [1] https://lore.kernel.org/r/20240227235121.153277-1-sj@kernel.org This patch (of 7): damon_pa_young() receives physical address, get the folio covering the address, and show if the folio is accessed since the last check. A following commit will reuse the internal logic for checking access to a given folio. To avoid duplication of the code, split the internal logic. Also, change the rmap walker function's name from __damon_pa_young() to damon_folio_young_one(), following the change of the caller's name and the naming rule that more commonly used by other rmap walkers. Link: https://lkml.kernel.org/r/20240426195247.100306-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240426195247.100306-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Tested-by: Honggyu Kim <honggyu.kim@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: optimise vmf_anon_prepare() for VMAs without an anon_vmaMatthew Wilcox (Oracle)1-4/+9
If the mmap_lock can be taken for read, we can call __anon_vma_prepare() while holding it, saving ourselves a trip back through the fault handler. Link: https://lkml.kernel.org/r/20240426144506.1290619-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: fix some minor per-VMA lock issues in userfaultfdMatthew Wilcox (Oracle)1-11/+9
Rename lock_vma() to uffd_lock_vma() because it really is uffd specific. Remove comment referencing unlock_vma() which doesn't exist. Fix the comment about lock_vma_under_rcu() which I just made incorrect. Link: https://lkml.kernel.org/r/20240426144506.1290619-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: delay the check for a NULL anon_vmaMatthew Wilcox (Oracle)2-13/+22
Instead of checking the anon_vma early in the fault path where all page faults pay the cost, delay it until we know we're going to need the anon_vma to be filled in. This will have a slight negative effect on the first fault in an anonymous VMA, but it shortens every other page fault. It also makes the code slightly cleaner as the anon and file backed fault handling look more similar. The Intel kernel test bot reports a 3x improvement in vm-scalability throughput with the small-allocs-mt test. This is clearly an extreme situation that won't be replicated in any real-world workload, but it's a nice win. https://lore.kernel.org/all/202404261055.c5e24608-oliver.sang@intel.com/ Link: https://lkml.kernel.org/r/20240426144506.1290619-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: assert the mmap_lock is held in __anon_vma_prepare()Matthew Wilcox (Oracle)1-2/+1
Patch series "Improve anon_vma scalability for anon VMAs". We have a 3x throughput improvement reported by Intel's kernel test robot: https://lore.kernel.org/all/202404261055.c5e24608-oliver.sang@intel.com/ This is from delaying taking the mmap_lock for page faults until we actually need the mmap_lock in order to assign an anon_vma to the vma. It cleans up the page fault path a little by making the anon fault handler more similar to the file fault handler. This patch (of 4): Convert the comment into an assertion. Link: https://lkml.kernel.org/r/20240426144506.1290619-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240426144506.1290619-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: simplify thp_vma_allowable_orderMatthew Wilcox3-15/+18
Combine the three boolean arguments into one flags argument for readability. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: remove stale comment __folio_mark_dirtyKemeng Shi1-2/+1
The __folio_mark_dirty will not mark inode dirty any longer. Remove the stale comment of it. Link: https://lkml.kernel.org/r/20240425131724.36778-5-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Howard Cochran <hcochran@kernelspring.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: call __wb_calc_thresh instead of wb_calc_thresh in wb_over_bg_threshKemeng Shi1-1/+1
Call __wb_calc_thresh to calculate wb bg_thresh of gdtc in wb_over_bg_thresh to remove unnecessary wrap in wb_calc_thresh. Link: https://lkml.kernel.org/r/20240425131724.36778-4-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Howard Cochran <hcochran@kernelspring.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: correct calculation of wb's bg_thresh in cgroup domainKemeng Shi1-1/+1
wb_calc_thresh() is calculating wb's share of bg_thresh in the global domain. However in case of cgroup writeback this is not the right thing to do. Consider the following domain hierarchy: global domain (> 20G) / \ cgroup1 (10G) cgroup2 (10G) | | bdi wb1 wb2 and assume wb1 and wb2 have the same bandwidth and the background threshold is set at 10%. The bg_thresh of cgroup1 and cgroup2 is going to be 1G. Now because wb_calc_thresh(mdtc->wb, mdtc->bg_thresh) calculates per-wb threshold in the global domain as (wb bandwidth) / (domain bandwidth) it returns bg_thresh for wb1 as 0.5G although it has nobody to compete against in cgroup1. Fix the problem by calculating wb's share of bg_thresh in the cgroup domain. Test as following: /* make it easier to observe the issue */ echo 300000 > /proc/sys/vm/dirty_expire_centisecs echo 100 > /proc/sys/vm/dirty_writeback_centisecs /* run fio in wb1 */ cd /sys/fs/cgroup echo "+memory +io" > cgroup.subtree_control mkdir group1 cd group1 echo 10G > memory.high echo 10G > memory.max echo $$ > cgroup.procs mkfs.ext4 -F /dev/vdb mount /dev/vdb /bdi1/ fio -name test -filename=/bdi1/file -size=600M -ioengine=libaio -bs=4K \ -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0 /* run fio in wb2 with a new shell */ cd /sys/fs/cgroup mkdir group2 cd group2 echo 10G > memory.high echo 10G > memory.max echo $$ > cgroup.procs mkfs.ext4 -F /dev/vdc mount /dev/vdc /bdi2/ fio -name test -filename=/bdi2/file -size=600M -ioengine=libaio -bs=4K \ -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0 Before fix, the wrttien pages of wb1 and wb2 reported from toos/writeback/wb_monitor.py keep growing. After fix, rare written pages are accumulated. There is no obvious change in fio result. [jack@suse.cz: changelog rewording] Link: https://lkml.kernel.org/r/20240425131724.36778-3-shikemeng@huaweicloud.com Fixes: 74d369443325 ("writeback: Fix performance regression in wb_over_bg_thresh()") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Howard Cochran <hcochran@kernelspring.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: enable __wb_calc_thresh to calculate dirty background thresholdKemeng Shi1-15/+18
Patch series "Fix and cleanups to page-writeback", v2. This series contains some random cleanups and a fix to correct calculation of wb's bg_