summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2024-11-08mm/page_alloc: rename ALLOC_HIGH to ALLOC_MIN_RESERVEMel Gorman2-5/+7
[ Upstream commit 524c48072e5673f4511f1ad81493e2485863fd65 ] Patch series "Discard __GFP_ATOMIC", v3. Neil's patch has been residing in mm-unstable as commit 2fafb4fe8f7a ("mm: discard __GFP_ATOMIC") for a long time and recently brought up again. Most recently, I was worried that __GFP_HIGH allocations could use high-order atomic reserves which is unintentional but there was no response so lets revisit -- this series reworks how min reserves are used, protects highorder reserves and then finishes with Neil's patch with very minor modifications so it fits on top. There was a review discussion on renaming __GFP_DIRECT_RECLAIM to __GFP_ALLOW_BLOCKING but I didn't think it was that big an issue and is orthogonal to the removal of __GFP_ATOMIC. There were some concerns about how the gfp flags affect the min reserves but it never reached a solid conclusion so I made my own attempt. The series tries to iron out some of the details on how reserves are used. ALLOC_HIGH becomes ALLOC_MIN_RESERVE and ALLOC_HARDER becomes ALLOC_NON_BLOCK and documents how the reserves are affected. For example, ALLOC_NON_BLOCK (no direct reclaim) on its own allows 25% of the min reserve. ALLOC_MIN_RESERVE (__GFP_HIGH) allows 50% and both combined allows deeper access again. ALLOC_OOM allows access to 75%. High-order atomic allocations are explicitly handled with the caveat that no __GFP_ATOMIC flag means that any high-order allocation that specifies GFP_HIGH and cannot enter direct reclaim will be treated as if it was GFP_ATOMIC. This patch (of 6): __GFP_HIGH aliases to ALLOC_HIGH but the name does not really hint what it means. As ALLOC_HIGH is internal to the allocator, rename it to ALLOC_MIN_RESERVE to document that the min reserves can be depleted. Link: https://lkml.kernel.org/r/20230113111217.14134-1-mgorman@techsingularity.net Link: https://lkml.kernel.org/r/20230113111217.14134-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Thierry Reding <thierry.reding@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 281dd25c1a01 ("mm/page_alloc: let GFP_ATOMIC order-0 allocs access highatomic reserves") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-22mm/swapfile: skip HugeTLB pages for unuse_vmaLiu Shixin1-1/+1
commit 7528c4fb1237512ee18049f852f014eba80bbe8d upstream. I got a bad pud error and lost a 1GB HugeTLB when calling swapoff. The problem can be reproduced by the following steps: 1. Allocate an anonymous 1GB HugeTLB and some other anonymous memory. 2. Swapout the above anonymous memory. 3. run swapoff and we will get a bad pud error in kernel message: mm/pgtable-generic.c:42: bad pud 00000000743d215d(84000001400000e7) We can tell that pud_clear_bad is called by pud_none_or_clear_bad in unuse_pud_range() by ftrace. And therefore the HugeTLB pages will never be freed because we lost it from page table. We can skip HugeTLB pages for unuse_vma to fix it. Link: https://lkml.kernel.org/r/20241015014521.570237-1-liushixin2@huawei.com Fixes: 0fe6e20b9c4c ("hugetlb, rmap: add reverse mapping for hugepage") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-17secretmem: disable memfd_secret() if arch cannot set direct mapPatrick Roy1-2/+2
commit 532b53cebe58f34ce1c0f34d866f5c0e335c53c6 upstream. Return -ENOSYS from memfd_secret() syscall if !can_set_direct_map(). This is the case for example on some arm64 configurations, where marking 4k PTEs in the direct map not present can only be done if the direct map is set up at 4k granularity in the first place (as ARM's break-before-make semantics do not easily allow breaking apart large/gigantic pages). More precisely, on arm64 systems with !can_set_direct_map(), set_direct_map_invalid_noflush() is a no-op, however it returns success (0) instead of an error. This means that memfd_secret will seemingly "work" (e.g. syscall succeeds, you can mmap the fd and fault in pages), but it does not actually achieve its goal of removing its memory from the direct map. Note that with this patch, memfd_secret() will start erroring on systems where can_set_direct_map() returns false (arm64 with CONFIG_RODATA_FULL_DEFAULT_ENABLED=n, CONFIG_DEBUG_PAGEALLOC=n and CONFIG_KFENCE=n), but that still seems better than the current silent failure. Since CONFIG_RODATA_FULL_DEFAULT_ENABLED defaults to 'y', most arm64 systems actually have a working memfd_secret() and aren't be affected. From going through the iterations of the original memfd_secret patch series, it seems that disabling the syscall in these scenarios was the intended behavior [1] (preferred over having set_direct_map_invalid_noflush return an error as that would result in SIGBUSes at page-fault time), however the check for it got dropped between v16 [2] and v17 [3], when secretmem moved away from CMA allocations. [1]: https://lore.kernel.org/lkml/20201124164930.GK8537@kernel.org/ [2]: https://lore.kernel.org/lkml/20210121122723.3446-11-rppt@kernel.org/#t [3]: https://lore.kernel.org/lkml/20201125092208.12544-10-rppt@kernel.org/ Link: https://lkml.kernel.org/r/20241001080056.784735-1-roypat@amazon.co.uk Fixes: 1507f51255c9 ("mm: introduce memfd_secret system call to create "secret" memory areas") Signed-off-by: Patrick Roy <roypat@amazon.co.uk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-17mm: z3fold: deprecate CONFIG_Z3FOLDYosry Ahmed1-6/+19
[ Upstream commit 7a2369b74abf76cd3e54c45b30f6addb497f831b ] The z3fold compressed pages allocator is rarely used, most users use zsmalloc. The only disadvantage of zsmalloc in comparison is the dependency on MMU, and zbud is a more common option for !MMU as it was the default zswap allocator for a long time. Historically, zsmalloc had worse latency than zbud and z3fold but offered better memory savings. This is no longer the case as shown by a simple recent analysis [1]. That analysis showed that z3fold does not have any advantage over zsmalloc or zbud considering both performance and memory usage. In a kernel build test on tmpfs in a limited cgroup, z3fold took 3% more time and used 1.8% more memory. The latency of zswap_load() was 7% higher, and that of zswap_store() was 10% higher. Zsmalloc is better in all metrics. Moreover, z3fold apparently has latent bugs, which was made noticeable by a recent soft lockup bug report with z3fold [2]. Switching to zsmalloc not only fixed the problem, but also reduced the swap usage from 6~8G to 1~2G. Other users have also reported being bitten by mistakenly enabling z3fold. Other than hurting users, z3fold is repeatedly causing wasted engineering effort. Apart from investigating the above bug, it came up in multiple development discussions (e.g. [3]) as something we need to handle, when there aren't any legit users (at least not intentionally). The natural course of action is to deprecate z3fold, and remove in a few cycles if no objections are raised from active users. Next on the list should be zbud, as it offers marginal latency gains at the cost of huge memory waste when compared to zsmalloc. That one will need to wait until zsmalloc does not depend on MMU. Rename the user-visible config option from CONFIG_Z3FOLD to CONFIG_Z3FOLD_DEPRECATED so that users with CONFIG_Z3FOLD=y get a new prompt with explanation during make oldconfig. Also, remove CONFIG_Z3FOLD=y from defconfigs. [1]https://lore.kernel.org/lkml/CAJD7tkbRF6od-2x_L8-A1QL3=2Ww13sCj4S3i4bNndqF+3+_Vg@mail.gmail.com/ [2]https://lore.kernel.org/lkml/EF0ABD3E-A239-4111-A8AB-5C442E759CF3@gmail.com/ [3]https://lore.kernel.org/lkml/CAJD7tkbnmeVugfunffSovJf9FAgy9rhBVt_tx=nxUveLUfqVsA@mail.gmail.com/ [arnd@arndb.de: deprecate ZSWAP_ZPOOL_DEFAULT_Z3FOLD as well] Link: https://lkml.kernel.org/r/20240909202625.1054880-1-arnd@kernel.org Link: https://lkml.kernel.org/r/20240904233343.933462-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vitaly Wool <vitaly.wool@konsulko.com> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 7a2369b74abf76cd3e54c45b30f6addb497f831b) Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-17mm: krealloc: consider spare memory for __GFP_ZERODanilo Krummrich1-0/+7
commit 1a83a716ec233990e1fd5b6fbb1200ade63bf450 upstream. As long as krealloc() is called with __GFP_ZERO consistently, starting with the initial memory allocation, __GFP_ZERO should be fully honored. However, if for an existing allocation krealloc() is called with a decreased size, it is not ensured that the spare portion the allocation is zeroed. Thus, if krealloc() is subsequently called with a larger size again, __GFP_ZERO can't be fully honored, since we don't know the previous size, but only the bucket size. Example: buf = kzalloc(64, GFP_KERNEL); memset(buf, 0xff, 64); buf = krealloc(buf, 48, GFP_KERNEL | __GFP_ZERO); /* After this call the last 16 bytes are still 0xff. */ buf = krealloc(buf, 64, GFP_KERNEL | __GFP_ZERO); Fix this, by explicitly setting spare memory to zero, when shrinking an allocation with __GFP_ZERO flag set or init_on_alloc enabled. Link: https://lkml.kernel.org/r/20240812223707.32049-1-dakr@kernel.org Signed-off-by: Danilo Krummrich <dakr@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-17mm/damon/vaddr: protect vma traversal in __damon_va_thre_regions() with rcu ↵Liam R. Howlett1-0/+2
read lock commit fb497d6db7c19c797cbd694b52d1af87c4eebcc6 upstream. Traversing VMAs of a given maple tree should be protected by rcu read lock. However, __damon_va_three_regions() is not doing the protection. Hold the lock. Link: https://lkml.kernel.org/r/20240905001204.1481-1-sj@kernel.org Fixes: d0cf3dd47f0d ("damon: convert __damon_va_three_regions to use the VMA iterator") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/b83651a0-5b24-4206-b860-cb54ffdf209b@roeck-us.net Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-17mm: only enforce minimum stack gap size if it's sensibleDavid Gow1-1/+1
commit 69b50d4351ed924f29e3d46b159e28f70dfc707f upstream. The generic mmap_base code tries to leave a gap between the top of the stack and the mmap base address, but enforces a minimum gap size (MIN_GAP) of 128MB, which is too large on some setups. In particular, on arm tasks without ADDR_LIMIT_32BIT, the STACK_TOP value is less than 128MB, so it's impossible to fit such a gap in. Only enforce this minimum if MIN_GAP < MAX_GAP, as we'd prefer to honour MAX_GAP, which is defined proportionally, so scales better and always leaves us with both _some_ stack space and some room for mmap. This fixes the usercopy KUnit test suite on 32-bit arm, as it doesn't set any personality flags so gets the default (in this case 26-bit) task size. This test can be run with: ./tools/testing/kunit/kunit.py run --arch arm usercopy --make_options LLVM=1 Link: https://lkml.kernel.org/r/20240803074642.1849623-2-davidgow@google.com Fixes: dba79c3df4a2 ("arm: use generic mmap top-down layout and brk randomization") Signed-off-by: David Gow <davidgow@google.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-17mm/filemap: optimize filemap folio addingKairui Song1-14/+39
commit 6758c1128ceb45d1a35298912b974eb4895b7dd9 upstream. Instead of doing multiple tree walks, do one optimism range check with lock hold, and exit if raced with another insertion. If a shadow exists, check it with a new xas_get_order helper before releasing the lock to avoid redundant tree walks for getting its order. Drop the lock and do the allocation only if a split is needed. In the best case, it only need to walk the tree once. If it needs to alloc and split, 3 walks are issued (One for first ranged conflict check and order retrieving, one for the second check after allocation, one for the insert after split). Testing with 4K pages, in an 8G cgroup, with 16G brd as block device: echo 3 > /proc/sys/vm/drop_caches fio -name=cached --numjobs=16 --filename=/mnt/test.img \ --buffered=1 --ioengine=mmap --rw=randread --time_based \ --ramp_time=30s --runtime=5m --group_reporting Before: bw ( MiB/s): min= 1027, max= 3520, per=100.00%, avg=2445.02, stdev=18.90, samples=8691 iops : min=263001, max=901288, avg=625924.36, stdev=4837.28, samples=8691 After (+7.3%): bw ( MiB/s): min= 493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651 iops : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651 Test result with THP (do a THP randread then switch to 4K page in hope it issues a lot of splitting): echo 3 > /proc/sys/vm/drop_caches fio -name=cached --numjobs=16 --filename=/mnt/test.img \ --buffered=1 --ioengine=mmap -thp=1 --readonly \ --rw=randread --time_based --ramp_time=30s --runtime=10m \ --group_reporting fio -name=cached --numjobs=16 --filename=/mnt/test.img \ --buffered=1 --ioengine=mmap \ --rw=randread --time_based --runtime=5s --group_reporting Before: bw ( KiB/s): min= 4141, max=14202, per=100.00%, avg=7935.51, stdev=96.85, samples=18976 iops : min= 1029, max= 3548, avg=1979.52, stdev=24.23, samples=18976· READ: bw=4545B/s (4545B/s), 4545B/s-4545B/s (4545B/s-4545B/s), io=64.0KiB (65.5kB), run=14419-14419msec After (+12.5%): bw ( KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146 iops : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146 READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec The performance is better for both 4K (+7.5%) and THP (+12.5%) cached read. Link: https://lkml.kernel.org/r/20240415171857.19244-5-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Closes: https://lore.kernel.org/linux-mm/A5A976CB-DB57-4513-A700-656580488AB6@flyingcircus.io/ [ kasong@tencent.com: minor adjustment of variable declarations ] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-10-17mm/filemap: return early if failed to allocate memory for splitKairui Song1-1/+4
commit de60fd8ddeda2b41fbe11df11733838c5f684616 upstream. xas_split_alloc could fail with NOMEM, and in such case, it should abort early instead of keep going and fail the xas_split below. Link: https://lkml.kernel.org/r/20240416071722.45997-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20240415171857.19244-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20240415171857.19244-2-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 6758c1128ceb ("mm/filemap: optimize filemap folio adding") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-18mm: avoid leaving partial pfn mappings around in error caseLinus Torvalds1-5/+22
commit 79a61cc3fc0466ad2b7b89618a6157785f0293b3 upstream. As Jann points out, PFN mappings are special, because unlike normal memory mappings, there is no lifetime information associated with the mapping - it is just a raw mapping of PFNs with no reference counting of a 'struct page'. That's all very much intentional, but it does mean that it's easy to mess up the cleanup in case of errors. Yes, a failed mmap() will always eventually clean up any partial mappings, but without any explicit lifetime in the page table mapping itself, it's very easy to do the error handling in the wrong order. In particular, it's easy to mistakenly free the physical backing store before the page tables are actually cleaned up and (temporarily) have stale dangling PTE entries. To make this situation less error-prone, just make sure that any partial pfn mapping is torn down early, before any other error handling. Reported-and-tested-by: Jann Horn <jannh@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Simona Vetter <simona.vetter@ffwll.ch> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-12memcg: protect concurrent access to mem_cgroup_idrShakeel Butt1-2/+20
commit 9972605a238339b85bd16b084eed5f18414d22db upstream. Commit 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") decoupled the memcg IDs from the CSS ID space to fix the cgroup creation failures. It introduced IDR to maintain the memcg ID space. The IDR depends on external synchronization mechanisms for modifications. For the mem_cgroup_idr, the idr_alloc() and idr_replace() happen within css callback and thus are protected through cgroup_mutex from concurrent modifications. However idr_remove() for mem_cgroup_idr was not protected against concurrency and can be run concurrently for different memcgs when they hit their refcnt to zero. Fix that. We have been seeing list_lru based kernel crashes at a low frequency in our fleet for a long time. These crashes were in different part of list_lru code including list_lru_add(), list_lru_del() and reparenting code. Upon further inspection, it looked like for a given object (dentry and inode), the super_block's list_lru didn't have list_lru_one for the memcg of that object. The initial suspicions were either the object is not allocated through kmem_cache_alloc_lru() or somehow memcg_list_lru_alloc() failed to allocate list_lru_one() for a memcg but returned success. No evidence were found for these cases. Looking more deeply, we started seeing situations where valid memcg's id is not present in mem_cgroup_idr and in some cases multiple valid memcgs have same id and mem_cgroup_idr is pointing to one of them. So, the most reasonable explanation is that these situations can happen due to race between multiple idr_remove() calls or race between idr_alloc()/idr_replace() and idr_remove(). These races are causing multiple memcgs to acquire the same ID and then offlining of one of them would cleanup list_lrus on the system for all of them. Later access from other memcgs to the list_lru cause crashes due to missing list_lru_one. Link: https://lkml.kernel.org/r/20240802235822.1830976-1-shakeel.butt@linux.dev Fixes: 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Adapted over commit 6f0df8e16eb5 ("memcontrol: ensure memcg acquired by id is properly set up") not in the tree ] Signed-off-by: Tomas Krcka <krckatom@amazon.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-12x86/kaslr: Expose and use the end of the physical memory address spaceThomas Gleixner2-2/+2
commit ea72ce5da22806d5713f3ffb39a6d5ae73841f93 upstream. iounmap() on x86 occasionally fails to unmap because the provided valid ioremap address is not below high_memory. It turned out that this happens due to KASLR. KASLR uses the full address space between PAGE_OFFSET and vaddr_end to randomize the starting points of the direct map, vmalloc and vmemmap regions. It thereby limits the size of the direct map by using the installed memory size plus an extra configurable margin for hot-plug memory. This limitation is done to gain more randomization space because otherwise only the holes between the direct map, vmalloc, vmemmap and vaddr_end would be usable for randomizing. The limited direct map size is not exposed to the rest of the kernel, so the memory hot-plug and resource management related code paths still operate under the assumption that the available address space can be determined with MAX_PHYSMEM_BITS. request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1 downwards. That means the first allocation happens past the end of the direct map and if unlucky this address is in the vmalloc space, which causes high_memory to become greater than VMALLOC_START and consequently causes iounmap() to fail for valid ioremap addresses. MAX_PHYSMEM_BITS cannot be changed for that because the randomization does not align with address bit boundaries and there are other places which actually require to know the maximum number of address bits. All remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found to be correct. Cure this by exposing the end of the direct map via PHYSMEM_END and use that for the memory hot-plug and resource management related places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END maps to a variable which is initialized by the KASLR initialization and otherwise it is based on MAX_PHYSMEM_BITS as before. To prevent future hickups add a check into add_pages() to catch callers trying to add memory above PHYSMEM_END. Fixes: 0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions") Reported-by: Max Ramanouski <max8rr8@gmail.com> Reported-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-By: Max Ramanouski <max8rr8@gmail.com> Tested-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-04mm: Fix missing folio invalidation calls during truncationDavid Howells1-2/+2
[ Upstream commit 0aa2e1b2fb7a75aa4b5b4347055ccfea6f091769 ] When AS_RELEASE_ALWAYS is set on a mapping, the ->release_folio() and ->invalidate_folio() calls should be invoked even if PG_private and PG_private_2 aren't set. This is used by netfslib to keep track of the point above which reads can be skipped in favour of just zeroing pagecache locally. There are a couple of places in truncation in which invalidation is only called when folio_has_private() is true. Fix these to check folio_needs_release() instead. Without this, the generic/075 and generic/112 xfstests (both fsx-based tests) fail with minimum folio size patches applied[1]. Fixes: b4fa966f03b7 ("mm, netfs, fscache: stop read optimisation when folio removed from pagecache") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20240815090849.972355-1-kernel@pankajraghav.com/ [1] Link: https://lore.kernel.org/r/20240823200819.532106-2-dhowells@redhat.com Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> cc: Matthew Wilcox (Oracle) <willy@infradead.org> cc: Pankaj Raghav <p.raghav@samsung.com> cc: Jeff Layton <jlayton@kernel.org> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: netfs@lists.linux.dev cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29mm/numa: no task_numa_fault() call if PTE is changedZi Yan1-15/+14
commit 40b760cfd44566bca791c80e0720d70d75382b84 upstream. When handling a numa page fault, task_numa_fault() should be called by a process that restores the page table of the faulted folio to avoid duplicated stats counting. Commit b99a342d4f11 ("NUMA balancing: reduce TLB flush via delaying mapping on hint page fault") restructured do_numa_page() and did not avoid task_numa_fault() call in the second page table check after a numa migration failure. Fix it by making all !pte_same() return immediately. This issue can cause task_numa_fault() being called more than necessary and lead to unexpected numa balancing results (It is hard to tell whether the issue will cause positive or negative performance impact due to duplicated numa fault counting). Link: https://lkml.kernel.org/r/20240809145906.1513458-2-ziy@nvidia.com Fixes: b99a342d4f11 ("NUMA balancing: reduce TLB flush via delaying mapping on hint page fault") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: "Huang, Ying" <ying.huang@intel.com> Closes: https://lore.kernel.org/linux-mm/87zfqfw0yw.fsf@yhuang6-desk2.ccr.corp.intel.com/ Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-29mm/numa: no task_numa_fault() call if PMD is changedZi Yan1-17/+13
commit fd8c35a92910f4829b7c99841f39b1b952c259d5 upstream. When handling a numa page fault, task_numa_fault() should be called by a process that restores the page table of the faulted folio to avoid duplicated stats counting. Commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling") restructured do_huge_pmd_numa_page() and did not avoid task_numa_fault() call in the second page table check after a numa migration failure. Fix it by making all !pmd_same() return immediately. This issue can cause task_numa_fault() being called more than necessary and lead to unexpected numa balancing results (It is hard to tell whether the issue will cause positive or negative performance impact due to duplicated numa fault counting). Link: https://lkml.kernel.org/r/20240809145906.1513458-3-ziy@nvidia.com Fixes: c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling") Reported-by: "Huang, Ying" <ying.huang@intel.com> Closes: https://lore.kernel.org/linux-mm/87zfqfw0yw.fsf@yhuang6-desk2.ccr.corp.intel.com/ Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-29mm/vmalloc: fix page mapping if vm_area_alloc_pages() with high order ↵Hailong Liu1-9/+2
fallback to order 0 commit 61ebe5a747da649057c37be1c37eb934b4af79ca upstream. The __vmap_pages_range_noflush() assumes its argument pages** contains pages with the same page shift. However, since commit e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations"), if gfp_flags includes __GFP_NOFAIL with high order in vm_area_alloc_pages() and page allocation failed for high order, the pages** may contain two different page shifts (high order and order-0). This could lead __vmap_pages_range_noflush() to perform incorrect mappings, potentially resulting in memory corruption. Users might encounter this as follows (vmap_allow_huge = true, 2M is for PMD_SIZE): kvmalloc(2M, __GFP_NOFAIL|GFP_X) __vmalloc_node_range_noprof(vm_flags=VM_ALLOW_HUGE_VMAP) vm_area_alloc_pages(order=9) ---> order-9 allocation failed and fallback to order-0 vmap_pages_range() vmap_pages_range_noflush() __vmap_pages_range_noflush(page_shift = 21) ----> wrong mapping happens We can remove the fallback code because if a high-order allocation fails, __vmalloc_node_range_noprof() will retry with order-0. Therefore, it is unnecessary to fallback to order-0 here. Therefore, fix this by removing the fallback code. Link: https://lkml.kernel.org/r/20240808122019.3361-1-hailong.liu@oppo.com Fixes: e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations") Signed-off-by: Hailong Liu <hailong.liu@oppo.com> Reported-by: Tangquan Zheng <zhengtangquan@oppo.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Barry Song <baohua@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-29mm: Remove kmem_valid_obj()Zhen Lei2-33/+12
commit 6e284c55fc0bef7d25fd34d29db11f483da60ea4 upstream. Function kmem_dump_obj() will splat if passed a pointer to a non-slab object. So nothing calls it directly, instead calling kmem_valid_obj() first to determine whether the passed pointer to a valid slab object. This means that merging kmem_valid_obj() into kmem_dump_obj() will make the code more concise. Therefore, convert kmem_dump_obj() to work the same way as vmalloc_dump_obj(), removing the need for the kmem_dump_obj() caller to check kmem_valid_obj(). After this, there are no remaining calls to kmem_valid_obj() anymore, and it can be safely removed. Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-29mm: khugepaged: fix kernel BUG in hpage_collapse_scan_file()Ivan Orlov1-0/+20
[ Upstream commit 2ce0bdfebc74f6cbd4e97a4e767d505a81c38cf2 ] Syzkaller reported the following issue: kernel BUG at mm/khugepaged.c:1823! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 5097 Comm: syz-executor220 Not tainted 6.2.0-syzkaller-13154-g857f1268a591 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/16/2023 RIP: 0010:collapse_file mm/khugepaged.c:1823 [inline] RIP: 0010:hpage_collapse_scan_file+0x67c8/0x7580 mm/khugepaged.c:2233 Code: 00 00 89 de e8 c9 66 a3 ff 31 ff 89 de e8 c0 66 a3 ff 45 84 f6 0f 85 28 0d 00 00 e8 22 64 a3 ff e9 dc f7 ff ff e8 18 64 a3 ff <0f> 0b f3 0f 1e fa e8 0d 64 a3 ff e9 93 f6 ff ff f3 0f 1e fa 4c 89 RSP: 0018:ffffc90003dff4e0 EFLAGS: 00010093 RAX: ffffffff81e95988 RBX: 00000000000001c1 RCX: ffff8880205b3a80 RDX: 0000000000000000 RSI: 00000000000001c0 RDI: 00000000000001c1 RBP: ffffc90003dff830 R08: ffffffff81e90e67 R09: fffffbfff1a433c3 R10: 0000000000000000 R11: dffffc0000000001 R12: 0000000000000000 R13: ffffc90003dff6c0 R14: 00000000000001c0 R15: 0000000000000000 FS: 00007fdbae5ee700(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fdbae6901e0 CR3: 000000007b2dd000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> madvise_collapse+0x721/0xf50 mm/khugepaged.c:2693 madvise_vma_behavior mm/madvise.c:1086 [inline] madvise_walk_vmas mm/madvise.c:1260 [inline] do_madvise+0x9e5/0x4680 mm/madvise.c:1439 __do_sys_madvise mm/madvise.c:1452 [inline] __se_sys_madvise mm/madvise.c:1450 [inline] __x64_sys_madvise+0xa5/0xb0 mm/madvise.c:1450 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd The xas_store() call during page cache scanning can potentially translate 'xas' into the error state (with the reproducer provided by the syzkaller the error code is -ENOMEM). However, there are no further checks after the 'xas_store', and the next call of 'xas_next' at the start of the scanning cycle doesn't increase the xa_index, and the issue occurs. This patch will add the xarray state error checking after the xas_store() and the corresponding result error code. Tested via syzbot. [akpm@linux-foundation.org: update include/trace/events/huge_memory.h's SCAN_STATUS] Link: https://lkml.kernel.org/r/20230329145330.23191-1-ivan.orlov0322@gmail.com Link: https://syzkaller.appspot.com/bug?id=7d6bb3760e026ece7524500fe44fb024a0e959fc Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com> Reported-by: syzbot+9578faa5475acb35fa50@syzkaller.appspotmail.com Tested-by: Zach O'Keefe <zokeefe@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Himadri Pandya <himadrispandya@gmail.com> Cc: Ivan Orlov <ivan.orlov0322@gmail.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Song Liu <songliubraving@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29memcg_write_event_control(): fix a user-triggerable oopsAl Viro1-2/+5
commit 046667c4d3196938e992fba0dfcde570aa85cd0e upstream. we are *not* guaranteed that anything past the terminating NUL is mapped (let alone initialized with anything sane). Fixes: 0dea116876ee ("cgroup: implement eventfd-based generic API for notifications") Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-29mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpuWaiman Long1-9/+11
commit d75abd0d0bc29e6ebfebbf76d11b4067b35844af upstream. The memory_failure_cpu structure is a per-cpu structure. Access to its content requires the use of get_cpu_var() to lock in the current CPU and disable preemption. The use of a regular spinlock_t for locking purpose is fine for a non-RT kernel. Since the integration of RT spinlock support into the v5.15 kernel, a spinlock_t in a RT kernel becomes a sleeping lock and taking a sleeping lock in a preemption disabled context is illegal resulting in the following kind of warning. [12135.732244] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 [12135.732248] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 270076, name: kworker/0:0 [12135.732252] preempt_count: 1, expected: 0 [12135.732255] RCU nest depth: 2, expected: 2 : [12135.732420] Hardware name: Dell Inc. PowerEdge R640/0HG0J8, BIOS 2.10.2 02/24/2021 [12135.732423] Workqueue: kacpi_notify acpi_os_execute_deferred [12135.732433] Call Trace: [12135.732436] <TASK> [12135.732450] dump_stack_lvl+0x57/0x81 [12135.732461] __might_resched.cold+0xf4/0x12f [12135.732479] rt_spin_lock+0x4c/0x100 [12135.732491] memory_failure_queue+0x40/0xe0 [12135.732503] ghes_do_memory_failure+0x53/0x390 [12135.732516] ghes_do_proc.constprop.0+0x229/0x3e0 [12135.732575] ghes_proc+0xf9/0x1a0 [12135.732591] ghes_notify_hed+0x6a/0x150 [12135.732602] notifier_call_chain+0x43/0xb0 [12135.732626] blocking_notifier_call_chain+0x43/0x60 [12135.732637] acpi_ev_notify_dispatch+0x47/0x70 [12135.732648] acpi_os_execute_deferred+0x13/0x20 [12135.732654] process_one_work+0x41f/0x500 [12135.732695] worker_thread+0x192/0x360 [12135.732715] kthread+0x111/0x140 [12135.732733] ret_from_fork+0x29/0x50 [12135.732779] </TASK> Fix it by using a raw_spinlock_t for locking instead. Also move the pr_err() out of the lock critical section and after put_cpu_ptr() to avoid indeterminate latency and the possibility of sleep with this call. [longman@redhat.com: don't hold percpu ref across pr_err(), per Miaohe] Link: https://lkml.kernel.org/r/20240807181130.1122660-1-longman@redhat.com Link: https://lkml.kernel.org/r/20240806164107.1044956-1-longman@redhat.com Fixes: 0f383b6dc96e ("locking/spinlock: Provide RT variant") Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-14mm/hugetlb: fix potential race in __update_and_free_hugetlb_folio()Miaohe Lin1-7/+7
commit 5596d9e8b553dacb0ac34bcf873cbbfb16c3ba3e upstream. There is a potential race between __update_and_free_hugetlb_folio() and try_memory_failure_hugetlb(): CPU1 CPU2 __update_and_free_hugetlb_folio try_memory_failure_hugetlb folio_test_hugetlb -- It's still hugetlb folio. folio_clear_hugetlb_hwpoison spin_lock_irq(&hugetlb_lock); __get_huge_page_for_hwpoison folio_set_hugetlb_hwpoison spin_unlock_irq(&hugetlb_lock); spin_lock_irq(&hugetlb_lock); __folio_clear_hugetlb(folio); -- Hugetlb flag is cleared but too late. spin_unlock_irq(&hugetlb_lock); When the above race occurs, raw error page info will be leaked. Even worse, raw error pages won't have hwpoisoned flag set and hit pcplists/buddy. Fix this issue by deferring folio_clear_hugetlb_hwpoison() until __folio_clear_hugetlb() is done. So all raw error pages will have hwpoisoned flag set. Link: https://lkml.kernel.org/r/20240708025127.107713-1-linmiaohe@huawei.com Fixes: 32c877191e02 ("hugetlb: do not clear hugetlb dtor until allocating vmemmap") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-14mm: huge_memory: use !CONFIG_64BIT to relax huge page alignment on 32 bit ↵Yang Shi1-1/+1
machines commit d9592025000b3cf26c742f3505da7b83aedc26d5 upstream. Yves-Alexis Perez reported commit 4ef9ad19e176 ("mm: huge_memory: don't force huge page alignment on 32 bit") didn't work for x86_32 [1]. It is because x86_32 uses CONFIG_X86_32 instead of CONFIG_32BIT. !CONFIG_64BIT should cover all 32 bit machines. [1] https://lore.kernel.org/linux-mm/CAHbLzkr1LwH3pcTgM+aGQ31ip2bKqiqEQ8=FQB+t2c3dhNKNHA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20240712155855.1130330-1-yang@os.amperecomputing.com Fixes: 4ef9ad19e176 ("mm: huge_memory: don't force huge page alignment on 32 bit") Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Reported-by: Yves-Alexis Perez <corsac@debian.org> Tested-by: Yves-Alexis Perez <corsac@debian.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Salvatore Bonaccorso <carnil@debian.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> [6.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-11mm/page_alloc: fix pcp->count race between drain_pages_zone() vs ↵Li Zhijian1-7/+11
__rmqueue_pcplist() [ Upstream commit 66eca1021a42856d6af2a9802c99e160278aed91 ] It's expected that no page should be left in pcp_list after calling zone_pcp_disable() in offline_pages(). Previously, it's observed that offline_pages() gets stuck [1] due to some pages remaining in pcp_list. Cause: There is a race condition between drain_pages_zone() and __rmqueue_pcplist() involving the pcp->count variable. See below scenario: CPU0 CPU1 ---------------- --------------- spin_lock(&pcp->lock); __rmqueue_pcplist() { zone_pcp_disable() { /* list is empty */ if (list_empty(list)) { /* add pages to pcp_list */ alloced = rmqueue_bulk() mutex_lock(&pcp_batch_high_lock) ... __drain_all_pages() { drain_pages_zone() { /* read pcp->count, it's 0 here */ count = READ_ONCE(pcp->count) /* 0 means nothing to drain */ /* update pcp->count */ pcp->count += alloced << order; ... ... spin_unlock(&pcp->lock); In this case, after calling zone_pcp_disable() though, there are still some pages in pcp_list. And these pages in pcp_list are neither movable nor isolated, offline_pages() gets stuck as a result. Solution: Expand the scope of the pcp->lock to also protect pcp->count in drain_pages_zone(), to ensure no pages are left in the pcp list after zone_pcp_disable() [1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@fujitsu.com/ Link: https://lkml.kernel.org/r/20240723064428.1179519-1-lizhijian@fujitsu.com Fixes: 4b23a68f9536 ("mm/page_alloc: protect PCP lists with a spinlock") Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Reported-by: Yao Xingtao <yaoxt.fnst@fujitsu.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-11mm: page_alloc: control latency caused by zone PCP drainingLucas Stach1-4/+7
[ Upstream commit 55f77df7d715110299f12c27f4365bd6332d1adb ] Patch series "mm/treewide: Remove pXd_huge() API", v2. In previous work [1], we removed the pXd_large() API, which is arch specific. This patchset further removes the hugetlb pXd_huge() API. Hugetlb was never special on creating huge mappings when compared with other huge mappings. Having a standalone API just to detect such pgtable entries is more or less redundant, especially after the pXd_leaf() API set is introduced with/without CONFIG_HUGETLB_PAGE. When looking at this problem, a few issues are also exposed that we don't have a clear definition of the *_huge() variance API. This patchset started by cleaning these issues first, then replace all *_huge() users to use *_leaf(), then drop all *_huge() code. On x86/sparc, swap entries will be reported "true" in pXd_huge(), while for all the rest archs they're reported "false" instead. This part is done in patch 1-5, in which I suspect patch 1 can be seen as a bug fix, but I'll leave that to hmm experts to decide. Besides, there are three archs (arm, arm64, powerpc) that have slightly different definitions between the *_huge() v.s. *_leaf() variances. I tackled them separately so that it'll be easier for arch experts to chim in when necessary. This part is done in patch 6-9. The final patches 10-14 do the rest on the final removal, since *_leaf() will be the ultimate API in the future, and we seem to have quite some confusions on how *_huge() APIs can be defined, provide a rich comment for *_leaf() API set to define them properly to avoid future misuse, and hopefully that'll also help new archs to start support huge mappings and avoid traps (like either swap entries, or PROT_NONE entry checks). [1] https://lore.kernel.org/r/20240305043750.93762-1-peterx@redhat.com This patch (of 14): When the complete PCP is drained a much larger number of pages than the usual batch size might be freed at once, causing large IRQ and preemption latency spikes, as they are all freed while holding the pcp and zone spinlocks. To avoid those latency spikes, limit the number of pages freed in a single bulk operation to common batch limits. Link: https://lkml.kernel.org/r/20240318200404.448346-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20240318200736.2835502-1-l.stach@pengutronix.de Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Mark Salter <msalter@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 66eca1021a42 ("mm/page_alloc: fix pcp->count race between drain_pages_zone() vs __rmqueue_pcplist()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-11mm: restrict the pcp batch scale factor to avoid too long latencyHuang Ying2-1/+12
[ Upstream commit 52166607ecc980391b1fffbce0be3074a96d0c7b ] In page allocator, PCP (Per-CPU Pageset) is refilled and drained in batches to increase page allocation throughput, reduce page allocation/freeing latency per page, and reduce zone lock contention. But too large batch size will cause too long maximal allocation/freeing latency, which may punish arbitrary users. So the default batch size is chosen carefully (in zone_batchsize(), the value is 63 for zone > 1GB) to avoid that. In commit 3b12e7e97938 ("mm/page_alloc: scale the number of pages that are batch freed"), the batch size will be scaled for large number of page freeing to improve page freeing performance and reduce zone lock contention. Similar optimization can be used for large number of pages allocation too. To find out a suitable max batch scale factor (that is, max effective batch size), some tests and measurement on some machines were done as follows. A set of debug patches are implemented as follows, - Set PCP high to be 2 * batch to reduce the effect of PCP high - Disable free batch size scaling to get the raw performance. - The code with zone lock held is extracted from rmqueue_bulk() and free_pcppages_bulk() to 2 separate functions to make it easy to measure the function run time with ftrace function_graph tracer. - The batch size is hard coded to be 63 (default), 127, 255, 511, 1023, 2047, 4095. Then will-it-scale/page_fault1 is used to generate the page allocation/freeing workload. The page allocation/freeing throughput (page/s) is measured via will-it-scale. The page allocation/freeing average latency (alloc/free latency avg, in us) and allocation/freeing latency at 99 percentile (alloc/free latency 99%, in us) are measured with ftrace function_graph tracer. The test results are as follows, Sapphire Rapids Server ====================== Batch throughput free latency free latency alloc latency alloc latency page/s avg / us 99% / us avg / us 99% / us ----- ---------- ------------ ------------ ------------- ------------- 63 513633.4 2.33 3.57 2.67 6.83 127 517616.7 4.35 6.65 4.22 13.03 255 520822.8 8.29 13.32 7.52 25.24 511 524122.0 15.79 23.42 14.02 49.35 1023 525980.5 30.25 44.19 25.36 94.88 2047 526793.6 59.39 84.50 45.22 140.81 Ice Lake Server =============== Batch throughput free latency free latency alloc latency alloc latency page/s avg / us 99% / us avg / us 99% / us ----- ---------- ------------ ------------ ------------- ------------- 63 620210.3 2.21 3.68 2.02 4.35 127 627003.0 4.09 6.86 3.51 8.28 255 630777.5 7.70 13.50 6.17 15.97 511 633651.5 14.85 22.62 11.66 31.08 1023 637071.1 28.55 42.02 20.81 54.36 2047 638089.7 56.54 84.06 39.28 91