summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2025-12-01mm/truncate: unmap large folio on split failureKiryl Shutsemau1-6/+29
commit fa04f5b60fda62c98a53a60de3a1e763f11feb41 upstream. Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are supposed to generate SIGBUS. This behavior might not be respected on truncation. During truncation, the kernel splits a large folio in order to reclaim memory. As a side effect, it unmaps the folio and destroys PMD mappings of the folio. The folio will be refaulted as PTEs and SIGBUS semantics are preserved. However, if the split fails, PMD mappings are preserved and the user will not receive SIGBUS on any accesses within the PMD. Unmap the folio on split failure. It will lead to refault as PTEs and preserve SIGBUS semantics. Make an exception for shmem/tmpfs that for long time intentionally mapped with PMDs across i_size. Link: https://lkml.kernel.org/r/20251027115636.82382-3-kirill@shutemov.name Fixes: b9a8a4195c7d ("truncate,shmem: Handle truncates that split large folios") Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christian Brauner <brauner@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-01mm/mempool: fix poisoning order>0 pages with HIGHMEMVlastimil Babka1-6/+26
commit ec33b59542d96830e3c89845ff833cf7b25ef172 upstream. The kernel test has reported: BUG: unable to handle page fault for address: fffba000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page *pde = 03171067 *pte = 00000000 Oops: Oops: 0002 [#1] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G T 6.18.0-rc2-00031-gec7f31b2a2d3 #1 NONE a1d066dfe789f54bc7645c7989957d2bdee593ca Tainted: [T]=RANDSTRUCT Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 EIP: memset (arch/x86/include/asm/string_32.h:168 arch/x86/lib/memcpy_32.c:17) Code: a5 8b 4d f4 83 e1 03 74 02 f3 a4 83 c4 04 5e 5f 5d 2e e9 73 41 01 00 90 90 90 3e 8d 74 26 00 55 89 e5 57 56 89 c6 89 d0 89 f7 <f3> aa 89 f0 5e 5f 5d 2e e9 53 41 01 00 cc cc cc 55 89 e5 53 57 56 EAX: 0000006b EBX: 00000015 ECX: 001fefff EDX: 0000006b ESI: fffb9000 EDI: fffba000 EBP: c611fbf0 ESP: c611fbe8 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00010287 CR0: 80050033 CR2: fffba000 CR3: 0316e000 CR4: 00040690 Call Trace: poison_element (mm/mempool.c:83 mm/mempool.c:102) mempool_init_node (mm/mempool.c:142 mm/mempool.c:226) mempool_init_noprof (mm/mempool.c:250 (discriminator 1)) ? mempool_alloc_pages (mm/mempool.c:640) bio_integrity_initfn (block/bio-integrity.c:483 (discriminator 8)) ? mempool_alloc_pages (mm/mempool.c:640) do_one_initcall (init/main.c:1283) Christoph found out this is due to the poisoning code not dealing properly with CONFIG_HIGHMEM because only the first page is mapped but then the whole potentially high-order page is accessed. We could give up on HIGHMEM here, but it's straightforward to fix this with a loop that's mapping, poisoning or checking and unmapping individual pages. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202511111411.9ebfa1ba-lkp@intel.com Analyzed-by: Christoph Hellwig <hch@lst.de> Fixes: bdfedb76f4f5 ("mm, mempool: poison elements backed by slab allocator") Cc: stable@vger.kernel.org Tested-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251113-mempool-poison-v1-1-233b3ef984c3@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-01shmem: fix tmpfs reconfiguration (remount) when noswap is setMike Yuan1-8/+7
commit 3cd1548a278c7d6a9bdef1f1866e7cf66bfd3518 upstream. In systemd we're trying to switch the internal credentials setup logic to new mount API [1], and I noticed fsconfig(FSCONFIG_CMD_RECONFIGURE) consistently fails on tmpfs with noswap option. This can be trivially reproduced with the following: ``` int fs_fd = fsopen("tmpfs", 0); fsconfig(fs_fd, FSCONFIG_SET_FLAG, "noswap", NULL, 0); fsconfig(fs_fd, FSCONFIG_CMD_CREATE, NULL, NULL, 0); fsmount(fs_fd, 0, 0); fsconfig(fs_fd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0); <------ EINVAL ``` After some digging the culprit is shmem_reconfigure() rejecting !(ctx->seen & SHMEM_SEEN_NOSWAP) && sbinfo->noswap, which is bogus as ctx->seen serves as a mask for whether certain options are touched at all. On top of that, noswap option doesn't use fsparam_flag_no, hence it's not really possible to "reenable" swap to begin with. Drop the check and redundant SHMEM_SEEN_NOSWAP flag. [1] https://github.com/systemd/systemd/pull/39637 Fixes: 2c6efe9cf2d7 ("shmem: add support to ignore swap") Signed-off-by: Mike Yuan <me@yhndnzj.com> Link: https://patch.msgid.link/20251108190930.440685-1-me@yhndnzj.com Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm/memory: do not populate page table entries beyond i_sizeKiryl Shutsemau2-6/+34
commit 74207de2ba10c2973334906822dc94d2e859ffc5 upstream. Patch series "Fix SIGBUS semantics with large folios", v3. Accessing memory within a VMA, but beyond i_size rounded up to the next page size, is supposed to generate SIGBUS. Darrick reported[1] an xfstests regression in v6.18-rc1. generic/749 failed due to missing SIGBUS. This was caused by my recent changes that try to fault in the whole folio where possible: 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()") 357b92761d94 ("mm/filemap: map entire large folio faultaround") These changes did not consider i_size when setting up PTEs, leading to xfstest breakage. However, the problem has been present in the kernel for a long time - since huge tmpfs was introduced in 2016. The kernel happily maps PMD-sized folios as PMD without checking i_size. And huge=always tmpfs allocates PMD-size folios on any writes. I considered this corner case when I implemented a large tmpfs, and my conclusion was that no one in their right mind should rely on receiving a SIGBUS signal when accessing beyond i_size. I cannot imagine how it could be useful for the workload. But apparently filesystem folks care a lot about preserving strict SIGBUS semantics. Generic/749 was introduced last year with reference to POSIX, but no real workloads were mentioned. It also acknowledged the tmpfs deviation from the test case. POSIX indeed says[3]: References within the address range starting at pa and continuing for len bytes to whole pages following the end of an object shall result in delivery of a SIGBUS signal. The patchset fixes the regression introduced by recent changes as well as more subtle SIGBUS breakage due to split failure on truncation. This patch (of 2): Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are supposed to generate SIGBUS. Recent changes attempted to fault in full folio where possible. They did not respect i_size, which led to populating PTEs beyond i_size and breaking SIGBUS semantics. Darrick reported generic/749 breakage because of this. However, the problem existed before the recent changes. With huge=always tmpfs, any write to a file leads to PMD-size allocation. Following the fault-in of the folio will install PMD mapping regardless of i_size. Fix filemap_map_pages() and finish_fault() to not install: - PTEs beyond i_size; - PMD mappings across i_size; Make an exception for shmem/tmpfs that for long time intentionally mapped with PMDs across i_size. Link: https://lkml.kernel.org/r/20251027115636.82382-1-kirill@shutemov.name Link: https://lkml.kernel.org/r/20251027115636.82382-2-kirill@shutemov.name Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Fixes: 6795801366da ("xfs: Support large folios") Reported-by: "Darrick J. Wong" <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm/huge_memory: do not change split_huge_page*() target order silentlyZi Yan2-10/+5
commit 77008e1b2ef73249bceb078a321a3ff6bc087afb upstream. Page cache folios from a file system that support large block size (LBS) can have minimal folio order greater than 0, thus a high order folio might not be able to be split down to order-0. Commit e220917fa507 ("mm: split a folio in minimum folio order chunks") bumps the target order of split_huge_page*() to the minimum allowed order when splitting a LBS folio. This causes confusion for some split_huge_page*() callers like memory failure handling code, since they expect after-split folios all have order-0 when split succeeds but in reality get min_order_for_split() order folios and give warnings. Fix it by failing a split if the folio cannot be split to the target order. Rename try_folio_split() to try_folio_split_to_order() to reflect the added new_order parameter. Remove its unused list parameter. [The test poisons LBS folios, which cannot be split to order-0 folios, and also tries to poison all memory. The non split LBS folios take more memory than the test anticipated, leading to OOM. The patch fixed the kernel warning and the test needs some change to avoid OOM.] Link: https://lkml.kernel.org/r/20251017013630.139907-1-ziy@nvidia.com Fixes: e220917fa507 ("mm: split a folio in minimum folio order chunks") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: syzbot+e6367ea2fdab6ed46056@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68d2c943.a70a0220.1b52b.02b3.GAE@google.com/ Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24codetag: debug: handle existing CODETAG_EMPTY in mark_objexts_empty for ↵Hao Ge1-1/+5
slabobj_ext commit 1abbdf3d57aa964e572940d67c9ec5dc87710738 upstream. When alloc_slab_obj_exts() fails and then later succeeds in allocating a slab extension vector, it calls handle_failed_objexts_alloc() to mark all objects in the vector as empty. As a result all objects in this slab (slabA) will have their extensions set to CODETAG_EMPTY. Later on if this slabA is used to allocate a slabobj_ext vector for another slab (slabB), we end up with the slabB->obj_exts pointing to a slabobj_ext vector that itself has a non-NULL slabobj_ext equal to CODETAG_EMPTY. When slabB gets freed, free_slab_obj_exts() is called to free slabB->obj_exts vector. free_slab_obj_exts() calls mark_objexts_empty(slabB->obj_exts) which will generate a warning because it expects slabobj_ext vectors to have a NULL obj_ext, not CODETAG_EMPTY. Modify mark_objexts_empty() to skip the warning and setting the obj_ext value if it's already set to CODETAG_EMPTY. To quickly detect this WARN, I modified the code from WARN_ON(slab_exts[offs].ref.ct) to BUG_ON(slab_exts[offs].ref.ct == 1); We then obtained this message: [21630.898561] ------------[ cut here ]------------ [21630.898596] kernel BUG at mm/slub.c:2050! [21630.898611] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [21630.900372] Modules linked in: squashfs isofs vfio_iommu_type1 vhost_vsock vfio vhost_net vmw_vsock_virtio_transport_common vhost tap vhost_iotlb iommufd vsock binfmt_misc nfsv3 nfs_acl nfs lockd grace netfs tls rds dns_resolver tun brd overlay ntfs3 exfat btrfs blake2b_generic xor xor_neon raid6_pq loop sctp ip6_udp_tunnel udp_tunnel nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables rfkill ip_set sunrpc vfat fat joydev sg sch_fq_codel nfnetlink virtio_gpu sr_mod cdrom drm_client_lib virtio_dma_buf drm_shmem_helper drm_kms_helper drm ghash_ce backlight virtio_net virtio_blk virtio_scsi net_failover virtio_console failover virtio_mmio dm_mirror dm_region_hash dm_log dm_multipath dm_mod fuse i2c_dev virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring autofs4 aes_neon_bs aes_ce_blk [last unloaded: hwpoison_inject] [21630.909177] CPU: 3 UID: 0 PID: 3787 Comm: kylin-process-m Kdump: loaded Tainted: G        W           6.18.0-rc1+ #74 PREEMPT(voluntary) [21630.910495] Tainted: [W]=WARN [21630.910867] Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 [21630.911625] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [21630.912392] pc : __free_slab+0x228/0x250 [21630.912868] lr : __free_slab+0x18c/0x250[21630.913334] sp : ffff8000a02f73e0 [21630.913830] x29: ffff8000a02f73e0 x28: fffffdffc43fc800 x27: ffff0000c0011c40 [21630.914677] x26: ffff0000c000cac0 x25: ffff00010fe5e5f0 x24: ffff000102199b40 [21630.915469] x23: 0000000000000003 x22: 0000000000000003 x21: ffff0000c0011c40 [21630.916259] x20: fffffdffc4086600 x19: fffffdffc43fc800 x18: 0000000000000000 [21630.917048] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [21630.917837] x14: 0000000000000000 x13: 0000000000000000 x12: ffff70001405ee66 [21630.918640] x11: 1ffff0001405ee65 x10: ffff70001405ee65 x9 : ffff800080a295dc [21630.919442] x8 : ffff8000a02f7330 x7 : 0000000000000000 x6 : 0000000000003000 [21630.920232] x5 : 0000000024924925 x4 : 0000000000000001 x3 : 0000000000000007 [21630.921021] x2 : 0000000000001b40 x1 : 000000000000001f x0 : 0000000000000001 [21630.921810] Call trace: [21630.922130]  __free_slab+0x228/0x250 (P) [21630.922669]  free_slab+0x38/0x118 [21630.923079]  free_to_partial_list+0x1d4/0x340 [21630.923591]  __slab_free+0x24c/0x348 [21630.924024]  ___cache_free+0xf0/0x110 [21630.924468]  qlist_free_all+0x78/0x130 [21630.924922]  kasan_quarantine_reduce+0x114/0x148 [21630.925525]  __kasan_slab_alloc+0x7c/0xb0 [21630.926006]  kmem_cache_alloc_noprof+0x164/0x5c8 [21630.926699]  __alloc_object+0x44/0x1f8 [21630.927153]  __create_object+0x34/0xc8 [21630.927604]  kmemleak_alloc+0xb8/0xd8 [21630.928052]  kmem_cache_alloc_noprof+0x368/0x5c8 [21630.928606]  getname_flags.part.0+0xa4/0x610 [21630.929112]  getname_flags+0x80/0xd8 [21630.929557]  vfs_fstatat+0xc8/0xe0 [21630.929975]  __do_sys_newfstatat+0xa0/0x100 [21630.930469]  __arm64_sys_newfstatat+0x90/0xd8 [21630.931046]  invoke_syscall+0xd4/0x258 [21630.931685]  el0_svc_common.constprop.0+0xb4/0x240 [21630.932467]  do_el0_svc+0x48/0x68 [21630.932972]  el0_svc+0x40/0xe0 [21630.933472]  el0t_64_sync_handler+0xa0/0xe8 [21630.934151]  el0t_64_sync+0x1ac/0x1b0 [21630.934923] Code: aa1803e0 97ffef2b a9446bf9 17ffff9c (d4210000) [21630.936461] SMP: stopping secondary CPUs [21630.939550] Starting crashdump kernel... [21630.940108] Bye! Link: https://lkml.kernel.org/r/20251029014317.1533488-1-hao.ge@linux.dev Fixes: 09c46563ff6d ("codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext allocations") Signed-off-by: Hao Ge <gehao@kylinos.cn> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Christoph Lameter (Ampere) <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Cc: gehao <gehao@kylinos.cn> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm/huge_memory: fix folio split check for anon folios in swapcacheZi Yan1-2/+4
commit f1d47cafe513b5552a5b20a7af0936d9070a8a78 upstream. Both uniform and non uniform split check missed the check to prevent splitting anon folios in swapcache to non-zero order. Splitting anon folios in swapcache to non-zero order can cause data corruption since swapcache only support PMD order and order-0 entries. This can happen when one use split_huge_pages under debugfs to split anon folios in swapcache. In-tree callers do not perform such an illegal operation. Only debugfs interface could trigger it. I will put adding a test case on my TODO list. Fix the check. Link: https://lkml.kernel.org/r/20251105162910.752266-1-ziy@nvidia.com Fixes: 58729c04cf10 ("mm/huge_memory: add buddy allocator like (non-uniform) folio_split()") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: "David Hildenbrand (Red Hat)" <david@kernel.org> Closes: https://lore.kernel.org/all/dc0ecc2c-4089-484f-917f-920fdca4c898@kernel.org/ Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm, swap: fix potential UAF issue for VMA readaheadKairui Song1-0/+13
commit 1c2a936edd71e133f2806e68324ec81a4eb07588 upstream. Since commit 78524b05f1a3 ("mm, swap: avoid redundant swap device pinning"), the common helper for allocating and preparing a folio in the swap cache layer no longer tries to get a swap device reference internally, because all callers of __read_swap_cache_async are already holding a swap entry reference. The repeated swap device pinning isn't needed on the same swap device. Caller of VMA readahead is also holding a reference to the target entry's swap device, but VMA readahead walks the page table, so it might encounter swap entries from other devices, and call __read_swap_cache_async on another device without holding a reference to it. So it is possible to cause a UAF when swapoff of device A raced with swapin on device B, and VMA readahead tries to read swap entries from device A. It's not easy to trigger, but in theory, it could cause real issues. Make VMA readahead try to get the device reference first if the swap device is a different one from the target entry. Link: https://lkml.kernel.org/r/20251111-swap-fix-vma-uaf-v1-1-41c660e58562@tencent.com Fixes: 78524b05f1a3 ("mm, swap: avoid redundant swap device pinning") Suggested-by: Huang Ying <ying.huang@linux.alibaba.com> Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm/mremap: honour writable bit in mremap pte batchingDev Jain1-1/+1
commit 04d1c9d60c6ec4c0003d433572eaa45f8b217788 upstream. Currently mremap folio pte batch ignores the writable bit during figuring out a set of similar ptes mapping the same folio. Suppose that the first pte of the batch is writable while the others are not - set_ptes will end up setting the writable bit on the other ptes, which is a violation of mremap semantics. Therefore, use FPB_RESPECT_WRITE to check the writable bit while determining the pte batch. Link: https://lkml.kernel.org/r/20251028063952.90313-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Fixes: f822a9a81a31 ("mm: optimize mremap() by PTE batching") Reported-by: David Hildenbrand <david@redhat.com> Debugged-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Barry Song <baohua@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [6.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm/shmem: fix THP allocation and fallback loopKairui Song1-3/+6
commit fc745ff317566ec299e16346ebb9eacc8fe5b9d2 upstream. The order check and fallback loop is updating the index value on every loop. This will cause the index to be wrongly aligned by a larger value while the loop shrinks the order. This may result in inserting and returning a folio of the wrong index and cause data corruption with some userspace workloads [1]. [kasong@tencent.com: introduce a temporary variable to improve code] Link: https://lkml.kernel.org/r/20251023065913.36925-1-ryncsn@gmail.com Link: https://lore.kernel.org/linux-mm/CAMgjq7DqgAmj25nDUwwu1U2cSGSn8n4-Hqpgottedy0S6YYeUw@mail.gmail.com/ [1] Link: https://lkml.kernel.org/r/20251022105719.18321-1-ryncsn@gmail.com Link: https://lore.kernel.org/linux-mm/CAMgjq7DqgAmj25nDUwwu1U2cSGSn8n4-Hqpgottedy0S6YYeUw@mail.gmail.com/ [1] Fixes: e7a2ab7b3bb5 ("mm: shmem: add mTHP support for anonymous shmem") Closes: https://lore.kernel.org/linux-mm/CAMgjq7DqgAmj25nDUwwu1U2cSGSn8n4-Hqpgottedy0S6YYeUw@mail.gmail.com/ Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm/kmsan: fix kmsan kmalloc hook when no stack depots are allocated yetAleksei Nikiforov3-6/+5
commit 7e76b75e5ab3339bebab3a4738226cd9b27d8c42 upstream. If no stack depot is allocated yet, due to masking out __GFP_RECLAIM flags kmsan called from kmalloc cannot allocate stack depot. kmsan fails to record origin and report issues. This may result in KMSAN failing to report issues. Reusing flags from kmalloc without modifying them should be safe for kmsan. For example, such chain of calls is possible: test_uninit_kmalloc -> kmalloc -> __kmalloc_cache_noprof -> slab_alloc_node -> slab_post_alloc_hook -> kmsan_slab_alloc -> kmsan_internal_poison_memory. Only when it is called in a context without flags present should __GFP_RECLAIM flags be masked. With this change all kmsan tests start working reliably. Eric reported: : Yes, KMSAN seems to be at least partially broken currently. Besides the : fact that the kmsan KUnit test is currently failing (which I reported at : https://lore.kernel.org/r/20250911175145.GA1376@sol), I've confirmed that : the poly1305 KUnit test causes a KMSAN warning with Aleksei's patch : applied but does not cause a warning without it. The warning did get : reached via syzbot somehow : (https://lore.kernel.org/r/751b3d80293a6f599bb07770afcef24f623c7da0.1761026343.git.xiaopei01@kylinos.cn/), : so KMSAN must still work in some cases. But it didn't work for me. Link: https://lkml.kernel.org/r/20250930115600.709776-2-aleksei.nikiforov@linux.ibm.com Link: https://lkml.kernel.org/r/20251022030213.GA35717@sol Fixes: 97769a53f117 ("mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation") Signed-off-by: Aleksei Nikiforov <aleksei.nikiforov@linux.ibm.com> Reviewed-by: Alexander Potapenko <glider@google.com> Tested-by: Eric Biggers <ebiggers@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Marco Elver <elver@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm/damon/stat: change last_refresh_jiffies to a global variableQuanmin Yan1-3/+6
commit 2f6ce7e714ef842e43120ecd6a7ed287b502026d upstream. Patch series "mm/damon: fixes for the jiffies-related issues", v2. On 32-bit systems, the kernel initializes jiffies to "-5 minutes" to make jiffies wrap bugs appear earlier. However, this may cause the time_before() series of functions to return unexpected values, resulting in DAMON not functioning as intended. Meanwhile, similar issues exist in some specific user operation scenarios. This patchset addresses these issues. The first patch is about the DAMON_STAT module, and the second patch is about the core layer's sysfs. This patch (of 2): In DAMON_STAT's damon_stat_damon_call_fn(), time_before_eq() is used to avoid unnecessarily frequent stat update. On 32-bit systems, the kernel initializes jiffies to "-5 minutes" to make jiffies wrap bugs appear earlier. However, this causes time_before_eq() in DAMON_STAT to unexpectedly return true during the first 5 minutes after boot on 32-bit systems (see [1] for more explanation, which fixes another jiffies-related issue before). As a result, DAMON_STAT does not update any monitoring results during that period, which becomes more confusing when DAMON_STAT_ENABLED_DEFAULT is enabled. There is also an issue unrelated to the system's word size[2]: if the user stops DAMON_STAT just after last_refresh_jiffies is updated and restarts it after 5 seconds or a longer delay, last_refresh_jiffies will retain an older value, causing time_before_eq() to return false and the update to happen earlier than expected. Fix these issues by making last_refresh_jiffies a global variable and initializing it each time DAMON_STAT is started. Link: https://lkml.kernel.org/r/20251030020746.967174-2-yanquanmin1@huawei.com Link: https://lkml.kernel.org/r/20250822025057.1740854-1-ekffu200098@gmail.com [1] Link: https://lore.kernel.org/all/20251028143250.50144-1-sj@kernel.org/ [2] Fixes: fabdd1e911da ("mm/damon/stat: calculate and expose estimated memory bandwidth") Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com> Suggested-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: ze zuo <zuoze1@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm/mm_init: fix hash table order logging in alloc_large_system_hash()Isaac J. Manjarres1-1/+1
commit 0d6c356dd6547adac2b06b461528e3573f52d953 upstream. When emitting the order of the allocation for a hash table, alloc_large_system_hash() unconditionally subtracts PAGE_SHIFT from log base 2 of the allocation size. This is not correct if the allocation size is smaller than a page, and yields a negative value for the order as seen below: TCP established hash table entries: 32 (order: -4, 256 bytes, linear) TCP bind hash table entries: 32 (order: -2, 1024 bytes, linear) Use get_order() to compute the order when emitting the hash table information to correctly handle cases where the allocation size is smaller than a page: TCP established hash table entries: 32 (order: 0, 256 bytes, linear) TCP bind hash table entries: 32 (order: 0, 1024 bytes, linear) Link: https://lkml.kernel.org/r/20251028191020.413002-1-isaacmanjarres@google.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Isaac J. Manjarres <isaacmanjarres@google.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm/huge_memory: preserve PG_has_hwpoisoned if a folio is split to >0 orderZi Yan1-3/+20
commit fa5a061700364bc28ee1cb1095372f8033645dcb upstream. folio split clears PG_has_hwpoisoned, but the flag should be preserved in after-split folios containing pages with PG_hwpoisoned flag if the folio is split to >0 order folios. Scan all pages in a to-be-split folio to determine which after-split folios need the flag. An alternatives is to change PG_has_hwpoisoned to PG_maybe_hwpoisoned to avoid the scan and set it on all after-split folios, but resulting false positive has undesirable negative impact. To remove false positive, caller of folio_test_has_hwpoisoned() and folio_contain_hwpoisoned_page() needs to do the scan. That might be causing a hassle for current and future callers and more costly than doing the scan in the split code. More details are discussed in [1]. This issue can be exposed via: 1. splitting a has_hwpoisoned folio to >0 order from debugfs interface; 2. truncating part of a has_hwpoisoned folio in truncate_inode_partial_folio(). And later accesses to a hwpoisoned page could be possible due to the missing has_hwpoisoned folio flag. This will lead to MCE errors. Link: https://lore.kernel.org/all/CAHbLzkoOZm0PXxE9qwtF4gKR=cpRXrSrJ9V9Pm2DJexs985q4g@mail.gmail.com/ [1] Link: https://lkml.kernel.org/r/20251023030521.473097-1-ziy@nvidia.com Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages") Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <yang@os.amperecomputing.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Pankaj Raghav <kernel@pankajraghav.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm/secretmem: fix use-after-free race in fault handlerLance Yang1-1/+1
commit 6f86d0534fddfbd08687fa0f01479d4226bc3c3d upstream. When a page fault occurs in a secret memory file created with `memfd_secret(2)`, the kernel will allocate a new folio for it, mark the underlying page as not-present in the direct map, and add it to the file mapping. If two tasks cause a fault in the same page concurrently, both could end up allocating a folio and removing the page from the direct map, but only one would succeed in adding the folio to the file mapping. The task that failed undoes the effects of its attempt by (a) freeing the folio again and (b) putting the page back into the direct map. However, by doing these two operations in this order, the page becomes available to the allocator again before it is placed back in the direct mapping. If another task attempts to allocate the page between (a) and (b), and the kernel tries to access it via the direct map, it would result in a supervisor not-present page fault. Fix the ordering to restore the direct map before the folio is freed. Link: https://lkml.kernel.org/r/20251031120955.92116-1-lance.yang@linux.dev Fixes: 1507f51255c9 ("mm: introduce memfd_secret system call to create "secret" memory areas") Signed-off-by: Lance Yang <lance.yang@linux.dev> Reported-by: Google Big Sleep <big-sleep-vuln-reports@google.com> Closes: https://lore.kernel.org/linux-mm/CAEXGt5QeDpiHTu3K9tvjUTPqo+d-=wuCNYPa+6sWKrdQJ-ATdg@mail.gmail.com/ Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24mm/damon/sysfs: change next_update_jiffies to a global variableQuanmin Yan1-3/+7
commit 9fd7bb5083d1e1027b8ac1e365c29921ab88b177 upstream. In DAMON's damon_sysfs_repeat_call_fn(), time_before() is used to compare the current jiffies with next_update_jiffies to determine whether to update the sysfs files at this moment. On 32-bit systems, the kernel initializes jiffies to "-5 minutes" to make jiffies wrap bugs appear earlier. However, this causes time_before() in damon_sysfs_repeat_call_fn() to unexpectedly return true during the first 5 minutes after boot on 32-bit systems (see [1] for more explanation, which fixes another jiffies-related issue before). As a result, DAMON does not update sysfs files during that period. There is also an issue unrelated to the system's word size[2]: if the user stops DAMON just after next_update_jiffies is updated and restarts it after 'refresh_ms' or a longer delay, next_update_jiffies will retain an older value, causing time_before() to return false and the update to happen earlier than expected. Fix these issues by making next_update_jiffies a global variable and initializing it each time DAMON is started. Link: https://lkml.kernel.org/r/20251030020746.967174-3-yanquanmin1@huawei.com Link: https://lkml.kernel.org/r/20250822025057.1740854-1-ekffu200098@gmail.com [1] Link: https://lore.kernel.org/all/20251029013038.66625-1-sj@kernel.org/ [2] Fixes: d809a7c64ba8 ("mm/damon/sysfs: implement refresh_ms file internal work") Suggested-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: ze zuo <zuoze1@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-24ksm: use range-walk function to jump over holes in scan_get_next_rmap_itemPedro Demarchi Gomes1-9/+104
commit f5548c318d6520d4fa3c5ed6003eeb710763cbc5 upstream. Currently, scan_get_next_rmap_item() walks every page address in a VMA to locate mergeable pages. This becomes highly inefficient when scanning large virtual memory areas that contain mostly unmapped regions, causing ksmd to use large amount of cpu without deduplicating much pages. This patch replaces the per-address lookup with a range walk using walk_page_range(). The range walker allows KSM to skip over entire unmapped holes in a VMA, avoiding unnecessary lookups. This problem was previously discussed in [1]. Consider the following test program which creates a 32 TiB mapping in the virtual address space but only populates a single page: #include <unistd.h> #include <stdio.h> #include <sys/mman.h> /* 32 TiB */ const size_t size = 32ul * 1024 * 1024 * 1024 * 1024; int main() { char *area = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_NORESERVE | MAP_PRIVATE | MAP_ANON, -1, 0); if (area == MAP_FAILED) { perror("mmap() failed\n"); return -1; } /* Populate a single page such that we get an anon_vma. */ *area = 0; /* Enable KSM. */ madvise(area, size, MADV_MERGEABLE); pause(); return 0; } $ ./ksm-sparse & $ echo 1 > /sys/kernel/mm/ksm/run Without this patch ksmd uses 100% of the cpu for a long time (more then 1 hour in my test machine) scanning all the 32 TiB virtual address space that contain only one mapped page. This makes ksmd essentially deadlocked not able to deduplicate anything of value. With this patch ksmd walks only the one mapped page and skips the rest of the 32 TiB virtual address space, making the scan fast using little cpu. Link: https://lkml.kernel.org/r/20251023035841.41406-1-pedrodemargomes@gmail.com Link: https://lkml.kernel.org/r/20251022153059.22763-1-pedrodemargomes@gmail.com Link: https://lore.kernel.org/linux-mm/423de7a3-1c62-4e72-8e79-19a6413e420c@redhat.com/ [1] Fixes: 31dbd01f3143 ("ksm: Kernel SamePage Merging") Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: craftfever <craftfever@airmail.cc> Closes: https://lkml.kernel.org/r/020cf8de6e773bb78ba7614ef250129f11a63781@murena.io Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: xu xin <xu.xin16@zte.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-29treewide: remove MIGRATEPAGE_SUCCESSDavid Hildenbrand3-24/+22
[ Upstream commit fb49a4425cfa163faccd91f913773d3401d3a7d4 ] At this point MIGRATEPAGE_SUCCESS is misnamed for all folio users, and now that we remove MIGRATEPAGE_UNMAP, it's really the only "success" return value that the code uses and expects. Let's just get rid of MIGRATEPAGE_SUCCESS completely and just use "0" for success. Link: https://lkml.kernel.org/r/20250811143949.1117439-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> [mm] Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> [jfs] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Byungchul Park <byungchul@sk.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Lance Yang <lance.yang@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 4ba5a8a7faa6 ("vmw_balloon: indicate success when effectively deflating during migration") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-29mm/migrate: remove MIGRATEPAGE_UNMAPDavid Hildenbrand1-23/+22
[ Upstream commit 95c2908f1a4fd608b1cdbb5acef3572e5d769e1c ] migrate_folio_unmap() is the only user of MIGRATEPAGE_UNMAP. We want to remove MIGRATEPAGE_* completely. It's rather weird to have a generic MIGRATEPAGE_UNMAP, documented to be returned from address-space callbacks, when it's only used for an internal helper. Let's start by having only a single "success" return value for migrate_folio_unmap() -- 0 -- by moving the "folio was already freed" check into the single caller. There is a remaining comment for PG_isolated, which we renamed to PG_movable_ops_isolated recently and forgot to update. While we might still run into that case with zsmalloc, it's something we want to get rid of soon. So let's just focus that optimization on real folios only for now by excluding movable_ops pages. Note that concurrent freeing can happen at any time and this "already freed" check is not relevant for correctness. [david@redhat.com: no need to pass "reason" to migrate_folio_unmap(), per Lance] Link: https://lkml.kernel.org/r/3bb725f8-28d7-4aa2-b75f-af40d5cab280@redhat.com Link: https://lkml.kernel.org/r/20250811143949.1117439-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Byungchul Park <byungchul@sk.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 4ba5a8a7faa6 ("vmw_balloon: indicate success when effectively deflating during migration") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-29mm/damon/sysfs: dealloc commit test ctx alwaysSeongJae Park1-3/+2
commit 139e7a572af0b45f558b5e502121a768dc328ba8 upstream. The damon_ctx for testing online DAMON parameters commit inputs is deallocated only when the test fails. This means memory is leaked for every successful online DAMON parameters commit. Fix the leak by always deallocating it. Link: https://lkml.kernel.org/r/20251003201455.41448-3-sj@kernel.org Fixes: 4c9ea539ad59 ("mm/damon/sysfs: validate user inputs from damon_sysfs_commit_input()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-29mm/damon/sysfs: catch commit test ctx alloc failureSeongJae Park1-0/+2
commit f0c5118ebb0eb7e4fd6f0d2ace3315ca141b317f upstream. Patch series "mm/damon/sysfs: fix commit test damon_ctx [de]allocation". DAMON sysfs interface dynamically allocates and uses a damon_ctx object for testing if given inputs for online DAMON parameters update is valid. The object is being used without an allocation failure check, and leaked when the test succeeds. Fix the two bugs. This patch (of 2): The damon_ctx for testing online DAMON parameters commit inputs is used without its allocation failure check. This could result in an invalid memory access. Fix it by directly returning an error when the allocation failed. Link: https://lkml.kernel.org/r/20251003201455.41448-1-sj@kernel.org Link: https://lkml.kernel.org/r/20251003201455.41448-2-sj@kernel.org Fixes: 4c9ea539ad59 ("mm/damon/sysfs: validate user inputs from damon_sysfs_commit_input()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-29mm/damon/core: fix potential memory leak by cleaning ops_filter in ↵Enze Li1-0/+3
damon_destroy_scheme commit 7071537159be845a5c4ed5fb7d3db25aa4bd04a3 upstream. Currently, damon_destroy_scheme() only cleans up the filter list but leaves ops_filter untouched, which could lead to memory leaks when a scheme is destroyed. This patch ensures both filter and ops_filter are properly freed in damon_destroy_scheme(), preventing potential memory leaks. Link: https://lkml.kernel.org/r/20251014084225.313313-1-lienze@kylinos.cn Fixes: ab82e57981d0 ("mm/damon/core: introduce damos->ops_filters") Signed-off-by: Enze Li <lienze@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Tested-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-29mm/damon/core: fix list_add_tail() call on damon_call()SeongJae Park1-1/+1
commit c3fa5b1bfd8380d935fa961f2ac166bdf000f418 upstream. Each damon_ctx maintains callback requests using a linked list (damon_ctx->call_controls). When a new callback request is received via damon_call(), the new request should be added to the list. However, the function is making a mistake at list_add_tail() invocation: putting the new item to add and the list head to add it before, in the opposite order. Because of the linked list manipulation implementation, the new request can still be reached from the context's list head. But the list items that were added before the new request are dropped from the list. As a result, the callbacks are unexpectedly not invocated. Worse yet, if the dropped callback requests were dynamically allocated, the memory is leaked. Actually DAMON sysfs interface is using a dynamically allocated repeat-mode callback request for automatic essential stats update. And because the online DAMON parameters commit is using a non-repeat-mode callback request, the issue can easily be reproduced, like below. # damo start --damos_action stat --refresh_stat 1s # damo tune --damos_action stat --refresh_stat 1s The first command dynamically allocates the repeat-mode callback request for automatic essential stat update. Users can see the essential stats are automatically updated for every second, using the sysfs interface. The second command calls damon_commit() with a new callback request that was made for the commit. As a result, the previously added repeat-mode callback request is dropped from the list. The automatic stats refresh stops working, and the memory for the repeat-mode callback request is leaked. It can be confirmed using kmemleak. Fix the mistake on the list_add_tail() call. Link: https://lkml.kernel.org/r/20251014205939.1206-1-sj@kernel.org Fixes: 004ded6bee11 ("mm/damon: accept parallel damon_call() requests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-29mm/damon/core: use damos_commit_quota_goal() for new goal commitSeongJae Park1-1/+1
commit 7eca961dd7188f20fdf8ce9ed5018280f79b2438 upstream. When damos_commit_quota_goals() is called for adding new DAMOS quota goals of DAMOS_QUOTA_USER_INPUT metric, current_value fields of the new goals should be also set as requested. However, damos_commit_quota_goals() is not updating the field for the case, since it is setting only metrics and target values using damos_new_quota_goal(), and metric-optional union fields using damos_commit_quota_goal_union(). As a result, users could see the first current_value parameter that committed online with a new quota goal is ignored. Users are assumed to commit the current_value for DAMOS_QUOTA_USER_INPUT quota goals, since it is being used as a feedback. Hence the real impact would be subtle. That said, this is obviously not intended behavior. Fix the issue by using damos_commit_quota_goal() which sets all quota goal parameters, instead of damos_commit_quota_goal_union(), which sets only the union fields. Link: https://lkml.kernel.org/r/20251014001846.279282-1-sj@kernel.org Fixes: 1aef9df0ee90 ("mm/damon/core: commit damos_quota_goal->nid") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.16+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <greg