summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2022-10-15mm: pagewalk: Fix race between unmap and page walkerSteven Price1-5/+8
commit 8782fb61cc848364e1e1599d76d3c9dd58a1cc06 upstream. The mmap lock protects the page walker from changes to the page tables during the walk. However a read lock is insufficient to protect those areas which don't have a VMA as munmap() detaches the VMAs before downgrading to a read lock and actually tearing down PTEs/page tables. For users of walk_page_range() the solution is to simply call pte_hole() immediately without checking the actual page tables when a VMA is not present. We now never call __walk_page_range() without a valid vma. For walk_page_range_novma() the locking requirements are tightened to require the mmap write lock to be taken, and then walking the pgd directly with 'no_vma' set. This in turn means that all page walkers either have a valid vma, or it's that special 'novma' case for page table debugging. As a result, all the odd '(!walk->vma && !walk->no_vma)' tests can be removed. Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Steven Price <steven.price@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [manually backported. backport note: walk_page_range_novma() does not exist in 5.4, so I'm omitting it from the backport] Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-05mm: fix madivse_pageout mishandling on non-LRU pageMinchan Kim1-2/+5
commit 58d426a7ba92870d489686dfdb9d06b66815a2ab upstream. MADV_PAGEOUT tries to isolate non-LRU pages and gets a warning from isolate_lru_page below. Fix it by checking PageLRU in advance. ------------[ cut here ]------------ trying to isolate tail page WARNING: CPU: 0 PID: 6175 at mm/folio-compat.c:158 isolate_lru_page+0x130/0x140 Modules linked in: CPU: 0 PID: 6175 Comm: syz-executor.0 Not tainted 5.18.12 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:isolate_lru_page+0x130/0x140 Link: https://lore.kernel.org/linux-mm/485f8c33.2471b.182d5726afb.Coremail.hantianshuo@iie.ac.cn/ Link: https://lkml.kernel.org/r/20220908151204.762596-1-minchan@kernel.org Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT") Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: 韩天ç`• <hantianshuo@iie.ac.cn> Suggested-by: Yang Shi <shy828301@gmail.com> Acked-by: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-05mm/migrate_device.c: flush TLB while holding PTLAlistair Popple1-2/+3
commit 60bae73708963de4a17231077285bd9ff2f41c44 upstream. When clearing a PTE the TLB should be flushed whilst still holding the PTL to avoid a potential race with madvise/munmap/etc. For example consider the following sequence: CPU0 CPU1 ---- ---- migrate_vma_collect_pmd() pte_unmap_unlock() madvise(MADV_DONTNEED) -> zap_pte_range() pte_offset_map_lock() [ PTE not present, TLB not flushed ] pte_unmap_unlock() [ page is still accessible via stale TLB ] flush_tlb_range() In this case the page may still be accessed via the stale TLB entry after madvise returns. Fix this by flushing the TLB while holding the PTL. Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Link: https://lkml.kernel.org/r/9f801e9d8d830408f2ca27821f606e09aa856899.1662078528.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reported-by: Nadav Amit <nadav.amit@gmail.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: huang ying <huang.ying.caritas@gmail.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-05mm: prevent page_frag_alloc() from corrupting the memoryMaurizio Lombardi1-0/+12
commit dac22531bbd4af2426c4e29e05594415ccfa365d upstream. A number of drivers call page_frag_alloc() with a fragment's size > PAGE_SIZE. In low memory conditions, __page_frag_cache_refill() may fail the order 3 cache allocation and fall back to order 0; In this case, the cache will be smaller than the fragment, causing memory corruptions. Prevent this from happening by checking if the newly allocated cache is large enough for the fragment; if not, the allocation will fail and page_frag_alloc() will return NULL. Link: https://lkml.kernel.org/r/20220715125013.247085-1-mlombard@redhat.com Fixes: b63ae8ca096d ("mm/net: Rename and move page fragment handling from net/ to mm/") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Cc: Chen Lin <chen45464546@163.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-05mm/page_alloc: fix race condition between build_all_zonelists and page ↵Mel Gorman1-10/+43
allocation commit 3d36424b3b5850bd92f3e89b953a430d7cfc88ef upstream. Patrick Daly reported the following problem; NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation [0] - ZONE_MOVABLE [1] - ZONE_NORMAL [2] - NULL For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent memory_offline operation removes the last page from ZONE_MOVABLE, build_all_zonelists() & build_zonerefs_node() will update node_zonelists as shown below. Only populated zones are added. NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation [0] - ZONE_NORMAL [1] - NULL [2] - NULL The race is simple -- page allocation could be in progress when a memory hot-remove operation triggers a zonelist rebuild that removes zones. The allocation request will still have a valid ac->preferred_zoneref that is now pointing to NULL and triggers an OOM kill. This problem probably always existed but may be slightly easier to trigger due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") which distinguishes between zones that are completely unpopulated versus zones that have valid pages not managed by the buddy allocator (e.g. reserved, memblock, ballooning etc). Memory hotplug had multiple stages with timing considerations around managed/present page updates, the zonelist rebuild and the zone span updates. As David Hildenbrand puts it memory offlining adjusts managed+present pages of the zone essentially in one go. If after the adjustments, the zone is no longer populated (present==0), we rebuild the zone lists. Once that's done, we try shrinking the zone (start+spanned pages) -- which results in zone_start_pfn == 0 if there are no more pages. That happens *after* rebuilding the zonelists via remove_pfn_range_from_zone(). The only requirement to fix the race is that a page allocation request identifies when a zonelist rebuild has happened since the allocation request started and no page has yet been allocated. Use a seqlock_t to track zonelist updates with a lockless read-side of the zonelist and protecting the rebuild and update of the counter with a spinlock. [akpm@linux-foundation.org: make zonelist_update_seq static] Link: https://lkml.kernel.org/r/20220824110900.vh674ltxmzb3proq@techsingularity.net Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Patrick Daly <quic_pdaly@quicinc.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> [4.9+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-28mm/slub: fix to return errno if kmalloc() failsChao Yu1-1/+4
commit 7e9c323c52b379d261a72dc7bd38120a761a93cd upstream. In create_unique_id(), kmalloc(, GFP_KERNEL) can fail due to out-of-memory, if it fails, return errno correctly rather than triggering panic via BUG_ON(); kernel BUG at mm/slub.c:5893! Internal error: Oops - BUG: 0 [#1] PREEMPT SMP Call trace: sysfs_slab_add+0x258/0x260 mm/slub.c:5973 __kmem_cache_create+0x60/0x118 mm/slub.c:4899 create_cache mm/slab_common.c:229 [inline] kmem_cache_create_usercopy+0x19c/0x31c mm/slab_common.c:335 kmem_cache_create+0x1c/0x28 mm/slab_common.c:390 f2fs_kmem_cache_create fs/f2fs/f2fs.h:2766 [inline] f2fs_init_xattr_caches+0x78/0xb4 fs/f2fs/xattr.c:808 f2fs_fill_super+0x1050/0x1e0c fs/f2fs/super.c:4149 mount_bdev+0x1b8/0x210 fs/super.c:1400 f2fs_mount+0x44/0x58 fs/f2fs/super.c:4512 legacy_get_tree+0x30/0x74 fs/fs_context.c:610 vfs_get_tree+0x40/0x140 fs/super.c:1530 do_new_mount+0x1dc/0x4e4 fs/namespace.c:3040 path_mount+0x358/0x914 fs/namespace.c:3370 do_mount fs/namespace.c:3383 [inline] __do_sys_mount fs/namespace.c:3591 [inline] __se_sys_mount fs/namespace.c:3568 [inline] __arm64_sys_mount+0x2f8/0x408 fs/namespace.c:3568 Cc: <stable@kernel.org> Fixes: 81819f0fc8285 ("SLUB core") Reported-by: syzbot+81684812ea68216e08c5@syzkaller.appspotmail.com Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Chao Yu <chao.yu@oppo.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-20mm: Fix TLB flush for not-first PFNMAP mappings in unmap_region()Jann Horn1-2/+7
This is a stable-specific patch. I botched the stable-specific rewrite of commit b67fbebd4cf98 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas"): As Hugh pointed out, unmap_region() actually operates on a list of VMAs, and the variable "vma" merely points to the first VMA in that list. So if we want to check whether any of the VMAs we're operating on is PFNMAP or MIXEDMAP, we have to iterate through the list and check each VMA. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-15Revert "mm: kmemleak: take a full lowmem check in kmemleak_*_phys()"Yee Lee1-4/+4
This reverts commit 23c2d497de21f25898fbea70aeb292ab8acc8c94. Commit 23c2d497de21 ("mm: kmemleak: take a full lowmem check in kmemleak_*_phys()") brought false leak alarms on some archs like arm64 that does not init pfn boundary in early booting. The final solution lands on linux-6.0: commit 0c24e061196c ("mm: kmemleak: add rbtree and store physical address for objects allocated with PA"). Revert this commit before linux-6.0. The original issue of invalid PA can be mitigated by additional check in devicetree. The false alarm report is as following: Kmemleak output: (Qemu/arm64) unreferenced object 0xffff0000c0170a00 (size 128): comm "swapper/0", pid 1, jiffies 4294892404 (age 126.208s) hex dump (first 32 bytes): 62 61 73 65 00 00 00 00 00 00 00 00 00 00 00 00 base............ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<(____ptrval____)>] __kmalloc_track_caller+0x1b0/0x2e4 [<(____ptrval____)>] kstrdup_const+0x8c/0xc4 [<(____ptrval____)>] kvasprintf_const+0xbc/0xec [<(____ptrval____)>] kobject_set_name_vargs+0x58/0xe4 [<(____ptrval____)>] kobject_add+0x84/0x100 [<(____ptrval____)>] __of_attach_node_sysfs+0x78/0xec [<(____ptrval____)>] of_core_init+0x68/0x104 [<(____ptrval____)>] driver_init+0x28/0x48 [<(____ptrval____)>] do_basic_setup+0x14/0x28 [<(____ptrval____)>] kernel_init_freeable+0x110/0x178 [<(____ptrval____)>] kernel_init+0x20/0x1a0 [<(____ptrval____)>] ret_from_fork+0x10/0x20 This pacth is also applicable to linux-5.17.y/linux-5.18.y/linux-5.19.y Cc: <stable@vger.kernel.org> Signed-off-by: Yee Lee <yee.lee@mediatek.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-05mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuseJann Horn1-14/+17
commit 2555283eb40df89945557273121e9393ef9b542b upstream. anon_vma->degree tracks the combined number of child anon_vmas and VMAs that use the anon_vma as their ->anon_vma. anon_vma_clone() then assumes that for any anon_vma attached to src->anon_vma_chain other than src->anon_vma, it is impossible for it to be a leaf node of the VMA tree, meaning that for such VMAs ->degree is elevated by 1 because of a child anon_vma, meaning that if ->degree equals 1 there are no VMAs that use the anon_vma as their ->anon_vma. This assumption is wrong because the ->degree optimization leads to leaf nodes being abandoned on anon_vma_clone() - an existing anon_vma is reused and no new parent-child relationship is created. So it is possible to reuse an anon_vma for one VMA while it is still tied to another VMA. This is an issue because is_mergeable_anon_vma() and its callers assume that if two VMAs have the same ->anon_vma, the list of anon_vmas attached to the VMAs is guaranteed to be the same. When this assumption is violated, vma_merge() can merge pages into a VMA that is not attached to the corresponding anon_vma, leading to dangling page->mapping pointers that will be dereferenced during rmap walks. Fix it by separately tracking the number of child anon_vmas and the number of VMAs using the anon_vma as their ->anon_vma. Fixes: 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy") Cc: stable@kernel.org Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-05mm: Force TLB flush for PFNMAP mappings before unlink_file_vma()Jann Horn1-0/+12
commit b67fbebd4cf980aecbcc750e1462128bffe8ae15 upstream. Some drivers rely on having all VMAs through which a PFN might be accessible listed in the rmap for correctness. However, on X86, it was possible for a VMA with stale TLB entries to not be listed in the rmap. This was fixed in mainline with commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas"), but that commit relies on preceding refactoring in commit 18ba064e42df3 ("mmu_gather: Let there be one tlb_{start,end}_vma() implementation") and commit 1e9fdf21a4339 ("mmu_gather: Remove per arch tlb_{start,end}_vma()"). This patch provides equivalent protection without needing that refactoring, by forcing a TLB flush between removing PTEs in unmap_vmas() and the call to unlink_file_vma() in free_pgtables(). [This is a stable-specific rewrite of the upstream commit!] Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-05mm/hugetlb: fix hugetlb not supporting softdirty trackingDavid Hildenbrand1-2/+6
commit f96f7a40874d7c746680c0b9f57cef2262ae551f upstream. Patch series "mm/hugetlb: fix write-fault handling for shared mappings", v2. I observed that hugetlb does not support/expect write-faults in shared mappings that would have to map the R/O-mapped page writable -- and I found two case where we could currently get such faults and would erroneously map an anon page into a shared mapping. Reproducers part of the patches. I propose to backport both fixes to stable trees. The first fix needs a small adjustment. This patch (of 2): Staring at hugetlb_wp(), one might wonder where all the logic for shared mappings is when stumbling over a write-protected page in a shared mapping. In fact, there is none, and so far we thought we could get away with that because e.g., mprotect() should always do the right thing and map all pages directly writable. Looks like we were wrong: -------------------------------------------------------------------------- #include <stdio.h> #include <stdlib.h> #include <string.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> #include <sys/mman.h> #define HUGETLB_SIZE (2 * 1024 * 1024u) static void clear_softdirty(void) { int fd = open("/proc/self/clear_refs", O_WRONLY); const char *ctrl = "4"; int ret; if (fd < 0) { fprintf(stderr, "open(clear_refs) failed\n"); exit(1); } ret = write(fd, ctrl, strlen(ctrl)); if (ret != strlen(ctrl)) { fprintf(stderr, "write(clear_refs) failed\n"); exit(1); } close(fd); } int main(int argc, char **argv) { char *map; int fd; fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT); if (!fd) { fprintf(stderr, "open() failed\n"); return -errno; } if (ftruncate(fd, HUGETLB_SIZE)) { fprintf(stderr, "ftruncate() failed\n"); return -errno; } map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (map == MAP_FAILED) { fprintf(stderr, "mmap() failed\n"); return -errno; } *map = 0; if (mprotect(map, HUGETLB_SIZE, PROT_READ)) { fprintf(stderr, "mmprotect() failed\n"); return -errno; } clear_softdirty(); if (mprotect(map, HUGETLB_SIZE, PROT_READ|PROT_WRITE)) { fprintf(stderr, "mmprotect() failed\n"); return -errno; } *map = 0; return 0; } -------------------------------------------------------------------------- Above test fails with SIGBUS when there is only a single free hugetlb page. # echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # ./test Bus error (core dumped) And worse, with sufficient free hugetlb pages it will map an anonymous page into a shared mapping, for example, messing up accounting during unmap and breaking MAP_SHARED semantics: # echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # ./test # cat /proc/meminfo | grep HugePages_ HugePages_Total: 2 HugePages_Free: 1 HugePages_Rsvd: 18446744073709551615 HugePages_Surp: 0 Reason in this particular case is that vma_wants_writenotify() will return "true", removing VM_SHARED in vma_set_page_prot() to map pages write-protected. Let's teach vma_wants_writenotify() that hugetlb does not support softdirty tracking. Link: https://lkml.kernel.org/r/20220811103435.188481-1-david@redhat.com Link: https://lkml.kernel.org/r/20220811103435.188481-2-david@redhat.com Fixes: 64e455079e1b ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Feiner <pfeiner@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Jamie Liu <jamieliu@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> [3.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-25mm/mmap.c: fix missing call to vm_unacct_memory in mmap_regionMiaohe Lin1-1/+0
[ Upstream commit 7f82f922319ede486540e8746769865b9508d2c2 ] Since the beginning, charged is set to 0 to avoid calling vm_unacct_memory twice because vm_unacct_memory will be called by above unmap_region. But since commit 4f74d2c8e827 ("vm: remove 'nr_accounted' calculations from the unmap_vmas() interfaces"), unmap_region doesn't call vm_unacct_memory anymore. So charged shouldn't be set to 0 now otherwise the calling to paired vm_unacct_memory will be missed and leads to imbalanced account. Link: https://lkml.kernel.org/r/20220618082027.43391-1-linmiaohe@huawei.com Fixes: 4f74d2c8e827 ("vm: remove 'nr_accounted' calculations from the unmap_vmas() interfaces") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-25mm/mremap: hold the rmap lock in write mode when moving page table entries.Aneesh Kumar K.V1-4/+2
commit 97113eb39fa7972722ff490b947d8af023e1f6a2 upstream. To avoid a race between rmap walk and mremap, mremap does take_rmap_locks(). The lock was taken to ensure that rmap walk don't miss a page table entry due to PTE moves via move_pagetables(). The kernel does further optimization of this lock such that if we are going to find the newly added vma after the old vma, the rmap lock is not taken. This is because rmap walk would find the vmas in the same order and if we don't find the page table attached to older vma we would find it with the new vma which we would iterate later. As explained in commit eb66ae030829 ("mremap: properly flush TLB before releasing the page") mremap is special in that it doesn't take ownership of the page. The optimized version for PUD/PMD aligned mremap also doesn't hold the ptl lock. This can result in stale TLB entries as show below. This patch updates the rmap locking requirement in mremap to handle the race condition explained below with optimized mremap:: Optmized PMD move CPU 1 CPU 2 CPU 3 mremap(old_addr, new_addr) page_shrinker/try_to_unmap_one mmap_write_lock_killable() addr = old_addr lock(pte_ptl) lock(pmd_ptl) pmd = *old_pmd pmd_clear(old_pmd) flush_tlb_range(old_addr) *new_pmd = pmd *new_addr = 10; and fills TLB with new addr and old pfn unlock(pmd_ptl) ptep_clear_flush() old pfn is free. Stale TLB entry Optimized PUD move also suffers from a similar race. Both the above race condition can be fixed if we force mremap path to take rmap lock. Link: https://lkml.kernel.org/r/20210616045239.370802-7-aneesh.kumar@linux.ibm.com Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions") Fixes: c49dd3401802 ("mm: speedup mremap on 1GB or larger regions") Link: https://lore.kernel.org/linux-mm/CAHk-=wgXVR04eBNtxQfevontWnP6FDm+oj5vauQXP3S-huwbPw@mail.gmail.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [patch rewritten for backport since the code was refactored since] Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-29mm/mempolicy: fix uninit-value in mpol_rebind_policy()Wang Cheng1-1/+1
commit 018160ad314d75b1409129b2247b614a9f35894c upstream. mpol_set_nodemask()(mm/mempolicy.c) does not set up nodemask when pol->mode is MPOL_LOCAL. Check pol->mode before access pol->w.cpuset_mems_allowed in mpol_rebind_policy()(mm/mempolicy.c). BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:352 [inline] BUG: KMSAN: uninit-value in mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368 mpol_rebind_policy mm/mempolicy.c:352 [inline] mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368 cpuset_change_task_nodemask kernel/cgroup/cpuset.c:1711 [inline] cpuset_attach+0x787/0x15e0 kernel/cgroup/cpuset.c:2278 cgroup_migrate_execute+0x1023/0x1d20 kernel/cgroup/cgroup.c:2515 cgroup_migrate kernel/cgroup/cgroup.c:2771 [inline] cgroup_attach_task+0x540/0x8b0 kernel/cgroup/cgroup.c:2804 __cgroup1_procs_write+0x5cc/0x7a0 kernel/cgroup/cgroup-v1.c:520 cgroup1_tasks_write+0x94/0xb0 kernel/cgroup/cgroup-v1.c:539 cgroup_file_write+0x4c2/0x9e0 kernel/cgroup/cgroup.c:3852 kernfs_fop_write_iter+0x66a/0x9f0 fs/kernfs/file.c:296 call_write_iter include/linux/fs.h:2162 [inline] new_sync_write fs/read_write.c:503 [inline] vfs_write+0x1318/0x2030 fs/read_write.c:590 ksys_write+0x28b/0x510 fs/read_write.c:643 __do_sys_write fs/read_write.c:655 [inline] __se_sys_write fs/read_write.c:652 [inline] __x64_sys_write+0xdb/0x120 fs/read_write.c:652 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae Uninit was created at: slab_post_alloc_hook mm/slab.h:524 [inline] slab_alloc_node mm/slub.c:3251 [inline] slab_alloc mm/slub.c:3259 [inline] kmem_cache_alloc+0x902/0x11c0 mm/slub.c:3264 mpol_new mm/mempolicy.c:293 [inline] do_set_mempolicy+0x421/0xb70 mm/mempolicy.c:853 kernel_set_mempolicy mm/mempolicy.c:1504 [inline] __do_sys_set_mempolicy mm/mempolicy.c:1510 [inline] __se_sys_set_mempolicy+0x44c/0xb60 mm/mempolicy.c:1507 __x64_sys_set_mempolicy+0xd8/0x110 mm/mempolicy.c:1507 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae KMSAN: uninit-value in mpol_rebind_task (2) https://syzkaller.appspot.com/bug?id=d6eb90f952c2a5de9ea718a1b873c55cb13b59dc This patch seems to fix below bug too. KMSAN: uninit-value in mpol_rebind_mm (2) https://syzkaller.appspot.com/bug?id=f2fecd0d7013f54ec4162f60743a2b28df40926b The uninit-value is pol->w.cpuset_mems_allowed in mpol_rebind_policy(). When syzkaller reproducer runs to the beginning of mpol_new(), mpol_new() mm/mempolicy.c do_mbind() mm/mempolicy.c kernel_mbind() mm/mempolicy.c `mode` is 1(MPOL_PREFERRED), nodes_empty(*nodes) is `true` and `flags` is 0. Then mode = MPOL_LOCAL; ... policy->mode = mode; policy->flags = flags; will be executed. So in mpol_set_nodemask(), mpol_set_nodemask() mm/mempolicy.c do_mbind() kernel_mbind() pol->mode is 4 (MPOL_LOCAL), that `nodemask` in `pol` is not initialized, which will be accessed in mpol_rebind_policy(). Link: https://lkml.kernel.org/r/20220512123428.fq3wofedp6oiotd4@ppc.localdomain Signed-off-by: Wang Cheng <wanngchenng@gmail.com> Reported-by: <syzbot+217f792c92599518a2ab@syzkaller.appspotmail.com> Tested-by: <syzbot+217f792c92599518a2ab@syzkaller.appspotmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-12mm/slub: add missing TID updates on slab deactivationJann Horn1-2/+2
commit eeaa345e128515135ccb864c04482180c08e3259 upstream. The fastpath in slab_alloc_node() assumes that c->slab is stable as long as the TID stays the same. However, two places in __slab_alloc() currently don't update the TID when deactivating the CPU slab. If multiple operations race the right way, this could lead to an object getting lost; or, in an even more unlikely situation, it could even lead to an object being freed onto the wrong slab's freelist, messing up the `inuse` counter and eventually causing a page to be freed to the page allocator while it still contains slab objects. (I haven't actually tested these cases though, this is just based on looking at the code. Writing testcases for this stuff seems like it'd be a pain...) The race leading to state inconsistency is (all operations on the same CPU and kmem_cache): - task A: begin do_slab_free(): - read TID - read pcpu freelist (==NULL) - check `slab == c->slab` (true) - [PREEMPT A->B] - task B: begin slab_alloc_node(): - fastpath fails (`c->freelist` is NULL) - enter __slab_alloc() - slub_get_cpu_ptr() (disables preemption) - enter ___slab_alloc() - take local_lock_irqsave() - read c->freelist as NULL - get_freelist() returns NULL - write `c->slab = NULL` - drop local_unlock_irqrestore() - goto new_slab - slub_percpu_partial() is NULL - get_partial() returns NULL - slub_put_cpu_ptr() (enables preemption) - [PREEMPT B->A] - task A: finish do_slab_free(): - this_cpu_cmpxchg_double() succeeds() - [CORRUPT STATE: c->slab==NULL, c->freelist!=NULL] From there, the object on c->freelist will get lost if task B is allowed to continue from here: It will proceed to the retry_load_slab label, set c->slab, then jump to load_freelist, which clobbers c->freelist. But if we instead continue as follows, we get worse corruption: - task A: run __slab_free() on object from other struct slab: - CPU_PARTIAL_FREE case (slab was on no list, is now on pcpu partial) - task A: run slab_alloc_node() with NUMA node constraint: - fastpath fails (c->slab is NULL) - call __slab_alloc() - slub_get_cpu_ptr() (disables preemption) - enter ___slab_alloc() - c->slab is NULL: goto new_slab - slub_percpu_partial() is non-NULL - set c->slab to slub_percpu_partial(c) - [CORRUPT STATE: c->slab points to slab-1, c->freelist has objects from slab-2] - goto redo - node_match() fails - goto deactivate_slab - existing c->freelist is passed into deactivate_slab() - inuse count of slab-1 is decremented to account for object from slab-2 At this point, the inuse count of slab-1 is 1 lower than it should be. This means that if we free all allocated objects in slab-1 except for one, SLUB will think that slab-1 is completely unused, and may free its page, leading to use-after-free. Fixes: c17dda40a6a4e ("slub: Separate out kmem_cache_cpu processing from deactivate_slab") Fixes: 03e404af26dc2 ("slub: fast release on full slab") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220608182205.2945720-1-jannh@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-22random: move randomize_page() into mm where it belongsJason A. Donenfeld1-0/+32
commit 5ad7dd882e45d7fe432c32e896e2aaa0b21746ea upstream. randomize_page is an mm function. It is documented like one. It contains the history of one. It has the naming convention of one. It looks just like another very similar function in mm, randomize_stack_top(). And it has always been maintained and updated by mm people. There is no need for it to be in random.c. In the "which shape does not look like the other ones" test, pointing to randomize_page() is correct. So move randomize_page() into mm/util.c, right next to the similar randomize_stack_top() function. This commit contains no actual code changes. Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-14hugetlb: fix huge_pmd_unshare address updateMike Kravetz1-1/+8
commit 48381273f8734d28ef56a5bdf1966dd8530111bc upstream. The routine huge_pmd_unshare() is passed a pointer to an address associated with an area which may be unshared. If unshare is successful this address is updated to 'optimize' callers iterating over huge page addresses. For the optimization to work correctly, address should be updated to the last huge page in the unmapped/unshared area. However, in the common case where the passed address is PUD_SIZE aligned, the address is incorrectly updated to the address of the preceding huge page. That wastes CPU cycles as the unmapped/unshared range is scanned twice. Link: https://lkml.kernel.org/r/20220524205003.126184-1-mike.kravetz@oracle.com Fixes: 39dde65c9940 ("shared page table for hugetlb page") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-14mm, compaction: fast_find_migrateblock() should return pfn in the target zoneRei Yamamoto1-0/+2
commit bbe832b9db2e1ad21522f8f0bf02775fff8a0e0e upstream. At present, pages not in the target zone are added to cc->migratepages list in isolate_migratepages_block(). As a result, pages may migrate between nodes unintentionally. This would be a serious problem for older kernels without commit a984226f457f849e ("mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec"), because it can corrupt the lru list by handling pages in list without holding proper lru_lock. Avoid returning a pfn outside the target zone in the case that it is not aligned with a pageblock boundary. Otherwise isolate_migratepages_block() will handle pages not in the target zone. Link: https://lkml.kernel.org/r/20220511044300.4069-1-yamamoto.rei@jp.fujitsu.com Fixes: 70b44595eafe ("mm, compaction: use free lists to quickly locate a migration source") Signed-off-by: Rei Yamamoto <yamamoto.rei@jp.fujitsu.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Don Dutile <ddutile@redhat.com> Cc: Wonhyuk Yang <vvghjk1234@gmail.com> Cc: Rei Yamamoto <yamamoto.rei@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-06zsmalloc: fix races between asynchronous zspage free and page migrationSultan Alsawaf1-4/+33
commit 2505a981114dcb715f8977b8433f7540854851d8 upstream. The asynchronous zspage free worker tries to lock a zspage's entire page list without defending against page migration. Since pages which haven't yet been locked can concurrently migrate off the zspage page list while lock_zspage() churns away, lock_zspage() can suffer from a few different lethal races. It can lock a page which no longer belongs to the zspage and unsafely dereference page_private(), it can unsafely dereference a torn pointer to the next page (since there's a data race), and it can observe a spurious NULL pointer to the next page and thus not lock all of the zspage's pages (since a single page migration will reconstruct the entire page list, and create_page_chain() unconditionally zeroes out each list pointer in the process). Fix the races by using migrate_read_lock() in lock_zspage() to synchronize with page migration. Link: https://lkml.kernel.org/r/20220509024703.243847-1-sultan@kerneltoast.com Fixes: 77ff465799c602 ("zsmalloc: zs_page_migrate: skip unnecessary loops but not return -EBUSY if zspage is not inuse") Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-15mm: userfaultfd: fix missing cache flush in mcopy_atomic_pte() and ↵Muchun Song1-0/+3
__mcopy_atomic() commit 7c25a0b89a487878b0691e6524fb5a8827322194 upstream. userfaultfd calls mcopy_atomic_pte() and __mcopy_atomic() which do not do any cache flushing for the target page. Then the target page will be mapped to the user space with a different address (user address), which might have an alias issue with the kernel address used to copy the data from the user to. Fix this by insert flush_dcache_page() after copy_from_user() succeeds. Link: https://lkml.kernel.org/r/20220210123058.79206-7-songmuchun@bytedance.com Fixes: b6ebaedb4cb1 ("userfaultfd: avoid mmap_sem read recursion in mcopy_atomic") Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lars Persson <lars.persson@axis.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-15mm: hugetlb: fix missing cache flush in copy_huge_page_from_user()Muchun Song1-0/+2
commit e763243cc6cb1fcc720ec58cfd6e7c35ae90a479 upstream. userfaultfd calls copy_huge_page_from_user() which does not do any cache flushing for the target page. Then the target page will be mapped to the user space with a different address (user address), which might have an alias issue with the kernel address used to copy the data from the user to. Fix this issue by flushing dcache in copy_huge_page_from_user(). Link: https://lkml.kernel.org/r/20220210123058.79206-4-songmuchun@bytedance.com Fixes: fa4d75c1de13 ("userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lars Persson <lars.persson@axis.com> Cc: Peter Xu <peterx@redhat.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-15mm: fix missing cache flush for all tail pages of compound pageMuchun Song1-2/+5
commit 2771739a7162782c0aa6424b2e3dd874e884a15d upstream. The D-cache maintenance inside move_to_new_page() only consider one page, there is still D-cache maintenance issue for tail pages of compound page (e.g. THP or HugeTLB). THP migration is only enabled on x86_64, ARM64 and powerpc, while powerpc and arm64 need to maintain the consistency between I-Cache and D-Cache, which depends on flush_dcache_page() to maintain the consistency between I-Cache and D-Cache. But there is no issues on arm64 and powerpc since they already considers the compound page cache flushing in their icache flush function. HugeTLB migration is enabled on arm, arm64, mips, parisc, powerpc, riscv, s390 and sh, while arm has handled the compound page cache flush in flush_dcache_page(), but most others do not. In theory, the issue exists on many architectures. Fix this by not using flush_dcache_folio() since it is not backportable. Link: https://lkml.kernel.org/r/20220210123058.79206-3-songmuchun@bytedance.com Fixes: 290408d4a250 ("hugetlb: hugepage migration core") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lars Persson <lars.persson@axis.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-15arm: remove CONFIG_ARCH_HAS_HOLES_MEMORYMODELMike Rapoport2-18/+0
commit 5e545df3292fbd3d5963c68980f1527ead2a2b3f upstream. ARM is the only architecture that defines CONFIG_ARCH_HAS_HOLES_MEMORYMODEL which in turn enables memmap_valid_within() function that is intended to verify existence of struct page associated with a pfn when there are holes in the memory map. However, the ARCH_HAS_HOLES_MEMORYMODEL also enables HAVE_ARCH_PFN_VALID and arch-specific pfn_valid() implementation that also deals with the holes in the memory map. The only two users of memmap_valid_within() call this function after a call to pfn_valid() so the memmap_valid_within() check becomes redundant. Remove CONFIG_ARCH_HAS_HOLES_MEMORYMODEL and memmap_valid_within() and rely entirely on ARM's implementation of pfn_valid() that is now enabled unconditionally. Link: https://lkml.kernel.org/r/20201101170454.9567-9-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Meelis Roos <mroos@linux.ee> Cc: Michael Schmitz <schmitzmic@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: kernel test robot <lkp@intel.com> Fixes: 8dd559d53b3b ("arm: ioremap: don't abuse pfn_valid() to check if pfn is in RAM") Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-12mm: fix unexpected zeroed page mapping with zram swapMinchan Kim1-54/+0
commit e914d8f00391520ecc4495dd0ca0124538ab7119 upstream. Two processes under CLONE_VM cloning, user process can be corrupted by seeing zeroed page unexpectedly. CPU A CPU B do_swap_page do_swap_page SWP_SYNCHRONOUS_IO path SWP_SYNCHRONOUS_IO path swap_readpage valid data swap_slot_free_notify delete zram entry swap_readpage zeroed(invalid) data pte_lock map the *zero data* to userspace pte_unlock pte_lock if (!pte_same) goto out_nomap; pte_unlock return and next refault will read zeroed data The swap_slot_free_notify is bogus for CLONE_VM case since it doesn't increase the refcount of swap slot at copy_mm so it couldn't catch up whether it's safe or not to discard data from backing device. In the case, only the lock it could rely on to synchronize swap slot freeing is page table lock. Thus, this patch gets rid of the swap_slot_free_notify function. With this patch, CPU A will see correct data. CPU A CPU B do_swap_page do_swap_page SWP_SYNCHRONOUS_IO path SWP_SYNCHRONOUS_IO path swap_readpage original data pte_lock map the original data swap_free swap_range_free bd_disk->fops->swap_slot_free_notify swap_readpage read zeroed data pte_unlock pte_lock if (!pte_same) goto out_nomap; pte_unlock return on next refault will see mapped data by CPU B The concern of the patch would increase memory consumption since it could keep wasted memory with compressed form in zram as well as uncompressed form in address space. However, most of cases of zram uses no readahead and do_swap_page is followed by swap_free so it will free the compressed form from in zram quickly. Link: https://lkml.kernel.org/r/YjTVVxIAsnKAXjTd@google.com Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device") Reported-by: Ivan Babrou <ivan@cloudflare.com> Tested-by: Ivan Babrou <ivan@cloudflare.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-09mm, hugetlb: allow for "high" userspace addressesChristophe Leroy1-8/+0
commit 5f24d5a579d1eace79d505b148808a850b417d4c upstream. This is a fix for commit f6795053dac8 ("mm: mmap: Allow for "high" userspace addresses") for hugetlb. This patch adds support for "high" userspace addresses that are optionally supported on the system and have to be requested via a hint mechanism ("high" addr parameter to mmap). Architectures such as powerpc and x86 achieve this by making changes to their architectural versions of hugetlb_get_unmapped_area() function. However, arm64 uses the generic version of that function. So take into account arch_get_mmap_base() and arch_get_mmap_end() in hugetlb_get_unmapped_area(). To allow that, move those two macros out of mm/mmap.c into include/linux/sched/mm.h If these macros are not defined in architectural code then they default to (TASK_SIZE) and (base) so should not introduce any behavioural changes to architectures that do not define them. For the time being, only ARM64 is affected by this change. Catalin (ARM64) said "We should have fixed hugetlb_get_unmapped_area() as well when we added support for 52-bit VA. The reason for commit f6795053dac8 was to prevent normal mmap() from returning addresses above 48-bit by default as some user-space had hard assumptions about this. It's a slight ABI change if you do this for hugetlb_get_unmapped_area() but I doubt anyone would notice. It's more likely that the current behaviour would cause issues, so I'd rather have them consistent. Basically when arm64 gained support for 52-bit addresses we did not want user-space calling mmap() to suddenly get such high addresses, otherwise we could have inadvertently broken some programs (similar behaviour to x86 here). Hence we added commit f6795053dac8. But we missed hugetlbfs which could still get such high mmap() addresses. So in theory that's a potential regression that should have bee addressed at the same time as commit f6795053dac8 (and before arm64 enabled 52-bit addresses)" Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.1650033747.git.christophe.leroy@csgroup.eu Fixes: f6795053dac8 ("mm: mmap: Allow for "high" userspace addresses") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Steve Capper <steve.capper@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> [5.0.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-27oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanupNico Pache1-14/+40
commit e4a38402c36e42df28eb1a5394be87e6571fb48a upstream. The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can be targeted by the oom reaper. This mapping is used to store the futex robust list head; the kernel does not keep a copy of the robust list and instead references a userspace address to maintain the robustness during a process death. A race can occur between exit_mm and the oom reaper that allows the oom reaper to free the memory of the futex robust list before the exit path has handled the futex death: CPU1 CPU2 -------------------------------------------------------------------- page_fault do_exit "signal" wake_oom_reaper oom_reaper oom_reap_task_mm (invalidates mm) exit_mm exit_mm_release futex_exit_release futex_cl