summaryrefslogtreecommitdiff
path: root/mm/zsmalloc.c
AgeCommit message (Collapse)AuthorFilesLines
2024-09-21Merge tag 'mm-stable-2024-09-20-02-31' of ↵Linus Torvalds1-11/+4
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Along with the usual shower of singleton patches, notable patch series in this pull request are: - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds consistency to the APIs and behaviour of these two core allocation functions. This also simplifies/enables Rustification. - "Some cleanups for shmem" from Baolin Wang. No functional changes - mode code reuse, better function naming, logic simplifications. - "mm: some small page fault cleanups" from Josef Bacik. No functional changes - code cleanups only. - "Various memory tiering fixes" from Zi Yan. A small fix and a little cleanup. - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and simplifications and .text shrinkage. - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This is a feature, it adds new feilds to /proc/vmstat such as $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 which tells us that 11391 processes used 4k of stack while none at all used 16k. Useful for some system tuning things, but partivularly useful for "the dynamic kernel stack project". - "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory. - "mm: memcg: page counters optimizations" from Roman Gushchin. "3 independent small optimizations of page counters". - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work correctly by design rather than by accident. - "mm: remove arch_make_page_accessible()" from David Hildenbrand. Some folio conversions which make arch_make_page_accessible() unneeded. - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel. Cleans up and fixes our handling of the resetting of the cgroup/process peak-memory-use detector. - "Make core VMA operations internal and testable" from Lorenzo Stoakes. Rationalizaion and encapsulation of the VMA manipulation APIs. With a view to better enable testing of the VMA functions, even from a userspace-only harness. - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in the zswap global shrinker, resulting in improved performance. - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in some missing info in /proc/zoneinfo. - "mm: replace follow_page() by folio_walk" from David Hildenbrand. Code cleanups and rationalizations (conversion to folio_walk()) resulting in the removal of follow_page(). - "improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some tuning to improve zswap's dynamic shrinker. Significant reductions in swapin and improvements in performance are shown. - "mm: Fix several issues with unaccepted memory" from Kirill Shutemov. Improvements to the new unaccepted memory feature, - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX PUDs. This was missing, although nobody seems to have notied yet. - "Introduce a store type enum for the Maple tree" from Sidhartha Kumar. Cleanups and modest performance improvements for the maple tree library code. - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move more cgroup v1 remnants away from the v2 memcg code. - "memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds various warnings telling users that memcg v1 features are deprecated. - "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li. Greatly improves the success rate of the mTHP swap allocation. - "mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate per-arch implementations of numa_memblk code into generic code. - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly improves the performance of munmap() of swap-filled ptes. - "support large folio swap-out and swap-in for shmem" from Baolin Wang. With this series we no longer split shmem large folios into simgle-page folios when swapping out shmem. - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance improvements and code reductions for gigantic folios. - "support shmem mTHP collapse" from Baolin Wang. Adds support for khugepaged's collapsing of shmem mTHP folios. - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect() performance regression due to the addition of mseal(). - "Increase the number of bits available in page_type" from Matthew Wilcox. Increases the number of bits available in page_type! - "Simplify the page flags a little" from Matthew Wilcox. Many legacy page flags are now folio flags, so the page-based flags and their accessors/mutators can be removed. - "mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An optimization which permits us to avoid writing/reading zero-filled zswap pages to backing store. - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window which occurs when a MAP_FIXED operqtion is occurring during an unrelated vma tree walk. - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the vma_merge() functionality, making ot cleaner, more testable and better tested. - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor fixups of DAMON selftests and kunit tests. - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code cleanups and folio conversions. - "Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups for shmem controls and stats. - "mm: count the number of anonymous THPs per size" from Barry Song. Expose additional anon THP stats to userspace for improved tuning. - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio conversions and removal of now-unused page-based APIs. - "replace per-quota region priorities histogram buffer with per-context one" from SeongJae Park. DAMON histogram rationalization. - "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae Park. DAMON documentation updates. - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve related doc and warn" from Jason Wang: fixes usage of page allocator __GFP_NOFAIL and GFP_ATOMIC flags. - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy. This was overprovisioning THPs in sparsely accessed memory areas. - "zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add support for zram run-time compression algorithm tuning. - "mm: Care about shadow stack guard gap when getting an unmapped area" from Mark Brown. Fix up the various arch_get_unmapped_area() implementations to better respect guard areas. - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of mem_cgroup_iter() and various code cleanups. - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge pfnmap support. - "resource: Fix region_intersects() vs add_memory_driver_managed()" from Huang Ying. Fix a bug in region_intersects() for systems with CXL memory. - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a couple more code paths to correctly recover from the encountering of poisoned memry. - "mm: enable large folios swap-in support" from Barry Song. Support the swapin of mTHP memory into appropriately-sized folios, rather than into single-page folios" * tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits) zram: free secondary algorithms names uprobes: turn xol_area->pages[2] into xol_area->page uprobes: introduce the global struct vm_special_mapping xol_mapping Revert "uprobes: use vm_special_mapping close() functionality" mm: support large folios swap-in for sync io devices mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios mm: fix swap_read_folio_zeromap() for large folios with partial zeromap mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries set_memory: add __must_check to generic stubs mm/vma: return the exact errno in vms_gather_munmap_vmas() memcg: cleanup with !CONFIG_MEMCG_V1 mm/show_mem.c: report alloc tags in human readable units mm: support poison recovery from copy_present_page() mm: support poison recovery from do_cow_fault() resource, kunit: add test case for region_intersects() resource: make alloc_free_mem_region() works for iomem_resource mm: z3fold: deprecate CONFIG_Z3FOLD vfio/pci: implement huge_fault support mm/arm64: support large pfn mappings mm/x86: support large pfn mappings ...
2024-09-17zsmalloc: use unique zsmalloc caches namesSergey Senozhatsky1-6/+17
Each zsmalloc pool maintains several named kmem-caches for zs_handle-s and zspage-s. On a system with multiple zsmalloc pools and CONFIG_DEBUG_VM this triggers kmem_cache_sanity_check(): kmem_cache of name 'zspage' already exists WARNING: at mm/slab_common.c:108 do_kmem_cache_create_usercopy+0xb5/0x310 ... kmem_cache of name 'zs_handle' already exists WARNING: at mm/slab_common.c:108 do_kmem_cache_create_usercopy+0xb5/0x310 ... We provide zram device name when init its zsmalloc pool, so we can use that same name for zsmalloc caches and, hence, create unique names that can easily be linked to zram device that has created them. So instead of having this cat /proc/slabinfo slabinfo - version: 2.1 zspage 46 46 ... zs_handle 128 128 ... zspage 34270 34270 ... zs_handle 34816 34816 ... zspage 0 0 ... zs_handle 0 0 ... We now have this cat /proc/slabinfo slabinfo - version: 2.1 zspage-zram2 46 46 ... zs_handle-zram2 128 128 ... zspage-zram0 34270 34270 ... zs_handle-zram0 34816 34816 ... zspage-zram1 0 0 ... zs_handle-zram1 0 0 ... Link: https://lkml.kernel.org/r/20240906035103.2435557-1-senozhatsky@chromium.org Fixes: 2e40e163a25a ("zsmalloc: decouple handle and object") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03zsmalloc: use all available 24 bits of page_typeMatthew Wilcox (Oracle)1-11/+4
Now that we have an extra 8 bits, we don't need to limit ourselves to supporting a 64KiB page size. I'm sure both Hexagon users are grateful, but it does reduce complexity a little. We can also remove reset_first_obj_offset() as calling __ClearPageZsmalloc() will now reset all 32 bits of page_type. Link: https://lkml.kernel.org/r/20240821173914.2270383-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-28minmax: make generic MIN() and MAX() macros available everywhereLinus Torvalds1-2/+0
This just standardizes the use of MIN() and MAX() macros, with the very traditional semantics. The goal is to use these for C constant expressions and for top-level / static initializers, and so be able to simplify the min()/max() macros. These macro names were used by various kernel code - they are very traditional, after all - and all such users have been fixed up, with a few different approaches: - trivial duplicated macro definitions have been removed Note that 'trivial' here means that it's obviously kernel code that already included all the major kernel headers, and thus gets the new generic MIN/MAX macros automatically. - non-trivial duplicated macro definitions are guarded with #ifndef This is the "yes, they define their own versions, but no, the include situation is not entirely obvious, and maybe they don't get the generic version automatically" case. - strange use case #1 A couple of drivers decided that the way they want to describe their versioning is with #define MAJ 1 #define MIN 2 #define DRV_VERSION __stringify(MAJ) "." __stringify(MIN) which adds zero value and I just did my Alexander the Great impersonation, and rewrote that pointless Gordian knot as #define DRV_VERSION "1.2" instead. - strange use case #2 A couple of drivers thought that it's a good idea to have a random 'MIN' or 'MAX' define for a value or index into a table, rather than the traditional macro that takes arguments. These values were re-written as C enum's instead. The new function-line macros only expand when followed by an open parenthesis, and thus don't clash with enum use. Happily, there weren't really all that many of these cases, and a lot of users already had the pattern of using '#ifndef' guarding (or in one case just using '#undef MIN') before defining their own private version that does the same thing. I left such cases alone. Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-12zsmalloc: rename class stat mutatorsSergey Senozhatsky1-19/+19
A cosmetic change. o Rename class_stat_inc() and class_stat_dec() to class_stat_add() and class_stat_sub() correspondingly. inc/dec are usually associated with +1/-1 modifications, while zsmlloc can modify stats by up to ->objs_per_zspage. Use add/sub (follow atomics naming). o Rename zs_stat_get() to class_stat_read() get() is usually associated with ref-counting and is paired with put(). zs_stat_get() simply reads class stat so rename to reflect it. (This also follows atomics naming). Link: https://lkml.kernel.org/r/20240701031140.3756345-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12mm/zsmalloc: move record_obj() into obj_malloc()Chengming Zhou1-9/+6
We always record_obj() to make handle points to object after obj_malloc(), so simplify the code by moving record_obj() into obj_malloc(). There should be no functional change. Link: https://lkml.kernel.org/r/20240627075959.611783-2-chengming.zhou@linux.dev Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12mm/zsmalloc: clarify class per-fullness zspage countsChengming Zhou1-0/+1
We always use insert_zspage() and remove_zspage() to update zspage's fullness location, which will account correctly. But this special async free path use "splice" instead of remove_zspage(), so the per-fullness zspage count for ZS_INUSE_RATIO_0 won't decrease. Clean things up by decreasing when iterate over the zspage free list. This doesn't actually fix anything. ZS_INUSE_RATIO_0 is just a "placeholder" which is never used anywhere. Link: https://lkml.kernel.org/r/20240627075959.611783-1-chengming.zhou@linux.dev Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-12mm/zsmalloc: change back to per-size_class lockChengming Zhou1-35/+50
Patch series "mm/zsmalloc: change back to per-size_class lock, v2". Commit c0547d0b6a4b ("zsmalloc: consolidate zs_pool's migrate_lock and size_class's locks") changed per-size_class lock to pool spinlock to prepare reclaim support in zsmalloc. Then reclaim support in zsmalloc had been dropped in favor of LRU reclaim in zswap, but this locking change had been left there. Obviously, the scalability of pool spinlock is worse than per-size_class. And we have a workaround that using 32 pools in zswap to avoid this scalability problem, which brings its own problems like memory waste and more memory fragmentation. So this series changes back to use per-size_class lock and using testing data in much stressed situation to verify that we can use only one pool in zswap. Note we only test and care about the zsmalloc backend, which makes sense now since zsmalloc became a lot more popular than other backends. Testing kernel build (make bzImage -j32) on tmpfs with memory.max=1GB, and zswap shrinker enabled with 10GB swapfile on ext4. real user sys 6.10.0-rc3 138.18 1241.38 1452.73 6.10.0-rc3-onepool 149.45 1240.45 1844.69 6.10.0-rc3-onepool-perclass 138.23 1242.37 1469.71 We can see from "sys" column that per-size_class locking with only one pool in zswap can have near performance with the current 32 pools. This patch (of 2): This patch is almost the revert of the commit c0547d0b6a4b ("zsmalloc: consolidate zs_pool's migrate_lock and size_class's locks"), which changed to use a global pool->lock instead of per-size_class lock and pool->migrate_lock, was preparation for suppporting reclaim in zsmalloc. Then reclaim in zsmalloc had been dropped in favor of LRU reclaim in zswap. In theory, per-size_class is more fine-grained than the pool->lock, since a pool can have many size_classes. As for the additional pool->migrate_lock, only free() and map() need to grab it to access stable handle to get zspage, and only in read lock mode. Link: https://lkml.kernel.org/r/20240625-zsmalloc-lock-mm-everything-v3-0-ad941699cb61@linux.dev Link: https://lkml.kernel.org/r/20240621-zsmalloc-lock-mm-everything-v2-0-d30e9cd2b793@linux.dev Link: https://lkml.kernel.org/r/20240617-zsmalloc-lock-mm-everything-v1-0-5e5081ea11b3@linux.dev Link: https://lkml.kernel.org/r/20240617-zsmalloc-lock-mm-everything-v1-1-5e5081ea11b3@linux.dev Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/zsmalloc: use a proper page typeDavid Hildenbrand1-4/+25
Let's clean it up: use a proper page type and store our data (offset into a page) in the lower 16 bit as documented. We won't be able to support 256 KiB base pages, which is acceptable. Teach Kconfig to handle that cleanly using a new CONFIG_HAVE_ZSMALLOC. Based on this, we should do a proper "struct zsdesc" conversion, as proposed in [1]. This removes the last _mapcount/page_type offender. [1] https://lore.kernel.org/all/20231130101242.2590384-1-42.hyeyoo@gmail.com/ Link: https://lkml.kernel.org/r/20240529111904.2069608-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> [zram/zsmalloc workloads] Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: remove MIGRATE_SYNC_NO_COPY modeKefeng Wang1-8/+0
Commit 2916ecc0f9d4 ("mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY") introduce a new MIGRATE_SYNC_NO_COPY mode to allow to offload the copy to a device DMA engine, which is only used __migrate_device_pages() to decide whether or not copy the old page, and the MIGRATE_SYNC_NO_COPY mode only set in hmm, as the MIGRATE_SYNC_NO_COPY set is removed by previous cleanup, it seems that we could remove the unnecessary MIGRATE_SYNC_NO_COPY. Link: https://lkml.kernel.org/r/20240524052843.182275-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/zsmalloc: add MODULE_DESCRIPTION()Jeff Johnson1-0/+1
Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in mm/zsmalloc.o Link: https://lkml.kernel.org/r/20240513-mm-md-v1-4-8c20e7d26842@quicinc.com Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: zpool: return pool size in pagesJohannes Weiner1-3/+3
All zswap backends track their pool sizes in pages. Currently they multiply by PAGE_SIZE for zswap, only for zswap to divide again in order to do limit math. Report pages directly. Link: https://lkml.kernel.org/r/20240312153901.3441-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-04mm/zsmalloc: don't need to reserve LSB in handleChengming Zhou1-4/+1
We will save allocated tag in the object header to indicate that it's allocated. handle |= OBJ_ALLOCATED_TAG; So the object header needs to reserve LSB for this tag bit. But the handle itself doesn't need to reserve LSB to save tag, since it's only used to find the position of object, by (pfn + obj_idx). So remove LSB reserve from handle, one more bit can be used as obj_idx. Link: https://lkml.kernel.org/r/20240228023854.3511239-1-chengming.zhou@linux.dev Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23mm/zsmalloc: remove get_zspage_mapping()Chengming Zhou1-24/+4
Actually we seldom use the class_idx returned from get_zspage_mapping(), only the zspage->fullness is useful, just use zspage->fullness to remove this helper. Note zspage->fullness is not stable outside pool->lock, remove redundant "VM_BUG_ON(fullness != ZS_INUSE_RATIO_0)" in async_free_zspage() since we already have the same VM_BUG_ON() in __free_zspage(), which is safe to access zspage->fullness with pool->lock held. Link: https://lkml.kernel.org/r/20240220-b4-zsmalloc-cleanup-v1-3-5c5ee4ccdd87@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23mm/zsmalloc: remove_zspage() don't need fullness parameterChengming Zhou1-7/+7
We must remove_zspage() from its current fullness list, then use insert_zspage() to update its fullness and insert to new fullness list. Obviously, remove_zspage() doesn't need the fullness parameter. Link: https://lkml.kernel.org/r/20240220-b4-zsmalloc-cleanup-v1-2-5c5ee4ccdd87@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23mm/zsmalloc: remove set_zspage_mapping()Chengming Zhou1-11/+2
Patch series "mm/zsmalloc: some cleanup for get/set_zspage_mapping()". The discussion[1] with Sergey shows there are some cleanup works to do in get/set_zspage_mapping(): - the fullness returned from get_zspage_mapping() is not stable outside pool->lock, this usage pattern is confusing, but should be ok in this free_zspage path. - we seldom use the class_idx returned from get_zspage_mapping(), only free_zspage path use to get its class. - set_zspage_mapping() always set the zspage->class, but it's never changed after zspage allocated. [1] https://lore.kernel.org/all/a6c22e30-cf10-4122-91bc-ceb9fb57a5d6@bytedance.com/ This patch (of 3): We only need to update zspage->fullness when insert_zspage(), since zspage->class is never changed after allocated. Link: https://lkml.kernel.org/r/20240220-b4-zsmalloc-cleanup-v1-0-5c5ee4ccdd87@bytedance.com Link: https://lkml.kernel.org/r/20240220-b4-zsmalloc-cleanup-v1-1-5c5ee4ccdd87@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23mm/zsmalloc: remove unused zspage->isolatedChengming Zhou1-32/+0
The zspage->isolated is not used anywhere, we don't need to maintain it, which needs to hold the heavy pool lock to update it, so just remove it. Link: https://lkml.kernel.org/r/20240219-b4-szmalloc-migrate-v1-3-34cd49c6545b@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23mm/zsmalloc: remove migrate_write_lock_nested()Chengming Zhou1-17/+5
The migrate write lock is to protect the race between zspage migration and zspage objects' map users. We only need to lock out the map users of src zspage, not dst zspage, which is safe to map by users concurrently, since we only need to do obj_malloc() from dst zspage. So we can remove the migrate_write_lock_nested() use case. As we are here, cleanup the __zs_compact() by moving putback_zspage() outside of migrate_write_unlock since we hold pool lock, no malloc or free users can come in. Link: https://lkml.kernel.org/r/20240219-b4-szmalloc-migrate-v1-2-34cd49c6545b@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-23mm/zsmalloc: fix migrate_write_lock() when !CONFIG_COMPACTIONChengming Zhou1-6/+3
Patch series "mm/zsmalloc: fix and optimize objects/page migration". This series is to fix and optimize the zsmalloc objects/page migration. This patch (of 3): migrate_write_lock() is a empty function when !CONFIG_COMPACTION, in which case zs_compact() can be triggered from shrinker reclaim context. (Maybe it's better to rename it to zs_shrink()?) And zspage map object users rely on this migrate_read_lock() so object won't be migrated elsewhere. Fix it by always implementing the migrate_write_lock() related functions. Link: https://lkml.kernel.org/r/20240219-b4-szmalloc-migrate-v1-0-34cd49c6545b@bytedance.com Link: https://lkml.kernel.org/r/20240219-b4-szmalloc-migrate-v1-1-34cd49c6545b@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-05mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is ↵Barry Song1-1/+4
too large This is the case the "compressed" data is larger than the original data, it is better to return -ENOSPC which can help zswap record a poor compr rather than an invalid request. Then we get more friendly counting for reject_compress_poor in debugfs. bool zswap_store(struct folio *folio) { ... ret = zpool_malloc(zpool, dlen, gfp, &handle); if (ret == -ENOSPC) { zswap_reject_compress_poor++; goto put_dstmem; } if (ret) { zswap_reject_alloc_fail++; goto put_dstmem; } ... } Also, zbud_alloc() and z3fold_alloc() are returning ENOSPC in the same case, eg static int z3fold_alloc(struct z3fold_pool *pool, size_t size, gfp_t gfp, unsigned long *handle) { ... if (!size || (gfp & __GFP_HIGHMEM)) return -EINVAL; if (size > PAGE_SIZE) return -ENOSPC; ... } Link: https://lkml.kernel.org/r/20231228061802.25280-1-v-songbaohua@oppo.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Chris Li <chrisl@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18zsmalloc: use copy_page for full page copyMark-PK Tsai1-1/+1
Some architectures have implemented optimized copy_page for full page copying, such as arm. On my arm platform, use the copy_page helper for single page copying is about 10 percent faster than memcpy. Link: https://lkml.kernel.org/r/20231006060245.7411-1-mark-pk.tsai@mediatek.com Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: YJ Chiang <yj.chiang@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04zsmalloc: dynamically allocate the mm-zspool shrinkerQi Zheng1-12/+15
In preparation for implementing lockless slab shrink, use new APIs to dynamically allocate the mm-zspool shrinker, so that it can be freed asynchronously via RCU. Then it doesn't need to wait for RCU read-side critical section when releasing the struct zs_pool. Link: https://lkml.kernel.org/r/20230911094444.68966-38-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21merge mm-hotfixes-stable into mm-stable to pick up depended-upon changesAndrew Morton1-5/+9
2023-08-18zsmalloc: remove obj_tagged()Sergey Senozhatsky1-22/+7
obj_tagged() is not needed at this point, because objects can only have one tag: OBJ_ALLOCATED_TAG. We needed obj_tagged() for the zsmalloc LRU implementation, which has now been removed. Simplify zsmalloc code and revert to the previous implementation that was in place before the zsmalloc LRU series. Link: https://lkml.kernel.org/r/20230709025817.3842416-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18zsmalloc: remove zs_compact_controlMinchan Kim1-28/+9
__zs_compact always putback src_zspage into class list after migrate_zspage. Thus, we don't need to keep last position of src_zspage any more. Let's remove it. Link: https://lkml.kernel.org/r/20230624053120.643409-4-senozhatsky@chromium.org Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18zsmalloc: move migration destination zspage inuse checkSergey Senozhatsky1-4/+4
Destination zspage fullness check need to be done after zs_object_copy() because that's where source and destination zspages fullness change. Checking destination zspage fullness before zs_object_copy() may cause migration to loop through source zspage sub-pages scanning for allocate objects just to find out at the end that the destination zspage is full. Link: https://lkml.kernel.org/r/20230624053120.643409-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Alexey Romanov <avromanov@sberdevices.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18zsmalloc: do not scan for allocated objects in empty zspageSergey Senozhatsky1-0/+9
Patch series "zsmalloc: small compaction improvements", v2. A tiny series that can reduce the number of find_alloced_obj() invocations (which perform a linear scan of sub-page) during compaction. Inspired by Alexey Romanov's findings. This patch (of 3): zspage migration can terminate as soon as it moves the last allocated object from the source zspage. Add a simple helper zspage_empty() that tests zspage ->inuse on each migration iteration. Link: https://lkml.kernel.org/r/20230624053120.643409-2-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Alexey Romanov <AVRomanov@sberdevices.ru> Reviewed-by: Alexey Romanov <avromanov@sberdevices.ru> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-04zsmalloc: fix races between modifications of fullness and isolatedAndrew Yang1-5/+9
We encountered many kernel exceptions of VM_BUG_ON(zspage->isolated == 0) in dec_zspage_isolation() and BUG_ON(!pages[1]) in zs_unmap_object() lately. This issue only occurs when migration and reclamation occur at the same time. With our memory stress test, we can reproduce this issue several times a day. We have no idea why no one else encountered this issue. BTW, we switched to the new kernel version with this defect a few months ago. Since fullness and isolated share the same unsigned int, modifications of them should be protected by the same lock. [andrew.yang@mediatek.com: move comment] Link: https://lkml.kernel.org/r/20230727062910.6337-1-andrew.yang@mediatek.com Link: https://lkml.kernel.org/r/20230721063705.11455-1-andrew.yang@mediatek.com Fixes: c4549b871102 ("zsmalloc: remove zspage isolation for migration") Signed-off-by: Andrew Yang <andrew.yang@mediatek.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm: zswap: remove shrink from zpool interfaceDomenico Cerasuolo1-3/+1
Now that all three zswap backends have removed their shrink code, it is no longer necessary for the zpool interface to include shrink/writeback endpoints. Link: https://lkml.kernel.org/r/20230612093815.133504-6-cerasuolodomenico@gmail.com Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm: zswap: remove page reclaim logic from zsmallocDomenico Cerasuolo1-380/+12
Switch zsmalloc to the new generic zswap LRU and remove its custom implementation. Link: https://lkml.kernel.org/r/20230612093815.133504-5-cerasuolodomenico@gmail.com Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09mm/zsmalloc: get rid of PAGE_MASKAlexey Romanov1-6/+6
Use offset_in_page() macro instead of 'val & ~PAGE_MASK' Link: https://lkml.kernel.org/r/20230516095029.49036-2-avromanov@sberdevices.ru Signed-off-by: Alexey Romanov <avromanov@sberdevices.ru> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-17zsmalloc: move LRU update from zs_map_object() to zs_malloc()Nhat Pham1-27/+9
Under memory pressure, we sometimes observe the following crash: [ 5694.832838] ------------[ cut here ]------------ [ 5694.842093] list_del corruption, ffff888014b6a448->next is LIST_POISON1 (dead000000000100) [ 5694.858677] WARNING: CPU: 33 PID: 418824 at lib/list_debug.c:47 __list_del_entry_valid+0x42/0x80 [ 5694.961820] CPU: 33 PID: 418824 Comm: fuse_counters.s Kdump: loaded Tainted: G S 5.19.0-0_fbk3_rc3_hoangnhatpzsdynshrv41_10870_g85a9558a25de #1 [ 5694.990194] Hardware name: Wiwynn Twin Lakes MP/Twin Lakes Passive MP, BIOS YMM16 05/24/2021 [ 5695.007072] RIP: 0010:__list_del_entry_valid+0x42/0x80 [ 5695.017351] Code: 08 48 83 c2 22 48 39 d0 74 24 48 8b 10 48 39 f2 75 2c 48 8b 51 08 b0 01 48 39 f2 75 34 c3 48 c7 c7 55 d7 78 82 e8 4e 45 3b 00 <0f> 0b eb 31 48 c7 c7 27 a8 70 82 e8 3e 45 3b 00 0f 0b eb 21 48 c7 [ 5695.054919] RSP: 0018:ffffc90027aef4f0 EFLAGS: 00010246 [ 5695.065366] RAX: 41fe484987275300 RBX: ffff888008988180 RCX: 0000000000000000 [ 5695.079636] RDX: ffff88886006c280 RSI: ffff888860060480 RDI: ffff888860060480 [ 5695.093904] RBP: 0000000000000002 R08: 0000000000000000 R09: ffffc90027aef370 [ 5695.108175] R10: 0000000000000000 R11: ffffffff82fdf1c0 R12: 0000000010000002 [ 5695.122447] R13: ffff888014b6a448 R14: ffff888014b6a420 R15: 00000000138dc240 [ 5695.136717] FS: 00007f23a7d3f740(0000) GS:ffff888860040000(0000) knlGS:0000000000000000 [ 5695.152899] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5695.164388] CR2: 0000560ceaab6ac0 CR3: 000000001c06c001 CR4: 00000000007706e0 [ 5695.178659] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5695.192927] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5695.207197] PKRU: 55555554 [ 5695.212602] Call Trace: [ 5695.217486] <TASK> [ 5695.221674] zs_map_object+0x91/0x270 [ 5695.229000] zswap_frontswap_store+0x33d/0x870 [ 5695.237885] ? do_raw_spin_lock+0x5d/0xa0 [ 5695.245899] __frontswap_store+0x51/0xb0 [ 5695.253742] swap_writepage+0x3c/0x60 [ 5695.261063] shrink_page_list+0x738/0x1230 [ 5695.269255] shrink_lruvec+0x5ec/0xcd0 [ 5695.276749] ? shrink_slab+0x187/0x5f0 [ 5695.284240] ? mem_cgroup_iter+0x6e/0x120 [ 5695.292255] shrink_node+0x293/0x7b0 [ 5695.299402] do_try_to_free_pages+0xea/0x550 [ 5695.307940] try_to_free_pages+0x19a/0x490 [ 5695.316126] __folio_alloc+0x19ff/0x3e40 [ 5695.323971] ? __filemap_get_folio+0x8a/0x4e0 [ 5695.332681] ? walk_component+0x2a8/0xb50 [ 5695.340697] ? generic_permission+0xda/0x2a0 [ 5695.349231] ? __filemap_get_folio+0x8a/0x4e0 [ 5695.357940] ? walk_component+0x2a8/0xb50 [ 5695.365955] vma_alloc_folio+0x10e/0x570 [ 5695.373796] ? walk_component+0x52/0xb50 [ 5695.381634] wp_page_copy+0x38c/0xc10 [ 5695.388953] ? filename_lookup+0x378/0xbc0 [ 5695.397140] handle_mm_fault+0x87f/0x1800 [ 5695.405157] do_user_addr_fault+0x1bd/0x570 [ 5695.413520] exc_page_fault+0x5d/0x110 [ 5695.421017] asm_exc_page_fault+0x22/0x30 After some investigation, I have found the following issue: unlike other zswap backends, zsmalloc performs the LRU list update at the object mapping time, rather than when the slot for the object is allocated. This deviation was discussed and agreed upon during the review process of the zsmalloc writeback patch series: https://lore.kernel.org/lkml/Y3flcAXNxxrvy3ZH@cmpxchg.org/ Unfortunately, this introduces a subtle bug that occurs when there is a concurrent store and reclaim, which interleave as follows: zswap_frontswap_store() shrink_worker() zs_malloc() zs_zpool_shrink() spin_lock(&pool->lock) zs_reclaim_page() zspage = find_get_zspage() spin_unlock(&pool->lock) spin_lock(&pool->lock) zspage = list_first_entry(&pool->lru) list_del(&zspage->lru) zspage->lru.next = LIST_POISON1 zspage->lru.prev = LIST_POISON2 spin_unlock(&pool->lock) zs_map_object() spin_lock(&pool->lock) if (!list_empty(&zspage->lru)) list_del(&zspage->lru) CHECK_DATA_CORRUPTION(next == LIST_POISON1) /* BOOM */ With the current upstream code, this issue rarely happens. zswap only triggers writeback when the pool is already full, at which point all further store attempts are short-circuited. This creates an implicit pseudo-serialization between reclaim and store. I am working on a new zswap shrinking mechanism, which makes interleaving reclaim and store more likely, exposing this bug. zbud and z3fold do not have this problem, because they perform the LRU list update in the alloc function, while still holding the pool's lock. This patch fixes the aforementioned bug by moving the LRU update back to zs_malloc(), analogous to zbud and z3fold. Link: https://lkml.kernel.org/r/20230505185054.2417128-1-nphamcs@gmail.com Fixes: 64f768c6b32e ("zsmalloc: add a LRU to zs_pool to keep track of zspages in LRU order") Signed-off-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21zsmalloc: allow only one active pool compaction contextSergey Senozhatsky1-0/+12
zsmalloc pool can be compacted concurrently by many contexts, e.g. cc1 handle_mm_fault() do_anonymous_page() __alloc_pages_slowpath() try_to_free_pages() do_try_to_free_pages( lru_gen_shrink_node() shrink_slab() do_shrink_slab() zs_shrinker_scan() zs_compact() Pool compaction is currently (basically) single-threaded as it is performed under pool->lock. Having multiple compaction threads results in unnecessary contention, as each thread competes for pool->lock. This, in turn, affects all zsmalloc operations such as zs_malloc(), zs_map_object(), zs_free(), etc. Introduce the pool->compaction_in_progress atomic variable, which ensures that only one compaction context can run at a time. This reduces overall pool->lock contention in (corner) cases when many contexts attempt to shrink zspool simultaneously. Link: https://lkml.kernel.org/r/20230418074639.1903197-1-senozhatsky@chromium.org Fixes: c0547d0b6a4b ("zsmalloc: