From 122e093c1734361dedb64f65c99b93e28e4624f4 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Mon, 28 Jun 2021 19:33:26 -0700 Subject: mm/page_alloc: fix memory map initialization for descending nodes On systems with memory nodes sorted in descending order, for instance Dell Precision WorkStation T5500, the struct pages for higher PFNs and respectively lower nodes, could be overwritten by the initialization of struct pages corresponding to the holes in the memory sections. For example for the below memory layout [ 0.245624] Early memory node ranges [ 0.248496] node 1: [mem 0x0000000000001000-0x0000000000090fff] [ 0.251376] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff] [ 0.254256] node 1: [mem 0x0000000100000000-0x0000001423ffffff] [ 0.257144] node 0: [mem 0x0000001424000000-0x0000002023ffffff] the range 0x1424000000 - 0x1428000000 in the beginning of node 0 starts in the middle of a section and will be considered as a hole during the initialization of the last section in node 1. The wrong initialization of the memory map causes panic on boot when CONFIG_DEBUG_VM is enabled. Reorder loop order of the memory map initialization so that the outer loop will always iterate over populated memory regions in the ascending order and the inner loop will select the zone corresponding to the PFN range. This way initialization of the struct pages for the memory holes will be always done for the ranges that are actually not populated. [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/YNXlMqBbL+tBG7yq@kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=213073 Link: https://lkml.kernel.org/r/20210624062305.10940-1-rppt@kernel.org Fixes: 0740a50b9baa ("mm/page_alloc.c: refactor initialization of struct page for holes in memory layout") Signed-off-by: Mike Rapoport Cc: Boris Petkov Cc: Robert Shteynfeld Cc: Baoquan He Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/mm.h b/include/linux/mm.h index 8ae31622deef..9afb8998e7e5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2474,7 +2474,6 @@ extern void set_dma_reserve(unsigned long new_dma_reserve); extern void memmap_init_range(unsigned long, int, unsigned long, unsigned long, unsigned long, enum meminit_context, struct vmem_altmap *, int migratetype); -extern void memmap_init_zone(struct zone *zone); extern void setup_per_zone_wmarks(void); extern int __meminit init_per_zone_wmark_min(void); extern void mem_init(void); -- cgit v1.2.3 From 20ce0c2d5a303c41c0e02ceb596837868e290dcc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jonathan=20Neusch=C3=A4fer?= Date: Mon, 28 Jun 2021 19:33:32 -0700 Subject: kthread: switch to new kerneldoc syntax for named variable macro argument MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The syntax without dots is available since commit 43756e347f21 ("scripts/kernel-doc: Add support for named variable macro arguments"). The same HTML output is produced with and without this patch. Link: https://lkml.kernel.org/r/20210513161702.1721039-1-j.neuschaefer@gmx.net Signed-off-by: Jonathan Neuschäfer Cc: Jens Axboe Cc: Felix Kuehling Cc: Valentin Schneider Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/kthread.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/kthread.h b/include/linux/kthread.h index 2484ed97e72f..db3eafea168f 100644 --- a/include/linux/kthread.h +++ b/include/linux/kthread.h @@ -18,7 +18,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data), * @threadfn: the function to run in the thread * @data: data pointer for @threadfn() * @namefmt: printf-style format string for the thread name - * @arg...: arguments for @namefmt. + * @arg: arguments for @namefmt. * * This macro will create a kthread on the current node, leaving it in * the stopped state. This is just a helper for kthread_create_on_node(); -- cgit v1.2.3 From 588c7fa022d7b2361500ead5660d9a1a2ecd9b7d Mon Sep 17 00:00:00 2001 From: Hyeonggon Yoo <42.hyeyoo@gmail.com> Date: Mon, 28 Jun 2021 19:34:39 -0700 Subject: mm, slub: change run-time assertion in kmalloc_index() to compile-time Currently when size is not supported by kmalloc_index, compiler will generate a run-time BUG() while compile-time error is also possible, and better. So change BUG to BUILD_BUG_ON_MSG to make compile-time check possible. Also remove code that allocates more than 32MB because current implementation supports only up to 32MB. [42.hyeyoo@gmail.com: fix support for clang 10] Link: https://lkml.kernel.org/r/20210518181247.GA10062@hyeyoo [vbabka@suse.cz: fix false-positive assert in kernel/bpf/local_storage.c] Link: https://lkml.kernel.org/r/bea97388-01df-8eac-091b-a3c89b4a4a09@suse.czLink: https://lkml.kernel.org/r/20210511173448.GA54466@hyeyoo [elver@google.com: kfence fix] Link: https://lkml.kernel.org/r/20210512195227.245000695c9014242e9a00e5@linux-foundation.org Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka Reviewed-by: Vlastimil Babka Signed-off-by: Marco Elver Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Marco Elver Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/slab.h | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) (limited to 'include/linux') diff --git a/include/linux/slab.h b/include/linux/slab.h index 0c97d788762c..bc9ab3a5a017 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -346,8 +346,14 @@ static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags) * 1 = 65 .. 96 bytes * 2 = 129 .. 192 bytes * n = 2^(n-1)+1 .. 2^n + * + * Note: __kmalloc_index() is compile-time optimized, and not runtime optimized; + * typical usage is via kmalloc_index() and therefore evaluated at compile-time. + * Callers where !size_is_constant should only be test modules, where runtime + * overheads of __kmalloc_index() can be tolerated. Also see kmalloc_slab(). */ -static __always_inline unsigned int kmalloc_index(size_t size) +static __always_inline unsigned int __kmalloc_index(size_t size, + bool size_is_constant) { if (!size) return 0; @@ -382,12 +388,17 @@ static __always_inline unsigned int kmalloc_index(size_t size) if (size <= 8 * 1024 * 1024) return 23; if (size <= 16 * 1024 * 1024) return 24; if (size <= 32 * 1024 * 1024) return 25; - if (size <= 64 * 1024 * 1024) return 26; - BUG(); + + if ((IS_ENABLED(CONFIG_CC_IS_GCC) || CONFIG_CLANG_VERSION >= 110000) + && !IS_ENABLED(CONFIG_PROFILE_ALL_BRANCHES) && size_is_constant) + BUILD_BUG_ON_MSG(1, "unexpected size in kmalloc_index()"); + else + BUG(); /* Will never be reached. Needed because the compiler may complain */ return -1; } +#define kmalloc_index(s) __kmalloc_index(s, true) #endif /* !CONFIG_SLOB */ void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __malloc; -- cgit v1.2.3 From 792702911f581f7793962fbeb99d5c3a1b28f4c3 Mon Sep 17 00:00:00 2001 From: Stephen Boyd Date: Mon, 28 Jun 2021 19:34:52 -0700 Subject: slub: force on no_hash_pointers when slub_debug is enabled Obscuring the pointers that slub shows when debugging makes for some confusing slub debug messages: Padding overwritten. 0x0000000079f0674a-0x000000000d4dce17 Those addresses are hashed for kernel security reasons. If we're trying to be secure with slub_debug on the commandline we have some big problems given that we dump whole chunks of kernel memory to the kernel logs. Let's force on the no_hash_pointers commandline flag when slub_debug is on the commandline. This makes slub debug messages more meaningful and if by chance a kernel address is in some slub debug object dump we will have a better chance of figuring out what went wrong. Note that we don't use %px in the slub code because we want to reduce the number of places that %px is used in the kernel. This also nicely prints a big fat warning at kernel boot if slub_debug is on the commandline so that we know that this kernel shouldn't be used on production systems. [akpm@linux-foundation.org: fix build with CONFIG_SLUB_DEBUG=n] Link: https://lkml.kernel.org/r/20210601182202.3011020-5-swboyd@chromium.org Signed-off-by: Stephen Boyd Acked-by: Vlastimil Babka Acked-by: Petr Mladek Cc: Joe Perches Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Muchun Song Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/kernel.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/linux') diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 15d8bad3d2f2..bf950621febf 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -357,6 +357,8 @@ int sscanf(const char *, const char *, ...); extern __scanf(2, 0) int vsscanf(const char *, const char *, va_list); +extern int no_hash_pointers_enable(char *str); + extern int get_option(char **str, int *pint); extern char *get_options(const char *str, int nints, int *ints); extern unsigned long long memparse(const char *ptr, char **retptr); -- cgit v1.2.3 From 9f849c6f9572d8cef407f55928d3dc68fc42ad3e Mon Sep 17 00:00:00 2001 From: Gavin Shan Date: Mon, 28 Jun 2021 19:35:22 -0700 Subject: mm/page_reporting: allow driver to specify reporting order The page reporting order (threshold) is sticky to @pageblock_order by default. The page reporting can never be triggered because the freeing page can't come up with a free area like that huge. The situation becomes worse when the system memory becomes heavily fragmented. For example, the following configurations are used on ARM64 when 64KB base page size is enabled. In this specific case, the page reporting won't be triggered until the freeing page comes up with a 512MB free area. That's hard to be met, especially when the system memory becomes heavily fragmented. PAGE_SIZE: 64KB HPAGE_SIZE: 512MB pageblock_order: 13 (512MB) MAX_ORDER: 14 This allows the drivers to specify the page reporting order when the page reporting device is registered. It falls back to @pageblock_order if it's not specified by the driver. The existing users (hv_balloon and virtio_balloon) don't specify it and @pageblock_order is still taken as their page reporting order. So this shouldn't introduce any functional changes. Link: https://lkml.kernel.org/r/20210625014710.42954-4-gshan@redhat.com Signed-off-by: Gavin Shan Reviewed-by: Alexander Duyck Cc: Anshuman Khandual Cc: Catalin Marinas Cc: David Hildenbrand Cc: "Michael S. Tsirkin" Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/page_reporting.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include/linux') diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h index 3b99e0ec24f2..fe648dfa3a7c 100644 --- a/include/linux/page_reporting.h +++ b/include/linux/page_reporting.h @@ -18,6 +18,9 @@ struct page_reporting_dev_info { /* Current state of page reporting */ atomic_t state; + + /* Minimal order of page reporting */ + unsigned int order; }; /* Tear-down and bring-up for page reporting devices */ -- cgit v1.2.3 From f3b6a6df38aa514d97e8c6fcc748be1d4142bec9 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Mon, 28 Jun 2021 19:35:53 -0700 Subject: writeback, cgroup: keep list of inodes attached to bdi_writeback Currently there is no way to iterate over inodes attached to a specific cgwb structure. It limits the ability to efficiently reclaim the writeback structure itself and associated memory and block cgroup structures without scanning all inodes belonging to a sb, which can be prohibitively expensive. While dirty/in-active-writeback an inode belongs to one of the bdi_writeback's io lists: b_dirty, b_io, b_more_io and b_dirty_time. Once cleaned up, it's removed from all io lists. So the inode->i_io_list can be reused to maintain the list of inodes, attached to a bdi_writeback structure. This patch introduces a new wb->b_attached list, which contains all inodes which were dirty at least once and are attached to the given cgwb. Inodes attached to the root bdi_writeback structures are never placed on such list. The following patch will use this list to try to release cgwbs structures more efficiently. Link: https://lkml.kernel.org/r/20210608230225.2078447-6-guro@fb.com Signed-off-by: Roman Gushchin Suggested-by: Jan Kara Reviewed-by: Jan Kara Acked-by: Tejun Heo Acked-by: Dennis Zhou Cc: Alexander Viro Cc: Dave Chinner Cc: Jan Kara Cc: Jens Axboe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/backing-dev-defs.h | 1 + 1 file changed, 1 insertion(+) (limited to 'include/linux') diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index fff9367a6348..e5dc238ebe4f 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -154,6 +154,7 @@ struct bdi_writeback { struct cgroup_subsys_state *blkcg_css; /* and blkcg */ struct list_head memcg_node; /* anchored at memcg->cgwb_list */ struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */ + struct list_head b_attached; /* attached inodes, protected by list_lock */ union { struct work_struct release_work; -- cgit v1.2.3 From f5fbe6b7ad6ef1fbdf8074a6ca9fdab739bf86d4 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Mon, 28 Jun 2021 19:35:59 -0700 Subject: writeback, cgroup: support switching multiple inodes at once Currently only a single inode can be switched to another writeback structure at once. That means to switch an inode a separate inode_switch_wbs_context structure must be allocated, and a separate rcu callback and work must be scheduled. It's fine for the existing ad-hoc switching, which is not happening that often, but sub-optimal for massive switching required in order to release a writeback structure. To prepare for it, let's add a support for switching multiple inodes at once. Instead of containing a single inode pointer, inode_switch_wbs_context will contain a NULL-terminated array of inode pointers. inode_do_switch_wbs() will be called for each inode. To optimize the locking bdi->wb_switch_rwsem, old_wb's and new_wb's list_locks will be acquired and released only once altogether for all inodes. wb_wakeup() will be also be called only once. Instead of calling wb_put(old_wb) after each successful switch, wb_put_many() is introduced and used. Link: https://lkml.kernel.org/r/20210608230225.2078447-8-guro@fb.com Signed-off-by: Roman Gushchin Acked-by: Tejun Heo Reviewed-by: Jan Kara Acked-by: Dennis Zhou Cc: Alexander Viro Cc: Dave Chinner Cc: Jan Kara Cc: Jens Axboe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/backing-dev-defs.h | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index e5dc238ebe4f..63f52ad2ce7a 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -240,8 +240,9 @@ static inline void wb_get(struct bdi_writeback *wb) /** * wb_put - decrement a wb's refcount * @wb: bdi_writeback to put + * @nr: number of references to put */ -static inline void wb_put(struct bdi_writeback *wb) +static inline void wb_put_many(struct bdi_writeback *wb, unsigned long nr) { if (WARN_ON_ONCE(!wb->bdi)) { /* @@ -252,7 +253,16 @@ static inline void wb_put(struct bdi_writeback *wb) } if (wb != &wb->bdi->wb) - percpu_ref_put(&wb->refcnt); + percpu_ref_put_many(&wb->refcnt, nr); +} + +/** + * wb_put - decrement a wb's refcount + * @wb: bdi_writeback to put + */ +static inline void wb_put(struct bdi_writeback *wb) +{ + wb_put_many(wb, 1); } /** @@ -281,6 +291,10 @@ static inline void wb_put(struct bdi_writeback *wb) { } +static inline void wb_put_many(struct bdi_writeback *wb, unsigned long nr) +{ +} + static inline bool wb_dying(struct bdi_writeback *wb) { return false; -- cgit v1.2.3 From c22d70a162d3cc177282c4487be4d54876ca55c8 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Mon, 28 Jun 2021 19:36:03 -0700 Subject: writeback, cgroup: release dying cgwbs by switching attached inodes Asynchronously try to release dying cgwbs by switching attached inodes to the nearest living ancestor wb. It helps to get rid of per-cgroup writeback structures themselves and of pinned memory and block cgroups, which are significantly larger structures (mostly due to large per-cpu statistics data). This prevents memory waste and helps to avoid different scalability problems caused by large piles of dying cgroups. Reuse the existing mechanism of inode switching used for foreign inode detection. To speed things up batch up to 115 inode switching in a single operation (the maximum number is selected so that the resulting struct inode_switch_wbs_context can fit into 1024 bytes). Because every switching consists of two steps divided by an RCU grace period, it would be too slow without batching. Please note that the whole batch counts as a single operation (when increasing/decreasing isw_nr_in_flight). This allows to keep umounting working (flush the switching queue), however prevents cleanups from consuming the whole switching quota and effectively blocking the frn switching. A cgwb cleanup operation can fail due to different reasons (e.g. not enough memory, the cgwb has an in-flight/pending io, an attached inode in a wrong state, etc). In this case the next scheduled cleanup will make a new attempt. An attempt is made each time a new cgwb is offlined (in other words a memcg and/or a blkcg is deleted by a user). In the future an additional attempt scheduled by a timer can be implemented. [guro@fb.com: replace open-coded "115" with arithmetic] Link: https://lkml.kernel.org/r/YMEcSBcq/VXMiPPO@carbon.dhcp.thefacebook.com [guro@fb.com: add smp_mb() to inode_prepare_wbs_switch()] Link: https://lkml.kernel.org/r/YMFa+guFw7OFjf3X@carbon.dhcp.thefacebook.com [willy@infradead.org: fix documentation] Link: https://lkml.kernel.org/r/20210615200242.1716568-2-willy@infradead.org Link: https://lkml.kernel.org/r/20210608230225.2078447-9-guro@fb.com Signed-off-by: Roman Gushchin Signed-off-by: Matthew Wilcox (Oracle) Acked-by: Tejun Heo Acked-by: Dennis Zhou Reviewed-by: Jan Kara Cc: Alexander Viro Cc: Dave Chinner Cc: Jan Kara Cc: Jens Axboe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/backing-dev-defs.h | 1 + include/linux/writeback.h | 1 + 2 files changed, 2 insertions(+) (limited to 'include/linux') diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 63f52ad2ce7a..1d7edad9914f 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -155,6 +155,7 @@ struct bdi_writeback { struct list_head memcg_node; /* anchored at memcg->cgwb_list */ struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */ struct list_head b_attached; /* attached inodes, protected by list_lock */ + struct list_head offline_node; /* anchored at offline_cgwbs */ union { struct work_struct release_work; diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 8e5c5bb16e2d..95de51c10248 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -221,6 +221,7 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page, int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr_pages, enum wb_reason reason, struct wb_completion *done); void cgroup_writeback_umount(void); +bool cleanup_offline_cgwb(struct bdi_writeback *wb); /** * inode_attach_wb - associate an inode with its wb -- cgit v1.2.3 From c1e3dbe9818e3caa4e467255a348df56912ca549 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Mon, 28 Jun 2021 19:36:09 -0700 Subject: fs: move ramfs_aops to libfs Move the ramfs aops to libfs and reuse them for kernfs and configfs. Thosw two did not wire up ->set_page_dirty before and now get __set_page_dirty_no_writeback, which is the right one for no-writeback address_space usage. Drop the now unused exports of the libfs helpers only used for ramfs-style pagecache usage. Link: https://lkml.kernel.org/r/20210614061512.3966143-3-hch@lst.de Signed-off-by: Christoph Hellwig Reviewed-by: Greg Kroah-Hartman Reviewed-by: Jan Kara Cc: Al Viro Cc: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/fs.h | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) (limited to 'include/linux') diff --git a/include/linux/fs.h b/include/linux/fs.h index c3c88fdb9b2a..869909345420 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3422,13 +3422,10 @@ extern void noop_invalidatepage(struct page *page, unsigned int offset, unsigned int length); extern ssize_t noop_direct_IO(struct kiocb *iocb, struct iov_iter *iter); extern int simple_empty(struct dentry *); -extern int simple_readpage(struct file *file, struct page *page); extern int simple_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata); -extern int simple_write_end(struct file *file, struct address_space *mapping, - loff_t pos, unsigned len, unsigned copied, - struct page *page, void *fsdata); +extern const struct address_space_operations ram_aops; extern int always_delete_dentry(const struct dentry *); extern struct inode *alloc_anon_inode(struct super_block *); extern int simple_nosetlease(struct file *, long, struct file_lock **, void **); -- cgit v1.2.3 From 6e1cae881a0646f31fe2bda90297d820da1137eb Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 28 Jun 2021 19:36:15 -0700 Subject: mm/writeback: move __set_page_dirty() to core mm Patch series "Further set_page_dirty cleanups". Prompted by Christoph's recent patches, here are some more patches to improve the state of set_page_dirty(). They're all from the folio tree, so they've been tested to a certain extent. This patch (of 6): Nothing in __set_page_dirty() is specific to buffer_head, so move it to mm/page-writeback.c. That removes the only caller of account_page_dirtied() outside of page-writeback.c, so make it static. Link: https://lkml.kernel.org/r/20210615162342.1669332-1-willy@infradead.org Link: https://lkml.kernel.org/r/20210615162342.1669332-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Christoph Hellwig Reviewed-by: Greg Kroah-Hartman Cc: Jan Kara Cc: Al Viro Cc: Dan Williams Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/mm.h b/include/linux/mm.h index 9afb8998e7e5..12589b811555 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1855,7 +1855,6 @@ int __set_page_dirty_nobuffers(struct page *page); int __set_page_dirty_no_writeback(struct page *page); int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page); -void account_page_dirtied(struct page *page, struct address_space *mapping); void account_page_cleaned(struct page *page, struct address_space *mapping, struct bdi_writeback *wb); int set_page_dirty(struct page *page); -- cgit v1.2.3 From fd7353f88bde80d557b6d74a5351979fc8b1b8db Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 28 Jun 2021 19:36:21 -0700 Subject: iomap: use __set_page_dirty_nobuffers The only difference between iomap_set_page_dirty() and __set_page_dirty_nobuffers() is that the latter includes a debugging check that a !Uptodate page has private data. Link: https://lkml.kernel.org/r/20210615162342.1669332-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Christoph Hellwig Cc: Al Viro Cc: Dan Williams Cc: Greg Kroah-Hartman Cc: Jan Kara Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/iomap.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/iomap.h b/include/linux/iomap.h index c87d0cb0de6d..479c1da3e221 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -159,7 +159,6 @@ ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from, const struct iomap_ops *ops); int iomap_readpage(struct page *page, const struct iomap_ops *ops); void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops); -int iomap_set_page_dirty(struct page *page); int iomap_is_partially_uptodate(struct page *page, unsigned long from, unsigned long count); int iomap_releasepage(struct page *page, gfp_t gfp_mask); -- cgit v1.2.3 From b82a96c9253333a8834b2df5f262a39cccf4f6c7 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 28 Jun 2021 19:36:27 -0700 Subject: fs: remove noop_set_page_dirty() Use __set_page_dirty_no_writeback() instead. This will set the dirty bit on the page, which will be used to avoid calling set_page_dirty() in the future. It will have no effect on actually writing the page back, as the pages are not on any LRU lists. [akpm@linux-foundation.org: export __set_page_dirty_no_writeback() to modules] Link: https://lkml.kernel.org/r/20210615162342.1669332-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Cc: Al Viro Cc: Christoph Hellwig Cc: Dan Williams Cc: Greg Kroah-Hartman Cc: Jan Kara Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/fs.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/fs.h b/include/linux/fs.h index 869909345420..fad6663cd1b0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3417,7 +3417,6 @@ extern int simple_rename(struct user_namespace *, struct inode *, extern void simple_recursive_removal(struct dentry *, void (*callback)(struct dentry *)); extern int noop_fsync(struct file *, loff_t, loff_t, int); -extern int noop_set_page_dirty(struct page *page); extern void noop_invalidatepage(struct page *page, unsigned int offset, unsigned int length); extern ssize_t noop_direct_IO(struct kiocb *iocb, struct iov_iter *iter); -- cgit v1.2.3 From 3a6b2162005f24c7caa10d7f10dba487629787f2 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 28 Jun 2021 19:36:30 -0700 Subject: mm: move page dirtying prototypes from mm.h These functions implement the address_space ->set_page_dirty operation and should live in pagemap.h, not mm.h so that the rest of the kernel doesn't get funny ideas about calling them directly. Link: https://lkml.kernel.org/r/20210615162342.1669332-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Christoph Hellwig Cc: Al Viro Cc: Dan Williams Cc: Greg Kroah-Hartman Cc: Jan Kara Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 3 --- include/linux/pagemap.h | 4 ++++ 2 files changed, 4 insertions(+), 3 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mm.h b/include/linux/mm.h index 12589b811555..e39ed497578b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1850,9 +1850,6 @@ extern int try_to_release_page(struct page * page, gfp_t gfp_mask); extern void do_invalidatepage(struct page *page, unsigned int offset, unsigned int length); -void __set_page_dirty(struct page *, struct address_space *, int warn); -int __set_page_dirty_nobuffers(struct page *page); -int __set_page_dirty_no_writeback(struct page *page); int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page); void account_page_cleaned(struct page *page, struct address_space *mapping, diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 0f1b34dbf3a2..ed02aa522263 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -702,6 +702,10 @@ int wait_on_page_writeback_killable(struct page *page); extern void end_page_writeback(struct page *page); void wait_for_stable_page(struct page *page); +void __set_page_dirty(struct page *, struct address_space *, int warn); +int __set_page_dirty_nobuffers(struct page *page); +int __set_page_dirty_no_writeback(struct page *page); + void page_endio(struct page *page, bool is_write, int err); /** -- cgit v1.2.3 From a458b76a4171f893efa7657dc079924580a8746a Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Mon, 28 Jun 2021 19:36:40 -0700 Subject: mm: gup: pack has_pinned in MMF_HAS_PINNED has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop cleanup. Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast would reintroduce a loss of SMP scalability to pin-fast, so there's no future potential usefulness to keep an atomic in the mm for this. set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE (atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like atomic_set after this commit) has to be still issued only once per "mm", so the difference between the two will be lost in the noise. will-it-scale "mmap2" shows no change in performance with enterprise config as expected. will-it-scale "pin_fast" retains the > 4000% SMP scalability performance improvement against upstream as expected. This is a noop as far as overall performance and SMP scalability are concerned. [peterx@redhat.com: pack has_pinned in MMF_HAS_PINNED] Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s [akpm@linux-foundation.org: coding style fixes] [peterx@redhat.com: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments] Link: https://lkml.kernel.org/r/20210507150553.208763-4-peterx@redhat.com Signed-off-by: Andrea Arcangeli Signed-off-by: Peter Xu Reviewed-by: John Hubbard Cc: Hugh Dickins Cc: Jan Kara Cc: Jann Horn Cc: Jason Gunthorpe Cc: Kirill Shutemov Cc: Kirill Tkhai Cc: Matthew Wilcox Cc: Michal Hocko Cc: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 2 +- include/linux/mm_types.h | 10 ---------- include/linux/sched/coredump.h | 8 ++++++++ 3 files changed, 9 insertions(+), 11 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mm.h b/include/linux/mm.h index e39ed497578b..79f32962d7ae 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1341,7 +1341,7 @@ static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma, if (!is_cow_mapping(vma->vm_flags)) return false; - if (!atomic_read(&vma->vm_mm->has_pinned)) + if (!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags)) return false; return page_maybe_dma_pinned(page); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 8f0fb62e8975..b66d0225414e 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -435,16 +435,6 @@ struct mm_struct { */ atomic_t mm_count; - /** - * @has_pinned: Whether this mm has pinned any pages. This can - * be either replaced in the future by @pinned_vm when it - * becomes stable, or grow into a counter on its own. We're - * aggresive on this bit now - even if the pinned pages were - * unpinned later on, we'll still keep this bit set for the - * lifecycle of this mm just for simplicity. - */ - atomic_t has_pinned; - #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* PTE page table pages */ #endif diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h index dfd82eab2902..4d9e3a656875 100644 --- a/include/linux/sched/coredump.h +++ b/include/linux/sched/coredump.h @@ -73,6 +73,14 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_OOM_VICTIM 25 /* mm is the oom victim */ #define MMF_OOM_REAP_QUEUED 26 /* mm was queued for oom_reaper */ #define MMF_MULTIPROCESS 27 /* mm is shared between processes */ +/* + * MMF_HAS_PINNED: Whether this mm has pinned any pages. This can be either + * replaced in the future by mm.pinned_vm when it becomes stable, or grow into + * a counter on its own. We're aggresive on this bit for now: even if the + * pinned pages were unpinned later on, we'll still keep this bit set for the + * lifecycle of this mm, just for simplicity. + */ +#define MMF_HAS_PINNED 28 /* FOLL_PIN has run, never cleared */ #define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ -- cgit v1.2.3 From 63d8620ecf93b5d8d0a254471184d08f8e8f538d Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Mon, 28 Jun 2021 19:36:46 -0700 Subject: mm/swapfile: use percpu_ref to serialize against concurrent swapoff Patch series "close various race windows for swap", v6. When I was investigating the swap code, I found some possible race windows. This series aims to fix all these races. But using current get/put_swap_device() to guard against concurrent swapoff for swap_readpage() looks terrible because swap_readpage() may take really long time. And to reduce the performance overhead on the hot-path as much as possible, it appears we can use the percpu_ref to close this race window(as suggested by Huang, Ying). The patch 1 adds percpu_ref support for swap and most of the remaining patches try to use this to close various race windows. More details can be found in the respective changelogs. This patch (of 4): Using current get/put_swap_device() to guard against concurrent swapoff for some swap ops, e.g. swap_readpage(), looks terrible because they might take really long time. This patch adds the percpu_ref support to serialize against concurrent swapoff(as suggested by Huang, Ying). Also we remove the SWP_VALID flag because it's used together with RCU solution. Link: https://lkml.kernel.org/r/20210426123316.806267-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210426123316.806267-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin Reviewed-by: "Huang, Ying" Cc: Alex Shi Cc: David Hildenbrand Cc: Dennis Zhou Cc: Hugh Dickins Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Matthew Wilcox Cc: Michal Hocko Cc: Minchan Kim Cc: Tim Chen Cc: Wei Yang Cc: Yang Shi Cc: Yu Zhao Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/swap.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/swap.h b/include/linux/swap.h index 144727041e78..c9e7fea10b83 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -177,7 +177,6 @@ enum { SWP_PAGE_DISCARD = (1 << 10), /* freed swap page-cluster discards */ SWP_STABLE_WRITES = (1 << 11), /* no overwrite PG_writeback pages */ SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */ - SWP_VALID = (1 << 13), /* swap is valid to be operated on? */ /* add others here before... */ SWP_SCANNING = (1 << 14), /* refcount in scan_swap_map */ }; @@ -240,6 +239,7 @@ struct swap_cluster_list { * The in-memory structure used to track swap areas. */ struct swap_info_struct { + struct percpu_ref users; /* indicate and keep swap device valid. */ unsigned long flags; /* SWP_USED etc: see above */ signed short prio; /* swap priority of this type */ struct plist_node list; /* entry in swap_active_head */ @@ -260,6 +260,7 @@ struct swap_info_struct { struct block_device *bdev; /* swap device or bdev of swap file */ struct file *swap_file; /* seldom referenced */ unsigned int old_block_size; /* seldom referenced */ + struct completion comp; /* seldom referenced */ #ifdef CONFIG_FRONTSWAP unsigned long *frontswap_map; /* frontswap in-use, one bit per page */ atomic_t frontswap_pages; /* frontswap pages in-use counter */ @@ -511,7 +512,7 @@ sector_t swap_page_sector(struct page *page); static inline void put_swap_device(struct swap_info_struct *si) { - rcu_read_unlock(); + percpu_ref_put(&si->users); } #else /* CONFIG_SWAP */ -- cgit v1.2.3 From 2799e77529c2a25492a4395db93996e3dacd762d Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Mon, 28 Jun 2021 19:36:50 -0700 Subject: swap: fix do_swap_page() race with swapoff When I was investigating the swap code, I found the below possible race window: CPU 1 CPU 2 ----- ----- do_swap_page if (data_race(si->flags & SWP_SYNCHRONOUS_IO) swap_readpage if (data_race(sis->flags & SWP_FS_OPS)) { swapoff .. p->swap_file = NULL; .. struct file *swap_file = sis->swap_file; struct address_space *mapping = swap_file->f_mapping;[oops!] Note that for the pages that are swapped in through swap cache, this isn't an issue. Because the page is locked, and the swap entry will be marked with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been unlocked. Fix this race by using get/put_swap_device() to guard against concurrent swapoff. Link: https://lkml.kernel.org/r/20210426123316.806267-3-linmiaohe@huawei.com Fixes: 0bcac06f27d7 ("mm,swap: skip swapcache for swapin of synchronous device") Signed-off-by: Miaohe Lin Reviewed-by: "Huang, Ying" Cc: Alex Shi Cc: David Hildenbrand Cc: Dennis Zhou Cc: Hugh Dickins Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Matthew Wilcox Cc: Michal Hocko Cc: Minchan Kim Cc: Tim Chen Cc: Wei Yang Cc: Yang Shi Cc: Yu Zhao Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/swap.h | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'include/linux') diff --git a/include/linux/swap.h b/include/linux/swap.h index c9e7fea10b83..46d51d058d05 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -527,6 +527,15 @@ static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry) return NULL; } +static inline struct swap_info_struct *get_swap_device(swp_entry_t entry) +{ + return NULL; +} + +static inline void put_swap_device(struct swap_info_struct *si) +{ +} + #define swap_address_space(entry) (NULL) #define get_nr_swap_pages() 0L #define total_swap_pages 0L -- cgit v1.2.3 From f4c4a3f48480730214c4f02ffa480f6bf5b0718f Mon Sep 17 00:00:00 2001 From: Huang Ying Date: Mon, 28 Jun 2021 19:37:12 -0700 Subject: mm: free idle swap cache page after COW With commit 09854ba94c6a ("mm: do_wp_page() simplification"), after COW, the idle swap cache page (neither the page nor the corresponding swap entry is mapped by any process) will be left in the LRU list, even if it's in the active list or the head of the inactive list. So, the page reclaimer may take quite some overhead to reclaim these actually unused pages. To help the page reclaiming, in this patch, after COW, the idle swap cache page will be tried to be freed. To avoid to introduce much overhead to the hot COW code path, a) there's almost zero overhead for non-swap case via checking PageSwapCache() firstly. b) the page lock is acquired via trylock only. To test the patch, we used pmbench memory accessing benchmark with working-set larger than available memory on a 2-socket Intel server with a NVMe SSD as swap device. Test results shows that the pmbench score increases up to 23.8% with the decreased size of swap cache and swapin throughput. Link: https://lkml.kernel.org/r/20210601053143.1380078-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" Suggested-by: Johannes Weiner [use free_swap_cache()] Acked-by: Johannes Weiner Cc: Hugh Dickins Cc: Matthew Wilcox Cc: Peter Xu Cc: Mel Gorman Cc: Rik van Riel Cc: Andrea Arcangeli Cc: Michal Hocko Cc: Dave Hansen Cc: Tim Chen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/swap.h | 5 +++++ 1 file changed, 5 insertions(+) (limited to 'include/linux') diff --git a/include/linux/swap.h b/include/linux/swap.h index 46d51d058d05..49b1dd2c100b 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -446,6 +446,7 @@ extern void __delete_from_swap_cache(struct page *page, extern void delete_from_swap_cache(struct page *); extern void clear_shadow_from_swap_cache(int type, unsigned long begin, unsigned long end); +extern void free_swap_cache(struct page *); extern void free_page_and_swap_cache(struct page *); extern void free_pages_and_swap_cache(struct page **, int); extern struct page *lookup_swap_cache(swp_entry_t entry, @@ -551,6 +552,10 @@ static inline void put_swap_device(struct swap_info_struct *si) #define free_pages_and_swap_cache(pages, nr) \ release_pages((pages), (nr)); +static inline void free_swap_cache(struct page *page) +{ +} + static inline void show_swap_cache_info(void) { } -- cgit v1.2.3 From 494c1dfe855ec1f70f89552fce5eadf4a1717552 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Mon, 28 Jun 2021 19:37:38 -0700 Subject: mm: memcg/slab: create a new set of kmalloc-cg- caches There are currently two problems in the way the objcg pointer array (memcg_data) in the page structure is being allocated and freed. On its allocation, it is possible that the allocated objcg pointer array comes from the same slab that requires memory accounting. If this happens, the slab will never become empty again as there is at least one object left (the obj_cgroup array) in the slab. When it is freed, the objcg pointer array object may be the last one in its slab and hence causes kfree() to be called again. With the right workload, the slab cache may be set up in a way that allows the recursive kfree() calling loop to nest deep enough to cause a kernel stack overflow and panic the system. One way to solve this problem is to split the kmalloc- caches (KMALLOC_NORMAL) into two separate sets - a new set of kmalloc- (KMALLOC_NORMAL) caches for unaccounted objects only and a new set of kmalloc-cg- (KMALLOC_CGROUP) caches for accounted objects only. All the other caches can still allow a mix of accounted and unaccounted objects. With this change, all the objcg pointer array objects will come from KMALLOC_NORMAL caches which won't have their objcg pointer arrays. So both the recursive kfree() problem and non-freeable slab problem are gone. Since both the KMALLOC_NORMAL and KMALLOC_CGROUP caches no longer have mixed accounted and unaccounted objects, this will slightly reduce the number of objcg pointer arrays that need to be allocated and save a bit of memory. On the other hand, creating a new set of kmalloc caches does have the effect of reducing cache utilization. So it is properly a wash. The new KMALLOC_CGROUP is added between KMALLOC_NORMAL and KMALLOC_RECLAIM so that the first for loop in create_kmalloc_caches() will include the newly added caches without change. [vbabka@suse.cz: don't create kmalloc-cg caches with cgroup.memory=nokmem] Link: https://lkml.kernel.org/r/20210512145107.6208-1-longman@redhat.com [akpm@linux-foundation.org: un-fat-finger v5 delta creation] [longman@redhat.com: disable cache merging for KMALLOC_NORMAL caches] Link: https://lkml.kernel.org/r/20210505200610.13943-4-longman@redhat.com Link: https://lkml.kernel.org/r/20210512145107.6208-1-longman@redhat.com Link: https://lkml.kernel.org/r/20210505200610.13943-3-longman@redhat.com Signed-off-by: Waiman Long Signed-off-by: Vlastimil Babka Suggested-by: Vlastimil Babka Reviewed-by: Shakeel Butt Acked-by: Roman Gushchin Cc: Christoph Lameter Cc: David Rientjes Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Michal Hocko Cc: Pekka Enberg Cc: Vladimir Davydov [longman@redhat.com: fix for CONFIG_ZONE_DMA=n] Suggested-by: Roman Gushchin Reviewed-by: Vlastimil Babka Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/slab.h | 42 +++++++++++++++++++++++++++++++++--------- 1 file changed, 33 insertions(+), 9 deletions(-) (limited to 'include/linux') diff --git a/include/linux/slab.h b/include/linux/slab.h index bc9ab3a5a017..083f3ce550bc 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -305,9 +305,21 @@ static inline void __check_heap_object(const void *ptr, unsigned long n, /* * Whenever changing this, take care of that kmalloc_type() and * create_kmalloc_caches() still work as intended. + * + * KMALLOC_NORMAL can contain only unaccounted objects whereas KMALLOC_CGROUP + * is for accounted but unreclaimable and non-dma objects. All the other + * kmem caches can have both accounted and unaccounted objects. */ enum kmalloc_cache_type { KMALLOC_NORMAL = 0, +#ifndef CONFIG_ZONE_DMA + KMALLOC_DMA = KMALLOC_NORMAL, +#endif +#ifndef CONFIG_MEMCG_KMEM + KMALLOC_CGROUP = KMALLOC_NORMAL, +#else + KMALLOC_CGROUP, +#endif KMALLOC_RECLAIM, #ifdef CONFIG_ZONE_DMA KMALLOC_DMA, @@ -319,24 +331,36 @@ enum kmalloc_cache_type { extern struct kmem_cache * kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1]; +/* + * Define gfp bits that should not be set for KMALLOC_NORMAL. + */ +#define KMALLOC_NOT_NORMAL_BITS \ + (__GFP_RECLAIMABLE | \ + (IS_ENABLED(CONFIG_ZONE_DMA) ? __GFP_DMA : 0) | \ + (IS_ENABLED(CONFIG_MEMCG_KMEM) ? __GFP_ACCOUNT : 0)) + static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags) { -#ifdef CONFIG_ZONE_DMA /* * The most common case is KMALLOC_NORMAL, so test for it - * with a single branch for both flags. + * with a single branch for all the relevant flags. */ - if (likely((flags & (__GFP_DMA | __GFP_RECLAIMABLE)) == 0)) + if (likely((flags & KMALLOC_NOT_NORMAL_BITS) == 0)) return KMALLOC_NORMAL; /* - * At least one of the flags has to be set. If both are, __GFP_DMA - * is more important. + * At least one of the flags has to be set. Their priorities in + * decreasing order are: + * 1) __GFP_DMA + * 2) __GFP_RECLAIMABLE + * 3) __GFP_ACCOUNT */ - return flags & __GFP_DMA ? KMALLOC_DMA : KMALLOC_RECLAIM; -#else - return flags & __GFP_RECLAIMABLE ? KMALLOC_RECLAIM : KMALLOC_NORMAL; -#endif + if (IS_ENABLED(CONFIG_ZONE_DMA) && (flags & __GFP_DMA)) + return KMALLOC_DMA; + if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || (flags & __GFP_RECLAIMABLE)) + return KMALLOC_RECLAIM; + else + return KMALLOC_CGROUP; } /* -- cgit v1.2.3 From a984226f457f849eb9c4ce727eeaa3b5080597d8 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Mon, 28 Jun 2021 19:37:53 -0700 Subject: mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec All the callers of mem_cgroup_page_lruvec() just pass page_pgdat(page) as the 2nd parameter to it (except isolate_migratepages_block()). But for isolate_migratepages_block(), the page_pgdat(page) is also equal to the local variable of @pgdat. So mem_cgroup_page_lruvec() do not need the pgdat parameter. Just remove it to simplify the code. Link: https://lkml.kernel.org/r/20210417043538.9793-4-songmuchun@bytedance.com Signed-off-by: Muchun Song Acked-by: Johannes Weiner Reviewed-by: Shakeel Butt Acked-by: Roman Gushchin Acked-by: Michal Hocko Cc: Vladimir Davydov Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'include/linux') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c193be760709..f2a5aaba3577 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -743,13 +743,12 @@ out: /** * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page * @page: the page - * @pgdat: pgdat of the page * * This function relies on page->mem_cgroup being stable. */ -static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, - struct pglist_data *pgdat) +static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page) { + pg_data_t *pgdat = page_pgdat(page); struct mem_cgroup *memcg = page_memcg(page); VM_WARN_ON_ONCE_PAGE(!memcg && !mem_cgroup_disabled(), page); @@ -1221,9 +1220,10 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, return &pgdat->__lruvec; } -static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, - struct pglist_data *pgdat) +static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page) { + pg_data_t *pgdat = page_pgdat(page); + return &pgdat->__lruvec; } -- cgit v1.2.3 From f2e4d28dd9f6478dd54d47b91edc3fe62c019968 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Mon, 28 Jun 2021 19:37:56 -0700 Subject: mm: memcontrol: simplify lruvec_holds_page_lru_lock We already have a helper lruvec_memcg() to get the memcg from lruvec, we do not need to do it ourselves in the lruvec_holds_page_lru_lock(). So use lruvec_memcg() instead. And if mem_cgroup_disabled() returns false, the page_memcg(page) (the LRU pages) cannot be NULL. So remove the odd logic of "memcg = page_memcg(page) ? : root_mem_cgroup". And use lruvec_pgdat to simplify the code. We can have a single definition for this function that works for !CONFIG_MEMCG, CONFIG_MEMCG + mem_cgroup_disabled() and CONFIG_MEMCG. Link: https://lkml.kernel.org/r/20210417043538.9793-5-songmuchun@bytedance.com Signed-off-by: Muchun Song Acked-by: Johannes Weiner Reviewed-by: Shakeel Butt Acked-by: Roman Gushchin Acked-by: Michal Hocko Cc: Vladimir Davydov Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 31 +++++++------------------------ 1 file changed, 7 insertions(+), 24 deletions(-) (limited to 'include/linux') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index f2a5aaba3577..2fc728492c9b 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -755,22 +755,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page) return mem_cgroup_lruvec(memcg, pgdat); } -static inline bool lruvec_holds_page_lru_lock(struct page *page, - struct lruvec *lruvec) -{ - pg_data_t *pgdat = page_pgdat(page); - const struct mem_cgroup *memcg; - struct mem_cgroup_per_node *mz; - - if (mem_cgroup_disabled()) - return lruvec == &pgdat->__lruvec; - - mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec); - memcg = page_memcg(page) ? : root_mem_cgroup; - - return lruvec->pgdat == pgdat && mz->memcg == memcg; -} - struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm); @@ -1227,14 +1211,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page) return &pgdat->__lruvec; } -static inline bool lruvec_holds_page_lru_lock(struct page *page, - struct lruvec *lruvec) -{ - pg_data_t *pgdat = page_pgdat(page); - - return lruvec == &pgdat->__lruvec; -} - static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) { } @@ -1516,6 +1492,13 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, spin_unlock_irqrestore(&lruvec->lru_lock, flags); } +static inline bool lruvec_holds_page_lru_lock(struct page *page, + struct lruvec *lruvec) +{ + return lruvec_pgdat(lruvec) == page_pgdat(page) && + lruvec_memcg(lruvec) == page_memcg(page); +} + /* Don't lock again iff page's lruvec locked */ static inline struct lruvec *relock_page_lruvec_irq(struct page *page, struct lruvec *locked_lruvec) -- cgit v1.2.3 From 7467c39128bda1d58af08aaeb0c7ba54d0ec87ae Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Mon, 28 Jun 2021 19:37:59 -0700 Subject: mm: memcontrol: rename lruvec_holds_page_lru_lock to page_matches_lruvec lruvec_holds_page_lru_lock() doesn't check anything about locking and is used to check whether the page belongs to the lruvec. So rename it to page_matches_lruvec(). Link: https://lkml.kernel.org/r/20210417043538.9793-6-songmuchun@bytedance.com Signed-off-by: Muchun Song Acked-by: Michal Hocko Acked-by: Johannes Weiner Reviewed-by: Shakeel Butt Cc: Roman Gushchin Cc: Vladimir Davydov Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'include/linux') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 2fc728492c9b..0ce97eff79e2 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1492,8 +1492,8 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, spin_unlock_irqrestore(&lruvec->lru_lock, flags); } -static inline bool lruvec_holds_page_lru_lock(struct page *page, - struct lruvec *lruvec) +/* Test requires a stable page->memcg binding, see page_memcg() */ +static inline bool page_matches_lruvec(struct page *page, struct lruvec *lruvec) { return lruvec_pgdat(lruvec) == page_pgdat(page) && lruvec_memcg(lruvec) == page_memcg(page); @@ -1504,7 +1504,7 @@ static inline struct lruvec *relock_page_lruvec_irq(struct page *page, struct lruvec *locked_lruvec) { if (locked_lruvec) { - if (lruvec_holds_page_lru_lock(page, locked_lruvec)) + if (page_matches_lruvec(page, locked_lruvec)) return locked_lruvec; unlock_page_lruvec_irq(locked_lruvec); @@ -1518,7 +1518,7 @@ static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page, struct lruvec *locked_lruvec, unsigned long *flags) { if (locked_lruvec) { - if (lruvec_holds_page_lru_lock(page, locked_lruvec)) + if (page_matches_lruvec(page, locked_lruvec)) return locked_lruvec; unlock_page_lruvec_irqrestore(locked_lruvec, *flags); -- cgit v1.2.3 From b51478a0b3c7040bfcadf6e2e04df5ddde59fd98 Mon Sep 17 00:00:00 2001 From: wenhuizhang Date: Mon, 28 Jun 2021 19:38:12 -0700 Subject: memcontrol: use flexible-array member Change deprecated zero-length-and-one-element-arrays into flexible array member.Zero-length and one-element arrays detected by Lukas's CodeChecker. Zero/one element arrays cause undefined behaviours if sizeof() used. Link: https://lkml.kernel.org/r/20210518200910.29912-1-wenhui@gwmail.gwu.edu Signed-off-by: wenhuizhang Reviewed-by: Muchun Song Acked-by: Michal Hocko Cc: Shakeel Butt Cc: Johannes Weiner Cc: Roman Gushchin Cc: Muchun Song Cc: Yang Shi Cc: Alex Shi Cc: Alexander Duyck Cc: Wei Yang Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0ce97eff79e2..3cc18c2176e7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -349,8 +349,7 @@ struct mem_cgroup { struct deferred_split deferred_split_queue; #endif - struct mem_cgroup_per_node *nodeinfo[0]; - /* WARNING: nodeinfo must be the last member here */ + struct mem_cgroup_per_node *nodeinfo[]; }; /* -- cgit v1.2.3 From c74d40e8b5e2ac5eee1ca45b12d3e174915f1d88 Mon Sep 17 00:00:00 2001 From: Dan Schatzberg Date: Mon, 28 Jun 2021 19:38:21 -0700 Subject: loop: charge i/o to mem and blk cg The current code only associates with the existing blkcg when aio is used to access the backing file. This patch covers all types of i/o to the backing file and also associates the memcg so if the backing file is on tmpfs, memory is charged appropriately. This patch also exports cgroup_get_e_css and int_active_memcg so it can be used by the loop module. Link: https://lkml.kernel.org/r/20210610173944.1203706-4-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg Acked-by: Johannes Weiner Acked-by: Jens Axboe Cc: Chris Down Cc: Michal Hocko Cc: Ming Lei Cc: Shakeel Butt Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'include/linux') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 3cc18c2176e7..1de3859233a6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1230,6 +1230,12 @@ static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm) return NULL; } +static inline +struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css) +{ + return NULL; +} + static inline void mem_cgroup_put(struct mem_cgroup *memcg) { } -- cgit v1.2.3 From 6a1803bb582c50909a7f6cc4153360eaf5ae8fc8 Mon Sep 17 00:00:00 2001 From: Huilong Deng Date: Mon, 28 Jun 2021 19:38:24 -0700 Subject: mm: memcontrol: remove trailing semicolon in macros Macros should not use a trailing semicolon. Link: https://lkml.kernel.org/r/20210614091530.22117-1-denghuilong@cdjrlc.com Signed-off-by: Huilong Deng Reviewed-by: Andrew Morton Reviewed-by: Shakeel Butt Cc: Roman Gushchin Cc: Michal Hocko Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'include/linux') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1de3859233a6..6d66037be646 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -192,7 +192,7 @@ enum memcg_kmem_state { struct memcg_padding { char x[0]; } ____cacheline_internodealigned_in_smp; -#define MEMCG_PADDING(name) struct memcg_padding name; +#define MEMCG_PADDING(name) struct memcg_padding name #else #define MEMCG_PADDING(name) #endif -- cgit v1.2.3 From 3b8db39fad98cbb1d36e079236a446fad710daea Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 28 Jun 2021 19:38:35 -0700 Subject: mm: ignore MAP_EXECUTABLE in ksys_mmap_pgoff() Let's also remove masking off MAP_EXECUTABLE from ksys_mmap_pgoff(): the last in-tree occurrence of MAP_EXECUTABLE is now in LEGACY_MAP_MASK, which accepts the flag e.g., for MAP_SHARED_VALIDATE; however, the flag is ignored throughout the kernel now. Add a comment to LEGACY_MAP_MASK stating that MAP_EXECUTABLE is ignored. Link: https://lkml.kernel.org/r/20210421093453.6904-4-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: "Eric W. Biederman" Reviewed-by: Kees Cook Cc: Alexander Shishkin Cc: Alexander Viro Cc: Arnaldo Carvalho de Melo Cc: Borislav Petkov Cc: Catalin Marinas Cc: Don Zickus Cc: Feng Tang Cc: Greg Ungerer Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Jiri Olsa Cc: Kevin Brodsky Cc: Mark Rutland Cc: Michal Hocko