summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--Documentation/admin-guide/cgroup-v1/memcg_test.rst15
-rw-r--r--Documentation/admin-guide/cgroup-v1/memory.rst21
-rw-r--r--Documentation/trace/events-kmem.rst2
-rw-r--r--Documentation/vm/unevictable-lru.rst22
-rw-r--r--include/linux/memcontrol.h110
-rw-r--r--include/linux/mm_types.h2
-rw-r--r--include/linux/mmzone.h6
-rw-r--r--include/linux/page-flags.h1
-rw-r--r--include/linux/swap.h4
-rw-r--r--mm/compaction.c94
-rw-r--r--mm/filemap.c4
-rw-r--r--mm/huge_memory.c45
-rw-r--r--mm/memcontrol.c84
-rw-r--r--mm/mlock.c63
-rw-r--r--mm/mmzone.c1
-rw-r--r--mm/page_alloc.c1
-rw-r--r--mm/page_idle.c4
-rw-r--r--mm/rmap.c12
-rw-r--r--mm/swap.c208
-rw-r--r--mm/vmscan.c207
-rw-r--r--mm/workingset.c2
21 files changed, 536 insertions, 372 deletions
diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
index 4f83de2dab6e..45b94f7b3beb 100644
--- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@@ -133,18 +133,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
8. LRU
======
- Each memcg has its own private LRU. Now, its handling is under global
- VM's control (means that it's handled under global pgdat->lru_lock).
- Almost all routines around memcg's LRU is called by global LRU's
- list management functions under pgdat->lru_lock.
-
- A special function is mem_cgroup_isolate_pages(). This scans
- memcg's private LRU and call __isolate_lru_page() to extract a page
- from LRU.
-
- (By __isolate_lru_page(), the page is removed from both of global and
- private LRU.)
-
+ Each memcg has its own vector of LRUs (inactive anon, active anon,
+ inactive file, active file, unevictable) of pages from each node,
+ each LRU handled under a single lru_lock for that memcg and node.
9. Typical Tests.
=================
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index a44cd467d218..52688ae34461 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -287,20 +287,17 @@ When oom event notifier is registered, event will be delivered.
2.6 Locking
-----------
- lock_page_cgroup()/unlock_page_cgroup() should not be called under
- the i_pages lock.
+Lock order is as follows:
- Other lock order is following:
+ Page lock (PG_locked bit of page->flags)
+ mm->page_table_lock or split pte_lock
+ lock_page_memcg (memcg->move_lock)
+ mapping->i_pages lock
+ lruvec->lru_lock.
- PG_locked.
- mm->page_table_lock
- pgdat->lru_lock
- lock_page_cgroup.
-
- In many cases, just lock_page_cgroup() is called.
-
- per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
- pgdat->lru_lock, it has no lock of its own.
+Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
+lruvec->lru_lock; PG_lru bit of page->flags is cleared before
+isolating a page from its LRU under lruvec->lru_lock.
2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
-----------------------------------------------
diff --git a/Documentation/trace/events-kmem.rst b/Documentation/trace/events-kmem.rst
index 555484110e36..68fa75247488 100644
--- a/Documentation/trace/events-kmem.rst
+++ b/Documentation/trace/events-kmem.rst
@@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered.
Broadly speaking, pages are taken off the LRU lock in bulk and
freed in batch with a page list. Significant amounts of activity here could
indicate that the system is under memory pressure and can also indicate
-contention on the zone->lru_lock.
+contention on the lruvec->lru_lock.
4. Per-CPU Allocator Activity
=============================
diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst
index 17d0861b0f1d..0e1490524f53 100644
--- a/Documentation/vm/unevictable-lru.rst
+++ b/Documentation/vm/unevictable-lru.rst
@@ -33,7 +33,7 @@ reclaim in Linux. The problems have been observed at customer sites on large
memory x86_64 systems.
To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
-main memory will have over 32 million 4k pages in a single zone. When a large
+main memory will have over 32 million 4k pages in a single node. When a large
fraction of these pages are not evictable for any reason [see below], vmscan
will spend a lot of time scanning the LRU lists looking for the small fraction
of pages that are evictable. This can result in a situation where all CPUs are
@@ -55,7 +55,7 @@ unevictable, either by definition or by circumstance, in the future.
The Unevictable Page List
-------------------------
-The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
+The Unevictable LRU infrastructure consists of an additional, per-node, LRU list
called the "unevictable" list and an associated page flag, PG_unevictable, to
indicate that the page is being managed on the unevictable list.
@@ -84,15 +84,9 @@ The unevictable list does not differentiate between file-backed and anonymous,
swap-backed pages. This differentiation is only important while the pages are,
in fact, evictable.
-The unevictable list benefits from the "arrayification" of the per-zone LRU
+The unevictable list benefits from the "arrayification" of the per-node LRU
lists and statistics originally proposed and posted by Christoph Lameter.
-The unevictable list does not use the LRU pagevec mechanism. Rather,
-unevictable pages are placed directly on the page's zone's unevictable list
-under the zone lru_lock. This allows us to prevent the stranding of pages on
-the unevictable list when one task has the page isolated from the LRU and other
-tasks are changing the "evictability" state of the page.
-
Memory Control Group Interaction
--------------------------------
@@ -101,8 +95,8 @@ The unevictable LRU facility interacts with the memory control group [aka
memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the
lru_list enum.
-The memory controller data structure automatically gets a per-zone unevictable
-list as a result of the "arrayification" of the per-zone LRU lists (one per
+The memory controller data structure automatically gets a per-node unevictable
+list as a result of the "arrayification" of the per-node LRU lists (one per
lru_list enum element). The memory controller tracks the movement of pages to
and from the unevictable list.
@@ -196,7 +190,7 @@ for the sake of expediency, to leave a unevictable page on one of the regular
active/inactive LRU lists for vmscan to deal with. vmscan checks for such
pages in all of the shrink_{active|inactive|page}_list() functions and will
"cull" such pages that it encounters: that is, it diverts those pages to the
-unevictable list for the zone being scanned.
+unevictable list for the node being scanned.
There may be situations where a page is mapped into a VM_LOCKED VMA, but the
page is not marked as PG_mlocked. Such pages will make it all the way to
@@ -328,7 +322,7 @@ If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
page from the LRU, as it is likely on the appropriate active or inactive list
at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put
back the page - by calling putback_lru_page() - which will notice that the page
-is now mlocked and divert the page to the zone's unevictable list. If
+is now mlocked and divert the page to the node's unevictable list. If
mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
it later if and when it attempts to reclaim the page.
@@ -603,7 +597,7 @@ Some examples of these unevictable pages on the LRU lists are:
unevictable list in mlock_vma_page().
shrink_inactive_list() also diverts any unevictable pages that it finds on the
-inactive lists to the appropriate zone's unevictable list.
+inactive lists to the appropriate node's unevictable list.
shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
after shrink_active_list() had moved them to the inactive list, or pages mapped
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f5b4d710f099..08ed57e02b73 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -654,12 +654,41 @@ out:
struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+ struct lruvec *lruvec)
+{
+ pg_data_t *pgdat = page_pgdat(page);
+ const struct mem_cgroup *memcg;
+ struct mem_cgroup_per_node *mz;
+
+ if (mem_cgroup_disabled())
+ return lruvec == &pgdat->__lruvec;
+
+ mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
+ memcg = page_memcg(page) ? : root_mem_cgroup;
+
+ return lruvec->pgdat == pgdat && mz->memcg == memcg;
+}
+
struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
+struct lruvec *lock_page_lruvec(struct page *page);
+struct lruvec *lock_page_lruvec_irq(struct page *page);
+struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+ unsigned long *flags);
+
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
+#else
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
+#endif
+
static inline
struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -1167,6 +1196,14 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
return &pgdat->__lruvec;
}
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+ struct lruvec *lruvec)
+{
+ pg_data_t *pgdat = page_pgdat(page);
+
+ return lruvec == &pgdat->__lruvec;
+}
+
static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
{
return NULL;
@@ -1192,6 +1229,31 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg)
{
}
+static inline struct lruvec *lock_page_lruvec(struct page *page)
+{
+ struct pglist_data *pgdat = page_pgdat(page);
+
+ spin_lock(&pgdat->__lruvec.lru_lock);
+ return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+ struct pglist_data *pgdat = page_pgdat(page);
+
+ spin_lock_irq(&pgdat->__lruvec.lru_lock);
+ return &pgdat->__lruvec;
+}
+
+static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page,
+ unsigned long *flagsp)
+{
+ struct pglist_data *pgdat = page_pgdat(page);
+
+ spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp);
+ return &pgdat->__lruvec;
+}
+
static inline struct mem_cgroup *
mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -1411,6 +1473,10 @@ static inline
void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
{
}
+
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
#endif /* CONFIG_MEMCG */
/* idx can be of type enum memcg_stat_item or node_stat_item */
@@ -1492,6 +1558,50 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
}
+static inline void unlock_page_lruvec(struct lruvec *lruvec)
+{
+ spin_unlock(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
+{
+ spin_unlock_irq(&lruvec->lru_lock);
+}
+
+static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
+ unsigned long flags)
+{
+ spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+}
+
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
+ struct lruvec *locked_lruvec)
+{
+ if (locked_lruvec) {
+ if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+ return locked_lruvec;
+
+ unlock_page_lruvec_irq(locked_lruvec);
+ }
+
+ return lock_page_lruvec_irq(page);
+}
+
+/* Don't lock again iff page's lruvec locked */
+static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page,
+ struct lruvec *locked_lruvec, unsigned long *flags)
+{
+ if (locked_lruvec) {
+ if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+ return locked_lruvec;
+
+ unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
+ }
+
+ return lock_page_lruvec_irqsave(page, flags);
+}
+
#ifdef CONFIG_CGROUP_WRITEBACK
struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e7de072ade03..07d9acb5b19c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -79,7 +79,7 @@ struct page {
struct { /* Page cache and anonymous pages */
/**
* @lru: Pageout list, eg. active_list protected by
- * pgdat->lru_lock. Sometimes used as a generic list
+ * lruvec->lru_lock. Sometimes used as a generic list
* by the page owner.
*/
struct list_head lru;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 98a80c01d150..b593316bff3d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -113,8 +113,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype)
struct pglist_data;
/*
- * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
- * So add a wild amount of padding here to ensure that they fall into separate
+ * Add a wild amount of padding here to ensure datas fall into separate
* cachelines. There are very few zone structures in the machine, so space
* consumption is not a concern here.
*/
@@ -276,6 +275,8 @@ enum lruvec_flags {
struct lruvec {
struct list_head lists[NR_LRU_LISTS];
+ /* per lruvec lru_lock for memcg */
+ spinlock_t lru_lock;
/*
* These track the cost of reclaiming one LRU - file or anon -
* over the other. As the observed cost of reclaiming one LRU
@@ -782,7 +783,6 @@ typedef struct pglist_data {
/* Write-intensive fields used by page reclaim */
ZONE_PADDING(_pad1_)
- spinlock_t lru_lock;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b5eb0fc15053..ec5d0290e0ee 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -334,6 +334,7 @@ PAGEFLAG(Referenced, referenced, PF_HEAD)
PAGEFLAG(Dirty, dirty, PF_HEAD) TESTSCFLAG(Dirty, dirty, PF_HEAD)
__CLEARPAGEFLAG(Dirty, dirty, PF_HEAD)
PAGEFLAG(LRU, lru, PF_HEAD) __CLEARPAGEFLAG(LRU, lru, PF_HEAD)
+ TESTCLEARFLAG(LRU, lru, PF_HEAD)
PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
TESTCLEARFLAG(Active, active, PF_HEAD)
PAGEFLAG(Workingset, workingset, PF_HEAD)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 667935c0dbd4..596bc2f4d9b0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -338,8 +338,6 @@ extern void lru_note_cost(struct lruvec *lruvec, bool file,
unsigned int nr_pages);
extern void lru_note_cost_page(struct page *);
extern void lru_cache_add(struct page *);
-extern void lru_add_page_tail(struct page *page, struct page *page_tail,
- struct lruvec *lruvec, struct list_head *head);
extern void mark_page_accessed(struct page *);
extern void lru_add_drain(void);
extern void lru_add_drain_cpu(int cpu);
@@ -358,7 +356,7 @@ extern void lru_cache_add_inactive_or_unevictable(struct page *page,
extern unsigned long zone_reclaimable_pages(struct zone *zone);
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
-extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
+extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
unsigned long nr_pages,
gfp_t gfp_mask,
diff --git a/mm/compaction.c b/mm/compaction.c
index dbcfdfce1b82..e5acb9714436 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -804,7 +804,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
unsigned long nr_scanned = 0, nr_isolated = 0;
struct lruvec *lruvec;
unsigned long flags = 0;
- bool locked = false;
+ struct lruvec *locked = NULL;
struct page *page = NULL, *valid_page = NULL;
unsigned long start_pfn = low_pfn;
bool skip_on_failure = false;
@@ -868,11 +868,20 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
* contention, to give chance to IRQs. Abort completely if
* a fatal signal is pending.
*/
- if (!(low_pfn % SWAP_CLUSTER_MAX)
- && compact_unlock_should_abort(&pgdat->lru_lock,
- flags, &locked, cc)) {
- low_pfn = 0;
- goto fatal_pending;
+ if (!(low_pfn % SWAP_CLUSTER_MAX)) {
+ if (locked) {
+ unlock_page_lruvec_irqrestore(locked, flags);
+ locked = NULL;
+ }
+
+ if (fatal_signal_pending(current)) {
+ cc->contended = true;
+
+ low_pfn = 0;
+ goto fatal_pending;
+ }
+
+ cond_resched();
}
if (!pfn_valid_within(low_pfn))
@@ -890,6 +899,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
low_pfn = end_pfn;
+ page = NULL;
goto isolate_abort;
}
valid_page = page;
@@ -943,9 +953,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (unlikely(__PageMovable(page)) &&
!PageIsolated(page)) {
if (locked) {
- spin_unlock_irqrestore(&pgdat->lru_lock,
- flags);
- locked = false;
+ unlock_page_lruvec_irqrestore(locked, flags);
+ locked = NULL;
}
if (!isolate_movable_page(page, isolate_mode))
@@ -971,10 +980,34 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
goto isolate_fail;
+ /*
+ * Be careful not to clear PageLRU until after we're
+ * sure the page is not being freed elsewhere -- the
+ * page release code relies on it.
+ */
+ if (unlikely(!get_page_unless_zero(page)))
+ goto isolate_fail;
+
+ if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
+ goto isolate_fail_put;
+
+ /* Try isolate the page */
+ if (!TestClearPageLRU(page))
+ goto isolate_fail_put;
+
+ rcu_read_lock();
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
+
/* If we already hold the lock, we can skip some rechecking */
- if (!locked) {
- locked = compact_lock_irqsave(&pgdat->lru_lock,
- &flags, cc);
+ if (lruvec != locked) {
+ if (locked)
+ unlock_page_lruvec_irqrestore(locked, flags);
+
+ compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
+ locked = lruvec;
+ rcu_read_unlock();
+
+ lruvec_memcg_debug(lruvec, page);
/* Try get exclusive access under lock */
if (!skip_updated) {
@@ -983,10 +1016,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
goto isolate_abort;
}
- /* Recheck PageLRU and PageCompound under lock */
- if (!PageLRU(page))
- goto isolate_fail;
-
/*
* Page become compound since the non-locked check,
* and it's on LRU. It can only be a THP so the order
@@ -994,15 +1023,11 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
*/
if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
low_pfn += compound_nr(page) - 1;
- goto isolate_fail;
+ SetPageLRU(page);
+ goto isolate_fail_put;
}
- }
-
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
-
- /* Try isolate the page */
- if (__isolate_lru_page(page, isolate_mode) != 0)
- goto isolate_fail;
+ } else
+ rcu_read_unlock();
/* The whole page is taken off the LRU; skip the tail pages. */
if (PageCompound(page))
@@ -1032,6 +1057,15 @@ isolate_success:
}
continue;
+
+isolate_fail_put:
+ /* Avoid potential deadlock in freeing page under lru_lock */
+ if (locked) {
+ unlock_page_lruvec_irqrestore(locked, flags);
+ locked = NULL;
+ }
+ put_page(page);
+
isolate_fail:
if (!skip_on_failure)
continue;
@@ -1043,8 +1077,8 @@ isolate_fail:
*/
if (nr_isolated) {
if (locked) {
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
- locked = false;
+ unlock_page_lruvec_irqrestore(locked, flags);
+ locked = NULL;
}
putback_movable_pages(&cc->migratepages);
cc->nr_migratepages = 0;
@@ -1068,9 +1102,15 @@ isolate_fail:
if (unlikely(low_pfn > end_pfn))
low_pfn = end_pfn;
+ page = NULL;
+
isolate_abort:
if (locked)
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ unlock_page_lruvec_irqrestore(locked, flags);
+ if (page) {
+ SetPageLRU(page);
+ put_page(page);
+ }
/*
* Updated the cached scanner pfn once the pageblock has been scanned
diff --git a/mm/filemap.c b/mm/filemap.c
index 39bb88140680..c178022d7893 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -102,8 +102,8 @@
* ->swap_lock (try_to_unmap_one)
* ->private_lock (try_to_unmap_one)
* ->i_pages lock (try_to_unmap_one)
- * ->pgdat->lru_lock (follow_page->mark_page_accessed)
- * ->pgdat->lru_lock (check_pte_range->isolate_lru_page)
+ * ->lruvec->lru_lock (follow_page->mark_page_accessed)
+ * ->lruvec->lru_lock (check_pte_range->isolate_lru_page)
* ->private_lock (page_remove_rmap->set_page_dirty)
* ->i_pages lock (page_remove_rmap->set_page_dirty)
* bdi.wb->list_lock (page_remove_rmap->set_page_dirty)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e425bbf6711a..1efe2b5ad59a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2359,6 +2359,27 @@ static void remap_page(struct page *page, unsigned int nr)
}
}
+static void lru_add_page_tail(struct page *head, struct page *tail,
+ struct lruvec *lruvec, struct list_head *list)
+{
+ VM_BUG_ON_PAGE(!PageHead(head), head);
+ VM_BUG_ON_PAGE(PageCompound(tail), head);
+ VM_BUG_ON_PAGE(PageLRU(tail), head);
+ lockdep_assert_held(&lruvec->lru_lock);
+
+ if (list) {
+ /* page reclaim is reclaiming a huge page */
+ VM_WARN_ON(PageLRU(head));
+ get_page(tail);
+ list_add_tail(&tail->lru, list);
+ } else {
+ /* head is still on lru (and we have it frozen) */
+ VM_WARN_ON(!PageLRU(head));
+ SetPageLRU(tail);
+ list_add_tail(&tail->lru, &head->lru);
+ }
+}
+
static void __split_huge_page_tail(struct page *head, int tail,
struct lruvec *lruvec, struct list_head *list)
{
@@ -2425,18 +2446,15 @@ static void __split_huge_page_tail(struct page *head, int tail,
}
static void __split_huge_page(struct page *page, struct list_head *list,
- pgoff_t end, unsigned long flags)
+ pgoff_t end)
{
struct page *head = compound_head(page);
- pg_data_t *pgdat = page_pgdat(head);
struct lruvec *lruvec;
struct address_space *swap_cache = NULL;
unsigned long offset = 0;
unsigned int nr = thp_nr_pages(head);
int i;
- lruvec = mem_cgroup_page_lruvec(head, pgdat);
-
/* complete memcg works before add pages to LRU */
mem_cgroup_split_huge_fixup(head);
@@ -2448,6 +2466,9 @@ static void __split_huge_page(struct page *page, struct list_head *list,
xa_lock(&swap_cache->i_pages);
}
+ /* lock lru list/PageCompound, ref freezed by page_ref_freeze */
+ lruvec = lock_page_lruvec(head);
+
for (i = nr - 1; i >= 1; i--) {
__split_huge_page_tail(head, i, lruvec, list);
/* Some pages can be beyond i_size: drop them from page cache */
@@ -2467,6 +2488,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
}
ClearPageCompound(head);
+ unlock_page_lruvec(lruvec);
+ /* Caller disabled irqs, so they are still disabled here */
split_page_owner(head, nr);
@@ -2484,8 +2507,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
page_ref_add(head, 2);
xa_unlock(&head->mapping->i_pages);
}
-
- spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+ local_irq_enable();
remap_page(head, nr);
@@ -2631,12 +2653,10 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
int split_huge_page_to_list(struct page *page, struct list_head *list)
{
struct page *head = compound_head(page);
- struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
struct deferred_split *ds_queue = get_deferred_split_queue(head);
struct anon_vma *anon_vma = NULL;
struct address_space *mapping = NULL;
int count, mapcount, extra_pins, ret;
- unsigned long flags;
pgoff_t end;
VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
@@ -2697,9 +2717,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
unmap_page(head);
VM_BUG_ON_PAGE(compound_mapcount(head), head);
- /* prevent PageLRU to go away from under us, and freeze lru stats */
- spin_lock_irqsave(&pgdata->lru_lock, flags);
-
+ /* block interrupt reentry in xa_lock and spinlock */
+ local_irq_disable();
if (mapping) {
XA_STATE(xas, &mapping->i_pages, page_index(head));
@@ -2729,7 +2748,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
__dec_lruvec_page_state(head, NR_FILE_THPS);
}
- __split_huge_page(page, list, end, flags);
+ __split_huge_page(page, list, end);
ret = 0;
} else {
if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
@@ -2743,7 +2762,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
spin_unlock(&ds_queue->split_queue_lock);
fail: if (mapping)
xa_unlock(&mapping->i_pages);
- spin_unlock_irqrestore(&pgdata->lru_lock, flags);
+ local_irq_enable();
remap_page(head, thp_nr_pages(head));
ret = -EBUSY;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9c5b14fe360e..e3c7ca7dc174 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -20,6 +20,9 @@
* Lockless page tracking & accounting
* Unified hierarchy configuration model
* Copyright (C) 2015 Red Hat, Inc., Johannes Weiner
+ *
+ * Per memcg lru locking
+ * Copyright (C) 2020 Alibaba, Inc, Alex Shi
*/
#include <linux/page_counter.h>
@@ -1322,6 +1325,23 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
return ret;
}
+#ifdef CONFIG_DEBUG_VM
+void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+ struct mem_cgroup *memcg;
+
+ if (mem_cgroup_disabled())
+ return;
+
+ memcg = page_memcg(page);
+
+ if (!memcg)
+ VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
+ else
+ VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != memcg, page);
+}
+#endif
+
/**
* mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
* @page: the page
@@ -1363,6 +1383,60 @@ out:
}
/**
+ * lock_page_lruvec - lock and return lruvec for a given page.
+ * @page: the page
+ *
+ * This series functions should be used in either conditions:
+ * PageLRU is cleared or unset
+ * or page->_refcount is zero
+ * or page is locked.
+ */
+struct lruvec *lock_page_lruvec(struct page *page)
+{
+ struct lruvec *lruvec;
+ struct pglist_data *pgdat = page_pgdat(page);
+
+ rcu_read_lock();
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ spin_lock(&lruvec->lru_lock);
+ rcu_read_unlock();
+
+ lruvec_memcg_debug(lruvec, page);
+
+ return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irq(struct page *page)
+{
+ struct lruvec *lruvec;
+ struct pglist_data *pgdat = page_pgdat(page);
+
+ rcu_read_lock();
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ spin_lock_irq(&lruvec->lru_lock);
+ rcu_read_unlock();
+
+ lruvec_memcg_debug(lruvec, page);
+
+ return lruvec;
+}
+
+struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
+{
+ struct lruvec *lruvec;
+ struct pglist_data *pgdat = page_pgdat(page);
+
+ rcu_read_lock();
+ lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ spin_lock_irqsave(&lruvec->lru_lock, *flags);
+ rcu_read_unlock();
+
+ lruvec_memcg_debug(lruvec, page);
+
+ return lruvec;
+}
+
+/**
* mem_cgroup_update_lru_size - account for adding or removing an lru page
* @lruvec: mem_cgroup per zone lru vector
* @lru: index of lru list the page is sitting on
@@ -2142,6 +2216,12 @@ again:
if (unlikely(!memcg))
return NULL;
+#ifdef CONFIG_PROVE_LOCKING
+ local_irq_save(flags);
+ might_lock(&memcg->move_lock);
+ local_irq_restore(flags);
+#endif
+
if (atomic_read(&memcg->moving_account) <= 0)
return memcg;
@@ -3263,10 +3343,8 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
#endif /* CONFIG_MEMCG_KMEM */
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-
/*
- * Because tail pages are not marked as "used", set it. We're under
- * pgdat->lru_lock and migration entries setup in all page mappings.
+ * Because page_memcg(head) is not set on compound tails, set it now.
*/
void mem_cgroup_split_huge_fixup(struct page *head)
{
diff --git a/mm/mlock.c b/mm/mlock.c
index 884b1216da6a..55b3b3672977 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -106,26 +106,6 @@ void mlock_vma_page(struct page *page)
}
/*
- * Isolate a page from LRU with optional get_page() pin.
- * Assumes lru_lock already held and page already pinned.
- */
-static bool __munlock_isolate_lru_page(struct page *page, bool getpage)
-{
- if (PageLRU(page)) {
- struct lruvec *lruvec;
-
- lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
- if (getpage)
- get_page(page);
- ClearPageLRU(page);
- del_page_from_lru_list(page, lruvec, page_lru(page));
- return true;
- }
-
- return false;
-}
-
-/*
* Finish munlock after successful page isolation
*
* Page must be locked. This is a wrapper for try_to_munlock()
@@ -187,40 +167,24 @@ static void __munlock_isolation_failed(struct page *page)
unsigned int munlock_vma_page(struct page *page)
{
int nr_pages;
- pg_data_t *pgdat = page_pgdat(page);
/* For try_to_munlock() and to serialize with page migration */
BUG_ON(!PageLocked(page));
-
VM_BUG_ON_PAGE(PageTail(page), page);