linux.git/mm/mlock.c, branch v3.18.72

mlock: fix mlock count can not decrease in race condition

2017-06-07T10:01:52+00:00

commit 70feee0e1ef331b22cc51f383d532a0d043fbdcc upstream.

Kefeng reported that when running the follow test, the mlock count in
meminfo will increase permanently:

 [1] testcase
 linux:~ # cat test_mlockal
 grep Mlocked /proc/meminfo
  for j in `seq 0 10`
  do
 	for i in `seq 4 15`
 	do
 		./p_mlockall >> log &
 	done
 	sleep 0.2
 done
 # wait some time to let mlock counter decrease and 5s may not enough
 sleep 5
 grep Mlocked /proc/meminfo

 linux:~ # cat p_mlockall.c
 #include 
 #include 
 #include 

 #define SPACE_LEN	4096

 int main(int argc, char ** argv)
 {
	 	int ret;
	 	void *adr = malloc(SPACE_LEN);
	 	if (!adr)
	 		return -1;

	 	ret = mlockall(MCL_CURRENT | MCL_FUTURE);
	 	printf("mlcokall ret = %d\n", ret);

	 	ret = munlockall();
	 	printf("munlcokall ret = %d\n", ret);

	 	free(adr);
	 	return 0;
	 }

In __munlock_pagevec() we should decrement NR_MLOCK for each page where
we clear the PageMlocked flag.  Commit 1ebb7cc6a583 ("mm: munlock: batch
NR_MLOCK zone state updates") has introduced a bug where we don't
decrement NR_MLOCK for pages where we clear the flag, but fail to
isolate them from the lru list (e.g.  when the pages are on some other
cpu's percpu pagevec).  Since PageMlocked stays cleared, the NR_MLOCK
accounting gets permanently disrupted by this.

Fix it by counting the number of page whose PageMlock flag is cleared.

Fixes: 1ebb7cc6a583 (" mm: munlock: batch NR_MLOCK zone state updates")
Link: http://lkml.kernel.org/r/1495678405-54569-1-git-send-email-xieyisheng1@huawei.com
Signed-off-by: Yisheng Xie 
Reported-by: Kefeng Wang 
Tested-by: Kefeng Wang 
Cc: Vlastimil Babka 
Cc: Joern Engel 
Cc: Mel Gorman 
Cc: Michel Lespinasse 
Cc: Hugh Dickins 
Cc: Rik van Riel 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Xishi Qiu 
Cc: zhongjiang 
Cc: Hanjun Guo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2014-10-13T13:44:12+00:00

Pull RCU updates from Ingo Molnar:
 "The main changes in this cycle were:

   - changes related to No-CBs CPUs and NO_HZ_FULL

   - RCU-tasks implementation

   - torture-test updates

   - miscellaneous fixes

   - locktorture updates

   - RCU documentation updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
  workqueue: Use cond_resched_rcu_qs macro
  workqueue: Add quiescent state between work items
  locktorture: Cleanup header usage
  locktorture: Cannot hold read and write lock
  locktorture: Fix __acquire annotation for spinlock irq
  locktorture: Support rwlocks
  rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
  locktorture: Document boot/module parameters
  rcutorture: Rename rcutorture_runnable parameter
  locktorture: Add test scenario for rwsem_lock
  locktorture: Add test scenario for mutex_lock
  locktorture: Make torture scripting account for new _runnable name
  locktorture: Introduce torture context
  locktorture: Support rwsems
  locktorture: Add infrastructure for torturing read locks
  torture: Address race in module cleanup
  locktorture: Make statistics generic
  locktorture: Teach about lock debugging
  locktorture: Support mutexes
  locktorture: Add documentation
  ...

mm: use VM_BUG_ON_MM where possible

2014-10-10T02:25:58+00:00

Dump the contents of the relevant struct_mm when we hit the bug condition.

Signed-off-by: Sasha Levin 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: convert a few VM_BUG_ON callers to VM_BUG_ON_VMA

2014-10-10T02:25:57+00:00

Trivially convert a few VM_BUG_ON calls to VM_BUG_ON_VMA to extract
more information when they trigger.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Sasha Levin 
Reviewed-by: Naoya Horiguchi 
Cc: Kirill A. Shutemov 
Cc: Konstantin Khlebnikov 
Cc: Rik van Riel 
Cc: Mel Gorman 
Cc: Michal Hocko 
Cc: Hugh Dickins 
Cc: Vlastimil Babka 
Cc: Michel Lespinasse 
Cc: Minchan Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops

2014-09-07T23:27:20+00:00

RCU-tasks requires the occasional voluntary context switch
from CPU-bound in-kernel tasks.  In some cases, this requires
instrumenting cond_resched().  However, there is some reluctance
to countenance unconditionally instrumenting cond_resched() (see
http://lwn.net/Articles/603252/), so this commit creates a separate
cond_resched_rcu_qs() that may be used in place of cond_resched() in
locations prone to long-duration in-kernel looping.

This commit currently instruments only RCU-tasks.  Future possibilities
include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce
IPI usage.

Signed-off-by: Paul E. McKenney

mm: describe mmap_sem rules for __lock_page_or_retry() and callers

2014-08-07T01:01:20+00:00

Add a comment describing the circumstances in which
__lock_page_or_retry() will or will not release the mmap_sem when
returning 0.

Add comments to lock_page_or_retry()'s callers (filemap_fault(),
do_swap_page()) noting the impact on VM_FAULT_RETRY returns.

Add comments on up the call tree, particularly replacing the false "We
return with mmap_sem still held" comments.

Signed-off-by: Paul Cassella 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: try_to_unmap_cluster() should lock_page() before mlocking

2014-04-07T23:35:57+00:00

A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin
fuzzing with trinity.  The call site try_to_unmap_cluster() does not lock
the pages other than its check_page parameter (which is already locked).

The BUG_ON in mlock_vma_page() is not documented and its purpose is
somewhat unclear, but apparently it serializes against page migration,
which could otherwise fail to transfer the PG_mlocked flag.  This would
not be fatal, as the page would be eventually encountered again, but
NR_MLOCK accounting would become distorted nevertheless.  This patch adds
a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that
effect.

The call site try_to_unmap_cluster() is fixed so that for page !=
check_page, trylock_page() is attempted (to avoid possible deadlocks as we
already have check_page locked) and mlock_vma_page() is performed only
upon success.  If the page lock cannot be obtained, the page is left
without PG_mlocked, which is again not a problem in the whole unevictable
memory design.

Signed-off-by: Vlastimil Babka 
Signed-off-by: Bob Liu 
Reported-by: Sasha Levin 
Cc: Wanpeng Li 
Cc: Michel Lespinasse 
Cc: KOSAKI Motohiro 
Acked-by: Rik van Riel 
Cc: David Rientjes 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: Joonsoo Kim 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE

2014-01-24T00:36:50+00:00

Most of the VM_BUG_ON assertions are performed on a page.  Usually, when
one of these assertions fails we'll get a BUG_ON with a call stack and
the registers.

I've recently noticed based on the requests to add a small piece of code
that dumps the page to various VM_BUG_ON sites that the page dump is
quite useful to people debugging issues in mm.

This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
VM_BUG_ON() does, also dumps the page before executing the actual
BUG_ON.

[akpm@linux-foundation.org: fix up includes]
Signed-off-by: Sasha Levin 
Cc: "Kirill A. Shutemov" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: munlock: fix potential race with THP page split

2014-01-24T00:36:50+00:00

Since commit ff6a6da60b89 ("mm: accelerate munlock() treatment of THP
pages") munlock skips tail pages of a munlocked THP page.  There is some
attempt to prevent bad consequences of racing with a THP page split, but
code inspection indicates that there are two problems that may lead to a
non-fatal, yet wrong outcome.

First, __split_huge_page_refcount() copies flags including PageMlocked
from the head page to the tail pages.  Clearing PageMlocked by
munlock_vma_page() in the middle of this operation might result in part
of tail pages left with PageMlocked flag.  As the head page still
appears to be a THP page until all tail pages are processed,
munlock_vma_page() might think it munlocked the whole THP page and skip
all the former tail pages.  Before ff6a6da60, those pages would be
cleared in further iterations of munlock_vma_pages_range(), but NR_MLOCK
would still become undercounted (related the next point).

Second, NR_MLOCK accounting is based on call to hpage_nr_pages() after
the PageMlocked is cleared.  The accounting might also become
inconsistent due to race with __split_huge_page_refcount()

- undercount when HUGE_PMD_NR is subtracted, but some tail pages are
  left with PageMlocked set and counted again (only possible before
  ff6a6da60)

- overcount when hpage_nr_pages() sees a normal page (split has already
  finished), but the parallel split has meanwhile cleared PageMlocked from
  additional tail pages

This patch prevents both problems via extending the scope of lru_lock in
munlock_vma_page().  This is convenient because:

- __split_huge_page_refcount() takes lru_lock for its whole operation

- munlock_vma_page() typically takes lru_lock anyway for page isolation

As this becomes a second function where page isolation is done with
lru_lock already held, factor this out to a new
__munlock_isolate_lru_page() function and clean up the code around.

[akpm@linux-foundation.org: avoid a coding-style ugly]
Signed-off-by: Vlastimil Babka 
Cc: Sasha Levin 
Cc: Michel Lespinasse 
Cc: Andrea Arcangeli 
Cc: Rik van Riel 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/mlock: prepare params outside critical region

2014-01-22T00:19:44+00:00

All mlock related syscalls prepare lock limits, lengths and start
parameters with the mmap_sem held.  Move this logic outside of the
critical region.  For the case of mlock, continue incrementing the
amount already locked by mm->locked_vm with the rwsem taken.

Signed-off-by: Davidlohr Bueso 
Cc: Rik van Riel 
Reviewed-by: Michel Lespinasse 
Acked-by: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds