linux.git/mm/rmap.c, branch v2.6.26-rc7

mm: remove nopage

2008-04-28T15:58:18+00:00

Nothing in the tree uses nopage any more.  Remove support for it in the
core mm code and documentation (and a few stray references to it in
comments).

Signed-off-by: Nick Piggin 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

s390: KVM preparation: host memory management changes for s390 kvm

2008-04-27T09:00:40+00:00

This patch changes the s390 memory management defintions to use the pgste field
for dirty and reference bit tracking of host and guest code. Usually on s390,
dirty and referenced are tracked in storage keys, which belong to the physical
page. This changes with virtualization: The guest and host dirty/reference bits
are defined to be the logical OR of the values for the mapping and the physical
page. This patch implements the necessary changes in pgtable.h for s390.

There is a common code change in mm/rmap.c, the call to
page_test_and_clear_young must be moved. This is a no-op for all
architecture but s390. page_referenced checks the referenced bits for
the physiscal page and for all mappings:
o The physical page is checked with page_test_and_clear_young.
o The mappings are checked with ptep_test_and_clear_young and friends.

Without pgstes (the current implementation on Linux s390) the physical page
check is implemented but the mapping callbacks are no-ops because dirty
and referenced are not tracked in the s390 page tables. The pgstes introduces
guest and host dirty and reference bits for s390 in the host mapping. These
mapping must be checked before page_test_and_clear_young resets the reference
bit.

Signed-off-by: Heiko Carstens 
Signed-off-by: Christian Borntraeger 
Acked-by: Martin Schwidefsky 
Acked-by: Andrew Morton 
Signed-off-by: Carsten Otte 
Signed-off-by: Avi Kivity

mm: rmap kernel-doc fixes

2008-03-20T01:53:35+00:00

Correct kernel-doc function names and parameters in rmap.c.

Signed-off-by: Randy Dunlap 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: mm_match_cgroup not vm_match_cgroup

2008-03-05T00:35:14+00:00

vm_match_cgroup is a perverse name for a macro to match mm with cgroup: rename
it mm_match_cgroup, matching mm_init_cgroup and mm_free_cgroup.

Signed-off-by: Hugh Dickins 
Acked-by: David Rientjes 
Acked-by: Balbir Singh 
Acked-by: KAMEZAWA Hiroyuki 
Cc: Hirokazu Takahashi 
Cc: YAMAMOTO Takashi 
Cc: Paul Menage 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcontrol: add vm_match_cgroup()

2008-02-09T19:08:33+00:00

mm_cgroup() is exclusively used to test whether an mm's mem_cgroup pointer
is pointing to a specific cgroup.  Instead of returning the pointer, we can
just do the test itself in a new macro:

	vm_match_cgroup(mm, cgroup)

returns non-zero if the mm's mem_cgroup points to cgroup.  Otherwise it
returns zero.

Signed-off-by: David Rientjes 
Cc: Balbir Singh 
Cc: Adrian Bunk 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Memory controller: make page_referenced() cgroup aware

2008-02-07T16:42:19+00:00

Make page_referenced() cgroup aware.  Without this patch, page_referenced()
can cause a page to be skipped while reclaiming pages.  This patch ensures
that other cgroups do not hold pages in a particular cgroup hostage.  It
is required to ensure that shared pages are freed from a cgroup when they
are not actively referenced from the cgroup that brought them in

Signed-off-by: Balbir Singh 
Cc: Pavel Emelianov 
Cc: Paul Menage 
Cc: Peter Zijlstra 
Cc: "Eric W. Biederman" 
Cc: Nick Piggin 
Cc: Kirill Korotaev 
Cc: Herbert Poetzl 
Cc: David Rientjes 
Cc: Vaidyanathan Srinivasan 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Memory controller: memory accounting

2008-02-07T16:42:18+00:00

Add the accounting hooks.  The accounting is carried out for RSS and Page
Cache (unmapped) pages.  There is now a common limit and accounting for both.
The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
time.  Page cache is accounted at add_to_page_cache(),
__delete_from_page_cache().  Swap cache is also accounted for.

Each page's page_cgroup is protected with the last bit of the
page_cgroup pointer, this makes handling of race conditions involving
simultaneous mappings of a page easier.  A reference count is kept in the
page_cgroup to deal with cases where a page might be unmapped from the RSS
of all tasks, but still lives in the page cache.

Credits go to Vaidyanathan Srinivasan for helping with reference counting work
of the page cgroup.  Almost all of the page cache accounting code has help
from Vaidyanathan Srinivasan.

[hugh@veritas.com: fix swapoff breakage]
[akpm@linux-foundation.org: fix locking]
Signed-off-by: Vaidyanathan Srinivasan 
Signed-off-by: Balbir Singh 
Cc: Pavel Emelianov 
Cc: Paul Menage 
Cc: Peter Zijlstra 
Cc: "Eric W. Biederman" 
Cc: Nick Piggin 
Cc: Kirill Korotaev 
Cc: Herbert Poetzl 
Cc: David Rientjes 
Cc: 
Signed-off-by: Hugh Dickins 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: don't waste swap on locked pages

2008-02-05T17:44:18+00:00

try_to_unmap always fails on a page found in a VM_LOCKED vma (unless
migrating), and recycles it back to the active list.  But if it's an
anonymous page, we've already allocated swap to it: just wasting swap.
Spot locked pages in page_referenced_one and treat them as referenced.

Signed-off-by: Hugh Dickins 
Tested-by: KAMEZAWA Hiroyuki 
Cc: Ethan Solomita 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

radix-tree: avoid atomic allocations for preloaded insertions

2008-02-05T17:44:17+00:00

Most pagecache (and some other) radix tree insertions have the great
opportunity to preallocate a few nodes with relaxed gfp flags.  But the
preallocation is squandered when it comes time to allocate a node, we
default to first attempting a GFP_ATOMIC allocation -- that doesn't
normally fail, but it can eat into atomic memory reserves that we don't
need to be using.

Another upshot of this is that it removes the sometimes highly contended
zone->lock from underneath tree_lock.  Pagecache insertions are always
performed with a radix tree preload, and after this change, such a
situation will never fall back to kmem_cache_alloc within
radix_tree_node_alloc.

David Miller reports seeing this allocation fail on a highly threaded
sparc64 system:

[527319.459981] dd: page allocation failure. order:0, mode:0x20
[527319.460403] Call Trace:
[527319.460568]  [00000000004b71e0] __slab_alloc+0x1b0/0x6a8
[527319.460636]  [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8
[527319.460698]  [000000000055309c] radix_tree_node_alloc+0x20/0x90
[527319.460763]  [0000000000553238] radix_tree_insert+0x12c/0x260
[527319.460830]  [0000000000495cd0] add_to_page_cache+0x38/0xb0
[527319.460893]  [00000000004e4794] mpage_readpages+0x6c/0x134
[527319.460955]  [000000000049c7fc] __do_page_cache_readahead+0x170/0x280
[527319.461028]  [000000000049cc88] ondemand_readahead+0x208/0x214
[527319.461094]  [0000000000496018] do_generic_mapping_read+0xe8/0x428
[527319.461152]  [0000000000497948] generic_file_aio_read+0x108/0x170
[527319.461217]  [00000000004badac] do_sync_read+0x88/0xd0
[527319.461292]  [00000000004bb5cc] vfs_read+0x78/0x10c
[527319.461361]  [00000000004bb920] sys_read+0x34/0x60
[527319.461424]  [0000000000406294] linux_sparc_syscall32+0x3c/0x40

The calltrace is significant: __do_page_cache_readahead allocates a number
of pages with GFP_KERNEL, and hence it should have reclaimed sufficient
memory to satisfy GFP_ATOMIC allocations.  However after the list of pages
goes to mpage_readpages, there can be significant intervals (including disk
IO) before all the pages are inserted into the radix-tree.  So the reserves
can easily be depleted at that point.  The patch is confirmed to fix the
problem.

Signed-off-by: Nick Piggin 
Cc: "David S. Miller" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

[S390] Optimize storage key handling for anonymous pages

2007-11-20T10:13:46+00:00

page_mkclean used to call page_clear_dirty for every given page. This
is different to all other architectures, where the dirty bit in the
PTEs is only resetted, if page_mapping() returns a non-NULL pointer.
We can move the page_test_dirty/page_clear_dirty sequence into the
2nd if to avoid unnecessary iske/sske sequences, which are expensive.

This change also helps kvm for s390 as the host must transfer the
dirty bit into the guest status bits. By moving the page_clear_dirty
operation into the 2nd if, the vm will only call page_clear_dirty
for pages where it walks the mapping anyway. There it calls
ptep_clear_flush for writable ptes, so we can transfer the dirty bit
to the guest.

Signed-off-by: Christian Borntraeger 
Signed-off-by: Martin Schwidefsky