linux.git/mm, branch v2.6.17.10

pdflush: handle resume wakeups

2006-07-25T03:35:28+00:00

2.6.16 needs this. It was merged into 2.6.18-rc1 in
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d616e09ab33aa4d013a93c9b393efd5cebf78521 .

pdflush is carefully designed to ensure that all wakeups have some
corresponding work to do - if a woken-up pdflush thread discovers that
it hasn't been given any work to do then this is considered an error.

That all broke when swsusp came along - because a timer-delivered
wakeup to a frozen pdflush thread will just get lost.  This causes the
pdflush thread to get lost as well: the writeback timer is supposed to
be re-armed by pdflush in process context, but pdflush doesn't execute
the callout which does this.

Fix that up by ignoring the return value from try_to_freeze(): jsut
proceed, see if we have any work pending and only go back to sleep if
that is not the case.

Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Pavel Machek 
Signed-off-by: Greg Kroah-Hartman

generic_file_buffered_write(): handle zero-length iovec segments

2006-07-25T03:35:25+00:00

The recent generic_file_write() deadlock fix caused
generic_file_buffered_write() to loop inifinitely when presented with a
zero-length iovec segment.  Fix.

Note that this fix deliberately avoids calling ->prepare_write(),
->commit_write() etc with a zero-length write.  This is because I don't trust
all filesystems to get that right.

This is a cautious approach, for 2.6.17.x.  For 2.6.18 we should just go ahead
and call ->prepare_write() and ->commit_write() with the zero length and fix
any broken filesystems.  So I'll make that change once this code is stabilised
and backported into 2.6.17.x.

The reason for preferring to call ->prepare_write() and ->commit_write() with
the zero-length segment: a zero-length segment _should_ be sufficiently
uncommon that this is the correct way of handling it.  We don't want to
optimise for poorly-written userspace at the expense of well-written
userspace.

Cc: "Vladimir V. Saveliev" 
Cc: Neil Brown 
Cc: Martin Schwidefsky 
Cc: Chris Wright 
Cc: Greg KH 
Cc: 
Cc: walt 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Chris Wright

generic_file_buffered_write(): deadlock on vectored write

2006-07-25T03:35:25+00:00

generic_file_buffered_write() prefaults in user pages in order to avoid
deadlock on copying from the same page as write goes to.

However, it looks like there is a problem when write is vectored:
fault_in_pages_readable brings in current segment or its part (maxlen). 
OTOH, filemap_copy_from_user_iovec is called to copy number of bytes
(bytes) which may exceed current segment, so filemap_copy_from_user_iovec
switches to the next segment which is not brought in yet.  Pagefault is
generated.  That causes the deadlock if pagefault is for the same page
write goes to: page being written is locked and not uptodate, pagefault
will deadlock trying to lock locked page.

[akpm@osdl.org: somewhat rewritten]
Cc: Neil Brown 
Cc: Martin Schwidefsky 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Chris Wright

memory hotplug: solve config broken: undefined reference to `online_page'

2006-07-25T03:35:20+00:00

Memory hotplug code of i386 adds memory to only highmem.  So, if
CONFIG_HIGHMEM is not set, CONFIG_MEMORY_HOTPLUG shouldn't be set.
Otherwise, it causes compile error.

In addition, many architecture can't use memory hotplug feature yet.  So, I
introduce CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG.

Signed-off-by: Yasunori Goto 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Chris Wright 
Signed-off-by: Greg Kroah-Hartman

[PATCH] tmpfs: Decrement i_nlink correctly in shmem_rmdir()

2006-06-12T21:29:04+00:00

shmem_rmdir() must undo the increment of i_nlink done in
shmem_get_inode() for directories, otherwise at least
IN_DELETE_SELF inotify event generation is broken.

Signed-off-by: Sergey Vlasov 
Signed-off-by: Hugh Dickins 
Signed-off-by: Linus Torvalds

[PATCH] tmpfs: time granularity fix for [acm]time going backwards

2006-06-12T20:55:52+00:00

I noticed a strange behavior in a tmpfs file system the other day, while
building packages - occasionally, and seemingly at random, make decided to
rebuild a target. However, only on tmpfs.

A file would be created, and if checked, it had a sub-second timestamp.
However, after an utimes related call where sub-seconds should be set, they
were zeroed instead. In the case that a file was created, and utimes(...,NULL)
was used on it in the same second, the timestamp on the file moved backwards.

After some digging, I found that this was being caused by tmpfs not having a
time granularity set, thus inheriting the default 1 second granularity.

Hugh adds: yes, we missed tmpfs when the s_time_gran mods went into 2.6.11.
Unfortunately, the granularity of CURRENT_TIME, often used in filesystems,
does not match the default granularity set by alloc_super.  A few more such
discrepancies have been found, but this is the most important to fix now.

Signed-off-by: Robin H. Johnson 
Acked-by: Andi Kleen 
Signed-off-by: Hugh Dickins 
Signed-off-by: Linus Torvalds

[PATCH] typo in vmscan.c

2006-06-11T22:27:37+00:00

From: Christoph Lameter 

Looks like a comma was left from the conversion from a struct to an
assignment.

Signed-off-by: Christoph Lameter 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

[PATCH] slab.c: fix offslab_limit bug

2006-06-02T18:21:10+00:00

mm/slab.c's offlab_limit logic is totally broken.

Firstly, "offslab_limit" is a global variable while it should either be
calculated in situ or should be passed in as a parameter.

Secondly, the more serious problem with it is that the condition for
calculating it:

               if (!(OFF_SLAB(sizes->cs_cachep))) {
                       offslab_limit = sizes->cs_size - sizeof(struct slab);
                       offslab_limit /= sizeof(kmem_bufctl_t);

is in total disconnect with the condition that makes use of it:

               /* More than offslab_limit objects will cause problems */
               if ((flags & CFLGS_OFF_SLAB) && num > offslab_limit)
                       break;

but due to offslab_limit being a global variable this breakage was
hidden.

Up until lockdep came along and perturbed the slab sizes sufficiently so
that the first off-slab cache would still see a (non-calculated) zero
value for offslab_limit and would panic with:

  kmem_cache_create: couldn't create cache size-512.

  Call Trace:
   [] show_trace+0x96/0x1c8
   [] dump_stack+0x13/0x15
   [] panic+0x39/0x21a
   [] kmem_cache_create+0x5a0/0x5d0
   [] kmem_cache_init+0x193/0x379
   [] start_kernel+0x17f/0x218
   [] _sinittext+0x263/0x26a

  Kernel panic - not syncing: kmem_cache_create(): failed to create slab `size-512'

Paolo Ornati's config on x86_64 managed to trigger it.

The fix is to move the calculation to the place that makes use of it.
This also makes slab.o 54 bytes smaller.

Btw., the check itself is quite silly. Its intention is to test whether
the number of objects per slab would be higher than the number of slab
control pointers possible. In theory it could be triggered: if someone
tried to allocate 4-byte objects cache and explicitly requested with
CFLGS_OFF_SLAB. So i kept the check.

Out of historic interest i checked how old this bug was and it's
ancient, 10 years old! It is the oldest hidden and then truly triggering
bugs i ever saw being fixed in the kernel!

Signed-off-by: Ingo Molnar 
Signed-off-by: Linus Torvalds

[PATCH] spanned_pages is not updated at a case of memory hot-add

2006-05-31T23:27:10+00:00

From: Yasunori Goto 

If hot-added memory's address is smaller than old area, spanned_pages will
not be updated.  It must be fixed.

example) Old zone_start_pfn = 0x60000, and spanned_pages = 0x10000
         Added new memory's start_pfn = 0x50000, and end_pfn = 0x60000

  new spanned_pages will be still 0x10000 by old code.
  (It should be updated to 0x20000.) Because old_zone_end_pfn will be
  0x70000, and end_pfn smaller than it. So, spanned_pages will not be
  updated.

In current code, spanned_pages is updated only when end_pfn is updated.
But, it should be updated by subtraction between bigger end_pfn and new
zone_start_pfn.

Signed-off-by: Yasunori Goto 
Signed-off-by: Dave Hansen 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

[PATCH] Align the node_mem_map endpoints to a MAX_ORDER boundary

2006-05-21T19:59:22+00:00

Andy added code to buddy allocator which does not require the zone's
endpoints to be aligned to MAX_ORDER.  An issue is that the buddy allocator
requires the node_mem_map's endpoints to be MAX_ORDER aligned.  Otherwise
__page_find_buddy could compute a buddy not in node_mem_map for partial
MAX_ORDER regions at zone's endpoints.  page_is_buddy will detect that
these pages at endpoints are not PG_buddy (they were zeroed out by bootmem
allocator and not part of zone).  Of course the negative here is we could
waste a little memory but the positive is eliminating all the old checks
for zone boundary conditions.

SPARSEMEM won't encounter this issue because of MAX_ORDER size constraint
when SPARSEMEM is configured.  ia64 VIRTUAL_MEM_MAP doesn't need the logic
either because the holes and endpoints are handled differently.  This
leaves checking alloc_remap and other arches which privately allocate for
node_mem_map.

Signed-off-by: Bob Picco 
Acked-by: Mel Gorman 
Cc: Dave Hansen 
Cc: Andy Whitcroft 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds