linux.git/drivers/md/md.c, branch v3.10.76

md: flush writes before starting a recovery.

2014-07-09T18:14:02+00:00

commit 133d4527eab8d199a62eee6bd433f0776842df2e upstream.

When we write to a degraded array which has a bitmap, we
make sure the relevant bit in the bitmap remains set when
the write completes (so a 're-add' can quickly rebuilt a
temporarily-missing device).

If, immediately after such a write starts, we incorporate a spare,
commence recovery, and skip over the region where the write is
happening (because the 'needs recovery' flag isn't set yet),
then that write will not get to the new device.

Once the recovery finishes the new device will be trusted, but will
have incorrect data, leading to possible corruption.

We cannot set the 'needs recovery' flag when we start the write as we
do not know easily if the write will be "degraded" or not.  That
depends on details of the particular raid level and particular write
request.

This patch fixes a corruption issue of long standing and so it
suitable for any -stable kernel.  It applied correctly to 3.0 at
least and will minor editing to earlier kernels.

Reported-by: Bill 
Tested-by: Bill 
Link: http://lkml.kernel.org/r/53A518BB.60709@sbcglobal.net
Signed-off-by: NeilBrown 
Signed-off-by: Greg Kroah-Hartman

md: always set MD_RECOVERY_INTR when interrupting a reshape thread.

2014-06-11T19:03:24+00:00

commit 2ac295a544dcae9299cba13ce250419117ae7fd1 upstream.

Commit 8313b8e57f55b15e5b7f7fc5d1630bbf686a9a97
   md: fix problem when adding device to read-only array with bitmap.

added a called to md_reap_sync_thread() which cause a reshape thread
to be interrupted (in particular, it could cause md_thread() to never even
call md_do_sync()).
However it didn't set MD_RECOVERY_INTR so ->finish_reshape() would not
know that the reshape didn't complete.

This only happens when mddev->ro is set and normally reshape threads
don't run in that situation.  But raid5 and raid10 can start a reshape
thread during "run" is the array is in the middle of a reshape.
They do this even if ->ro is set.

So it is best to set MD_RECOVERY_INTR before abortingg the
sync thread, just in case.

Though it rare for this to trigger a problem it can cause data corruption
because the reshape isn't finished properly.
So it is suitable for any stable which the offending commit was applied to.
(3.2 or later)

Fixes: 8313b8e57f55b15e5b7f7fc5d1630bbf686a9a97
Signed-off-by: NeilBrown 
Signed-off-by: Greg Kroah-Hartman

md: always set MD_RECOVERY_INTR when aborting a reshape or other "resync".

2014-06-11T19:03:24+00:00

commit 3991b31ea072b070081ca3bfa860a077eda67de5 upstream.

If mddev->ro is set, md_to_sync will (correctly) abort.
However in that case MD_RECOVERY_INTR isn't set.

If a RESHAPE had been requested, then ->finish_reshape() will be
called and it will think the reshape was successful even though
nothing happened.

Normally a resync will not be requested if ->ro is set, but if an
array is stopped while a reshape is on-going, then when the array is
started, the reshape will be restarted.  If the array is also set
read-only at this point, the reshape will instantly appear to success,
resulting in data corruption.

Consequently, this patch is suitable for any -stable kernel.

Signed-off-by: NeilBrown 
Signed-off-by: Greg Kroah-Hartman

md: avoid possible spinning md thread at shutdown.

2014-06-07T20:25:32+00:00

commit 0f62fb220aa4ebabe8547d3a9ce4a16d3c045f21 upstream.

If an md array with externally managed metadata (e.g. DDF or IMSM)
is in use, then we should not set safemode==2 at shutdown because:

1/ this is ineffective: user-space need to be involved in any 'safemode' handling,
2/ The safemode management code doesn't cope with safemode==2 on external metadata
   and md_check_recover enters an infinite loop.

Even at shutdown, an infinite-looping process can be problematic, so this
could cause shutdown to hang.

Signed-off-by: NeilBrown 
Signed-off-by: Greg Kroah-Hartman

md: fix problem when adding device to read-only array with bitmap.

2014-01-25T16:27:12+00:00

commit 8313b8e57f55b15e5b7f7fc5d1630bbf686a9a97 upstream.

If an array is started degraded, and then the missing device
is found it can be re-added and a minimal bitmap-based recovery
will bring it fully up-to-date.

If the array is read-only a recovery would not be allowed.
But also if the array is read-only and the missing device was
present very recently, then there could be no need for any
recovery at all, so we simply include the device in the read-only
array without any recovery.

However... if the missing device was removed a little longer ago
it could be missing some updates, but if a bitmap is present it will
be conditionally accepted pending a bitmap-based update.  We don't
currently detect this case properly and will include that old
device into the read-only array with no recovery even though it really
needs a recovery.

This patch keeps track of whether a bitmap-based-recovery is really
needed or not in the new Bitmap_sync rdev flag.  If that is set,
then the device will not be added to a read-only array.

Cc: Andrei Warkentin 
Fixes: d70ed2e4fafdbef0800e73942482bb075c21578b
Signed-off-by: NeilBrown 
Signed-off-by: Greg Kroah-Hartman

md: fix calculation of stacking limits on level change.

2013-12-04T18:57:15+00:00

commit 02e5f5c0a0f726e66e3d8506ea1691e344277969 upstream.

The various ->run routines of md personalities assume that the 'queue'
has been initialised by the blk_set_stacking_limits() call in
md_alloc().

However when the level is changed (by level_store()) the ->run routine
for the new level is called for an array which has already had the
stacking limits modified.  This can result in incorrect final
settings.

So call blk_set_stacking_limits() before ->run in level_store().

A specific consequence of this bug is that it causes
discard_granularity to be set incorrectly when reshaping a RAID4 to a
RAID0.

This is suitable for any -stable kernel since 3.3 in which
blk_set_stacking_limits() was introduced.

Reported-and-tested-by: "Baldysiak, Pawel" 
Signed-off-by: NeilBrown 
Signed-off-by: Greg Kroah-Hartman

md: avoid deadlock when md_set_badblocks.

2013-11-13T03:05:32+00:00

commit 905b0297a9533d7a6ee00a01a990456636877dd6 upstream.

When operate harddisk and hit errors, md_set_badblocks is called after
scsi_restart_operations which already disabled the irq. but md_set_badblocks
will call write_sequnlock_irq and enable irq. so softirq can preempt the
current thread and that may cause a deadlock. I think this situation should
use write_sequnlock_irqsave/irqrestore instead.

I met the situation and the call trace is below:
[  638.919974] BUG: spinlock recursion on CPU#0, scsi_eh_13/1010
[  638.921923]  lock: 0xffff8800d4d51fc8, .magic: dead4ead, .owner: scsi_eh_13/1010, .owner_cpu: 0
[  638.923890] CPU: 0 PID: 1010 Comm: scsi_eh_13 Not tainted 3.12.0-rc5+ #37
[  638.925844] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS 4.6.5 03/05/2013
[  638.927816]  ffff880037ad4640 ffff880118c03d50 ffffffff8172ff85 0000000000000007
[  638.929829]  ffff8800d4d51fc8 ffff880118c03d70 ffffffff81730030 ffff8800d4d51fc8
[  638.931848]  ffffffff81a72eb0 ffff880118c03d90 ffffffff81730056 ffff8800d4d51fc8
[  638.933884] Call Trace:
[  638.935867]    [] dump_stack+0x55/0x76
[  638.937878]  [] spin_dump+0x8a/0x8f
[  638.939861]  [] spin_bug+0x21/0x26
[  638.941836]  [] do_raw_spin_lock+0xa4/0xc0
[  638.943801]  [] _raw_spin_lock+0x66/0x80
[  638.945747]  [] ? scsi_device_unbusy+0x9d/0xd0
[  638.947672]  [] ? _raw_spin_unlock+0x2b/0x50
[  638.949595]  [] scsi_device_unbusy+0x9d/0xd0
[  638.951504]  [] scsi_finish_command+0x37/0xe0
[  638.953388]  [] scsi_softirq_done+0xa8/0x140
[  638.955248]  [] blk_done_softirq+0x7b/0x90
[  638.957116]  [] __do_softirq+0xfd/0x330
[  638.958987]  [] ? __lock_release+0x6f/0x100
[  638.960861]  [] call_softirq+0x1c/0x30
[  638.962724]  [] do_softirq+0x8d/0xc0
[  638.964565]  [] irq_exit+0x10e/0x150
[  638.966390]  [] smp_apic_timer_interrupt+0x4a/0x60
[  638.968223]  [] apic_timer_interrupt+0x6f/0x80
[  638.970079]    [] ? __lock_release+0x6f/0x100
[  638.971899]  [] ? _raw_spin_unlock_irq+0x3a/0x50
[  638.973691]  [] ? _raw_spin_unlock_irq+0x30/0x50
[  638.975475]  [] md_set_badblocks+0x1f3/0x4a0
[  638.977243]  [] rdev_set_badblocks+0x27/0x80
[  638.978988]  [] raid5_end_read_request+0x36b/0x4e0 [raid456]
[  638.980723]  [] bio_endio+0x1d/0x40
[  638.982463]  [] req_bio_endio.isra.65+0x83/0xa0
[  638.984214]  [] blk_update_request+0x7f/0x350
[  638.985967]  [] blk_update_bidi_request+0x31/0x90
[  638.987710]  [] __blk_end_bidi_request+0x20/0x50
[  638.989439]  [] __blk_end_request_all+0x1f/0x30
[  638.991149]  [] blk_peek_request+0x106/0x250
[  638.992861]  [] ? scsi_kill_request.isra.32+0xe9/0x130
[  638.994561]  [] scsi_request_fn+0x4a/0x3d0
[  638.996251]  [] __blk_run_queue+0x37/0x50
[  638.997900]  [] blk_run_queue+0x2f/0x50
[  638.999553]  [] scsi_run_queue+0xe0/0x1c0
[  639.001185]  [] scsi_run_host_queues+0x21/0x40
[  639.002798]  [] scsi_restart_operations+0x177/0x200
[  639.004391]  [] scsi_error_handler+0xc9/0xe0
[  639.005996]  [] ? scsi_unjam_host+0xd0/0xd0
[  639.007600]  [] kthread+0xdb/0xe0
[  639.009205]  [] ? flush_kthread_worker+0x170/0x170
[  639.010821]  [] ret_from_fork+0x7c/0xb0
[  639.012437]  [] ? flush_kthread_worker+0x170/0x170

This bug was introduce in commit  2e8ac30312973dd20e68073653
(the first time rdev_set_badblock was call from interrupt context),
so this patch is appropriate for 3.5 and subsequent kernels.

Signed-off-by: Bian Yu 
Reviewed-by: Jianpeng Ma 
Signed-off-by: NeilBrown 
Signed-off-by: Greg Kroah-Hartman

md: Remove recent change which allows devices to skip recovery.

2013-08-04T08:50:53+00:00

commit 5024c298311f3b97c85cb034f9edaa333fdb9338 upstream.

commit 7ceb17e87bde79d285a8b988cfed9eaeebe60b86
    md: Allow devices to be re-added to a read-only array.

allowed a bit more than just that.  It also allows devices to be added
to a read-write array and to end up skipping recovery.

This patch removes the offending piece of code pending a rewrite for a
subsequent release.

More specifically:
 If the array has a bitmap, then the device will still need a bitmap
 based resync ('saved_raid_disk' is set under different conditions
 is a bitmap is present).
 If the array doesn't have a bitmap, then this is correct as long as
 nothing has been written to the array since the metadata was checked
 by ->validate_super.  However there is no locking to ensure that there
 was no write.

Bug was introduced in 3.10 and causes data corruption so
patch is suitable for 3.10-stable.

Reported-by: Joe Lawrence 
Signed-off-by: NeilBrown 
Signed-off-by: Greg Kroah-Hartman

Merge tag 'md-3.10-fixes' of git://neil.brown.name/md

2013-06-13T17:13:29+00:00

Pull md bugfixes from Neil Brown:
 "A few bugfixes for md

  Some tagged for -stable"

* tag 'md-3.10-fixes' of git://neil.brown.name/md:
  md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
  md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
  md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
  md: md_stop_writes() should always freeze recovery.

md: md_stop_writes() should always freeze recovery.

2013-06-13T03:18:15+00:00

__md_stop_writes() will currently sometimes freeze recovery.
So any caller must be ready for that to happen, and indeed they are.

However if __md_stop_writes() doesn't freeze_recovery, then
a recovery could start before mddev_suspend() is called, which
could be awkward.  This can particularly cause problems or dm-raid.

So change __md_stop_writes() to always freeze recovery.  This is safe
and more predicatable.

Reported-by: Brassow Jonathan 
Tested-by: Brassow Jonathan 
Signed-off-by: NeilBrown