linux.git/fs/aio.c, branch v4.9.262

aio: fix spectre gadget in lookup_ioctx

2018-12-21T13:11:31+00:00

commit a538e3ff9dabcdf6c3f477a373c629213d1c3066 upstream.

Matthew pointed out that the ioctx_table is susceptible to spectre v1,
because the index can be controlled by an attacker.  The below patch
should mitigate the attack for all of the aio system calls.

Cc: stable@vger.kernel.org
Reported-by: Matthew Wilcox 
Reported-by: Dan Carpenter 
Signed-off-by: Jeff Moyer 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

fix io_destroy()/aio_complete() race

2018-06-06T14:44:38+00:00

commit 4faa99965e027cc057c5145ce45fa772caa04e8d upstream.

If io_destroy() gets to cancelling everything that can be cancelled and
gets to kiocb_cancel() calling the function driver has left in ->ki_cancel,
it becomes vulnerable to a race with IO completion.  At that point req
is already taken off the list and aio_complete() does *NOT* spin until
we (in free_ioctx_users()) releases ->ctx_lock.  As the result, it proceeds
to kiocb_free(), freing req just it gets passed to ->ki_cancel().

Fix is simple - remove from the list after the call of kiocb_cancel().  All
instances of ->ki_cancel() already have to cope with the being called with
iocb still on list - that's what happens in io_cancel(2).

Cc: stable@kernel.org
Fixes: 0460fef2a921 "aio: use cancellation list lazily"
Signed-off-by: Al Viro 
Signed-off-by: Greg Kroah-Hartman

aio: fix io_destroy(2) vs. lookup_ioctx() race

2018-05-30T05:50:16+00:00

commit baf10564fbb66ea222cae66fbff11c444590ffd9 upstream.

kill_ioctx() used to have an explicit RCU delay between removing the
reference from ->ioctx_table and percpu_ref_kill() dropping the refcount.
At some point that delay had been removed, on the theory that
percpu_ref_kill() itself contained an RCU delay.  Unfortunately, that was
the wrong kind of RCU delay and it didn't care about rcu_read_lock() used
by lookup_ioctx().  As the result, we could get ctx freed right under
lookup_ioctx().  Tejun has fixed that in a6d7cff472e ("fs/aio: Add explicit
RCU grace period when freeing kioctx"); however, that fix is not enough.

Suppose io_destroy() from one thread races with e.g. io_setup() from another;
CPU1 removes the reference from current->mm->ioctx_table[...] just as CPU2
has picked it (under rcu_read_lock()).  Then CPU1 proceeds to drop the
refcount, getting it to 0 and triggering a call of free_ioctx_users(),
which proceeds to drop the secondary refcount and once that reaches zero
calls free_ioctx_reqs().  That does
        INIT_RCU_WORK(&ctx->free_rwork, free_ioctx);
        queue_rcu_work(system_wq, &ctx->free_rwork);
and schedules freeing the whole thing after RCU delay.

In the meanwhile CPU2 has gotten around to percpu_ref_get(), bumping the
refcount from 0 to 1 and returned the reference to io_setup().

Tejun's fix (that queue_rcu_work() in there) guarantees that ctx won't get
freed until after percpu_ref_get().  Sure, we'd increment the counter before
ctx can be freed.  Now we are out of rcu_read_lock() and there's nothing to
stop freeing of the whole thing.  Unfortunately, CPU2 assumes that since it
has grabbed the reference, ctx is *NOT* going away until it gets around to
dropping that reference.

The fix is obvious - use percpu_ref_tryget_live() and treat failure as miss.
It's not costlier than what we currently do in normal case, it's safe to
call since freeing *is* delayed and it closes the race window - either
lookup_ioctx() comes before percpu_ref_kill() (in which case ctx->users
won't reach 0 until the caller of lookup_ioctx() drops it) or lookup_ioctx()
fails, ctx->users is unaffected and caller of lookup_ioctx() doesn't see
the object in question at all.

Cc: stable@kernel.org
Fixes: a6d7cff472e "fs/aio: Add explicit RCU grace period when freeing kioctx"
Signed-off-by: Al Viro 
Signed-off-by: Greg Kroah-Hartman

fs/aio: Use RCU accessors for kioctx_table->table[]

2018-03-22T08:18:00+00:00

commit d0264c01e7587001a8c4608a5d1818dba9a4c11a upstream.

While converting ioctx index from a list to a table, db446a08c23d
("aio: convert the ioctx list to table lookup v3") missed tagging
kioctx_table->table[] as an array of RCU pointers and using the
appropriate RCU accessors.  This introduces a small window in the
lookup path where init and access may race.

Mark kioctx_table->table[] with __rcu and use the approriate RCU
accessors when using the field.

Signed-off-by: Tejun Heo 
Reported-by: Jann Horn 
Fixes: db446a08c23d ("aio: convert the ioctx list to table lookup v3")
Cc: Benjamin LaHaise 
Cc: Linus Torvalds 
Cc: stable@vger.kernel.org # v3.12+
Signed-off-by: Greg Kroah-Hartman

fs/aio: Add explicit RCU grace period when freeing kioctx

2018-03-22T08:18:00+00:00

commit a6d7cff472eea87d96899a20fa718d2bab7109f3 upstream.

While fixing refcounting, e34ecee2ae79 ("aio: Fix a trinity splat")
incorrectly removed explicit RCU grace period before freeing kioctx.
The intention seems to be depending on the internal RCU grace periods
of percpu_ref; however, percpu_ref uses a different flavor of RCU,
sched-RCU.  This can lead to kioctx being freed while RCU read
protected dereferences are still in progress.

Fix it by updating free_ioctx() to go through call_rcu() explicitly.

v2: Comment added to explain double bouncing.

Signed-off-by: Tejun Heo 
Reported-by: Jann Horn 
Fixes: e34ecee2ae79 ("aio: Fix a trinity splat")
Cc: Kent Overstreet 
Cc: Linus Torvalds 
Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Greg Kroah-Hartman

aio: fix lock dep warning

2017-07-05T12:40:26+00:00

[ Upstream commit a12f1ae61c489076a9aeb90bddca7722bf330df3 ]

lockdep reports a warnning. file_start_write/file_end_write only
acquire/release the lock for regular files. So checking the files in aio
side too.

[  453.532141] ------------[ cut here ]------------
[  453.533011] WARNING: CPU: 1 PID: 1298 at ../kernel/locking/lockdep.c:3514 lock_release+0x434/0x670
[  453.533011] DEBUG_LOCKS_WARN_ON(depth <= 0)
[  453.533011] Modules linked in:
[  453.533011] CPU: 1 PID: 1298 Comm: fio Not tainted 4.9.0+ #964
[  453.533011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
[  453.533011]  ffff8803a24b7a70 ffffffff8196cffb ffff8803a24b7ae8 0000000000000000
[  453.533011]  ffff8803a24b7ab8 ffffffff81091ee1 ffff8803a5dba700 00000dba00000008
[  453.533011]  ffffed0074496f59 ffff8803a5dbaf54 ffff8803ae0f8488 fffffffffffffdef
[  453.533011] Call Trace:
[  453.533011]  [] dump_stack+0x67/0x9c
[  453.533011]  [] __warn+0x111/0x130
[  453.533011]  [] warn_slowpath_fmt+0x97/0xb0
[  453.533011]  [] ? __warn+0x130/0x130
[  453.533011]  [] ? blk_finish_plug+0x29/0x60
[  453.533011]  [] lock_release+0x434/0x670
[  453.533011]  [] ? import_single_range+0xd4/0x110
[  453.533011]  [] ? rw_verify_area+0x65/0x140
[  453.533011]  [] ? aio_write+0x1f6/0x280
[  453.533011]  [] aio_write+0x229/0x280
[  453.533011]  [] ? aio_complete+0x640/0x640
[  453.533011]  [] ? debug_check_no_locks_freed+0x1a0/0x1a0
[  453.533011]  [] ? debug_lockdep_rcu_enabled.part.2+0x1a/0x30
[  453.533011]  [] ? debug_lockdep_rcu_enabled+0x35/0x40
[  453.533011]  [] ? __might_fault+0x7e/0xf0
[  453.533011]  [] do_io_submit+0x94c/0xb10
[  453.533011]  [] ? do_io_submit+0x23e/0xb10
[  453.533011]  [] ? SyS_io_destroy+0x270/0x270
[  453.533011]  [] ? mark_held_locks+0x23/0xc0
[  453.533011]  [] ? trace_hardirqs_on_thunk+0x1a/0x1c
[  453.533011]  [] SyS_io_submit+0x10/0x20
[  453.533011]  [] entry_SYSCALL_64_fastpath+0x18/0xad
[  453.533011]  [] ? trace_hardirqs_off_caller+0xc0/0x110
[  453.533011] ---[ end trace b2fbe664d1cc0082 ]---

Cc: Dmitry Monakhov 
Cc: Jan Kara 
Cc: Christoph Hellwig 
Cc: Al Viro 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Shaohua Li 
Signed-off-by: Al Viro 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

aio: fix freeze protection of aio writes

2016-10-30T17:09:42+00:00

Currently we dropped freeze protection of aio writes just after IO was
submitted. Thus aio write could be in flight while the filesystem was
frozen and that could result in unexpected situation like aio completion
wanting to convert extent type on frozen filesystem. Testcase from
Dmitry triggering this is like:

for ((i=0;i<60;i++));do fsfreeze -f /mnt ;sleep 1;fsfreeze -u /mnt;done &
fio --bs=4k --ioengine=libaio --iodepth=128 --size=1g --direct=1 \
    --runtime=60 --filename=/mnt/file --name=rand-write --rw=randwrite

Fix the problem by dropping freeze protection only once IO is completed
in aio_complete().

Reported-by: Dmitry Monakhov 
Signed-off-by: Jan Kara 
[hch: forward ported on top of various VFS and aio changes]
Signed-off-by: Christoph Hellwig 
Signed-off-by: Al Viro

fs: remove aio_run_iocb

2016-10-30T17:09:42+00:00

Pass the ABI iocb structure to aio_setup_rw and let it handle the
non-vectored I/O case as well.  With that and a new helper for the AIO
return value handling we can now define new aio_read and aio_write
helpers that implement reads and writes in a self-contained way without
duplicating too much code.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Al Viro

fs: remove the never implemented aio_fsync file operation

2016-10-30T17:09:42+00:00

Signed-off-by: Christoph Hellwig 
Signed-off-by: Al Viro

aio: hold an extra file reference over AIO read/write operations

2016-10-30T17:09:42+00:00

Otherwise we might dereference an already freed file and/or inode
when aio_complete is called before we return from the read_iter or
write_iter method.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Al Viro