linux.git/include/linux, branch v4.4.69

block: get rid of blk_integrity_revalidate()

2017-05-14T11:32:59+00:00

commit 19b7ccf8651df09d274671b53039c672a52ad84d upstream.

Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
introduced blk_integrity_revalidate(), which seems to assume ownership
of the stable pages flag and unilaterally clears it if no blk_integrity
profile is registered:

    if (bi->profile)
            disk->queue->backing_dev_info->capabilities |=
                    BDI_CAP_STABLE_WRITES;
    else
            disk->queue->backing_dev_info->capabilities &=
                    ~BDI_CAP_STABLE_WRITES;

It's called from revalidate_disk() and rescan_partitions(), making it
impossible to enable stable pages for drivers that support partitions
and don't use blk_integrity: while the call in revalidate_disk() can be
trivially worked around (see zram, which doesn't support partitions and
hence gets away with zram_revalidate_disk()), rescan_partitions() can
be triggered from userspace at any time.  This breaks rbd, where the
ceph messenger is responsible for generating/verifying CRCs.

Since blk_integrity_{un,}register() "must" be used for (un)registering
the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
setting there.  This way drivers that call blk_integrity_register() and
use integrity infrastructure won't interfere with drivers that don't
but still want stable pages.

Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
Cc: "Martin K. Petersen" 
Cc: Christoph Hellwig 
Cc: Mike Snitzer 
Tested-by: Dan Williams 
Signed-off-by: Ilya Dryomov 
[idryomov@gmail.com: backport to < 4.11: bdi is embedded in queue]
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

f2fs: sanity check segment count

2017-05-14T11:32:59+00:00

commit b9dd46188edc2f0d1f37328637860bb65a771124 upstream.

F2FS uses 4 bytes to represent block address. As a result, supported
size of disk is 16 TB and it equals to 16 * 1024 * 1024 / 2 segments.

Signed-off-by: Jin Qian 
Signed-off-by: Jaegeuk Kim 
Signed-off-by: Greg Kroah-Hartman

usb: chipidea: Handle extcon events properly

2017-05-14T11:32:56+00:00

commit a89b94b53371bbfa582787c2fa3378000ea4263d upstream.

We're currently emulating the vbus and id interrupts in the OTGSC
read API, but we also need to make sure that if we're handling
the events with extcon that we don't enable the interrupts for
those events in the hardware. Therefore, properly emulate this
register if we're using extcon, but don't enable the interrupts.
This allows me to get my cable connect/disconnect working
properly without getting spurious interrupts on my device that
uses an extcon for these two events.

Acked-by: Peter Chen 
Cc: Greg Kroah-Hartman 
Cc: "Ivan T. Ivanov" 
Fixes: 3ecb3e09b042 ("usb: chipidea: Use extcon framework for VBUS and ID detect")
Signed-off-by: Stephen Boyd 
Signed-off-by: Peter Chen 
Signed-off-by: Greg Kroah-Hartman

mtd: avoid stack overflow in MTD CFI code

2017-05-08T05:46:01+00:00

commit fddcca5107051adf9e4481d2a79ae0616577fd2c upstream.

When map_word gets too large, we use a lot of kernel stack, and for
MTD_MAP_BANK_WIDTH_32, this means we use more than the recommended
1024 bytes in a number of functions:

drivers/mtd/chips/cfi_cmdset_0020.c: In function 'cfi_staa_write_buffers':
drivers/mtd/chips/cfi_cmdset_0020.c:651:1: warning: the frame size of 1336 bytes is larger than 1024 bytes [-Wframe-larger-than=]
drivers/mtd/chips/cfi_cmdset_0020.c: In function 'cfi_staa_erase_varsize':
drivers/mtd/chips/cfi_cmdset_0020.c:972:1: warning: the frame size of 1208 bytes is larger than 1024 bytes [-Wframe-larger-than=]
drivers/mtd/chips/cfi_cmdset_0001.c: In function 'do_write_buffer':
drivers/mtd/chips/cfi_cmdset_0001.c:1835:1: warning: the frame size of 1240 bytes is larger than 1024 bytes [-Wframe-larger-than=]

This can be avoided if all operations on the map word are done
indirectly and the stack gets reused between the calls. We can
mostly achieve this by selecting MTD_COMPLEX_MAPPINGS whenever
MTD_MAP_BANK_WIDTH_32 is set, but for the case that no other
bank width is enabled, we also need to use a non-constant
map_bankwidth() to convince the compiler to use less stack.

Signed-off-by: Arnd Bergmann 
[Brian: this patch mostly achieves its goal by forcing
    MTD_COMPLEX_MAPPINGS (and the accompanying indirection) for 256-bit
    mappings; the rest of the change is mostly a wash, though it helps
    reduce stack size slightly. If we really care about supporting
    256-bit mappings though, we should consider rewriting some of this
    code to avoid keeping and assigning so many 256-bit objects on the
    stack.]
Signed-off-by: Brian Norris 
Signed-off-by: Greg Kroah-Hartman

mnt: Add a per mount namespace limit on the number of mounts

2017-04-30T03:49:28+00:00

commit d29216842a85c7970c536108e093963f02714498 upstream.

CAI Qian  pointed out that the semantics
of shared subtrees make it possible to create an exponentially
increasing number of mounts in a mount namespace.

    mkdir /tmp/1 /tmp/2
    mount --make-rshared /
    for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

Will create create 2^20 or 1048576 mounts, which is a practical problem
as some people have managed to hit this by accident.

As such CVE-2016-6213 was assigned.

Ian Kent  described the situation for autofs users
as follows:

> The number of mounts for direct mount maps is usually not very large because of
> the way they are implemented, large direct mount maps can have performance
> problems. There can be anywhere from a few (likely case a few hundred) to less
> than 10000, plus mounts that have been triggered and not yet expired.
>
> Indirect mounts have one autofs mount at the root plus the number of mounts that
> have been triggered and not yet expired.
>
> The number of autofs indirect map entries can range from a few to the common
> case of several thousand and in rare cases up to between 30000 and 50000. I've
> not heard of people with maps larger than 50000 entries.
>
> The larger the number of map entries the greater the possibility for a large
> number of active mounts so it's not hard to expect cases of a 1000 or somewhat
> more active mounts.

So I am setting the default number of mounts allowed per mount
namespace at 100,000.  This is more than enough for any use case I
know of, but small enough to quickly stop an exponential increase
in mounts.  Which should be perfect to catch misconfigurations and
malfunctioning programs.

For anyone who needs a higher limit this can be changed by writing
to the new /proc/sys/fs/mount-max sysctl.

Tested-by: CAI Qian 
Signed-off-by: "Eric W. Biederman" 
[bwh: Backported to 4.4: adjust context]
Signed-off-by: Ben Hutchings 
Signed-off-by: Greg Kroah-Hartman

cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups

2017-04-21T07:30:04+00:00

commit 77f88796cee819b9c4562b0b6b44691b3b7755b1 upstream.

Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator.  Once the new kthread starts
running, it initializes itself and wakes up the creator.  The creator
then can further configure the kthread and then let it start doing its
job by waking it up.

In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland.  When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity.  This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.

Unfortunately, the cgroup side of protection is racy.  While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.

This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one.  Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.

This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland.  kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed.  The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.

It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side.  Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.

v2: Switch to a simpler implementation using a new task_struct bit
    field suggested by Oleg.

Signed-off-by: Tejun Heo 
Suggested-by: Oleg Nesterov 
Cc: Linus Torvalds 
Cc: Andrew Morton 
Cc: Peter Zijlstra (Intel) 
Cc: Thomas Gleixner 
Reported-and-debugged-by: Chris Mason 
Signed-off-by: Tejun Heo 
Signed-off-by: Greg Kroah-Hartman

KVM: kvm_io_bus_unregister_dev() should never fail

2017-04-08T07:53:32+00:00

commit 90db10434b163e46da413d34db8d0e77404cc645 upstream.

No caller currently checks the return value of
kvm_io_bus_unregister_dev(). This is evil, as all callers silently go on
freeing their device. A stale reference will remain in the io_bus,
getting at least used again, when the iobus gets teared down on
kvm_destroy_vm() - leading to use after free errors.

There is nothing the callers could do, except retrying over and over
again.

So let's simply remove the bus altogether, print an error and make
sure no one can access this broken bus again (returning -ENOMEM on any
attempt to access it).

Fixes: e93f8a0f821e ("KVM: convert io_bus to SRCU")
Reported-by: Dmitry Vyukov 
Reviewed-by: Cornelia Huck 
Signed-off-by: David Hildenbrand 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Greg Kroah-Hartman

usb-core: Add LINEAR_FRAME_INTR_BINTERVAL USB quirk

2017-03-30T07:35:16+00:00

commit 3243367b209faed5c320a4e5f9a565ee2a2ba958 upstream.

Some USB 2.0 devices erroneously report millisecond values in
bInterval. The generic config code manages to catch most of them,
but in some cases it's not completely enough.

The case at stake here is a USB 2.0 braille device, which wants to
announce 10ms and thus sets bInterval to 10, but with the USB 2.0
computation that yields to 64ms.  It happens that one can type fast
enough to reach this interval and get the device buffers overflown,
leading to problematic latencies.  The generic config code does not
catch this case because the 64ms is considered a sane enough value.

This change thus adds a USB_QUIRK_LINEAR_FRAME_INTR_BINTERVAL quirk
to mark devices which actually report milliseconds in bInterval,
and marks Vario Ultra devices as needing it.

Signed-off-by: Samuel Thibault 
Acked-by: Alan Stern 
Signed-off-by: Greg Kroah-Hartman

give up on gcc ilog2() constant optimizations

2017-03-26T10:13:18+00:00

commit 474c90156c8dcc2fa815e6716cc9394d7930cb9c upstream.

gcc-7 has an "optimization" pass that completely screws up, and
generates the code expansion for the (impossible) case of calling
ilog2() with a zero constant, even when the code gcc compiles does not
actually have a zero constant.

And we try to generate a compile-time error for anybody doing ilog2() on
a constant where that doesn't make sense (be it zero or negative).  So
now gcc7 will fail the build due to our sanity checking, because it
created that constant-zero case that didn't actually exist in the source
code.

There's a whole long discussion on the kernel mailing about how to work
around this gcc bug.  The gcc people themselevs have discussed their
"feature" in

   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72785

but it's all water under the bridge, because while it looked at one
point like it would be solved by the time gcc7 was released, that was
not to be.

So now we have to deal with this compiler braindamage.

And the only simple approach seems to be to just delete the code that
tries to warn about bad uses of ilog2().

So now "ilog2()" will just return 0 not just for the value 1, but for
any non-positive value too.

It's not like I can recall anybody having ever actually tried to use
this function on any invalid value, but maybe the sanity check just
meant that such code never made it out in public.

Reported-by: Laura Abbott 
Cc: John Stultz ,
Cc: Thomas Gleixner 
Cc: Ard Biesheuvel 
Signed-off-by: Linus Torvalds 
Cc: Jiri Slaby 
Signed-off-by: Greg Kroah-Hartman

usb: core: hub: hub_port_init lock controller instead of bus

2017-03-26T10:13:16+00:00

commit feb26ac31a2a5cb88d86680d9a94916a6343e9e6 upstream.

The XHCI controller presents two USB buses to the system - one for USB2
and one for USB3. The hub init code (hub_port_init) is reentrant but
only locks one bus per thread, leading to a race condition failure when
two threads attempt to simultaneously initialise a USB2 and USB3 device:

[    8.034843] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command
[   13.183701] usb 3-3: device descriptor read/all, error -110

On a test system this failure occurred on 6% of all boots.

The call traces at the point of failure are:

Call Trace:
 [] schedule+0x37/0x90
 [] usb_kill_urb+0x8d/0xd0
 [] ? wake_up_atomic_t+0x30/0x30
 [] usb_start_wait_urb+0xbe/0x150
 [] usb_control_msg+0xbc/0xf0
 [] hub_port_init+0x51e/0xb70
 [] hub_event+0x817/0x1570
 [] process_one_work+0x1ff/0x620
 [] ? process_one_work+0x15f/0x620
 [] worker_thread+0x64/0x4b0
 [] ? rescuer_thread+0x390/0x390
 [] kthread+0x105/0x120
 [] ? kthread_create_on_node+0x200/0x200
 [] ret_from_fork+0x3f/0x70
 [] ? kthread_create_on_node+0x200/0x200

Call Trace:
 [] xhci_setup_device+0x53d/0xa40
 [] xhci_address_device+0xe/0x10
 [] hub_port_init+0x1bf/0xb70
 [] ? trace_hardirqs_on+0xd/0x10
 [] hub_event+0x817/0x1570
 [] process_one_work+0x1ff/0x620
 [] ? process_one_work+0x15f/0x620
 [] worker_thread+0x64/0x4b0
 [] ? rescuer_thread+0x390/0x390
 [] kthread+0x105/0x120
 [] ? kthread_create_on_node+0x200/0x200
 [] ret_from_fork+0x3f/0x70
 [] ? kthread_create_on_node+0x200/0x200

Which results from the two call chains:

hub_port_init
 usb_get_device_descriptor
  usb_get_descriptor
   usb_control_msg
    usb_internal_control_msg
     usb_start_wait_urb
      usb_submit_urb / wait_for_completion_timeout / usb_kill_urb

hub_port_init
 hub_set_address
  xhci_address_device
   xhci_setup_device

Mathias Nyman explains the current behaviour violates the XHCI spec:

 hub_port_reset() will end up moving the corresponding xhci device slot
 to default state.

 As hub_port_reset() is called several times in hub_port_init() it
 sounds reasonable that we could end up with two threads having their
 xhci device slots in default state at the same time, which according to
 xhci 4.5.3 specs still is a big no no:

 "Note: Software shall not transition more than one Device Slot to the
  Default State at a time"

 So both threads fail at their next task after this.
 One fails to read the descriptor, and the other fails addressing the
 device.

Fix this in hub_port_init by locking the USB controller (instead of an
individual bus) to prevent simultaneous initialisation of both buses.

Fixes: 638139eb95d2 ("usb: hub: allow to process more usb hub events in parallel")
Link: https://lkml.org/lkml/2016/2/8/312
Link: https://lkml.org/lkml/2016/2/4/748
Signed-off-by: Chris Bainbridge 
Cc: stable 
Acked-by: Mathias Nyman 
Signed-off-by: Sumit Semwal 
 [sumits: minor merge conflict resolution for linux-4.4.y]
Signed-off-by: Greg Kroah-Hartman