linux.git/kernel/sysctl.c, branch v4.4.177

kernel/sysctl.c: add missing range check in do_proc_dointvec_minmax_conv

2019-03-23T07:44:37+00:00

commit 8cf7630b29701d364f8df4a50e4f1f5e752b2778 upstream.

This bug has apparently existed since the introduction of this function
in the pre-git era (4500e91754d3 in Thomas Gleixner's history.git,
"[NET]: Add proc_dointvec_userhz_jiffies, use it for proper handling of
neighbour sysctls.").

As a minimal fix we can simply duplicate the corresponding check in
do_proc_dointvec_conv().

Link: http://lkml.kernel.org/r/20190207123426.9202-3-zev@bewilderbeest.net
Signed-off-by: Zev Weiss 
Cc: Brendan Higgins 
Cc: Iurii Zaikin 
Cc: Kees Cook 
Cc: Luis Chamberlain 
Cc: 	[2.6.2+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

namei: allow restricted O_CREAT of FIFOs and regular files

2018-12-01T08:46:41+00:00

commit 30aba6656f61ed44cba445a3c0d38b296fa9e8f5 upstream.

Disallows open of FIFOs or regular files not owned by the user in world
writable sticky directories, unless the owner is the same as that of the
directory or the file is opened without the O_CREAT flag.  The purpose
is to make data spoofing attacks harder.  This protection can be turned
on and off separately for FIFOs and regular files via sysctl, just like
the symlinks/hardlinks protection.  This patch is based on Openwall's
"HARDEN_FIFO" feature by Solar Designer.

This is a brief list of old vulnerabilities that could have been prevented
by this feature, some of them even allow for privilege escalation:

CVE-2000-1134
CVE-2007-3852
CVE-2008-0525
CVE-2009-0416
CVE-2011-4834
CVE-2015-1838
CVE-2015-7442
CVE-2016-7489

This list is not meant to be complete.  It's difficult to track down all
vulnerabilities of this kind because they were often reported without any
mention of this particular attack vector.  In fact, before
hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
vehicle to exploit them.

[s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
  Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
  Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
[keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
[keescook@chromium.org: adjust commit subjet]
Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
Signed-off-by: Salvatore Mesoraca 
Signed-off-by: Kees Cook 
Suggested-by: Solar Designer 
Suggested-by: Kees Cook 
Cc: Al Viro 
Cc: Dan Carpenter 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Cc: Loic 
Signed-off-by: Greg Kroah-Hartman

sched/sysctl: Check user input value of sysctl_sched_time_avg

2018-09-05T07:18:33+00:00

commit 5ccba44ba118a5000cccc50076b0344632459779 upstream.

System will hang if user set sysctl_sched_time_avg to 0:

  [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0

  Stack traceback for pid 0
  0xffff883f6406c600 0 0 1 3 R 0xffff883f6406cf50 *swapper/3
  ffff883f7ccc3ae8 0000000000000018 ffffffff810c4dd0 0000000000000000
  0000000000017800 ffff883f7ccc3d78 0000000000000003 ffff883f7ccc3bf8
  ffffffff810c4fc9 ffff883f7ccc3c08 00000000810c5043 ffff883f7ccc3c08
  Call Trace:
   [] ? update_group_capacity+0x110/0x200
  [] ? update_sd_lb_stats+0x109/0x600
  [] ? find_busiest_group+0x47/0x530
  [] ? load_balance+0x194/0x900
  [] ? update_rq_clock.part.83+0x1a/0xe0
  [] ? rebalance_domains+0x152/0x290
  [] ? run_rebalance_domains+0xdc/0x1d0
  [] ? __do_softirq+0xfb/0x320
  [] ? irq_exit+0x125/0x130
  [] ? scheduler_ipi+0x97/0x160
  [] ? smp_reschedule_interrupt+0x29/0x30
  [] ? reschedule_interrupt+0x6e/0x80
    [] ? cpuidle_enter_state+0xcc/0x230
  [] ? cpuidle_enter_state+0x9c/0x230
  [] ? cpuidle_enter+0x17/0x20
  [] ? cpu_startup_entry+0x38c/0x420
  [] ? start_secondary+0x173/0x1e0

Because divide-by-zero error happens in function:

update_group_capacity()
  update_cpu_capacity()
    scale_rt_capacity()
     {
          ...
          total = sched_avg_period() + delta;
          used = div_u64(avg, total);
          ...
     }

To fix this issue, check user input value of sysctl_sched_time_avg, keep
it unchanged when hitting invalid input, and set the minimum limit of
sysctl_sched_time_avg to 1 ms.

Reported-by: James Puthukattukaran 
Signed-off-by: Ethan Zhao 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: efault@gmx.de
Cc: ethan.kernel@gmail.com
Cc: keescook@chromium.org
Cc: mcgrof@kernel.org
Cc: 
Link: http://lkml.kernel.org/r/1504504774-18253-1-git-send-email-ethan.zhao@oracle.com
Signed-off-by: Ingo Molnar 
Cc: Steve Muckle 
Signed-off-by: Greg Kroah-Hartman

timer/sysclt: Restrict timer migration sysctl values to 0 and 1

2017-10-05T07:41:47+00:00

commit b94bf594cf8ed67cdd0439e70fa939783471597a upstream.

timer_migration sysctl acts as a boolean switch, so the allowed values
should be restricted to 0 and 1.

Add the necessary extra fields to the sysctl table entry to enforce that.

[ tglx: Rewrote changelog ]

Signed-off-by: Myungho Jung 
Link: http://lkml.kernel.org/r/1492640690-3550-1-git-send-email-mhjungk@gmail.com
Signed-off-by: Thomas Gleixner 
Cc: Kazuhiro Hayashi 
Signed-off-by: Greg Kroah-Hartman

sysctl: report EINVAL if value is larger than UINT_MAX for proc_douintvec

2017-07-15T09:57:46+00:00

commit 425fffd886bae3d127a08fa6a17f2e31e24ed7ff upstream.

Currently, inputting the following command will succeed but actually the
value will be truncated:

  # echo 0x12ffffffff > /proc/sys/net/ipv4/tcp_notsent_lowat

This is not friendly to the user, so instead, we should report error
when the value is larger than UINT_MAX.

Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
Signed-off-by: Liping Zhang 
Cc: Subash Abhinov Kasiviswanathan 
Cc: Andrew Morton 
Cc: Eric W. Biederman 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

sysctl: don't print negative flag for proc_douintvec

2017-07-15T09:57:46+00:00

commit 5380e5644afbba9e3d229c36771134976f05c91e upstream.

I saw some very confusing sysctl output on my system:
  # cat /proc/sys/net/core/xfrm_aevent_rseqth
  -2
  # cat /proc/sys/net/core/xfrm_aevent_etime
  -10
  # cat /proc/sys/net/ipv4/tcp_notsent_lowat
  -4294967295

Because we forget to set the *negp flag in proc_douintvec, so it will
become a garbage value.

Since the value related to proc_douintvec is always an unsigned integer,
so we can set *negp to false explictily to fix this issue.

Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
Signed-off-by: Liping Zhang 
Cc: Subash Abhinov Kasiviswanathan 
Cc: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

sysctl: enable strict writes

2017-07-05T12:37:16+00:00

commit 41662f5cc55335807d39404371cfcbb1909304c4 upstream.

SYSCTL_WRITES_WARN was added in commit f4aacea2f5d1 ("sysctl: allow for
strict write position handling"), and released in v3.16 in August of
2014.  Since then I can find only 1 instance of non-zero offset
writing[1], and it was fixed immediately in CRIU[2].  As such, it
appears safe to flip this to the strict state now.

[1] https://www.google.com/search?q="when%20file%20position%20was%20not%200"
[2] http://lists.openvz.org/pipermail/criu/2015-April/019819.html

Signed-off-by: Kees Cook 
Cc: "Eric W. Biederman" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Cc: Sumit Semwal 
Signed-off-by: Greg Kroah-Hartman

mnt: Add a per mount namespace limit on the number of mounts

2017-04-30T03:49:28+00:00

commit d29216842a85c7970c536108e093963f02714498 upstream.

CAI Qian  pointed out that the semantics
of shared subtrees make it possible to create an exponentially
increasing number of mounts in a mount namespace.

    mkdir /tmp/1 /tmp/2
    mount --make-rshared /
    for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

Will create create 2^20 or 1048576 mounts, which is a practical problem
as some people have managed to hit this by accident.

As such CVE-2016-6213 was assigned.

Ian Kent  described the situation for autofs users
as follows:

> The number of mounts for direct mount maps is usually not very large because of
> the way they are implemented, large direct mount maps can have performance
> problems. There can be anywhere from a few (likely case a few hundred) to less
> than 10000, plus mounts that have been triggered and not yet expired.
>
> Indirect mounts have one autofs mount at the root plus the number of mounts that
> have been triggered and not yet expired.
>
> The number of autofs indirect map entries can range from a few to the common
> case of several thousand and in rare cases up to between 30000 and 50000. I've
> not heard of people with maps larger than 50000 entries.
>
> The larger the number of map entries the greater the possibility for a large
> number of active mounts so it's not hard to expect cases of a 1000 or somewhat
> more active mounts.

So I am setting the default number of mounts allowed per mount
namespace at 100,000.  This is more than enough for any use case I
know of, but small enough to quickly stop an exponential increase
in mounts.  Which should be perfect to catch misconfigurations and
malfunctioning programs.

For anyone who needs a higher limit this can be changed by writing
to the new /proc/sys/fs/mount-max sysctl.

Tested-by: CAI Qian 
Signed-off-by: "Eric W. Biederman" 
[bwh: Backported to 4.4: adjust context]
Signed-off-by: Ben Hutchings 
Signed-off-by: Greg Kroah-Hartman

sysctl: fix proc_doulongvec_ms_jiffies_minmax()

2017-02-01T07:30:52+00:00

commit ff9f8a7cf935468a94d9927c68b00daae701667e upstream.

We perform the conversion between kernel jiffies and ms only when
exporting kernel value to user space.

We need to do the opposite operation when value is written by user.

Only matters when HZ != 1000

Signed-off-by: Eric Dumazet 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

sysctl: handle error writing UINT_MAX to u32 fields

2016-10-07T13:23:46+00:00

commit e7d316a02f683864a12389f8808570e37fb90aa3 upstream.

We have scripts which write to certain fields on 3.18 kernels but this
seems to be failing on 4.4 kernels.  An entry which we write to here is
xfrm_aevent_rseqth which is u32.

  echo 4294967295  > /proc/sys/net/core/xfrm_aevent_rseqth

Commit 230633d109e3 ("kernel/sysctl.c: detect overflows when converting
to int") prevented writing to sysctl entries when integer overflow
occurs.  However, this does not apply to unsigned integers.

Heinrich suggested that we introduce a new option to handle 64 bit
limits and set min as 0 and max as UINT_MAX.  This might not work as it
leads to issues similar to __do_proc_doulongvec_minmax.  Alternatively,
we would need to change the datatype of the entry to 64 bit.

  static int __do_proc_doulongvec_minmax(void *data, struct ctl_table
  {
      i = (unsigned long *) data;   //This cast is causing to read beyond the size of data (u32)
      vleft = table->maxlen / sizeof(unsigned long); //vleft is 0 because maxlen is sizeof(u32) which is lesser than sizeof(unsigned long) on x86_64.

Introduce a new proc handler proc_douintvec.  Individual proc entries
will need to be updated to use the new handler.

[akpm@linux-foundation.org: coding-style fixes]
Fixes: 230633d109e3 ("kernel/sysctl.c:detect overflows when converting to int")
Link: http://lkml.kernel.org/r/1471479806-5252-1-git-send-email-subashab@codeaurora.org
Signed-off-by: Subash Abhinov Kasiviswanathan 
Cc: Heinrich Schuchardt 
Cc: Kees Cook 
Cc: "David S. Miller" 
Cc: Ingo Molnar 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman