Age | Commit message (Collapse) | Author | Files | Lines |
|
commit 229fd15908fe1f99b1de4cde3326e62d1e892611 upstream.
When copying a namespace we won't have added the new copy into the
namespace rbtree until after the copy succeeded. Calling free_mnt_ns()
will try to remove the copy from the rbtree which is invalid. Simply
free the namespace skeleton directly.
Link: https://lore.kernel.org/r/20241016-adapter-seilwinde-83c508a7bde1@brauner
Fixes: 1901c92497bd ("fs: keep an index of current mount namespaces")
Tested-by: Brad Spengler <spender@grsecurity.net>
Cc: stable@vger.kernel.org # v6.11+
Reported-by: Brad Spengler <spender@grsecurity.net>
Suggested-by: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 4bcda1eaf184e308f07f9c61d3a535f9ce477ce8 ]
If no page could be allocated, an error pointer was used as format
string in pr_warn.
Rearrange the code to return early in case of OOM. Also add a check
for the return value of d_path.
Fixes: f8b92ba67c5d ("mount: Add mount warning for impending timestamp expiry")
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Link: https://lore.kernel.org/r/20240730085856.32385-1-olaf@aepfle.de
[brauner: rewrite commit and commit message]
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
The counter is unconditionally incremented for each mount allocation.
If we set it to 1ULL << 32 we're losing 4294967296 as the first valid
non-32 bit mount id.
Link: https://lore.kernel.org/r/20240719-work-mount-namespace-v1-1-834113cab0d2@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount query updates from Christian Brauner:
"This contains work to extend the abilities of listmount() and
statmount() and various fixes and cleanups.
Features:
- Allow iterating through mounts via listmount() from newest to
oldest. This makes it possible for mount(8) to keep iterating the
mount table in reverse order so it gets newest mounts first.
- Relax permissions on listmount() and statmount().
It's not necessary to have capabilities in the initial namespace:
it is sufficient to have capabilities in the owning namespace of
the mount namespace we're located in to list unreachable mounts in
that namespace.
- Extend both listmount() and statmount() to list and stat mounts in
foreign mount namespaces.
Currently the only way to iterate over mount entries in mount
namespaces that aren't in the caller's mount namespace is by
crawling through /proc in order to find /proc/<pid>/mountinfo for
the relevant mount namespace.
This is both very clumsy and hugely inefficient. So extend struct
mnt_id_req with a new member that allows to specify the mount
namespace id of the mount namespace we want to look at.
Luckily internally we already have most of the infrastructure for
this so we just need to expose it to userspace. Give userspace a
way to retrieve the id of a mount namespace via statmount() and
through a new nsfs ioctl() on mount namespace file descriptor.
This comes with appropriate selftests.
- Expose mount options through statmount().
Currently if userspace wants to get mount options for a mount and
with statmount(), they still have to open /proc/<pid>/mountinfo to
parse mount options. Simply the information through statmount()
directly.
Afterwards it's possible to only rely on statmount() and
listmount() to retrieve all and more information than
/proc/<pid>/mountinfo provides.
This comes with appropriate selftests.
Fixes:
- Avoid copying to userspace under the namespace semaphore in
listmount.
Cleanups:
- Simplify the error handling in listmount by relying on our newly
added cleanup infrastructure.
- Refuse invalid mount ids early for both listmount and statmount"
* tag 'vfs-6.11.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: reject invalid last mount id early
fs: refuse mnt id requests with invalid ids early
fs: find rootfs mount of the mount namespace
fs: only copy to userspace on success in listmount()
sefltests: extend the statmount test for mount options
fs: use guard for namespace_sem in statmount()
fs: export mount options via statmount()
fs: rename show_mnt_opts -> show_vfsmnt_opts
selftests: add a test for the foreign mnt ns extensions
fs: add an ioctl to get the mnt ns id from nsfs
fs: Allow statmount() in foreign mount namespace
fs: Allow listmount() in foreign mount namespace
fs: export the mount ns id via statmount
fs: keep an index of current mount namespaces
fs: relax permissions for statmount()
listmount: allow listing in reverse order
fs: relax permissions for listmount()
fs: simplify error handling
fs: don't copy to userspace under namespace semaphore
path: add cleanup helper
|
|
Unique mount ids start past the last valid old mount id value to not
confuse the two. If a last mount id has been specified, reject any
invalid values early.
Link: https://lore.kernel.org/r/20240704-work-mount-fixes-v1-2-d007c990de5f@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Unique mount ids start past the last valid old mount id value to not
confuse the two so reject invalid values early in copy_mnt_id_req().
Link: https://lore.kernel.org/r/20240704-work-mount-fixes-v1-1-d007c990de5f@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The method we used was predicated on the assumption that the mount
immediately following the root mount of the mount namespace would be the
rootfs mount of the namespace. That's not always the case though. For
example:
ID PARENT ID
408 412 0:60 /containers/overlay-containers/bc391117192b32071b22ef2083ebe7735d5c390f87a5779e02faf79ba0746ceb/userdata/hosts /etc/hosts rw,nosuid,nodev,relatime - tmpfs tmpfs rw,size=954664k,nr_inodes=238666,mode=700,uid=1000,gid=1000,inode64
409 414 0:61 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs shm rw,size=64000k,uid=1000,gid=1000,inode64
410 412 0:60 /containers/overlay-containers/bc391117192b32071b22ef2083ebe7735d5c390f87a5779e02faf79ba0746ceb/userdata/.containerenv /run/.containerenv rw,nosuid,nodev,relatime - tmpfs tmpfs rw,size=954664k,nr_inodes=238666,mode=700,uid=1000,gid=1000,inode64
411 412 0:60 /containers/overlay-containers/bc391117192b32071b22ef2083ebe7735d5c390f87a5779e02faf79ba0746ceb/userdata/hostname /etc/hostname rw,nosuid,nodev,relatime - tmpfs tmpfs rw,size=954664k,nr_inodes=238666,mode=700,uid=1000,gid=1000,inode64
412 363 0:65 / / rw,relatime - overlay overlay rw,lowerdir=/home/user1/.local/share/containers/storage/overlay/l/JS65SUCGTPCP2EEBHLRP4UCFI5:/home/user1/.local/share/containers/storage/overlay/l/DLW22KVDWUNI4242D6SDJ5GKCL [...]
413 412 0:68 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
414 412 0:69 / /dev rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,uid=1000,gid=1000,inode64
415 412 0:70 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
416 414 0:71 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,gid=100004,mode=620,ptmxmode=666
417 414 0:67 / /dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw
418 415 0:27 / /sys/fs/cgroup ro,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw,nsdelegate,memory_recursiveprot
419 414 0:6 /null /dev/null rw,nosuid,noexec - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
420 414 0:6 /zero /dev/zero rw,nosuid,noexec - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
422 414 0:6 /full /dev/full rw,nosuid,noexec - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
423 414 0:6 /tty /dev/tty rw,nosuid,noexec - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
430 414 0:6 /random /dev/random rw,nosuid,noexec - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
431 414 0:6 /urandom /dev/urandom rw,nosuid,noexec - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
433 413 0:72 / /proc/acpi ro,relatime - tmpfs tmpfs rw,size=0k,uid=1000,gid=1000,inode64
440 413 0:6 /null /proc/kcore ro,nosuid - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
441 413 0:6 /null /proc/keys ro,nosuid - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
442 413 0:6 /null /proc/timer_list ro,nosuid - devtmpfs devtmpfs rw,size=4096k,nr_inodes=1179282,mode=755,inode64
443 413 0:73 / /proc/scsi ro,relatime - tmpfs tmpfs rw,size=0k,uid=1000,gid=1000,inode64
444 415 0:74 / /sys/firmware ro,relatime - tmpfs tmpfs rw,size=0k,uid=1000,gid=1000,inode64
445 415 0:75 / /sys/dev/block ro,relatime - tmpfs tmpfs rw,size=0k,uid=1000,gid=1000,inode64
446 413 0:68 /bus /proc/bus ro,nosuid,nodev,noexec,relatime - proc proc rw
447 413 0:68 /fs /proc/fs ro,nosuid,nodev,noexec,relatime - proc proc rw
448 413 0:68 /irq /proc/irq ro,nosuid,nodev,noexec,relatime - proc proc rw
449 413 0:68 /sys /proc/sys ro,nosuid,nodev,noexec,relatime - proc proc rw
450 413 0:68 /sysrq-trigger /proc/sysrq-trigger ro,nosuid,nodev,noexec,relatime - proc proc rw
364 414 0:71 /0 /dev/console rw,relatime - devpts devpts rw,gid=100004,mode=620,ptmxmode=666
In this mount table the root mount of the mount namespace is the mount
with id 363 (It isn't visible because it's literally just what the
rootfs mount is mounted upon and usually it's just a copy of the real
rootfs).
The rootfs mount that's mounted on the root mount of the mount namespace
is the mount with id 412. But the mount namespace contains mounts that
were created before the rootfs mount and thus have earlier mount ids. So
the first call to listmnt_next() would return the mount with the mount
id 408 and not the rootfs mount.
So we need to find the actual rootfs mount mounted on the root mount of
the mount namespace. This logic is also present in mntns_install() where
vfs_path_lookup() is used. We can't use this though as we're holding the
namespace semaphore. We could look at the children of the root mount of
the mount namespace directly but that also seems a bit out of place
while we have the rbtree. So let's just iterate through the rbtree
starting from the root mount of the mount namespace and find the mount
whose parent is the root mount of the mount namespace. That mount will
usually appear very early in the rbtree and afaik there can only be one.
IOW, it would be very strange if we ended up with a root mount of a
mount namespace that has shadow mounts.
Fixes: 0a3deb11858a ("fs: Allow listmount() in foreign mount namespace") # mainline only
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Avoid copying when we failed to, or didn't have any mounts to list.
Fixes: cb54ef4f050e ("fs: don't copy to userspace under namespace semaphore") # mainline only
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
statmount() can export arbitrary strings, so utilize the __spare1 slot
for a mnt_opts string pointer, and then support asking for and setting
the mount options during statmount(). This calls into the helper for
showing mount options, which already uses a seq_file, so fits in nicely
with our existing mechanism for exporting strings via statmount().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/3aa6bf8bd5d0a21df9ebd63813af8ab532c18276.1719257716.git.josef@toxicpanda.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
[brauner: only call sb->s_op->show_options()]
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This patch makes use of the new mnt_ns_id field in struct mnt_id_req to
allow users to stat mount entries not in their mount namespace. The
rules are the same as listmount(), the user must have CAP_SYS_ADMIN in
their user namespace and the target mount namespace must be a child of
the current namespace.
Co-developed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/52a2e17e50ba7aa420bc8bae1d9e88ff593395c1.1719243756.git.josef@toxicpanda.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Expand struct mnt_id_req to add an optional mnt_ns_id field. When this
field is populated, listmount() will be performed on the specified mount
namespace, provided the currently application has CAP_SYS_ADMIN in its
user namespace and the mount namespace is a child of the current
namespace.
Co-developed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/49930bdce29a8367a213eb14c1e68e7e49284f86.1719243756.git.josef@toxicpanda.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In order to allow users to iterate through children mount namespaces via
listmount we need a way for them to know what the ns id for the mount.
Add a new field to statmount called mnt_ns_id which will carry the ns id
for the given mount entry.
Co-developed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/6dabf437331fb7415d886f7c64b21cb2a50b1c66.1719243756.git.josef@toxicpanda.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In order to allow for listmount() to be used on different namespaces we
need a way to lookup a mount ns by its id. Keep a rbtree of the current
!anonymous mount name spaces indexed by ID that we can use to look up
the namespace.
Co-developed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/e5fdd78a90f5b00a75bd893962a70f52a2c015cd.1719243756.git.josef@toxicpanda.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
It is sufficient to have capabilities in the owning user namespace of
the mount namespace to stat a mount regardless of whether it's reachable
or not.
Link: https://lore.kernel.org/r/bf5961d71ec479ba85806766b0d8d96043e67bba.1719243756.git.josef@toxicpanda.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
It is sufficient to have capabilities in the owning user namespace of
the mount namespace to list all mounts regardless of whether they are
reachable or not.
Link: https://lore.kernel.org/r/8adc0d3f4f7495faacc6a7c63095961f7f1637c7.1719243756.git.josef@toxicpanda.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
util-linux is about to implement listmount() and statmount() support.
Karel requested the ability to scan the mount table in backwards order
because that's what libmount currently does in order to get the latest
mount first. We currently don't support this in listmount(). Add a new
LISTMOUNT_REVERSE flag to allow listing mounts in reverse order. For
example, listing all child mounts of /sys without LISTMOUNT_REVERSE
gives:
/sys/kernel/security @ mnt_id: 4294968369
/sys/fs/cgroup @ mnt_id: 4294968370
/sys/firmware/efi/efivars @ mnt_id: 4294968371
/sys/fs/bpf @ mnt_id: 4294968372
/sys/kernel/tracing @ mnt_id: 4294968373
/sys/kernel/debug @ mnt_id: 4294968374
/sys/fs/fuse/connections @ mnt_id: 4294968375
/sys/kernel/config @ mnt_id: 4294968376
whereas with LISTMOUNT_REVERSE it gives:
/sys/kernel/config @ mnt_id: 4294968376
/sys/fs/fuse/connections @ mnt_id: 4294968375
/sys/kernel/debug @ mnt_id: 4294968374
/sys/kernel/tracing @ mnt_id: 4294968373
/sys/fs/bpf @ mnt_id: 4294968372
/sys/firmware/efi/efivars @ mnt_id: 4294968371
/sys/fs/cgroup @ mnt_id: 4294968370
/sys/kernel/security @ mnt_id: 4294968369
Link: https://lore.kernel.org/r/20240607-vfs-listmount-reverse-v1-4-7877a2bfa5e5@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Rely on cleanup helper and simplify error handling
Link: https://lore.kernel.org/r/20240607-vfs-listmount-reverse-v1-3-7877a2bfa5e5@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Don't copy mount ids to userspace while holding the namespace semaphore.
We really shouldn't do that and I've gone through lenghts avoiding that
in statmount() already.
Limit the number of mounts that can be retrieved in one go to 1 million
mount ids. That's effectively 10 times the default limt of 100000 mounts
that we put on each mount namespace by default. Since listmount() is an
iterator limiting the number of mounts retrievable in one go isn't a
problem as userspace can just pick up where they left off.
Karel menti_ned that libmount will probably be reading the mount table
in "in small steps, 512 nodes per request. Nobody likes a tool that
takes too long in the kernel, and huge servers are unusual use cases.
Libmount will very probably provide API to define size of the step (IDs
per request)."
Reported-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20240610-frettchen-liberal-a9a5c53865f8@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The newly used helper also checks for empty ("") paths.
NULL paths with any flag value other than AT_EMPTY_PATH go the usual
route and end up with -EFAULT to retain compatibility (Rust is abusing
calls of the sort to detect availability of statx).
This avoids path lookup code, lockref management, memory allocation and
in case of NULL path userspace memory access (which can be quite
expensive with SMAP on x86_64).
Benchmarked with statx(..., AT_EMPTY_PATH, ...) running on Sapphire
Rapids, with the "" path for the first two cases and NULL for the last
one.
Results in ops/s:
stock: 4231237
pre-check: 5944063 (+40%)
NULL path: 6601619 (+11%/+56%)
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20240625151807.620812-1-mjguzik@gmail.com
Tested-by: Xi Ruoyao <xry111@xry111.site>
[brauner: use path_mounted() and other tweaks]
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
by employing `copy mount tree from src to dst` concept.
This involves renaming the opaque variables (e.g., p, q, r, s)
to be more descriptive, aiming to make the code easier to understand.
Changes:
mnt -> src_root (root of the tree to copy)
r -> src_root_child (direct child of the root being cloning)
p -> src_parent (parent of src_mnt)
s -> src_mnt (current mount being copying)
parent -> dst_parent (parent of dst_child)
q -> dst_mnt (freshly cloned mount)
Signed-off-by: Jemmy <jemmywong512@gmail.com>
Link: https://lore.kernel.org/r/20240606173912.99442-1-jemmywong512@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
A current limitation of open_by_handle_at() is that it's currently not possible
to use it from within containers at all because we require CAP_DAC_READ_SEARCH
in the initial namespace. That's unfortunate because there are scenarios where
using open_by_handle_at() from within containers.
Two examples:
(1) cgroupfs allows to encode cgroups to file handles and reopen them with
open_by_handle_at().
(2) Fanotify allows placing filesystem watches they currently aren't usable in
containers because the returned file handles cannot be used.
Here's a proposal for relaxing the permission check for open_by_handle_at().
(1) Opening file handles when the caller has privileges over the filesystem
(1.1) The caller has an unobstructed view of the filesystem.
(1.2) The caller has permissions to follow a path to the file handle.
This doesn't address the problem of opening a file handle when only a portion
of a filesystem is exposed as is common in containers by e.g., bind-mounting a
subtree. The proposal to solve this use-case is:
(2) Opening file handles when the caller has privileges over a subtree
(2.1) The caller is able to reach the file from the provided mount fd.
(2.2) The caller has permissions to construct an unobstructed path to the
file handle.
(2.3) The caller has permissions to follow a path to the file handle.
The relaxed permission checks are currently restricted to directory file
handles which are what both cgroupfs and fanotify need. Handling disconnected
non-directory file handles would lead to a potentially non-deterministic api.
If a disconnected non-directory file handle is provided we may fail to decode
a valid path that we could use for permission checking. That in itself isn't a
problem as we would just return EACCES in that case. However, confusion may
arise if a non-disconnected dentry ends up in the cache later and those opening
the file handle would suddenly succeed.
* It's potentially possible to use timing information (side-channel) to infer
whether a given inode exists. I don't think that's particularly
problematic. Thanks to Jann for bringing this to my attention.
* An unrelated note (IOW, these are thoughts that apply to
open_by_handle_at() generically and are unrelated to the changes here):
Jann pointed out that we should verify whether deleted files could
potentially be reopened through open_by_handle_at(). I don't think that's
possible though.
Another potential thing to check is whether open_by_handle_at() could be
abused to open internal stuff like memfds or gpu stuff. I don't think so
but I haven't had the time to completely verify this.
This dates back to discussions Amir and I had quite some time ago and thanks to
him for providing a lot of details around the export code and related patches!
Link: https://lore.kernel.org/r/20240524-vfs-open_by_handle_at-v1-1-3d4b7d22736b@kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When we added mount_setattr() I added additional checks compared to the
legacy do_reconfigure_mnt() and do_change_type() helpers used by regular
mount(2). If that mount had a parent then verify that the caller and the
mount namespace the mount is attached to match and if not make sure that
it's an anonymous mount.
The real rootfs falls into neither category. It is neither an anoymous
mount because it is obviously attached to the initial mount namespace
but it also obviously doesn't have a parent mount. So that means legacy
mount(2) allows changing mount properties on the real rootfs but
mount_setattr(2) blocks this. I never thought much about this but of
course someone on this planet of earth changes properties on the real
rootfs as can be seen in [1].
Since util-linux finally switched to the new mount api in 2.39 not so
long ago it also relies on mount_setattr() and that surfaced this issue
when Fedora 39 finally switched to it. Fix this.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2256843
Link: https://lore.kernel.org/r/20240206-vfs-mount-rootfs-v1-1-19b335eee133@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Karel Zak <kzak@redhat.com>
Cc: stable@vger.kernel.org # v5.12+
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Linus pointed out that there's error handling and naming issues in the
that we should rewrite:
* Perform the access checks for the buffer before actually doing any
work instead of doing it during the iteration.
* Rename the arguments to listmount() and do_listmount() to clarify what
the arguments are used for.
* Get rid of the pointless ctr variable and overflow checking.
* Get rid of the pointless speculation check.
Link: https://lore.kernel.org/r/CAHk-=wjh6Cypo8WC-McXgSzCaou3UXccxB+7PVeSuGR8AjCphg@mail.gmail.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull sysctl updates from Luis Chamberlain:
"To help make the move of sysctls out of kernel/sysctl.c not incur a
size penalty sysctl has been changed to allow us to not require the
sentinel, the final empty element on the sysctl array. Joel Granados
has been doing all this work.
In the v6.6 kernel we got the major infrastructure changes required to
support this. For v6.7 we had all arch/ and drivers/ modified to
remove the sentinel. For v6.8-rc1 we get a few more updates for fs/
directory only.
The kernel/ directory is left but we'll save that for v6.9-rc1 as
those patches are still being reviewed. After that we then can expect
also the removal of the no longer needed check for procname == NULL.
Let us recap the purpose of this work:
- this helps reduce the overall build time size of the kernel and run
time memory consumed by the kernel by about ~64 bytes per array
- the extra 64-byte penalty is no longer inncurred now when we move
sysctls out from kernel/sysctl.c to their own files
Thomas Weißschuh also sent a few cleanups, for v6.9-rc1 we expect to
see further work by Thomas Weißschuh with the constificatin of the
struct ctl_table.
Due to Joel Granados's work, and to help bring in new blood, I have
suggested for him to become a maintainer and he's accepted. So for
v6.9-rc1 I look forward to seeing him sent you a pull request for
further sysctl changes. This also removes Iurii Zaikin as a maintainer
as he has moved on to other projects and has had no time to help at
all"
* tag 'sysctl-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
sysctl: remove struct ctl_path
sysctl: delete unused define SYSCTL_PERM_EMPTY_DIR
coda: Remove the now superfluous sentinel elements from ctl_table array
sysctl: Remove the now superfluous sentinel elements from ctl_table array
fs: Remove the now superfluous sentinel elements from ctl_table array
cachefiles: Remove the now superfluous sentinel element from ctl_table array
sysclt: Clarify the results of selftest run
sysctl: Add a selftest for handling empty dirs
sysctl: Fix out of bounds access for empty sysctl registers
MAINTAINERS: Add Joel Granados as co-maintainer for proc sysctl
MAINTAINERS: remove Iurii Zaikin from proc sysctl
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There are no exciting changes for users, it's been mostly API
conversions and some fixes or refactoring.
The mount API conversion is a base for future improvements that would
come with VFS. Metadata processing has been converted to folios, not
yet enabling the large folios but it's one patch away once everything
gets tested enough.
Core changes:
- convert extent buffers to folios:
- direct API conversion where possible
- performance can drop by a few percent on metadata heavy
workloads, the folio sizes are not constant and the calculations
add up in the item helpers
- both regular and subpage modes
- data cannot be converted yet, we need to port that to iomap and
there are some other generic changes required
- convert mount to the new API, should not be user visible:
- options deprecated long time ago have been removed: inode_cache,
recovery
- the new logic that splits mount to two phases slightly changes
timing of device scanning for multi-device filesystems
- LSM options will now work (like for selinux)
- convert delayed nodes radix tree to xarray, preserving the
preload-like logic that still allows to allocate with GFP_NOFS
- more validation of sysfs value of scrub_speed_max
- refactor chunk map structure, reduce size and improve performance
- extent map refactoring, smaller data structures, improved
performance
- reduce size of struct extent_io_tree, embedded in several
structures
- temporary pages used for compression are cached and attached to a
shrinker, this may slightly improve performance
- in zoned mode, remove redirty extent buffer tracking, zeros are
written in case an out-of-order is detected and proper data are
written to the actual write pointer
- cleanups, refactoring, error message improvements, updated tests
- verify and update branch name or tag
- remove unwanted text"
* tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (89 commits)
btrfs: pass btrfs_io_geometry into btrfs_max_io_len
btrfs: pass struct btrfs_io_geometry to set_io_stripe
btrfs: open code set_io_stripe for RAID56
btrfs: change block mapping to switch/case in btrfs_map_block
btrfs: factor out block mapping for single profiles
btrfs: factor out block mapping for RAID5/6
btrfs: reduce scope of data_stripes in btrfs_map_block
btrfs: factor out block mapping for RAID10
btrfs: factor out block mapping for DUP profiles
btrfs: factor out RAID1 block mapping
btrfs: factor out block-mapping for RAID0
btrfs: re-introduce struct btrfs_io_geometry
btrfs: factor out helper for single device IO check
btrfs: migrate btrfs_repair_io_failure() to folio interfaces
btrfs: migrate eb_bitmap_offset() to folio interfaces
btrfs: migrate various end io functions to folios
btrfs: migrate subpage code to folio interfaces
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
btrfs: don't double put our subpage reference in alloc_extent_buffer
btrfs: cleanup metadata page pointer usage
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount updates from Christian Brauner:
"This contains the work to retrieve detailed information about mounts
via two new system calls. This is hopefully the beginning of the end
of the saga that started with fsinfo() years ago.
The LWN articles in [1] and [2] can serve as a summary so we can avoid
rehashing everything here.
At LSFMM in May 2022 we got into a room and agreed on what we want to
do about fsinfo(). Basically, split it into pieces. This is the first
part of that agreement. Specifically, it is concerned with retrieving
information about mounts. So this only concerns the mount information
retrieval, not the mount table change notification, or the extended
filesystem specific mount option work. That is separate work.
Currently mounts have a 32bit id. Mount ids are already in heavy use
by libmount and other low-level userspace but they can't be relied
upon because they're recycled very quickly. We agreed that mounts
should carry a unique 64bit id by which they can be referenced
directly. This is now implemented as part of this work.
The new 64bit mount id is exposed in statx() through the new
STATX_MNT_ID_UNIQUE flag. If the flag isn't raised the old mount id is
returned. If it is raised and the kernel supports the new 64bit mount
id the flag is raised in the result mask and the new 64bit mount id is
returned. New and old mount ids do not overlap so they cannot be
conflated.
Two new system calls are introduced that operate on the 64bit mount
id: statmount() and listmount(). A summary of the api and usage can be
found on LWN as well (cf. [3]) but of course, I'll provide a summary
here as well.
Both system calls rely on struct mnt_id_req. Which is the request
struct used to pass the 64bit mount id identifying the mount to
operate on. It is extensible to allow for the addition of new
parameters and for future use in other apis that make use of mount
ids.
statmount() mimicks the semantics of statx() and exposes a set flags
that userspace may raise in mnt_id_req to request specific information
to be retrieved. A statmount() call returns a struct statmount filled
in with information about the requested mount. Supported requests are
indicated by raising the request flag passed in struct mnt_id_req in
the @mask argument in struct statmount.
Currently we do support:
- STATMOUNT_SB_BASIC:
Basic filesystem info
- STATMOUNT_MNT_BASIC
Mount information (mount id, parent mount id, mount attributes etc)
- STATMOUNT_PROPAGATE_FROM
Propagation from what mount in current namespace
- STATMOUNT_MNT_ROOT
Path of the root of the mount (e.g., mount --bind /bla /mnt returns /bla)
- STATMOUNT_MNT_POINT
Path of the mount point (e.g., mount --bind /bla /mnt returns /mnt)
- STATMOUNT_FS_TYPE
Name of the filesystem type as the magic number isn't enough due to submounts
The string options STATMOUNT_MNT_{ROOT,POINT} and STATMOUNT_FS_TYPE
are appended to the end of the struct. Userspace can use the offsets
in @fs_type, @mnt_root, and @mnt_point to reference those strings
easily.
The struct statmount reserves quite a bit of space currently for
future extensibility. This isn't really a problem and if this bothers
us we can just send a follow-up pull request during this cycle.
listmount() is given a 64bit mount id via mnt_id_req just as
statmount(). It takes a buffer and a size to return an array of the
64bit ids of the child mounts of the requested mount. Userspace can
thus choose to either retrieve child mounts for a mount in batches or
iterate through the child mounts. For most use-cases it will be
sufficient to just leave space for a few child mounts. But for big
mount tables having an iterator is really helpful. Iterating through a
mount table works by setting @param in mnt_id_req to the mount id of
the last child mount retrieved in the previous listmount() call"
Link: https://lwn.net/Articles/934469 [1]
Link: https://lwn.net/Articles/829212 [2]
Link: https://lwn.net/Articles/950569 [3]
* tag 'vfs-6.8.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
add selftest for statmount/listmount
fs: keep struct mnt_id_req extensible
wire up syscalls for statmount/listmount
add listmount(2) syscall
statmount: simplify string option retrieval
statmount: simplify numeric option retrieval
add statmount(2) syscall
namespace: extract show_path() helper
mounts: keep list of mounts in an rbtree
add unique mount ID
|
|
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)
Remove sentinel elements ctl_table struct. Special attention was placed in
making sure that an empty directory for fs/verity was created when
CONFIG_FS_VERITY_BUILTIN_SIGNATURES is not defined. In this case we use the
register sysctl call that expects a size.
Signed-off-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
We already communicate to filesystems when a remount request comes from
the old mount API as some filesystems choose to implement different
behavior in the new mount API than the old mount API to e.g., take the
chance to fix significant API bugs. Allow the same for regular mount
requests.
Fixes: b330966f79fb ("fuse: reject options on reconfigure via fsconfig(2)")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Make it extensible so that we have the liberty to reuse it in future
mount-id based apis. Treat zero size as the first published struct.
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add way to query the children of a particular mount. This is a more
flexible way to iterate the mount tree than having to parse
/proc/self/mountinfo.
Lookup the mount by the new 64bit mount ID. If a mount needs to be
queried based on path, then statx(2) can be used to first query the
mount ID belonging to the path.
Return an array of new (64bit) mount ID's. Without privileges only
mounts are listed which are reachable from the task's root.
Folded into this patch are several later improvements. Keeping them
separate would make the history pointlessly confusing:
* Recursive listing of mounts is the default now (cf. [1]).
* Remove explicit LISTMOUNT_UNREACHABLE flag (cf. [1]) and fail if mount
is unreachable from current root. This also makes permission checking
consistent with statmount() (cf. [3]).
* Start listing mounts in unique mount ID order (cf. [2]) to allow
continuing listmount() from a midpoint.
* Allow to continue listmount(). The @request_mask parameter is renamed
and to @param to be usable by both statmount() and listmount().
If @param is set to a mount id then listmount() will continue listing
mounts from that id on. This allows listing mounts in multiple
listmount invocations without having to resize the buffer. If @param
is zero then the listing starts from the beginning (cf. [4]).
* Don't return EOVERFLOW, instead return the buffer size which allows to
detect a full buffer as well (cf. [4]).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20231025140205.3586473-6-mszeredi@redhat.com
Reviewed-by: Ian Kent <raven@themaw.net>
Link: https://lore.kernel.org/r/20231128160337.29094-2-mszeredi@redhat.com [1] (folded)
Link: https://lore.kernel.org/r/20231128160337.29094-3-mszeredi@redhat.com [2] (folded)
Link: https://lore.kernel.org/r/20231128160337.29094-4-mszeredi@redhat.com [3] (folded)
Link: https://lore.kernel.org/r/20231128160337.29094-5-mszeredi@redhat.com [4] (folded)
[Christian Brauner <brauner@kernel.org>: various smaller fixes]
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The previous code was a bit too clever for what we currently support.
A few minor changes:
* Avoid indirect function calls and use a simple switch statement. We
really only have three cases to handle so it's not like it's massively
complex. We can switch to something more elaborate should we introduce
more complex options.
* Defer all copy_to_user() calls until after we've given up namespace
semaphore.
On kernels with userfaultfd it's possible to abuse copy_from/to_user()
calls to indefinitely block on page faults. That's usually a
privileged operation but may be made available to unprivileged users.
Independent of userfaultfd it's better to not do all the
copy_to_user() work while holding namespace semaphore. Instead collect
the information and then copy it out after we've given up all locks.
* This also folds a change from Arnd to reduce the stack size in
prepare_kstatmount() to avoid warning such as:
fs/namespace.c:4995:1: error: stack frame size (1536) exceeds limit (1024) in '__se_sys_statmount' [-Werror,-Wframe-larger-than]
4995 | SYSCALL_DEFINE4(statmount, const struct mnt_id_req __user *, req,
Reviewed-by: Ian Kent <raven@themaw.net>
Link: https://lore.kernel.org/r/20231213090015.518044-1-arnd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Don't use all of this indirection which makes it really hard to follow
the code which is very basic. Error handling is also not really neede
here at all.
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a way to query attributes of a single mount instead of having to parse
the complete /proc/$PID/mountinfo, which might be huge.
Lookup the mount the new 64bit mount ID. If a mount needs to be queried
based on path, then statx(2) can be used to first query the mount ID
belonging to the path.
Design is based on a suggestion by Linus:
"So I'd suggest something that is very much like "statfsat()", which gets
a buffer and a length, and returns an extended "struct statfs" *AND*
just a string description at the end."
The interface closely mimics that of statx.
Handle ASCII attributes by appending after the end of the structure (as per
above suggestion). Pointers to strings are stored in u64 members to make
the structure the same regardless of pointer size. Strings are nul
terminated.
Link: https://lore.kernel.org/all/CAHk-=wh5YifP7hzKSbwJj94+DZ2czjrZsczy6GBimiogZws=rg@mail.gmail.com/
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20231025140205.3586473-5-mszeredi@redhat.com
Reviewed-by: Ian Kent <raven@themaw.net>
[Christian Brauner <brauner@kernel.org>: various minor changes]
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The helper is a bit pointless. Just open-code the check.
Link: https://lore.kernel.org/r/20231122-vfs-mnt_idmap-v1-1-dae4abdde5bd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Fix the W=1 build warning:
../fs/namespace.c:3050: warning: Function parameter or member 'mp' not described in 'can_move_mount_beneath'
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
To be used by the statmount(2) syscall as well.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20231025140205.3586473-4-mszeredi@redhat.com
Reviewed-by: Ian Kent <raven@themaw.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When adding a mount to a namespace insert it into an rbtree rooted in the
mnt_namespace instead of a linear list.
The mnt.mnt_list is still used to set up the mount tree and for
propagation, but not after the mount has been added to a namespace. Hence
mnt_list can live in union with rb_node. Use MNT_ONRB mount flag to
validate that the mount is on the correct list.
This allows removing the cursor used for reading /proc/$PID/mountinfo. The
mnt_id_unique of the next mount can be used as an index into the seq file.
Tested by inserting 100k bind mounts, unsharing the mount namespace, and
unmounting. No performance regressions have been observed.
For the last mount in the 100k list the statmount() call was more than 100x
faster due to the mount ID lookup not having to do a linear search. This
patch makes the overhead of mount ID lookup non-observable in this range.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20231025140205.3586473-3-mszeredi@redhat.com
Reviewed-by: Ian Kent <raven@themaw.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
If a mount is released then its mnt_id can immediately be reused. This is
bad news for user interfaces that want to uniquely identify a mount.
Implementing a unique mount ID is trivial (use a 64bit counter).
Unfortunately userspace assumes 32bit size and would overflow after the
counter reaches 2^32.
Introduce a new 64bit ID alongside the old one. Initialize the counter to
2^32, this guarantees that the old and new IDs are never mixed up.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20231025140205.3586473-2-mszeredi@redhat.com
Reviewed-by: Ian Kent <raven@themaw.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|