linux.git/fs/proc/root.c, branch v6.18.21

Merge tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

2025-09-29T18:20:29+00:00

Pull namespace updates from Christian Brauner:
 "This contains a larger set of changes around the generic namespace
  infrastructure of the kernel.

  Each specific namespace type (net, cgroup, mnt, ...) embedds a struct
  ns_common which carries the reference count of the namespace and so
  on.

  We open-coded and cargo-culted so many quirks for each namespace type
  that it just wasn't scalable anymore. So given there's a bunch of new
  changes coming in that area I've started cleaning all of this up.

  The core change is to make it possible to correctly initialize every
  namespace uniformly and derive the correct initialization settings
  from the type of the namespace such as namespace operations, namespace
  type and so on. This leaves the new ns_common_init() function with a
  single parameter which is the specific namespace type which derives
  the correct parameters statically. This also means the compiler will
  yell as soon as someone does something remotely fishy.

  The ns_common_init() addition also allows us to remove ns_alloc_inum()
  and drops any special-casing of the initial network namespace in the
  network namespace initialization code that Linus complained about.

  Another part is reworking the reference counting. The reference
  counting was open-coded and copy-pasted for each namespace type even
  though they all followed the same rules. This also removes all open
  accesses to the reference count and makes it private and only uses a
  very small set of dedicated helpers to manipulate them just like we do
  for e.g., files.

  In addition this generalizes the mount namespace iteration
  infrastructure introduced a few cycles ago. As reminder, the vfs makes
  it possible to iterate sequentially and bidirectionally through all
  mount namespaces on the system or all mount namespaces that the caller
  holds privilege over. This allow userspace to iterate over all mounts
  in all mount namespaces using the listmount() and statmount() system
  call.

  Each mount namespace has a unique identifier for the lifetime of the
  systems that is exposed to userspace. The network namespace also has a
  unique identifier working exactly the same way. This extends the
  concept to all other namespace types.

  The new nstree type makes it possible to lookup namespaces purely by
  their identifier and to walk the namespace list sequentially and
  bidirectionally for all namespace types, allowing userspace to iterate
  through all namespaces. Looking up namespaces in the namespace tree
  works completely locklessly.

  This also means we can move the mount namespace onto the generic
  infrastructure and remove a bunch of code and members from struct
  mnt_namespace itself.

  There's a bunch of stuff coming on top of this in the future but for
  now this uses the generic namespace tree to extend a concept
  introduced first for pidfs a few cycles ago. For a while now we have
  supported pidfs file handles for pidfds. This has proven to be very
  useful.

  This extends the concept to cover namespaces as well. It is possible
  to encode and decode namespace file handles using the common
  name_to_handle_at() and open_by_handle_at() apis.

  As with pidfs file handles, namespace file handles are exhaustive,
  meaning it is not required to actually hold a reference to nsfs in
  able to decode aka open_by_handle_at() a namespace file handle.
  Instead the FD_NSFS_ROOT constant can be passed which will let the
  kernel grab a reference to the root of nsfs internally and thus decode
  the file handle.

  Namespaces file descriptors can already be derived from pidfds which
  means they aren't subject to overmount protection bugs. IOW, it's
  irrelevant if the caller would not have access to an appropriate
  /proc//ns/ directory as they could always just derive the
  namespace based on a pidfd already.

  It has the same advantage as pidfds. It's possible to reliably and for
  the lifetime of the system refer to a namespace without pinning any
  resources and to compare them trivially.

  Permission checking is kept simple. If the caller is located in the
  namespace the file handle refers to they are able to open it otherwise
  they must hold privilege over the owning namespace of the relevant
  namespace.

  The namespace file handle layout is exposed as uapi and has a stable
  and extensible format. For now it simply contains the namespace
  identifier, the namespace type, and the inode number. The stable
  format means that userspace may construct its own namespace file
  handles without going through name_to_handle_at() as they are already
  allowed for pidfs and cgroup file handles"

* tag 'namespace-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (65 commits)
  ns: drop assert
  ns: move ns type into struct ns_common
  nstree: make struct ns_tree private
  ns: add ns_debug()
  ns: simplify ns_common_init() further
  cgroup: add missing ns_common include
  ns: use inode initializer for initial namespaces
  selftests/namespaces: verify initial namespace inode numbers
  ns: rename to __ns_ref
  nsfs: port to ns_ref_*() helpers
  net: port to ns_ref_*() helpers
  uts: port to ns_ref_*() helpers
  ipv4: use check_net()
  net: use check_net()
  net-sysfs: use check_net()
  user: port to ns_ref_*() helpers
  time: port to ns_ref_*() helpers
  pid: port to ns_ref_*() helpers
  ipc: port to ns_ref_*() helpers
  cgroup: port to ns_ref_*() helpers
  ...

procfs: add "pidns" mount option

2025-09-02T09:37:24+00:00

Since the introduction of pid namespaces, their interaction with procfs
has been entirely implicit in ways that require a lot of dancing around
by programs that need to construct sandboxes with different PID
namespaces.

Being able to explicitly specify the pid namespace to use when
constructing a procfs super block will allow programs to no longer need
to fork off a process which does then does unshare(2) / setns(2) and
forks again in order to construct a procfs in a pidns.

So, provide a "pidns" mount option which allows such users to just
explicitly state which pid namespace they want that procfs instance to
use. This interface can be used with fsconfig(2) either with a file
descriptor or a path:

  fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
  fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);

or with classic mount(2) / mount(8):

  // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
  mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");

As this new API is effectively shorthand for setns(2) followed by
mount(2), the permission model for this mirrors pidns_install() to avoid
opening up new attack surfaces by loosening the existing permission
model.

In order to avoid having to RCU-protect all users of proc_pid_ns() (to
avoid UAFs), attempting to reconfigure an existing procfs instance's pid
namespace will error out with -EBUSY. Creating new procfs instances is
quite cheap, so this should not be an impediment to most users, and lets
us avoid a lot of churn in fs/proc/* for a feature that it seems
unlikely userspace would use.

Signed-off-by: Aleksa Sarai 
Link: https://lore.kernel.org/20250805-procfs-pidns-api-v4-2-705f984940e7@cyphar.com
Signed-off-by: Christian Brauner

uapi: export PROCFS_ROOT_INO

2025-07-10T07:39:18+00:00

The root inode of /proc having a fixed inode number has been part of the
core kernel ABI since its inception, and recently some userspace
programs (mainly container runtimes) have started to explicitly depend
on this behaviour.

The main reason this is useful to userspace is that by checking that a
suspect /proc handle has fstype PROC_SUPER_MAGIC and is PROCFS_ROOT_INO,
they can then use openat2(RESOLVE_{NO_{XDEV,MAGICLINK},BENEATH}) to
ensure that there isn't a bind-mount that replaces some procfs file with
a different one. This kind of attack has lead to security issues in
container runtimes in the past (such as CVE-2019-19921) and libraries
like libpathrs[1] use this feature of procfs to provide safe procfs
handling functions.

There was also some trailing whitespace in the "struct proc_dir_entry"
initialiser, so fix that up as well.

[1]: https://github.com/openSUSE/libpathrs

Signed-off-by: Aleksa Sarai 
Link: https://lore.kernel.org/20250708-uapi-procfs-root-ino-v1-1-6ae61e97c79b@cyphar.com
Signed-off-by: Christian Brauner

procfs: make freeing proc_fs_info rcu-delayed

2024-02-25T07:10:32+00:00

makes proc_pid_ns() safe from rcu pathwalk (put_pid_ns()
is still synchronous, but that's not a problem - it does
rcu-delay everything that needs to be)

Reviewed-by: Christian Brauner 
Signed-off-by: Al Viro

fs: super: dynamically allocate the s_shrink

2023-10-04T17:32:26+00:00

In preparation for implementing lockless slab shrink, use new APIs to
dynamically allocate the s_shrink, so that it can be freed asynchronously
via RCU. Then it doesn't need to wait for RCU read-side critical section
when releasing the struct super_block.

Link: https://lkml.kernel.org/r/20230911094444.68966-39-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng 
Reviewed-by: Muchun Song 
Acked-by: David Sterba 
Cc: Chris Mason 
Cc: Josef Bacik 
Cc: Alexander Viro 
Cc: Christian Brauner 
Cc: Abhinav Kumar 
Cc: Alasdair Kergon 
Cc: Alyssa Rosenzweig 
Cc: Andreas Dilger 
Cc: Andreas Gruenbacher 
Cc: Anna Schumaker 
Cc: Arnd Bergmann 
Cc: Bob Peterson 
Cc: Borislav Petkov 
Cc: Carlos Llamas 
Cc: Chandan Babu R 
Cc: Chao Yu 
Cc: Christian Koenig 
Cc: Chuck Lever 
Cc: Coly Li 
Cc: Dai Ngo 
Cc: Daniel Vetter 
Cc: Daniel Vetter 
Cc: "Darrick J. Wong" 
Cc: Dave Chinner 
Cc: Dave Hansen 
Cc: David Airlie 
Cc: David Hildenbrand 
Cc: Dmitry Baryshkov 
Cc: Gao Xiang 
Cc: Greg Kroah-Hartman 
Cc: Huang Rui 
Cc: Ingo Molnar 
Cc: Jaegeuk Kim 
Cc: Jani Nikula 
Cc: Jan Kara 
Cc: Jason Wang 
Cc: Jeff Layton 
Cc: Jeffle Xu 
Cc: Joel Fernandes (Google) 
Cc: Joonas Lahtinen 
Cc: Juergen Gross 
Cc: Kent Overstreet 
Cc: Kirill Tkhai 
Cc: Marijn Suijten 
Cc: "Michael S. Tsirkin" 
Cc: Mike Snitzer 
Cc: Minchan Kim 
Cc: Muchun Song 
Cc: Nadav Amit 
Cc: Neil Brown 
Cc: Oleksandr Tyshchenko 
Cc: Olga Kornievskaia 
Cc: Paul E. McKenney 
Cc: Richard Weinberger 
Cc: Rob Clark 
Cc: Rob Herring 
Cc: Rodrigo Vivi 
Cc: Roman Gushchin 
Cc: Sean Paul 
Cc: Sergey Senozhatsky 
Cc: Song Liu 
Cc: Stefano Stabellini 
Cc: Steven Price 
Cc: "Theodore Ts'o" 
Cc: Thomas Gleixner 
Cc: Tomeu Vizoso 
Cc: Tom Talpey 
Cc: Trond Myklebust 
Cc: Tvrtko Ursulin 
Cc: Vlastimil Babka 
Cc: Xuan Zhuo 
Cc: Yue Hu 
Signed-off-by: Andrew Morton

fs: pass the request_mask to generic_fillattr

2023-08-09T06:56:36+00:00

generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.

Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.

Acked-by: Joseph Qi 
Reviewed-by: Xiubo Li 
Reviewed-by: "Paulo Alcantara (SUSE)" 
Reviewed-by: Jan Kara 
Signed-off-by: Jeff Layton 
Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner

fs: port ->getattr() to pass mnt_idmap

2023-01-19T08:24:25+00:00

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.

Acked-by: Dave Chinner 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Christian Brauner (Microsoft)

proc: add some (hopefully) insightful comments

2022-07-30T01:12:35+00:00

* /proc/${pid}/net status
* removing PDE vs last close stuff (again!)
* random small stuff

Link: https://lkml.kernel.org/r/YtwrM6sDC0OQ53YB@localhost.localdomain
Signed-off-by: Alexey Dobriyan 
Signed-off-by: Andrew Morton

proc: delete unused includes

2022-07-18T00:31:39+00:00

Those aren't necessary after seq files won.

Link: https://lkml.kernel.org/r/YqnA3mS7KBt8Z4If@localhost.localdomain
Signed-off-by: Alexey Dobriyan 
Signed-off-by: Andrew Morton

fs: make helpers idmap mount aware

2021-01-24T13:27:20+00:00

Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig 
Cc: David Howells 
Cc: Al Viro 
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig 
Signed-off-by: Christian Brauner