<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/include/linux/binfmts.h, branch v6.18.21</title>
<subtitle>Clone of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git</subtitle>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/'/>
<entry>
<title>fs: remove uselib() system call</title>
<updated>2025-04-21T08:27:59+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>brauner@kernel.org</email>
</author>
<published>2025-04-15T08:27:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=79beea2db0431536d79fc5d321225fb42f955466'/>
<id>79beea2db0431536d79fc5d321225fb42f955466</id>
<content type='text'>
This system call has been deprecated for quite a while now.
Let's try and remove it from the kernel completely.

Link: https://lore.kernel.org/20250415-kanufahren-besten-02ac00e6becd@brauner
Acked-by: Kees Cook &lt;kees@kernel.org&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This system call has been deprecated for quite a while now.
Let's try and remove it from the kernel completely.

Link: https://lore.kernel.org/20250415-kanufahren-besten-02ac00e6becd@brauner
Acked-by: Kees Cook &lt;kees@kernel.org&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>binfmt: Remove loader from linux_binprm struct</title>
<updated>2025-02-24T19:30:16+00:00</updated>
<author>
<name>Yonatan Goldschmidt</name>
<email>yon.goldschmidt@gmail.com</email>
</author>
<published>2025-02-23T22:32:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=cc9554e662a3b5a13f514cfcced54ac263c07094'/>
<id>cc9554e662a3b5a13f514cfcced54ac263c07094</id>
<content type='text'>
Commit 987f20a9dcce ("a.out: Remove the a.out implementation") removed
the last in-tree user of the loader field, and as far as I can tell, it
was the only one historically.

Signed-off-by: Yonatan Goldschmidt &lt;yon.goldschmidt@gmail.com&gt;
Link: https://lore.kernel.org/r/20250223223234.13764-1-yon.goldschmidt@gmail.com
Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit 987f20a9dcce ("a.out: Remove the a.out implementation") removed
the last in-tree user of the loader field, and as far as I can tell, it
was the only one historically.

Signed-off-by: Yonatan Goldschmidt &lt;yon.goldschmidt@gmail.com&gt;
Link: https://lore.kernel.org/r/20250223223234.13764-1-yon.goldschmidt@gmail.com
Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'AT_EXECVE_CHECK-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux</title>
<updated>2025-01-23T04:34:42+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2025-01-23T04:34:42+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=21266b8df5224c4f677acf9f353eecc9094731f0'/>
<id>21266b8df5224c4f677acf9f353eecc9094731f0</id>
<content type='text'>
Pull AT_EXECVE_CHECK from Kees Cook:

 - Implement AT_EXECVE_CHECK flag to execveat(2) (Mickaël Salaün)

 - Implement EXEC_RESTRICT_FILE and EXEC_DENY_INTERACTIVE securebits
   (Mickaël Salaün)

 - Add selftests and samples for AT_EXECVE_CHECK (Mickaël Salaün)

* tag 'AT_EXECVE_CHECK-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  ima: instantiate the bprm_creds_for_exec() hook
  samples/check-exec: Add an enlighten "inc" interpreter and 28 tests
  selftests: ktap_helpers: Fix uninitialized variable
  samples/check-exec: Add set-exec
  selftests/landlock: Add tests for execveat + AT_EXECVE_CHECK
  selftests/exec: Add 32 tests for AT_EXECVE_CHECK and exec securebits
  security: Add EXEC_RESTRICT_FILE and EXEC_DENY_INTERACTIVE securebits
  exec: Add a new AT_EXECVE_CHECK flag to execveat(2)
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull AT_EXECVE_CHECK from Kees Cook:

 - Implement AT_EXECVE_CHECK flag to execveat(2) (Mickaël Salaün)

 - Implement EXEC_RESTRICT_FILE and EXEC_DENY_INTERACTIVE securebits
   (Mickaël Salaün)

 - Add selftests and samples for AT_EXECVE_CHECK (Mickaël Salaün)

* tag 'AT_EXECVE_CHECK-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  ima: instantiate the bprm_creds_for_exec() hook
  samples/check-exec: Add an enlighten "inc" interpreter and 28 tests
  selftests: ktap_helpers: Fix uninitialized variable
  samples/check-exec: Add set-exec
  selftests/landlock: Add tests for execveat + AT_EXECVE_CHECK
  selftests/exec: Add 32 tests for AT_EXECVE_CHECK and exec securebits
  security: Add EXEC_RESTRICT_FILE and EXEC_DENY_INTERACTIVE securebits
  exec: Add a new AT_EXECVE_CHECK flag to execveat(2)
</pre>
</div>
</content>
</entry>
<entry>
<title>exec: Add a new AT_EXECVE_CHECK flag to execveat(2)</title>
<updated>2024-12-19T01:00:29+00:00</updated>
<author>
<name>Mickaël Salaün</name>
<email>mic@digikod.net</email>
</author>
<published>2024-12-12T17:42:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=a5874fde3c0884a33ed4145101052318c5e17c74'/>
<id>a5874fde3c0884a33ed4145101052318c5e17c74</id>
<content type='text'>
Add a new AT_EXECVE_CHECK flag to execveat(2) to check if a file would
be allowed for execution.  The main use case is for script interpreters
and dynamic linkers to check execution permission according to the
kernel's security policy. Another use case is to add context to access
logs e.g., which script (instead of interpreter) accessed a file.  As
any executable code, scripts could also use this check [1].

This is different from faccessat(2) + X_OK which only checks a subset of
access rights (i.e. inode permission and mount options for regular
files), but not the full context (e.g. all LSM access checks).  The main
use case for access(2) is for SUID processes to (partially) check access
on behalf of their caller.  The main use case for execveat(2) +
AT_EXECVE_CHECK is to check if a script execution would be allowed,
according to all the different restrictions in place.  Because the use
of AT_EXECVE_CHECK follows the exact kernel semantic as for a real
execution, user space gets the same error codes.

An interesting point of using execveat(2) instead of openat2(2) is that
it decouples the check from the enforcement.  Indeed, the security check
can be logged (e.g. with audit) without blocking an execution
environment not yet ready to enforce a strict security policy.

LSMs can control or log execution requests with
security_bprm_creds_for_exec().  However, to enforce a consistent and
complete access control (e.g. on binary's dependencies) LSMs should
restrict file executability, or measure executed files, with
security_file_open() by checking file-&gt;f_flags &amp; __FMODE_EXEC.

Because AT_EXECVE_CHECK is dedicated to user space interpreters, it
doesn't make sense for the kernel to parse the checked files, look for
interpreters known to the kernel (e.g. ELF, shebang), and return ENOEXEC
if the format is unknown.  Because of that, security_bprm_check() is
never called when AT_EXECVE_CHECK is used.

It should be noted that script interpreters cannot directly use
execveat(2) (without this new AT_EXECVE_CHECK flag) because this could
lead to unexpected behaviors e.g., `python script.sh` could lead to Bash
being executed to interpret the script.  Unlike the kernel, script
interpreters may just interpret the shebang as a simple comment, which
should not change for backward compatibility reasons.

Because scripts or libraries files might not currently have the
executable permission set, or because we might want specific users to be
allowed to run arbitrary scripts, the following patch provides a dynamic
configuration mechanism with the SECBIT_EXEC_RESTRICT_FILE and
SECBIT_EXEC_DENY_INTERACTIVE securebits.

This is a redesign of the CLIP OS 4's O_MAYEXEC:
https://github.com/clipos-archive/src_platform_clip-patches/blob/f5cb330d6b684752e403b4e41b39f7004d88e561/1901_open_mayexec.patch
This patch has been used for more than a decade with customized script
interpreters.  Some examples can be found here:
https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC

Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Acked-by: Paul Moore &lt;paul@paul-moore.com&gt;
Reviewed-by: Serge Hallyn &lt;serge@hallyn.com&gt;
Reviewed-by: Jeff Xu &lt;jeffxu@chromium.org&gt;
Tested-by: Jeff Xu &lt;jeffxu@chromium.org&gt;
Link: https://docs.python.org/3/library/io.html#io.open_code [1]
Signed-off-by: Mickaël Salaün &lt;mic@digikod.net&gt;
Link: https://lore.kernel.org/r/20241212174223.389435-2-mic@digikod.net
Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add a new AT_EXECVE_CHECK flag to execveat(2) to check if a file would
be allowed for execution.  The main use case is for script interpreters
and dynamic linkers to check execution permission according to the
kernel's security policy. Another use case is to add context to access
logs e.g., which script (instead of interpreter) accessed a file.  As
any executable code, scripts could also use this check [1].

This is different from faccessat(2) + X_OK which only checks a subset of
access rights (i.e. inode permission and mount options for regular
files), but not the full context (e.g. all LSM access checks).  The main
use case for access(2) is for SUID processes to (partially) check access
on behalf of their caller.  The main use case for execveat(2) +
AT_EXECVE_CHECK is to check if a script execution would be allowed,
according to all the different restrictions in place.  Because the use
of AT_EXECVE_CHECK follows the exact kernel semantic as for a real
execution, user space gets the same error codes.

An interesting point of using execveat(2) instead of openat2(2) is that
it decouples the check from the enforcement.  Indeed, the security check
can be logged (e.g. with audit) without blocking an execution
environment not yet ready to enforce a strict security policy.

LSMs can control or log execution requests with
security_bprm_creds_for_exec().  However, to enforce a consistent and
complete access control (e.g. on binary's dependencies) LSMs should
restrict file executability, or measure executed files, with
security_file_open() by checking file-&gt;f_flags &amp; __FMODE_EXEC.

Because AT_EXECVE_CHECK is dedicated to user space interpreters, it
doesn't make sense for the kernel to parse the checked files, look for
interpreters known to the kernel (e.g. ELF, shebang), and return ENOEXEC
if the format is unknown.  Because of that, security_bprm_check() is
never called when AT_EXECVE_CHECK is used.

It should be noted that script interpreters cannot directly use
execveat(2) (without this new AT_EXECVE_CHECK flag) because this could
lead to unexpected behaviors e.g., `python script.sh` could lead to Bash
being executed to interpret the script.  Unlike the kernel, script
interpreters may just interpret the shebang as a simple comment, which
should not change for backward compatibility reasons.

Because scripts or libraries files might not currently have the
executable permission set, or because we might want specific users to be
allowed to run arbitrary scripts, the following patch provides a dynamic
configuration mechanism with the SECBIT_EXEC_RESTRICT_FILE and
SECBIT_EXEC_DENY_INTERACTIVE securebits.

This is a redesign of the CLIP OS 4's O_MAYEXEC:
https://github.com/clipos-archive/src_platform_clip-patches/blob/f5cb330d6b684752e403b4e41b39f7004d88e561/1901_open_mayexec.patch
This patch has been used for more than a decade with customized script
interpreters.  Some examples can be found here:
https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC

Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Christian Brauner &lt;brauner@kernel.org&gt;
Cc: Kees Cook &lt;keescook@chromium.org&gt;
Acked-by: Paul Moore &lt;paul@paul-moore.com&gt;
Reviewed-by: Serge Hallyn &lt;serge@hallyn.com&gt;
Reviewed-by: Jeff Xu &lt;jeffxu@chromium.org&gt;
Tested-by: Jeff Xu &lt;jeffxu@chromium.org&gt;
Link: https://docs.python.org/3/library/io.html#io.open_code [1]
Signed-off-by: Mickaël Salaün &lt;mic@digikod.net&gt;
Link: https://lore.kernel.org/r/20241212174223.389435-2-mic@digikod.net
Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case</title>
<updated>2024-12-17T00:54:00+00:00</updated>
<author>
<name>Kees Cook</name>
<email>kees@kernel.org</email>
</author>
<published>2024-11-21T15:07:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=543841d1806029889c2f69f040e88b247aba8e22'/>
<id>543841d1806029889c2f69f040e88b247aba8e22</id>
<content type='text'>
Zbigniew mentioned at Linux Plumber's that systemd is interested in
switching to execveat() for service execution, but can't, because the
contents of /proc/pid/comm are the file descriptor which was used,
instead of the path to the binary[1]. This makes the output of tools like
top and ps useless, especially in a world where most fds are opened
CLOEXEC so the number is truly meaningless.

When the filename passed in is empty (e.g. with AT_EMPTY_PATH), use the
dentry's filename for "comm" instead of using the useless numeral from
the synthetic fdpath construction. This way the actual exec machinery
is unchanged, but cosmetically the comm looks reasonable to admins
investigating things.

Instead of adding TASK_COMM_LEN more bytes to bprm, use one of the unused
flag bits to indicate that we need to set "comm" from the dentry.

Suggested-by: Zbigniew Jędrzejewski-Szmek &lt;zbyszek@in.waw.pl&gt;
Suggested-by: Tycho Andersen &lt;tandersen@netflix.com&gt;
Suggested-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Suggested-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: https://github.com/uapi-group/kernel-features#set-comm-field-before-exec [1]
Reviewed-by: Aleksa Sarai &lt;cyphar@cyphar.com&gt;
Tested-by: Zbigniew Jędrzejewski-Szmek &lt;zbyszek@in.waw.pl&gt;
Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Zbigniew mentioned at Linux Plumber's that systemd is interested in
switching to execveat() for service execution, but can't, because the
contents of /proc/pid/comm are the file descriptor which was used,
instead of the path to the binary[1]. This makes the output of tools like
top and ps useless, especially in a world where most fds are opened
CLOEXEC so the number is truly meaningless.

When the filename passed in is empty (e.g. with AT_EMPTY_PATH), use the
dentry's filename for "comm" instead of using the useless numeral from
the synthetic fdpath construction. This way the actual exec machinery
is unchanged, but cosmetically the comm looks reasonable to admins
investigating things.

Instead of adding TASK_COMM_LEN more bytes to bprm, use one of the unused
flag bits to indicate that we need to set "comm" from the dentry.

Suggested-by: Zbigniew Jędrzejewski-Szmek &lt;zbyszek@in.waw.pl&gt;
Suggested-by: Tycho Andersen &lt;tandersen@netflix.com&gt;
Suggested-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Suggested-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Link: https://github.com/uapi-group/kernel-features#set-comm-field-before-exec [1]
Reviewed-by: Aleksa Sarai &lt;cyphar@cyphar.com&gt;
Tested-by: Zbigniew Jędrzejewski-Szmek &lt;zbyszek@in.waw.pl&gt;
Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>execve: Keep bprm-&gt;argmin behind CONFIG_MMU</title>
<updated>2024-07-14T04:31:57+00:00</updated>
<author>
<name>Kees Cook</name>
<email>kees@kernel.org</email>
</author>
<published>2024-06-21T20:50:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=084ebf7ca83e6cb743784f2eecc654193ce064fb'/>
<id>084ebf7ca83e6cb743784f2eecc654193ce064fb</id>
<content type='text'>
When argmin was added in commit 655c16a8ce9c ("exec: separate
MM_ANONPAGES and RLIMIT_STACK accounting"), it was intended only for
validating stack limits on CONFIG_MMU[1]. All checking for reaching the
limit (argmin) is wrapped in CONFIG_MMU ifdef checks, though setting
argmin was not. That argmin is only supposed to be used under CONFIG_MMU
was rediscovered recently[2], and I don't want to trip over this again.

Move argmin's declaration into the existing CONFIG_MMU area, and add
helpers functions so the MMU tests can be consolidated.

Link: https://lore.kernel.org/all/20181126122307.GA1660@redhat.com [1]
Link: https://lore.kernel.org/all/202406211253.7037F69@keescook/ [2]
Link: https://lore.kernel.org/r/20240621205046.4001362-1-kees@kernel.org
Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When argmin was added in commit 655c16a8ce9c ("exec: separate
MM_ANONPAGES and RLIMIT_STACK accounting"), it was intended only for
validating stack limits on CONFIG_MMU[1]. All checking for reaching the
limit (argmin) is wrapped in CONFIG_MMU ifdef checks, though setting
argmin was not. That argmin is only supposed to be used under CONFIG_MMU
was rediscovered recently[2], and I don't want to trip over this again.

Move argmin's declaration into the existing CONFIG_MMU area, and add
helpers functions so the MMU tests can be consolidated.

Link: https://lore.kernel.org/all/20181126122307.GA1660@redhat.com [1]
Link: https://lore.kernel.org/all/202406211253.7037F69@keescook/ [2]
Link: https://lore.kernel.org/r/20240621205046.4001362-1-kees@kernel.org
Signed-off-by: Kees Cook &lt;kees@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>binfmt_misc: enable sandboxed mounts</title>
<updated>2023-10-11T15:46:01+00:00</updated>
<author>
<name>Christian Brauner</name>
<email>christian.brauner@ubuntu.com</email>
</author>
<published>2021-10-28T10:31:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=21ca59b365c091d583f36ac753eaa8baf947be6f'/>
<id>21ca59b365c091d583f36ac753eaa8baf947be6f</id>
<content type='text'>
Enable unprivileged sandboxes to create their own binfmt_misc mounts.
This is based on Laurent's work in [1] but has been significantly
reworked to fix various issues we identified in earlier versions.

While binfmt_misc can currently only be mounted in the initial user
namespace, binary types registered in this binfmt_misc instance are
available to all sandboxes (Either by having them installed in the
sandbox or by registering the binary type with the F flag causing the
interpreter to be opened right away). So binfmt_misc binary types are
already delegated to sandboxes implicitly.

However, while a sandbox has access to all registered binary types in
binfmt_misc a sandbox cannot currently register its own binary types
in binfmt_misc. This has prevented various use-cases some of which were
already outlined in [1] but we have a range of issues associated with
this (cf. [3]-[5] below which are just a small sample).

Extend binfmt_misc to be mountable in non-initial user namespaces.
Similar to other filesystem such as nfsd, mqueue, and sunrpc we use
keyed superblock management. The key determines whether we need to
create a new superblock or can reuse an already existing one. We use the
user namespace of the mount as key. This means a new binfmt_misc
superblock is created once per user namespace creation. Subsequent
mounts of binfmt_misc in the same user namespace will mount the same
binfmt_misc instance. We explicitly do not create a new binfmt_misc
superblock on every binfmt_misc mount as the semantics for
load_misc_binary() line up with the keying model. This also allows us to
retrieve the relevant binfmt_misc instance based on the caller's user
namespace which can be done in a simple (bounded to 32 levels) loop.

Similar to the current binfmt_misc semantics allowing access to the
binary types in the initial binfmt_misc instance we do allow sandboxes
access to their parent's binfmt_misc mounts if they do not have created
a separate binfmt_misc instance.

Overall, this will unblock the use-cases mentioned below and in general
will also allow to support and harden execution of another
architecture's binaries in tight sandboxes. For instance, using the
unshare binary it possible to start a chroot of another architecture and
configure the binfmt_misc interpreter without being root to run the
binaries in this chroot and without requiring the host to modify its
binary type handlers.

Henning had already posted a few experiments in the cover letter at [1].
But here's an additional example where an unprivileged container
registers qemu-user-static binary handlers for various binary types in
its separate binfmt_misc mount and is then seamlessly able to start
containers with a different architecture without affecting the host:

root    [lxc monitor] /var/snap/lxd/common/lxd/containers f1
1000000  \_ /sbin/init
1000000      \_ /lib/systemd/systemd-journald
1000000      \_ /lib/systemd/systemd-udevd
1000100      \_ /lib/systemd/systemd-networkd
1000101      \_ /lib/systemd/systemd-resolved
1000000      \_ /usr/sbin/cron -f
1000103      \_ /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
1000000      \_ /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
1000104      \_ /usr/sbin/rsyslogd -n -iNONE
1000000      \_ /lib/systemd/systemd-logind
1000000      \_ /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 vt220
1000107      \_ dnsmasq --conf-file=/dev/null -u lxc-dnsmasq --strict-order --bind-interfaces --pid-file=/run/lxc/dnsmasq.pid --liste
1000000      \_ [lxc monitor] /var/lib/lxc f1-s390x
1100000          \_ /usr/bin/qemu-s390x-static /sbin/init
1100000              \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-journald
1100000              \_ /usr/bin/qemu-s390x-static /usr/sbin/cron -f
1100103              \_ /usr/bin/qemu-s390x-static /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-ac
1100000              \_ /usr/bin/qemu-s390x-static /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
1100104              \_ /usr/bin/qemu-s390x-static /usr/sbin/rsyslogd -n -iNONE
1100000              \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-logind
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/0 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/1 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/2 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/3 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-udevd

[1]: https://lore.kernel.org/all/20191216091220.465626-1-laurent@vivier.eu
[2]: https://discuss.linuxcontainers.org/t/binfmt-misc-permission-denied
[3]: https://discuss.linuxcontainers.org/t/lxd-binfmt-support-for-qemu-static-interpreters
[4]: https://discuss.linuxcontainers.org/t/3-1-0-binfmt-support-service-in-unprivileged-guest-requires-write-access-on-hosts-proc-sys-fs-binfmt-misc
[5]: https://discuss.linuxcontainers.org/t/qemu-user-static-not-working-4-11

Link: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu (origin)
Link: https://lore.kernel.org/r/20211028103114.2849140-2-brauner@kernel.org (v1)
Cc: Sargun Dhillon &lt;sargun@sargun.me&gt;
Cc: Serge Hallyn &lt;serge@hallyn.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Henning Schild &lt;henning.schild@siemens.com&gt;
Cc: Andrei Vagin &lt;avagin@gmail.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Laurent Vivier &lt;laurent@vivier.eu&gt;
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Laurent Vivier &lt;laurent@vivier.eu&gt;
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
---
/* v2 */
- Serge Hallyn &lt;serge@hallyn.com&gt;:
  - Use GFP_KERNEL_ACCOUNT for userspace triggered allocations when a
    new binary type handler is registered.
- Christian Brauner &lt;christian.brauner@ubuntu.com&gt;:
  - Switch authorship to me. I refused to do that earlier even though
    Laurent said I should do so because I think it's genuinely bad form.
    But by now I have changed so many things that it'd be unfair to
    blame Laurent for any potential bugs in here.
  - Add more comments that explain what's going on.
  - Rename functions while changing them to better reflect what they are
    doing to make the code easier to understand.
  - In the first version when a specific binary type handler was removed
    either through a write to the entry's file or all binary type
    handlers were removed by a write to the binfmt_misc mount's status
    file all cleanup work happened during inode eviction.
    That includes removal of the relevant entries from entry list. While
    that works fine I disliked that model after thinking about it for a
    bit. Because it means that there was a window were someone has
    already removed a or all binary handlers but they could still be
    safely reached from load_misc_binary() when it has managed to take
    the read_lock() on the entries list while inode eviction was already
    happening. Again, that perfectly benign but it's cleaner to remove
    the binary handler from the list immediately meaning that ones the
    write to then entry's file or the binfmt_misc status file returns
    the binary type cannot be executed anymore. That gives stronger
    guarantees to the user.
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Enable unprivileged sandboxes to create their own binfmt_misc mounts.
This is based on Laurent's work in [1] but has been significantly
reworked to fix various issues we identified in earlier versions.

While binfmt_misc can currently only be mounted in the initial user
namespace, binary types registered in this binfmt_misc instance are
available to all sandboxes (Either by having them installed in the
sandbox or by registering the binary type with the F flag causing the
interpreter to be opened right away). So binfmt_misc binary types are
already delegated to sandboxes implicitly.

However, while a sandbox has access to all registered binary types in
binfmt_misc a sandbox cannot currently register its own binary types
in binfmt_misc. This has prevented various use-cases some of which were
already outlined in [1] but we have a range of issues associated with
this (cf. [3]-[5] below which are just a small sample).

Extend binfmt_misc to be mountable in non-initial user namespaces.
Similar to other filesystem such as nfsd, mqueue, and sunrpc we use
keyed superblock management. The key determines whether we need to
create a new superblock or can reuse an already existing one. We use the
user namespace of the mount as key. This means a new binfmt_misc
superblock is created once per user namespace creation. Subsequent
mounts of binfmt_misc in the same user namespace will mount the same
binfmt_misc instance. We explicitly do not create a new binfmt_misc
superblock on every binfmt_misc mount as the semantics for
load_misc_binary() line up with the keying model. This also allows us to
retrieve the relevant binfmt_misc instance based on the caller's user
namespace which can be done in a simple (bounded to 32 levels) loop.

Similar to the current binfmt_misc semantics allowing access to the
binary types in the initial binfmt_misc instance we do allow sandboxes
access to their parent's binfmt_misc mounts if they do not have created
a separate binfmt_misc instance.

Overall, this will unblock the use-cases mentioned below and in general
will also allow to support and harden execution of another
architecture's binaries in tight sandboxes. For instance, using the
unshare binary it possible to start a chroot of another architecture and
configure the binfmt_misc interpreter without being root to run the
binaries in this chroot and without requiring the host to modify its
binary type handlers.

Henning had already posted a few experiments in the cover letter at [1].
But here's an additional example where an unprivileged container
registers qemu-user-static binary handlers for various binary types in
its separate binfmt_misc mount and is then seamlessly able to start
containers with a different architecture without affecting the host:

root    [lxc monitor] /var/snap/lxd/common/lxd/containers f1
1000000  \_ /sbin/init
1000000      \_ /lib/systemd/systemd-journald
1000000      \_ /lib/systemd/systemd-udevd
1000100      \_ /lib/systemd/systemd-networkd
1000101      \_ /lib/systemd/systemd-resolved
1000000      \_ /usr/sbin/cron -f
1000103      \_ /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
1000000      \_ /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
1000104      \_ /usr/sbin/rsyslogd -n -iNONE
1000000      \_ /lib/systemd/systemd-logind
1000000      \_ /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 vt220
1000107      \_ dnsmasq --conf-file=/dev/null -u lxc-dnsmasq --strict-order --bind-interfaces --pid-file=/run/lxc/dnsmasq.pid --liste
1000000      \_ [lxc monitor] /var/lib/lxc f1-s390x
1100000          \_ /usr/bin/qemu-s390x-static /sbin/init
1100000              \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-journald
1100000              \_ /usr/bin/qemu-s390x-static /usr/sbin/cron -f
1100103              \_ /usr/bin/qemu-s390x-static /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-ac
1100000              \_ /usr/bin/qemu-s390x-static /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
1100104              \_ /usr/bin/qemu-s390x-static /usr/sbin/rsyslogd -n -iNONE
1100000              \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-logind
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/0 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/1 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/2 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /sbin/agetty -o -p -- \u --noclear --keep-baud pts/3 115200,38400,9600 vt220
1100000              \_ /usr/bin/qemu-s390x-static /lib/systemd/systemd-udevd

[1]: https://lore.kernel.org/all/20191216091220.465626-1-laurent@vivier.eu
[2]: https://discuss.linuxcontainers.org/t/binfmt-misc-permission-denied
[3]: https://discuss.linuxcontainers.org/t/lxd-binfmt-support-for-qemu-static-interpreters
[4]: https://discuss.linuxcontainers.org/t/3-1-0-binfmt-support-service-in-unprivileged-guest-requires-write-access-on-hosts-proc-sys-fs-binfmt-misc
[5]: https://discuss.linuxcontainers.org/t/qemu-user-static-not-working-4-11

Link: https://lore.kernel.org/r/20191216091220.465626-2-laurent@vivier.eu (origin)
Link: https://lore.kernel.org/r/20211028103114.2849140-2-brauner@kernel.org (v1)
Cc: Sargun Dhillon &lt;sargun@sargun.me&gt;
Cc: Serge Hallyn &lt;serge@hallyn.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Henning Schild &lt;henning.schild@siemens.com&gt;
Cc: Andrei Vagin &lt;avagin@gmail.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Laurent Vivier &lt;laurent@vivier.eu&gt;
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Laurent Vivier &lt;laurent@vivier.eu&gt;
Signed-off-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Signed-off-by: Christian Brauner &lt;brauner@kernel.org&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
---
/* v2 */
- Serge Hallyn &lt;serge@hallyn.com&gt;:
  - Use GFP_KERNEL_ACCOUNT for userspace triggered allocations when a
    new binary type handler is registered.
- Christian Brauner &lt;christian.brauner@ubuntu.com&gt;:
  - Switch authorship to me. I refused to do that earlier even though
    Laurent said I should do so because I think it's genuinely bad form.
    But by now I have changed so many things that it'd be unfair to
    blame Laurent for any potential bugs in here.
  - Add more comments that explain what's going on.
  - Rename functions while changing them to better reflect what they are
    doing to make the code easier to understand.
  - In the first version when a specific binary type handler was removed
    either through a write to the entry's file or all binary type
    handlers were removed by a write to the binfmt_misc mount's status
    file all cleanup work happened during inode eviction.
    That includes removal of the relevant entries from entry list. While
    that works fine I disliked that model after thinking about it for a
    bit. Because it means that there was a window were someone has
    already removed a or all binary handlers but they could still be
    safely reached from load_misc_binary() when it has managed to take
    the read_lock() on the entries list while inode eviction was already
    happening. Again, that perfectly benign but it's cleaner to remove
    the binary handler from the list immediately meaning that ones the
    write to then entry's file or the binfmt_misc status file returns
    the binary type cannot be executed anymore. That gives stronger
    guarantees to the user.
</pre>
</div>
</content>
</entry>
<entry>
<title>binfmt: remove taso from linux_binprm struct</title>
<updated>2022-09-29T23:38:05+00:00</updated>
<author>
<name>Lukas Bulwahn</name>
<email>lukas.bulwahn@gmail.com</email>
</author>
<published>2022-09-29T20:39:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=9f4beead610c83065cc0410bfe97ff51d8e9578d'/>
<id>9f4beead610c83065cc0410bfe97ff51d8e9578d</id>
<content type='text'>
With commit 987f20a9dcce ("a.out: Remove the a.out implementation"), the
use of the special taso flag for alpha architectures in the linux_binprm
struct is gone.

Remove the definition of taso in the linux_binprm struct.

No functional change.

Signed-off-by: Lukas Bulwahn &lt;lukas.bulwahn@gmail.com&gt;
Reviewed-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
Link: https://lore.kernel.org/r/20220929203903.9475-1-lukas.bulwahn@gmail.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
With commit 987f20a9dcce ("a.out: Remove the a.out implementation"), the
use of the special taso flag for alpha architectures in the linux_binprm
struct is gone.

Remove the definition of taso in the linux_binprm struct.

No functional change.

Signed-off-by: Lukas Bulwahn &lt;lukas.bulwahn@gmail.com&gt;
Reviewed-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
Link: https://lore.kernel.org/r/20220929203903.9475-1-lukas.bulwahn@gmail.com
</pre>
</div>
</content>
</entry>
<entry>
<title>coredump: Move definition of struct coredump_params into coredump.h</title>
<updated>2022-03-08T18:50:58+00:00</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2022-01-31T17:50:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=a99a3e2efaf1f4454eb5c9176f47e66de075b134'/>
<id>a99a3e2efaf1f4454eb5c9176f47e66de075b134</id>
<content type='text'>
Move the definition of struct coredump_params into coredump.h where
it belongs.

Remove the slightly errorneous comment explaining why struct
coredump_params was declared in binfmts.h.

Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Move the definition of struct coredump_params into coredump.h where
it belongs.

Remove the slightly errorneous comment explaining why struct
coredump_params was declared in binfmts.h.

Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>binfmt: move more stuff undef CONFIG_COREDUMP</title>
<updated>2022-03-02T00:16:27+00:00</updated>
<author>
<name>Alexey Dobriyan</name>
<email>adobriyan@gmail.com</email>
</author>
<published>2022-02-13T19:25:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.exis.tech/linux.git/commit/?id=d65bc29be0ae4ca2368df25dc6f6247aefb57f07'/>
<id>d65bc29be0ae4ca2368df25dc6f6247aefb57f07</id>
<content type='text'>
struct linux_binfmt::core_dump and struct min_coredump::min_coredump
are used under CONFIG_COREDUMP only. Shrink those embedded configs
a bit.

Signed-off-by: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
Link: https://lore.kernel.org/r/YglbIFyN+OtwVyjW@localhost.localdomain
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
struct linux_binfmt::core_dump and struct min_coredump::min_coredump
are used under CONFIG_COREDUMP only. Shrink those embedded configs
a bit.

Signed-off-by: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
Link: https://lore.kernel.org/r/YglbIFyN+OtwVyjW@localhost.localdomain
</pre>
</div>
</content>
</entry>
</feed>
