summaryrefslogtreecommitdiff
path: root/fs/binfmt_elf.c
AgeCommit message (Collapse)AuthorFilesLines
2025-05-22binfmt_elf: Move brk for static PIE even if ASLR disabledKees Cook1-24/+47
[ Upstream commit 11854fe263eb1b9a8efa33b0c087add7719ea9b4 ] In commit bbdc6076d2e5 ("binfmt_elf: move brk out of mmap when doing direct loader exec"), the brk was moved out of the mmap region when loading static PIE binaries (ET_DYN without INTERP). The common case for these binaries was testing new ELF loaders, so the brk needed to be away from mmap to avoid colliding with stack, future mmaps (of the loader-loaded binary), etc. But this was only done when ASLR was enabled, in an attempt to minimize changes to memory layouts. After adding support to respect alignment requirements for static PIE binaries in commit 3545deff0ec7 ("binfmt_elf: Honor PT_LOAD alignment for static PIE"), it became possible to have a large gap after the final PT_LOAD segment and the top of the mmap region. This means that future mmap allocations might go after the last PT_LOAD segment (where brk might be if ASLR was disabled) instead of before them (where they traditionally ended up). On arm64, running with ASLR disabled, Ubuntu 22.04's "ldconfig" binary, a static PIE, has alignment requirements that leaves a gap large enough after the last PT_LOAD segment to fit the vdso and vvar, but still leave enough space for the brk (which immediately follows the last PT_LOAD segment) to be allocated by the binary. fffff7f20000-fffff7fde000 r-xp 00000000 fe:02 8110426 /sbin/ldconfig.real fffff7fee000-fffff7ff5000 rw-p 000be000 fe:02 8110426 /sbin/ldconfig.real fffff7ff5000-fffff7ffa000 rw-p 00000000 00:00 0 ***[brk will go here at fffff7ffa000]*** fffff7ffc000-fffff7ffe000 r--p 00000000 00:00 0 [vvar] fffff7ffe000-fffff8000000 r-xp 00000000 00:00 0 [vdso] fffffffdf000-1000000000000 rw-p 00000000 00:00 0 [stack] After commit 0b3bc3354eb9 ("arm64: vdso: Switch to generic storage implementation"), the arm64 vvar grew slightly, and suddenly the brk collided with the allocation. fffff7f20000-fffff7fde000 r-xp 00000000 fe:02 8110426 /sbin/ldconfig.real fffff7fee000-fffff7ff5000 rw-p 000be000 fe:02 8110426 /sbin/ldconfig.real fffff7ff5000-fffff7ffa000 rw-p 00000000 00:00 0 ***[oops, no room any more, vvar is at fffff7ffa000!]*** fffff7ffa000-fffff7ffe000 r--p 00000000 00:00 0 [vvar] fffff7ffe000-fffff8000000 r-xp 00000000 00:00 0 [vdso] fffffffdf000-1000000000000 rw-p 00000000 00:00 0 [stack] The solution is to unconditionally move the brk out of the mmap region for static PIE binaries. Whether ASLR is enabled or not does not change if there may be future mmap allocation collisions with a growing brk region. Update memory layout comments (with kernel-doc headings), consolidate the setting of mm->brk to later (it isn't needed early), move static PIE brk out of mmap unconditionally, and make sure brk(2) knows to base brk position off of mm->start_brk not mm->end_data no matter what the cause of moving it is (via current->brk_randomized). For the CONFIG_COMPAT_BRK case, though, leave the logic unchanged, as we can never safely move the brk. These systems, however, are not using specially aligned static PIE binaries. Reported-by: Ryan Roberts <ryan.roberts@arm.com> Closes: https://lore.kernel.org/lkml/f93db308-4a0e-4806-9faf-98f890f5a5e6@arm.com/ Fixes: bbdc6076d2e5 ("binfmt_elf: move brk out of mmap when doing direct loader exec") Link: https://lore.kernel.org/r/20250425224502.work.520-kees@kernel.org Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-22binfmt_elf: Honor PT_LOAD alignment for static PIEKees Cook1-5/+37
[ Upstream commit 3545deff0ec7a37de7ed9632e262598582b140e9 ] The p_align values in PT_LOAD were ignored for static PIE executables (i.e. ET_DYN without PT_INTERP). This is because there is no way to request a non-fixed mmap region with a specific alignment. ET_DYN with PT_INTERP uses a separate base address (ELF_ET_DYN_BASE) and binfmt_elf performs the ASLR itself, which means it can also apply alignment. For the mmap region, the address selection happens deep within the vm_mmap() implementation (when the requested address is 0). The earlier attempt to implement this: commit 9630f0d60fec ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE") commit 925346c129da ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders") did not take into account the different base address origins, and were eventually reverted: aeb7923733d1 ("revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"") In order to get the correct alignment from an mmap base, binfmt_elf must perform a 0-address load first, then tear down the mapping and perform alignment on the resulting address. Since this is slightly more overhead, only do this when it is needed (i.e. the alignment is not the default ELF alignment). This does, however, have the benefit of being able to use MAP_FIXED_NOREPLACE, to avoid potential collisions. With this fixed, enable the static PIE self tests again. Reported-by: H.J. Lu <hjl.tools@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=215275 Link: https://lore.kernel.org/r/20240508173149.677910-3-keescook@chromium.org Signed-off-by: Kees Cook <kees@kernel.org> Stable-dep-of: 11854fe263eb ("binfmt_elf: Move brk for static PIE even if ASLR disabled") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-22binfmt_elf: Calculate total_size earlierKees Cook1-25/+27
[ Upstream commit 2d4cf7b190bbfadd4986bf5c34da17c1a88adf8e ] In preparation to support PT_LOAD with large p_align values on non-PT_INTERP ET_DYN executables (i.e. "static pie"), we'll need to use the total_size details earlier. Move this separately now to make the next patch more readable. As total_size and load_bias are currently calculated separately, this has no behavioral impact. Link: https://lore.kernel.org/r/20240508173149.677910-2-keescook@chromium.org Signed-off-by: Kees Cook <kees@kernel.org> Stable-dep-of: 11854fe263eb ("binfmt_elf: Move brk for static PIE even if ASLR disabled") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-22binfmt_elf: Leave a gap between .bss and brkKees Cook1-0/+3
[ Upstream commit 2a5eb9995528441447d33838727f6ec1caf08139 ] Currently the brk starts its randomization immediately after .bss, which means there is a chance that when the random offset is 0, linear overflows from .bss can reach into the brk area. Leave at least a single page gap between .bss and brk (when it has not already been explicitly relocated into the mmap range). Reported-by: <y0un9n132@gmail.com> Closes: https://lore.kernel.org/linux-hardening/CA+2EKTVLvc8hDZc+2Yhwmus=dzOUG5E4gV7ayCbu0MPJTZzWkw@mail.gmail.com/ Link: https://lore.kernel.org/r/20240217062545.1631668-2-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Stable-dep-of: 11854fe263eb ("binfmt_elf: Move brk for static PIE even if ASLR disabled") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-22binfmt_elf: elf_bss no longer used by load_elf_binary()Kees Cook1-5/+1
[ Upstream commit 8ed2ef21ff564cf4a25c098ace510ee6513c9836 ] With the BSS handled generically via the new filesz/memsz mismatch handling logic in elf_load(), elf_bss no longer needs to be tracked. Drop the variable. Cc: Eric Biederman <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org Cc: linux-mm@kvack.org Suggested-by: Eric Biederman <ebiederm@xmission.com> Tested-by: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Sebastian Ott <sebott@redhat.com> Link: https://lore.kernel.org/r/20230929032435.2391507-2-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Stable-dep-of: 11854fe263eb ("binfmt_elf: Move brk for static PIE even if ASLR disabled") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-22binfmt_elf: Support segments with 0 filesz and misaligned startsEric W. Biederman1-63/+48
[ Upstream commit 585a018627b4d7ed37387211f667916840b5c5ea ] Implement a helper elf_load() that wraps elf_map() and performs all of the necessary work to ensure that when "memsz > filesz" the bytes described by "memsz > filesz" are zeroed. An outstanding issue is if the first segment has filesz 0, and has a randomized location. But that is the same as today. In this change I replaced an open coded padzero() that did not clear all of the way to the end of the page, with padzero() that does. I also stopped checking the return of padzero() as there is at least one known case where testing for failure is the wrong thing to do. It looks like binfmt_elf_fdpic may have the proper set of tests for when error handling can be safely completed. I found a couple of commits in the old history https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git, that look very interesting in understanding this code. commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail") commit c6e2227e4a3e ("[SPARC64]: Missing user access return value checks in fs/binfmt_elf.c and fs/compat.c") commit 5bf3be033f50 ("v2.4.10.1 -> v2.4.10.2") Looking at commit 39b56d902bf3 ("[PATCH] binfmt_elf: clearing bss may fail"): > commit 39b56d902bf35241e7cba6cc30b828ed937175ad > Author: Pavel Machek <pavel@ucw.cz> > Date: Wed Feb 9 22:40:30 2005 -0800 > > [PATCH] binfmt_elf: clearing bss may fail > > So we discover that Borland's Kylix application builder emits weird elf > files which describe a non-writeable bss segment. > > So remove the clear_user() check at the place where we zero out the bss. I > don't _think_ there are any security implications here (plus we've never > checked that clear_user() return value, so whoops if it is a problem). > > Signed-off-by: Pavel Machek <pavel@suse.cz> > Signed-off-by: Andrew Morton <akpm@osdl.org> > Signed-off-by: Linus Torvalds <torvalds@osdl.org> It seems pretty clear that binfmt_elf_fdpic with skipping clear_user() for non-writable segments and otherwise calling clear_user(), aka padzero(), and checking it's return code is the right thing to do. I just skipped the error checking as that avoids breaking things. And notably, it looks like Borland's Kylix died in 2005 so it might be safe to just consider read-only segments with memsz > filesz an error. Reported-by: Sebastian Ott <sebott@redhat.com> Reported-by: Thomas Weißschuh <linux@weissschuh.net> Closes: https://lkml.kernel.org/r/20230914-bss-alloc-v1-1-78de67d2c6dd@weissschuh.net Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Link: https://lore.kernel.org/r/87sf71f123.fsf@email.froward.int.ebiederm.org Tested-by: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Sebastian Ott <sebott@redhat.com> Link: https://lore.kernel.org/r/20230929032435.2391507-1-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Stable-dep-of: 11854fe263eb ("binfmt_elf: Move brk for static PIE even if ASLR disabled") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-09-12ELF: fix kernel.randomize_va_space double readAlexey Dobriyan1-2/+3
[ Upstream commit 2a97388a807b6ab5538aa8f8537b2463c6988bd2 ] ELF loader uses "randomize_va_space" twice. It is sysctl and can change at any moment, so 2 loads could see 2 different values in theory with unpredictable consequences. Issue exactly one load for consistent value across one exec. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Link: https://lore.kernel.org/r/3329905c-7eb8-400a-8f0a-d87cff979b5b@p183 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-28Merge branch 'expand-stack'Linus Torvalds1-3/+3
This modifies our user mode stack expansion code to always take the mmap_lock for writing before modifying the VM layout. It's actually something we always technically should have done, but because we didn't strictly need it, we were being lazy ("opportunistic" sounds so much better, doesn't it?) about things, and had this hack in place where we would extend the stack vma in-place without doing the proper locking. And it worked fine. We just needed to change vm_start (or, in the case of grow-up stacks, vm_end) and together with some special ad-hoc locking using the anon_vma lock and the mm->page_table_lock, it all was fairly straightforward. That is, it was all fine until Ruihan Li pointed out that now that the vma layout uses the maple tree code, we *really* don't just change vm_start and vm_end any more, and the locking really is broken. Oops. It's not actually all _that_ horrible to fix this once and for all, and do proper locking, but it's a bit painful. We have basically three different cases of stack expansion, and they all work just a bit differently: - the common and obvious case is the page fault handling. It's actually fairly simple and straightforward, except for the fact that we have something like 24 different versions of it, and you end up in a maze of twisty little passages, all alike. - the simplest case is the execve() code that creates a new stack. There are no real locking concerns because it's all in a private new VM that hasn't been exposed to anybody, but lockdep still can end up unhappy if you get it wrong. - and finally, we have GUP and page pinning, which shouldn't really be expanding the stack in the first place, but in addition to execve() we also use it for ptrace(). And debuggers do want to possibly access memory under the stack pointer and thus need to be able to expand the stack as a special case. None of these cases are exactly complicated, but the page fault case in particular is just repeated slightly differently many many times. And ia64 in particular has a fairly complicated situation where you can have both a regular grow-down stack _and_ a special grow-up stack for the register backing store. So to make this slightly more manageable, the bulk of this series is to first create a helper function for the most common page fault case, and convert all the straightforward architectures to it. Thus the new 'lock_mm_and_find_vma()' helper function, which ends up being used by x86, arm, powerpc, mips, riscv, alpha, arc, csky, hexagon, loongarch, nios2, sh, sparc32, and xtensa. So we not only convert more than half the architectures, we now have more shared code and avoid some of those twisty little passages. And largely due to this common helper function, the full diffstat of this series ends up deleting more lines than it adds. That still leaves eight architectures (ia64, m68k, microblaze, openrisc, parisc, s390, sparc64 and um) that end up doing 'expand_stack()' manually because they are doing something slightly different from the normal pattern. Along with the couple of special cases in execve() and GUP. So there's a couple of patches that first create 'locked' helper versions of the stack expansion functions, so that there's a obvious path forward in the conversion. The execve() case is then actually pretty simple, and is a nice cleanup from our old "grow-up stackls are special, because at execve time even they grow down". The #ifdef CONFIG_STACK_GROWSUP in that code just goes away, because it's just more straightforward to write out the stack expansion there manually, instead od having get_user_pages_remote() do it for us in some situations but not others and have to worry about locking rules for GUP. And the final step is then to just convert the remaining odd cases to a new world order where 'expand_stack()' is called with the mmap_lock held for reading, but where it might drop it and upgrade it to a write, only to return with it held for reading (in the success case) or with it completely dropped (in the failure case). In the process, we remove all the stack expansion from GUP (where dropping the lock wouldn't be ok without special rules anyway), and add it in manually to __access_remote_vm() for ptrace(). Thanks to Adrian Glaubitz and Frank Scheiner who tested the ia64 cases. Everything else here felt pretty straightforward, but the ia64 rules for stack expansion are really quite odd and very different from everything else. Also thanks to Vegard Nossum who caught me getting one of those odd conditions entirely the wrong way around. Anyway, I think I want to actually move all the stack expansion code to a whole new file of its own, rather than have it split up between mm/mmap.c and mm/memory.c, but since this will have to be backported to the initial maple tree vma introduction anyway, I tried to keep the patches _fairly_ minimal. Also, while I don't think it's valid to expand the stack from GUP, the final patch in here is a "warn if some crazy GUP user wants to try to expand the stack" patch. That one will be reverted before the final release, but it's left to catch any odd cases during the merge window and release candidates. Reported-by: Ruihan Li <lrh2000@pku.edu.cn> * branch 'expand-stack': gup: add warning if some caller would seem to want stack expansion mm: always expand the stack with the mmap write lock held execve: expand new process stack manually ahead of time mm: make find_extend_vma() fail if write lock not held powerpc/mm: convert coprocessor fault to lock_mm_and_find_vma() mm/fault: convert remaining simple cases to lock_mm_and_find_vma() arm/mm: Convert to using lock_mm_and_find_vma() riscv/mm: Convert to using lock_mm_and_find_vma() mips/mm: Convert to using lock_mm_and_find_vma() powerpc/mm: Convert to using lock_mm_and_find_vma() arm64/mm: Convert to using lock_mm_and_find_vma() mm: make the page fault mmap locking killable mm: introduce new 'lock_mm_and_find_vma()' page fault helper
2023-06-27mm: always expand the stack with the mmap write lock heldLinus Torvalds1-1/+1
This finishes the job of always holding the mmap write lock when extending the user stack vma, and removes the 'write_locked' argument from the vm helper functions again. For some cases, we just avoid expanding the stack at all: drivers and page pinning really shouldn't be extending any stacks. Let's see if any strange users really wanted that. It's worth noting that architectures that weren't converted to the new lock_mm_and_find_vma() helper function are left using the legacy "expand_stack()" function, but it has been changed to drop the mmap_lock and take it for writing while expanding the vma. This makes it fairly straightforward to convert the remaining architectures. As a result of dropping and re-taking the lock, the calling conventions for this function have also changed, since the old vma may no longer be valid. So it will now return the new vma if successful, and NULL - and the lock dropped - if the area could not be extended. Tested-by: Vegard Nossum <vegard.nossum@oracle.com> Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # ia64 Tested-by: Frank Scheiner <frank.scheiner@web.de> # ia64 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-24mm: make find_extend_vma() fail if write lock not heldLiam R. Howlett1-3/+3
Make calls to extend_vma() and find_extend_vma() fail if the write lock is required. To avoid making this a flag-day event, this still allows the old read-locking case for the trivial situations, and passes in a flag to say "is it write-locked". That way write-lockers can say "yes, I'm being careful", and legacy users will continue to work in all the common cases until they have been fully converted to the new world order. Co-Developed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-23binfmt_elf: fix comment typo s/reset/regset/Baruch Siach1-1/+1
Signed-off-by: Baruch Siach <baruch@tkos.co.il> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/0b2967c4a4141875c493e835d5a6f8f2d19ae2d6.1687499804.git.baruch@tkos.co.il
2023-05-16coredump, vmcore: Set p_align to 4 for PT_NOTEFangrui Song1-1/+1
Tools like readelf/llvm-readelf use p_align to parse a PT_NOTE program header as an array of 4-byte entries or 8-byte entries. Currently, there are workarounds[1] in place for Linux to treat p_align==0 as 4. However, it would be more appropriate to set the correct alignment so that tools do not have to rely on guesswork. FreeBSD coredumps set p_align to 4 as well. [1]: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=82ed9683ec099d8205dc499ac84febc975235af6 [2]: https://reviews.llvm.org/D150022 Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230512022528.3430327-1-maskray@google.com
2023-04-27Merge tag 'mm-nonmm-stable-2023-04-27-16-01' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "Mainly singleton patches all over the place. Series of note are: - updates to scripts/gdb from Glenn Washburn - kexec cleanups from Bjorn Helgaas" * tag 'mm-nonmm-stable-2023-04-27-16-01' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (50 commits) mailmap: add entries for Paul Mackerras libgcc: add forward declarations for generic library routines mailmap: add entry for Oleksandr ocfs2: reduce ioctl stack usage fs/proc: add Kthread flag to /proc/$pid/status ia64: fix an addr to taddr in huge_pte_offset() checkpatch: introduce proper bindings license check epoll: rename global epmutex scripts/gdb: add GDB convenience functions $lx_dentry_name() and $lx_i_dentry() scripts/gdb: create linux/vfs.py for VFS related GDB helpers uapi/linux/const.h: prefer ISO-friendly __typeof__ delayacct: track delays from IRQ/SOFTIRQ scripts/gdb: timerlist: convert int chunks to str scripts/gdb: print interrupts scripts/gdb: raise error with reduced debugging information scripts/gdb: add a Radix Tree Parser lib/rbtree: use '+' instead of '|' for setting color. proc/stat: remove arch_idle_time() checkpatch: check for misuse of the link tags checkpatch: allow Closes tags with links ...
2023-04-13binfmt_elf: remove MODULE_LICENSE in non-modulesNick Alcock1-1/+0
Since commit 8b41fc4454e ("kbuild: create modules.builtin without Makefile.modbuiltin or tristate.conf"), MODULE_LICENSE declarations are used to identify modules. As a consequence, uses of the macro in non-modules will cause modprobe to misidentify their containing object file as a module when it is not (false positives), and modprobe might succeed rather than failing with a suitable error message. So remove it in the files in this commit, none of which can be built as modules. Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Suggested-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: linux-modules@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Hitomi Hasegawa <hasegawa-hitomi@fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org Cc: linux-mm@kvack.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-04-08ELF: fix all "Elf" typosAlexey Dobriyan1-1/+1
ELF is acronym and therefore should be spelled in all caps. I left one exception at Documentation/arm/nwfpe/nwfpe.rst which looks like being written in the first person. Link: https://lkml.kernel.org/r/Y/3wGWQviIOkyLJW@p183 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-31Merge tag 'v6.2-rc6' into sched/core, to pick up fixesIngo Molnar1-2/+2
Pick up fixes before merging another batch of cpuidle updates. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-01-05elfcore: Add a cprm parameter to elf_core_extra_{phdrs,data_size}Catalin Marinas1-2/+2
A subsequent fix for arm64 will use this parameter to parse the vma information from the snapshot created by dump_vma_snapshot() rather than traversing the vma list without the mmap_lock. Fixes: 6dd8b1a0b6cb ("arm64: mte: Dump the MTE tags in the core file") Cc: <stable@vger.kernel.org> # 5.18.x Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Seth Jenkins <sethjenkins@google.com> Suggested-by: Seth Jenkins <sethjenkins@google.com> Cc: Will Deacon <will@kernel.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221222181251.1345752-3-catalin.marinas@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-12-27rseq: Introduce feature size and alignment ELF auxiliary vector entriesMathieu Desnoyers1-0/+5
Export the rseq feature size supported by the kernel as well as the required allocation alignment for the rseq per-thread area to user-space through ELF auxiliary vector entries. This is part of the extensible rseq ABI. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221122203932.231377-3-mathieu.desnoyers@efficios.com
2022-12-12Merge tag 'pull-elfcore' of ↵Linus Torvalds1-220/+51
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull elf coredumping updates from Al Viro: "Unification of regset and non-regset sides of ELF coredump handling. Collecting per-thread register values is the only thing that needs to be ifdefed there..." * tag 'pull-elfcore' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: [elf] get rid of get_note_info_size() [elf] unify regset and non-regset cases [elf][non-regset] use elf_core_copy_task_regs() for dumper as well [elf][non-regset] uninline elf_core_copy_task_fpregs() (and lose pt_regs argument) elf_core_copy_task_regs(): task_pt_regs is defined everywhere [elf][regset] simplify thread list handling in fill_note_info() [elf][regset] clean fill_note_info() a bit kill extern of vsyscall32_sysctl kill coredump_params->regs kill signal_pt_regs()
2022-11-24[elf] get rid of get_note_info_size()Al Viro1-6/+1
it's trivial now... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-11-24[elf] unify regset and non-regset casesAl Viro1-192/+38
The only real difference is in filling per-thread notes - getting the values of registers. And this is the only part that is worth an ifdef - we don't need to duplicate the logics regarding gathering threads, filling other notes, etc. It would've been hard to do back when regset-based variant had been introduced, mostly due to sharing bits and pieces of helpers with aout coredumps. As the result, too much had been duplicated and the copies had drifted away since then. Now it can be done cleanly... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-11-24[elf][non-regset] use elf_core_copy_task_regs() for dumper as wellAl Viro1-1/+1
elf_core_copy_regs() is equivalent to elf_core_copy_task_regs() of current on all architectures. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-11-24[elf][non-regset] uninline elf_core_copy_task_fpregs() (and lose pt_regs ↵Al Viro1-3/+2
argument) Don't bother with pointless macros - we are not sharing it with aout coredumps anymore. Just convert the underlying functions to the same arguments (nobody uses regs, actually) and call them elf_core_copy_task_fpregs(). And unexport the entire bunch, while we are at it. [added missing includes in arch/{csky,m68k,um}/kernel/process.c to avoid extra warnings about the lack of externs getting added to huge piles for those files. Pointless, but...] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-11-17binfmt_elf: replace IS_ERR() with IS_ERR_VALUE()Bo Liu1-3/+3
Avoid typecasts that are needed for IS_ERR() and use IS_ERR_VALUE() instead. Signed-off-by: Bo Liu <liubo03@inspur.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221115031757.2426-1-liubo03@inspur.com
2022-10-25binfmt_elf: simplify error handling in load_elf_phdrs()Rolf Eike Beer1-8/+2
The err variable was the same like retval, but capped to <= 0. This is the same as retval as elf_read() never returns positive values. Signed-off-by: Rolf Eike Beer <eb@emlix.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/4137126.7Qn9TF0dmF@mobilepool36.emlix.com
2022-10-25binfmt_elf: fix documented return value for load_elf_phdrs()Rolf Eike Beer1-1/+1
This function has never returned anything but a plain NULL. Fixes: 6a8d38945cf4 ("binfmt_elf: Hoist ELF program header loading to a function") Signed-off-by: Rolf Eike Beer <eb@emlix.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/2359389.EDbqzprbEW@mobilepool36.emlix.com
2022-10-25binfmt: Fix whitespace issuesKees Cook1-7/+7
Fix the annoying whitespace issues that have been following these files around for years. Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Link: https://lore.kernel.org/r/20221018071350.never.230-kees@kernel.org
2022-10-25fs/binfmt_elf: Fix memory leak in load_elf_binary()Li Zetao1-1/+2
There is a memory leak reported by kmemleak: unreferenced object 0xffff88817104ef80 (size 224): comm "xfs_admin", pid 47165, jiffies 4298708825 (age 1333.476s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 60 a8 b3 00 81 88 ff ff a8 10 5a 00 81 88 ff ff `.........Z..... backtrace: [<ffffffff819171e1>] __alloc_file+0x21/0x250 [<ffffffff81918061>] alloc_empty_file+0x41/0xf0 [<ffffffff81948cda>] path_openat+0xea/0x3d30 [<ffffffff8194ec89>] do_filp_open+0x1b9/0x290 [<ffffffff8192660e>] do_open_execat+0xce/0x5b0 [<ffffffff81926b17>] open_exec+0x27/0x50 [<ffffffff81a69250>] load_elf_binary+0x510/0x3ed0 [<ffffffff81927759>] bprm_execve+0x599/0x1240 [<ffffffff8192a997>] do_execveat_common.isra.0+0x4c7/0x680 [<ffffffff8192b078>] __x64_sys_execve+0x88/0xb0 [<ffffffff83bbf0a5>] do_syscall_64+0x35/0x80 If "interp_elf_ex" fails to allocate memory in load_elf_binary(), the program will take the "out_free_ph" error handing path, resulting in "interpreter" file resource is not released. Fix it by adding an error handing path "out_free_file", which will release the file resource when "interp_elf_ex" failed to allocate memory. Fixes: 0693ffebcfe5 ("fs/binfmt_elf.c: allocate less for static executable") Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20221024154421.982230-1-lizetao1@huawei.com
2022-10-23[elf][regset] simplify thread list handling in fill_note_info()Al Viro1-12/+10
fill_note_info() iterates through the list of threads collected in mm->core_state->dumper, allocating a struct elf_thread_core_info instance for each and linking those into a list. We need the entry corresponding to current to be first in the resulting list, so the logics for list insertion is if it's for current or list is empty insert in the head else insert after the first element However, in mm->core_state->dumper the entry for current is guaranteed to be the first one. Which means that both parts of condition will be true on the first iteration and neither will be true on all subsequent ones. Taking the first iteration out of the loop simplifies things nicely... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-10-23[elf][regset] clean fill_note_info() a bitAl Viro1-9/+2
*info is already initialized... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-10-23kill coredump_params->regsAl Viro1-2/+2
it's always task_pt_regs(current) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-04-15revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"Andrew Morton1-2/+2
Despite Mike's attempted fix (925346c129da117122), regressions reports continue: https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/ https://bugzilla.kernel.org/show_bug.cgi?id=215720 https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info So revert this patch. Fixes: 9630f0d60fec ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE") Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Kennelly <ckennelly@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Fangrui Song <maskray@google.com> Cc: H.J. Lu <hjl.tools@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Sandeep Patil <sspatil@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thorsten Leemhuis <regressions@leemhuis.info> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"Andrew Morton1-1/+1
Commit 925346c129da11 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders") was an attempt to fix regressions due to 9630f0d60fec5f ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE"). But regressionss continue to be reported: https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/ https://bugzilla.kernel.org/show_bug.cgi?id=215720 https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info This patch reverts the fix, so the original can also be reverted. Fixes: 925346c129da11 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders") Cc: H.J. Lu <hjl.tools@gmail.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Song Liu <songliubraving@fb.com> Cc: David Rientjes <rientjes@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sandeep Patil <sspatil@google.com> Cc: Fangrui Song <maskray@google.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thorsten Leemhuis <regressions@leemhuis.info> Cc: Mike Rapoport <rppt@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-21Merge tag 'execve-v5.18-rc1' of ↵Linus Torvalds1-70/+83
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: "Execve and binfmt updates. Eric and I have stepped up to be the active maintainers of this area, so here's our first collection. The bulk of the work was in coredump handling fixes; additional details are noted below: - Handle unusual AT_PHDR offsets (Akira Kawata) - Fix initial mapping size when PT_LOADs are not ordered (Alexey Dobriyan) - Move more code under CONFIG_COREDUMP (Alexey Dobriyan) - Fix missing mmap_lock in file_files_note (Eric W. Biederman) - Remove a.out support for alpha and m68k (Eric W. Biederman) - Include first pages of non-exec ELF libraries in coredump (Jann Horn) - Don't write past end of notes for regset gap in coredump (Rick Edgecombe) - Comment clean-ups (Tom Rix) - Force single empty string when argv is empty (Kees Cook) - Add NULL argv selftest (Kees Cook) - Properly redefine PT_GNU_* in terms of PT_LOOS (Kees Cook) - MAINTAINERS: Update execve entry with tree (Kees Cook) - Introduce initial KUnit testing for binfmt_elf (Kees Cook)" * tag 'execve-v5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt_elf: Don't write past end of notes for regset gap a.out: Stop building a.out/osf1 support on alpha and m68k coredump: Don't compile flat_core_dump when coredumps are disabled coredump: Use the vma snapshot in fill_files_note coredump/elf: Pass coredump_params into fill_note_info coredump: Remove the WARN_ON in dump_vma_snapshot coredump: Snapshot the vmas in do_coredump coredump: Move definition of struct coredump_params into coredump.h binfmt_elf: Introduce KUnit test ELF: Properly redefine PT_GNU_* in terms of PT_LOOS MAINTAINERS: Update execve entry with more details exec: cleanup comments fs/binfmt_elf: Refactor load_elf_binary function fs/binfmt_elf: Fix AT_PHDR for unusual ELF files binfmt: move more stuff undef CONFIG_COREDUMP selftests/exec: Test for empty string on NULL argv exec: Force single empty string when argv is empty coredump: Also dump first pages of non-executable ELF libraries ELF: fix overflow in total mapping size calculation
2022-03-18binfmt_elf: Don't write past end of notes for regset gapRick Edgecombe1-10/+14
In fill_thread_core_info() the ptrace accessible registers are collected to be written out as notes in a core file. The note array is allocated from a size calculated by iterating the user regset view, and counting the regsets that have a non-zero core_note_type. However, this only allows for there to be non-zero core_note_type at the end of the regset view. If there are any gaps in the middle, fill_thread_core_info() will overflow the note allocation, as it iterates over the size of the view and the allocation would be smaller than that. There doesn't appear to be any arch that has gaps such that they exceed the notes allocation, but the code is brittle and tries to support something it doesn't. It could be fixed by increasing the allocation size, but instead just have the note collecting code utilize the array better. This way the allocation can stay smaller. Even in the case of no arch's that have gaps in their regset views, this introduces a change in the resulting indicies of t->notes. It does not introduce any changes to the core file itself, because any blank notes are skipped in write_note_info(). In case, the allocation logic between fill_note_info() and fill_thread_core_info() ever diverges from the usage logic, warn and skip writing any notes that would overflow the array. This fix is derrived from an earlier one[0] by Yu-cheng Yu. [0] https://lore.kernel.org/lkml/20180717162502.32274-1-yu-cheng.yu@intel.com/ Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220317192013.13655-4-rick.p.edgecombe@intel.com
2022-03-08coredump: Use the vma snapshot in fill_files_noteEric W. Biederman1-12/+12
Matthew Wilcox reported that there is a missing mmap_lock in file_files_note that could possibly lead to a user after free. Solve this by using the existing vma snapshot for consistency and to avoid the need to take the mmap_lock anywhere in the coredump code except for dump_vma_snapshot. Update the dump_vma_snapshot to capture vm_pgoff and vm_file that are neeeded by fill_files_note. Add free_vma_snapshot to free the captured values of vm_file. Reported-by: Matthew Wilcox <willy@infradead.org> Link: https://lkml.kernel.org/r/20220131153740.2396974-1-willy@infradead.org Cc: stable@vger.kernel.org Fixes: a07279c9a8cd ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot") Fixes: 2aa362c49c31 ("coredump: extend core dump note section to contain file names of mapped files") Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-08coredump/elf: Pass coredump_params into fill_note_infoEric W. Biederman1-11/+11
Instead of individually passing cprm->siginfo and cprm->regs into fill_note_info pass all of struct coredump_params. This is preparation to allow fill_files_note to use the existing vma snapshot. Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-08coredump: Snapshot the vmas in do_coredumpEric W. Biederman1-13/+7
Move the call of dump_vma_snapshot and kvfree(vma_meta) out of the individual coredump routines into do_coredump itself. This makes the code less error prone and easier to maintain. Make the vma snapshot available to the coredump routines in struct coredump_params. This makes it easier to change and update what is captures in the vma snapshot and will be needed for fixing fill_file_notes. Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-03-03binfmt_elf: Introduce KUnit testKees Cook1-0/+4
Adds simple KUnit test for some binfmt_elf internals: specifically a regression test for the problem fixed by commit 8904d9cd90ee ("ELF: fix overflow in total mapping size calculation"). $ ./tools/testing/kunit/kunit.py run --arch x86_64 \ --kconfig_add CONFIG_IA32_EMULATION=y '*binfmt_elf' ... [19:41:08] ================== binfmt_elf (1 subtest) ================== [19:41:08] [PASSED] total_mapping_size_test [19:41:08] =================== [PASSED] binfmt_elf ==================== [19:41:08] ============== compat_binfmt_elf (1 subtest) =============== [19:41:08] [PASSED] total_mapping_size_test [19:41:08] ================ [PASSED] compat_binfmt_elf ================ [19:41:08] ============================================================ [19:41:08] Testing complete. Passed: 2, Failed: 0, Crashed: 0, Skipped: 0, Errors: 0 Cc: Eric Biederman <ebiederm@xmission.com> Cc: David Gow <davidgow@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Magnus Groß" <magnus.gross@rwth-aachen.de> Cc: kunit-dev@googlegroups.com Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> --- v1: https://lore.kernel.org/lkml/20220224054332.1852813-1-keescook@chromium.org v2: - improve commit log - fix comment URL (Daniel) - drop redundant KUnit Kconfig help info (Daniel) - note in Kconfig help that COMPAT builds add a compat test (David)
2022-03-01fs/binfmt_elf: Refactor load_elf_binary functionAkira Kawata1-8/+6
I delete load_addr because it is not used anymore. And I rename load_addr_set to first_pt_load because it is used only to capture the first iteration of the loop. Signed-off-by: Akira Kawata <akirakawata1@gmail.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220127124014.338760-3-akirakawata1@gmail.com
2022-03-01fs/binfmt_elf: Fix AT_PHDR for unusual ELF filesAkira Kawata1-6/+18
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=197921 As pointed out in the discussion of buglink, we cannot calculate AT_PHDR as the sum of load_addr and exec->e_phoff. : The AT_PHDR of ELF auxiliary vectors should point to the memory address : of program header. But binfmt_elf.c calculates this address as follows: : : NEW_AUX_ENT(AT_PHDR, load_addr + exec->e_phoff); : : which is wrong since e_phoff is the file offset of program header and : load_addr is the memory base address from PT_LOAD entry. : : The ld.so uses AT_PHDR as the memory address of program header. In normal : case, since the e_phoff is usually 64 and in the first PT_LOAD region, it : is the correct program header address. : : But if the address of program header isn't equal to the first PT_LOAD : address + e_phoff (e.g. Put the program header in other non-consecutive : PT_LOAD region), ld.so will try to