summaryrefslogtreecommitdiff
path: root/fs/proc
AgeCommit message (Collapse)AuthorFilesLines
2023-05-23sysctl: Remove register_sysctl_tableJoel Granados1-159/+0
This is part of the general push to deprecate register_sysctl_paths and register_sysctl_table. After removing all the calling functions, we remove both the register_sysctl_table function and the documentation check that appeared in check-sysctl-docs awk script. We save 595 bytes with this change: ./scripts/bloat-o-meter vmlinux.1.refactor-base-paths vmlinux.2.remove-sysctl-table add/remove: 2/8 grow/shrink: 1/0 up/down: 1154/-1749 (-595) Function old new delta count_subheaders - 983 +983 unregister_sysctl_table 29 184 +155 __pfx_count_subheaders - 16 +16 __pfx_unregister_sysctl_table.part 16 - -16 __pfx_register_leaf_sysctl_tables.constprop 16 - -16 __pfx_count_subheaders.part 16 - -16 __pfx___register_sysctl_base 16 - -16 unregister_sysctl_table.part 136 - -136 __register_sysctl_base 478 - -478 register_leaf_sysctl_tables.constprop 524 - -524 count_subheaders.part 547 - -547 Total: Before=21257652, After=21257057, chg -0.00% [mcgrof: remove register_leaf_sysctl_tables and append_path too and add bloat-o-meter stats] Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org>
2023-05-23sysctl: stop exporting register_sysctl_tableJoel Granados1-3/+2
We make register_sysctl_table static because the only function calling it is in fs/proc/proc_sysctl.c (__register_sysctl_base). We remove it from the sysctl.h header and modify the documentation in both the header and proc_sysctl.c files to mention "register_sysctl" instead of "register_sysctl_table". This plus the commits that remove register_sysctl_table from parport save 217 bytes: ./scripts/bloat-o-meter .bsysctl/vmlinux.old .bsysctl/vmlinux.new add/remove: 0/1 grow/shrink: 5/1 up/down: 458/-675 (-217) Function old new delta __register_sysctl_base 8 286 +278 parport_proc_register 268 379 +111 parport_device_proc_register 195 247 +52 kzalloc.constprop 598 608 +10 parport_default_proc_register 62 69 +7 register_sysctl_table 291 - -291 parport_sysctl_template 1288 904 -384 Total: Before=8603076, After=8602859, chg -0.00% Signed-off-by: Joel Granados <j.granados@samsung.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-05-16coredump, vmcore: Set p_align to 4 for PT_NOTEFangrui Song1-2/+2
Tools like readelf/llvm-readelf use p_align to parse a PT_NOTE program header as an array of 4-byte entries or 8-byte entries. Currently, there are workarounds[1] in place for Linux to treat p_align==0 as 4. However, it would be more appropriate to set the correct alignment so that tools do not have to rely on guesswork. FreeBSD coredumps set p_align to 4 as well. [1]: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=82ed9683ec099d8205dc499ac84febc975235af6 [2]: https://reviews.llvm.org/D150022 Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230512022528.3430327-1-maskray@google.com
2023-05-02sysctl: remove register_sysctl_paths()Luis Chamberlain1-51/+4
The deprecation for register_sysctl_paths() is over. We can rejoice as we nuke register_sysctl_paths(). The routine register_sysctl_table() was the only user left of register_sysctl_paths(), so we can now just open code and move the implementation over to what used to be to __register_sysctl_paths(). The old dynamic struct ctl_table_set *set is now the point to sysctl_table_root.default_set. The old dynamic const struct ctl_path *path was being used in the routine register_sysctl_paths() with a static: static const struct ctl_path null_path[] = { {} }; Since this is a null path we can now just simplfy the old routine and remove its use as its always empty. This saves us a total of 230 bytes. $ ./scripts/bloat-o-meter vmlinux.old vmlinux add/remove: 2/7 grow/shrink: 1/1 up/down: 1015/-1245 (-230) Function old new delta register_leaf_sysctl_tables.constprop - 524 +524 register_sysctl_table 22 497 +475 __pfx_register_leaf_sysctl_tables.constprop - 16 +16 null_path 8 - -8 __pfx_register_sysctl_paths 16 - -16 __pfx_register_leaf_sysctl_tables 16 - -16 __pfx___register_sysctl_paths 16 - -16 __register_sysctl_base 29 12 -17 register_sysctl_paths 18 - -18 register_leaf_sysctl_tables 534 - -534 __register_sysctl_paths 620 - -620 Total: Before=21259666, After=21259436, chg -0.00% Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-04-28Merge tag 'x86_mm_for_6.4' of ↵Linus Torvalds2-2/+14
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 LAM (Linear Address Masking) support from Dave Hansen: "Add support for the new Linear Address Masking CPU feature. This is similar to ARM's Top Byte Ignore and allows userspace to store metadata in some bits of pointers without masking it out before use" * tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA selftests/x86/lam: Add test cases for LAM vs thread creation selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking selftests/x86/lam: Add inherit test cases for linear-address masking selftests/x86/lam: Add io_uring test cases for linear-address masking selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking x86/mm/iommu/sva: Make LAM and SVA mutually exclusive iommu/sva: Replace pasid_valid() helper with mm_valid_pasid() mm: Expose untagging mask in /proc/$PID/status x86/mm: Provide arch_prctl() interface for LAM x86/mm: Reduce untagged_addr() overhead for systems without LAM x86/uaccess: Provide untagged_addr() and remove tags before address check mm: Introduce untagged_addr_remote() x86/mm: Handle LAM on context switch x86: CPUID and CR3/CR4 flags for Linear Address Masking x86: Allow atomic MM_CONTEXT flags setting x86/mm: Rework address range check in get_user() and put_user()
2023-04-27Merge tag 'mm-nonmm-stable-2023-04-27-16-01' of ↵Linus Torvalds6-40/+13
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "Mainly singleton patches all over the place. Series of note are: - updates to scripts/gdb from Glenn Washburn - kexec cleanups from Bjorn Helgaas" * tag 'mm-nonmm-stable-2023-04-27-16-01' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (50 commits) mailmap: add entries for Paul Mackerras libgcc: add forward declarations for generic library routines mailmap: add entry for Oleksandr ocfs2: reduce ioctl stack usage fs/proc: add Kthread flag to /proc/$pid/status ia64: fix an addr to taddr in huge_pte_offset() checkpatch: introduce proper bindings license check epoll: rename global epmutex scripts/gdb: add GDB convenience functions $lx_dentry_name() and $lx_i_dentry() scripts/gdb: create linux/vfs.py for VFS related GDB helpers uapi/linux/const.h: prefer ISO-friendly __typeof__ delayacct: track delays from IRQ/SOFTIRQ scripts/gdb: timerlist: convert int chunks to str scripts/gdb: print interrupts scripts/gdb: raise error with reduced debugging information scripts/gdb: add a Radix Tree Parser lib/rbtree: use '+' instead of '|' for setting color. proc/stat: remove arch_idle_time() checkpatch: check for misuse of the link tags checkpatch: allow Closes tags with links ...
2023-04-27Merge tag 'mm-stable-2023-04-27-15-30' of ↵Linus Torvalds4-47/+57
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
2023-04-27Merge tag 'sysctl-6.4-rc1' of ↵Linus Torvalds1-30/+58
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "This only does a few sysctl moves from the kernel/sysctl.c file, the rest of the work has been put towards deprecating two API calls which incur recursion and prevent us from simplifying the registration process / saving memory per move. Most of the changes have been soaking on linux-next since v6.3-rc3. I've slowed down the kernel/sysctl.c moves due to Matthew Wilcox's feedback that we should see if we could *save* memory with these moves instead of incurring more memory. We currently incur more memory since when we move a syctl from kernel/sysclt.c out to its own file we end up having to add a new empty sysctl used to register it. To achieve saving memory we want to allow syctls to be passed without requiring the end element being empty, and just have our registration process rely on ARRAY_SIZE(). Without this, supporting both styles of sysctls would make the sysctl registration pretty brittle, hard to read and maintain as can be seen from Meng Tang's efforts to do just this [0]. Fortunately, in order to use ARRAY_SIZE() for all sysctl registrations also implies doing the work to deprecate two API calls which use recursion in order to support sysctl declarations with subdirectories. And so during this development cycle quite a bit of effort went into this deprecation effort. I've annotated the following two APIs are deprecated and in few kernel releases we should be good to remove them: - register_sysctl_table() - register_sysctl_paths() During this merge window we should be able to deprecate and unexport register_sysctl_paths(), we can probably do that towards the end of this merge window. Deprecating register_sysctl_table() will take a bit more time but this pull request goes with a few example of how to do this. As it turns out each of the conversions to move away from either of these two API calls *also* saves memory. And so long term, all these changes *will* prove to have saved a bit of memory on boot. The way I see it then is if remove a user of one deprecated call, it gives us enough savings to move one kernel/sysctl.c out from the generic arrays as we end up with about the same amount of bytes. Since deprecating register_sysctl_table() and register_sysctl_paths() does not require maintainer coordination except the final unexport you'll see quite a bit of these changes from other pull requests, I've just kept the stragglers after rc3" Link: https://lkml.kernel.org/r/ZAD+cpbrqlc5vmry@bombadil.infradead.org [0] * tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (29 commits) fs: fix sysctls.c built mm: compaction: remove incorrect #ifdef checks mm: compaction: move compaction sysctl to its own file mm: memory-failure: Move memory failure sysctls to its own file arm: simplify two-level sysctl registration for ctl_isa_vars ia64: simplify one-level sysctl registration for kdump_ctl_table utsname: simplify one-level sysctl registration for uts_kern_table ntfs: simplfy one-level sysctl registration for ntfs_sysctls coda: simplify one-level sysctl registration for coda_table fs/cachefiles: simplify one-level sysctl registration for cachefiles_sysctls xfs: simplify two-level sysctl registration for xfs_table nfs: simplify two-level sysctl registration for nfs_cb_sysctls nfs: simplify two-level sysctl registration for nfs4_cb_sysctls lockd: simplify two-level sysctl registration for nlm_sysctls proc_sysctl: enhance documentation xen: simplify sysctl registration for balloon md: simplify sysctl registration hv: simplify sysctl registration scsi: simplify sysctl registration with register_sysctl() csky: simplify alignment sysctl registration ...
2023-04-21fs/proc: add Kthread flag to /proc/$pid/statusChunguang Wu1-0/+2
The command `ps -ef ` and `top -c` mark kernel thread by '[' and ']', but sometimes the result is not correct. The task->flags in /proc/$pid/stat is good, but we need remember the value of PF_KTHREAD is 0x00200000 and convert dec to hex. If we have no binary program and shell script which read /proc/$pid/stat, we can know it directly by `cat /proc/$pid/status`. Link: https://lkml.kernel.org/r/20230416052404.2920-1-fullspring2018@gmail.com Signed-off-by: Chunguang Wu <fullspring2018@gmail.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21mm: add new KSM process and sysfs knobsStefan Roesch1-0/+3
This adds the general_profit KSM sysfs knob and the process profit metric knobs to ksm_stat. 1) expose general_profit metric The documentation mentions a general profit metric, however this metric is not calculated. In addition the formula depends on the size of internal structures, which makes it more difficult for an administrator to make the calculation. Adding the metric for a better user experience. 2) document general_profit sysfs knob 3) calculate ksm process profit metric The ksm documentation mentions the process profit metric and how to calculate it. This adds the calculation of the metric. 4) mm: expose ksm process profit metric in ksm_stat This exposes the ksm process profit metric in /proc/<pid>/ksm_stat. The documentation mentions the formula for the ksm process profit metric, however it does not calculate it. In addition the formula depends on the size of internal structures. So it makes sense to expose it. 5) document new procfs ksm knobs Link: https://lkml.kernel.org/r/20230418051342.1919757-3-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18proc/stat: remove arch_idle_time()Heiko Carstens1-26/+0
The last (only) architecture specific arch_idle_time() implementation was removed with commit be76ea614460 ("s390/idle: remove arch_cpu_idle_time() and corresponding code"). Therefore remove the now dead code in fs/proc/stat.c as well. Link: https://lkml.kernel.org/r/20230405143452.2677172-1-hca@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18smaps: fix defined but not used smaps_shmem_walk_opsSteven Price1-2/+1
When !CONFIG_SHMEM smaps_shmem_walk_ops is defined but not used, triggering a compiler warning. To avoid the warning remove the #ifdef around the usage. This has no effect because shmem_mapping() is a stub returning false when !CONFIG_SHMEM so the code will be compiled out, however we now need to also provide a stub for shmem_swap_usage(). Link: https://lkml.kernel.org/r/20230405103819.151246-1-steven.price@arm.com Fixes: 7b86ac3371b7 ("pagewalk: separate function pointers from iterator data") Signed-off-by: Steven Price <steven.price@arm.com> Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202304031749.UiyJpxzF-lkp@intel.com/ Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-13proc_sysctl: enhance documentationLuis Chamberlain1-5/+20
Expand documentation to clarify: o that paths don't need to exist for the new API callers o clarify that we *require* callers to keep the memory of the table around during the lifetime of the sysctls o annotate routines we are trying to deprecate and later remove Cc: stable@vger.kernel.org # v5.17 Cc: Christian Brauner <brauner@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-04-13sysctl: clarify register_sysctl_init() base directory orderLuis Chamberlain1-4/+1
Relatively new docs which I added which hinted the base directories needed to be created before is wrong, remove that incorrect comment. This has been hinted before by Eric twice already [0] [1], I had just not verified that until now. Now that I've verified that updates the docs to relax the context described. [0] https://lkml.kernel.org/r/875ys0azt8.fsf@email.froward.int.ebiederm.org [1] https://lkml.kernel.org/r/87ftbiud6s.fsf@x220.int.ebiederm.org Cc: stable@vger.kernel.org # v5.17 Cc: Christian Brauner <brauner@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Suggested-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-04-13proc_sysctl: move helper which creates required subdirectoriesLuis Chamberlain1-24/+32
Move the code which creates the subdirectories for a ctl table into a helper routine so to make it easier to review. Document the goal. This creates no functional changes. Reviewed-by: John Johansen <john.johansen@canonical.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-04-13proc_sysctl: update docs for __register_sysctl_table()Luis Chamberlain1-3/+11
Update the docs for __register_sysctl_table() to make it clear no child entries can be passed. When the child is true these are non-leaf entries on the ctl table and sysctl treats these as directories. The point to __register_sysctl_table() is to deal only with directories not part of the ctl table where thay may riside, to be simple and avoid recursion. While at it, hint towards using long on extra1 and extra2 later. Cc: stable@vger.kernel.org # v5.17 Cc: Christian Brauner <brauner@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-04-08ELF: fix all "Elf" typosAlexey Dobriyan1-11/+11
ELF is acronym and therefore should be spelled in all caps. I left one exception at Documentation/arm/nwfpe/nwfpe.rst which looks like being written in the first person. Link: https://lkml.kernel.org/r/Y/3wGWQviIOkyLJW@p183 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-08proc: remove mark_inode_dirty() in .setattr()Chao Yu3-3/+0
procfs' .setattr() has updated i_uid, i_gid and i_mode into proc dirent, we don't need to call mark_inode_dirty() for delayed update, remove it. Link: https://lkml.kernel.org/r/20230131150840.34726-1-chao@kernel.org Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm: vmalloc: convert vread() to vread_iter()Lorenzo Stoakes1-21/+23
Having previously laid the foundation for converting vread() to an iterator function, pull the trigger and do so. This patch attempts to provide minimal refactoring and to reflect the existing logic as best we can, for example we continue to zero portions of memory not read, as before. Overall, there should be no functional difference other than a performance improvement in /proc/kcore access to vmalloc regions. Now we have eliminated the need for a bounce buffer in read_kcore_iter(), we dispense with it, and try to write to user memory optimistically but with faults disabled via copy_page_to_iter_nofault(). We already have preemption disabled by holding a spin lock. We continue faulting in until the operation is complete. Additionally, we must account for the fact that at any point a copy may fail (most likely due to a fault not being able to occur), we exit indicating fewer bytes retrieved than expected. [sfr@canb.auug.org.au: fix sparc64 warning] Link: https://lkml.kernel.org/r/20230320144721.663280c3@canb.auug.org.au [lstoakes@gmail.com: redo Stephen's sparc build fix] Link: https://lkml.kernel.org/r/8506cbc667c39205e65a323f750ff9c11a463798.1679566220.git.lstoakes@gmail.com [akpm@linux-foundation.org: unbreak uio.h includes] Link: https://lkml.kernel.org/r/941f88bc5ab928e6656e1e2593b91bf0f8c81e1b.1679511146.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05fs/proc/kcore: convert read_kcore() to read_kcore_iter()Lorenzo Stoakes1-18/+18
For the time being we still use a bounce buffer for vread(), however in the next patch we will convert this to interact directly with the iterator and eliminate the bounce buffer altogether. Link: https://lkml.kernel.org/r/ebe12c8d70eebd71f487d80095605f3ad0d1489c.1679511146.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05fs/proc/kcore: avoid bounce buffer for ktext dataLorenzo Stoakes1-12/+5
Patch series "convert read_kcore(), vread() to use iterators", v8. While reviewing Baoquan's recent changes to permit vread() access to vm_map_ram regions of vmalloc allocations, Willy pointed out [1] that it would be nice to refactor vread() as a whole, since its only user is read_kcore() and the existing form of vread() necessitates the use of a bounce buffer. This patch series does exactly that, as well as adjusting how we read the kernel text section to avoid the use of a bounce buffer in this case as well. This has been tested against the test case which motivated Baoquan's changes in the first place [2] which continues to function correctly, as do the vmalloc self tests. This patch (of 4): Commit df04abfd181a ("fs/proc/kcore.c: Add bounce buffer for ktext data") introduced the use of a bounce buffer to retrieve kernel text data for /proc/kcore in order to avoid failures arising from hardened user copies enabled by CONFIG_HARDENED_USERCOPY in check_kernel_text_object(). We can avoid doing this if instead of copy_to_user() we use _copy_to_user() which bypasses the hardening check. This is more efficient than using a bounce buffer and simplifies the code. We do so as part an overall effort to eliminate bounce buffer usage in the function with an eye to converting it an iterator read. Link: https://lkml.kernel.org/r/cover.1679566220.git.lstoakes@gmail.com Link: https://lore.kernel.org/all/Y8WfDSRkc%2FOHP3oD@casper.infradead.org/ [1] Link: https://lore.kernel.org/all/87ilk6gos2.fsf@oracle.com/T/#u [2] Link: https://lkml.kernel.org/r/fd39b0bfa7edc76d360def7d034baaee71d90158.1679511146.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-05mm/memtest: add results of early memtest to /proc/meminfoTomas Mudrunka1-0/+13
Currently the memtest results were only presented in dmesg. When running a large fleet of devices without ECC RAM it's currently not easy to do bulk monitoring for memory corruption. You have to parse dmesg, but that's a ring buffer so the error might disappear after some time. In general I do not consider dmesg to be a great API to query RAM status. In several companies I've seen such errors remain undetected and cause issues for way too long. So I think it makes sense to provide a monitoring API, so that we can safely detect and act upon them. This adds /proc/meminfo entry which can be easily used by scripts. Link: https://lkml.kernel.org/r/20230321103430.7130-1-tomas.mudrunka@gmail.com Signed-off-by: Tomas Mudrunka <tomas.mudrunka@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-29mm, pagemap: remove SLOB and SLQB from comments and documentationVlastimil Babka1-5/+4
SLOB has been removed and SLQB never merged, so remove their mentions from comments and documentation of pagemap. In stable_page_flags() also correct an outdated comment mentioning that PageBuddy() means a page->_refcount of -1, and remove compound_head() from the PageSlab() call, as that's already implicitly there thanks to PF_NO_TAIL. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
2023-03-16mm: Expose untagging mask in /proc/$PID/statusKirill A. Shutemov1-0/+7
Add a line in /proc/$PID/status to report untag_mask. It can be used to find out LAM status of the process from the outside. It is useful for debuggers. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/all/20230312112612.31869-10-kirill.shutemov%40linux.intel.com
2023-03-16mm: Introduce untagged_addr_remote()Kirill A. Shutemov1-2/+7
untagged_addr() removes tags/metadata from the address and brings it to the canonical form. The helper is implemented on arm64 and sparc. Both of them do untagging based on global rules. However, Linear Address Masking (LAM) on x86 introduces per-process settings for untagging. As a result, untagged_addr() is now only suitable for untagging addresses for the current proccess. The new helper untagged_addr_remote() has to be used when the address targets remote process. It requires the mmap lock for target mm to be taken. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/all/20230312112612.31869-6-kirill.shutemov%40linux.intel.com
2023-03-01capability: just use a 'u64' instead of a 'u32[2]' arrayLinus Torvalds1-6/+1
Back in 2008 we extended the capability bits from 32 to 64, and we did it by extending the single 32-bit capability word from one word to an array of two words. It was then obfuscated by hiding the "2" behind two macro expansions, with the reasoning being that maybe it gets extended further some day. That reasoning may have been valid at the time, but the last thing we want to do is to extend the capability set any more. And the array of values not only causes source code oddities (with loops to deal with it), but also results in worse code generation. It's a lose-lose situation. So just change the 'u32[2]' into a 'u64' and be done with it. We still have to deal with the fact that the user space interface is designed around an array of these 32-bit values, but that was the case before too, since the array layouts were different (ie user space doesn't use an array of 32-bit values for individual capability masks, but an array of 32-bit slices of multiple masks). So that marshalling of data is actually simplified too, even if it does remain somewhat obscure and odd. This was all triggered by my reaction to the new "cap_isidentical()" introduced recently. By just using a saner data structure, it went from unsigned __capi; CAP_FOR_EACH_U32(__capi) { if (a.cap[__capi] != b.cap[__capi]) return false; } return true; to just being return a.val == b.val; instead. Which is rather more obvious both to humans and to compilers. Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Paul Moore <paul@paul-moore.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-02-23Merge tag 'mm-nonmm-stable-2023-02-20-15-29' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "There is no particular theme here - mainly quick hits all over the tree. Most notable is a set of zlib changes from Mikhail Zaslonko which enhances and fixes zlib's use of S390 hardware support: 'lib/zlib: Set of s390 DFLTCC related patches for kernel zlib'" * tag 'mm-nonmm-stable-2023-02-20-15-29' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (55 commits) Update CREDITS file entry for Jesper Juhl sparc: allow PM configs for sparc32 COMPILE_TEST hung_task: print message when hung_task_warnings gets down to zero. arch/Kconfig: fix indentation scripts/tags.sh: fix the Kconfig tags generation when using latest ctags nilfs2: prevent WARNING in nilfs_dat_commit_end() lib/zlib: remove redundation assignement of avail_in dfltcc_gdht() lib/Kconfig.debug: do not enable DEBUG_PREEMPT by default lib/zlib: DFLTCC always switch to software inflate for Z_PACKET_FLUSH option lib/zlib: DFLTCC support inflate with small window lib/zlib: Split deflate and inflate states for DFLTCC lib/zlib: DFLTCC not writing header bits when avail_out == 0 lib/zlib: fix DFLTCC ignoring flush modes when avail_in == 0 lib/zlib: fix DFLTCC not flushing EOBS when creating raw streams lib/zlib: implement switching between DFLTCC and software lib/zlib: adjust offset calculation for dfltcc_state nilfs2: replace WARN_ONs for invalid DAT metadata block requests scripts/spelling.txt: add "exsits" pattern and fix typo instances fs: gracefully handle ->get_block not mapping bh in __mpage_writepage cramfs: Kconfig: fix spelling & punctuation ...
2023-02-23Merge tag 'mm-stable-2023-02-20-13-37' of ↵Linus Torvalds3-19/+17
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
2023-02-21sysctl: fix proc_dobool() usabilityOndrej Mosnacek1-0/+6
Currently proc_dobool expects a (bool *) in table->data, but sizeof(int) in table->maxsize, because it uses do_proc_dointvec() directly. This is unsafe for at least two reasons: 1. A sysctl table definition may use { .data = &variable, .maxsize = sizeof(variable) }, not realizing that this makes the sysctl unusable (see the Fixes: tag) and that they need to use the completely counterintuitive sizeof(int) instead. 2. proc_dobool() will currently try to parse an array of values if given .maxsize >= 2*sizeof(int), but will try to write values of type bool by offsets of sizeof(int), so it will not work correctly with neither an (int *) nor a (bool *). There is no .maxsize validation to prevent this. Fix this by: 1. Constraining proc_dobool() to allow only one value and .maxsize == sizeof(bool). 2. Wrapping the original struct ctl_table in a temporary one with .data pointing to a local int variable and .maxsize set to sizeof(int) and passing this one to proc_dointvec(), converting the value to/from bool as needed (using proc_dou8vec_minmax() as an example). 3. Extending sysctl_check_table() to enforce proc_dobool() expectations. 4. Fixing the proc_dobool() docstring (it was just copy-pasted from proc_douintvec, apparently...). 5. Converting all existing proc_dobool() users to set .maxsize to sizeof(bool) instead of sizeof(int). Fixes: 83efeeeb3d04 ("tty: Allow TIOCSTI to be disabled") Fixes: a2071573d634 ("sysctl: introduce new proc handler proc_dobool") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-02-20Merge tag 'fs.idmapped.v6.3' of ↵Linus Torvalds8-33/+33
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull vfs idmapping updates from Christian Brauner: - Last cycle we introduced the dedicated struct mnt_idmap type for mount idmapping and the required infrastucture in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). As promised in last cycle's pull request message this converts everything to rely on struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevant on the mount level. Especially for non-vfs developers without detailed knowledge in this area this was a potential source for bugs. This finishes the conversion. Instead of passing the plain namespace around this updates all places that currently take a pointer to a mnt_userns with a pointer to struct mnt_idmap. Now that the conversion is done all helpers down to the really low-level helpers only accept a struct mnt_idmap argument instead of two namespace arguments. Conflating mount and other idmappings will now cause the compiler to complain loudly thus eliminating the possibility of any bugs. This makes it impossible for filesystem developers to mix up mount and filesystem idmappings as they are two distinct types and require distinct helpers that cannot be used interchangeably. Everything associated with struct mnt_idmap is moved into a single separate file. With that change no code can poke around in struct mnt_idmap. It can only be interacted with through dedicated helpers. That means all filesystems are and all of the vfs is completely oblivious to the actual implementation of idmappings. We are now also able to extend struct mnt_idmap as we see fit. For example, we can decouple it completely from namespaces for users that don't require or don't want to use them at all. We can also extend the concept of idmappings so we can cover filesystem specific requirements. In combination with the vfs{g,u}id_t work we finished in v6.2 this makes this feature substantially more robust and thus difficult to implement wrong by a given filesystem and also protects the vfs. - Enable idmapped mounts for tmpfs and fulfill a longstanding request. A long-standing request from users had been to make it possible to create idmapped mounts for tmpfs. For example, to share the host's tmpfs mount between multiple sandboxes. This is a prerequisite for some advanced Kubernetes cases. Systemd also has a range of use-cases to increase service isolation. And there are more users of this. However, with all of the other work going on this was way down on the priority list but luckily someone other than ourselves picked this up. As usual the patch is tiny as all the infrastructure work had been done multiple kernel releases ago. In addition to all the tests that we already have I requested that Rodrigo add a dedicated tmpfs testsuite for idmapped mounts to xfstests. It is to be included into xfstests during the v6.3 development cycle. This should add a slew of additional tests. * tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits) shmem: support idmapped mounts for tmpfs fs: move mnt_idmap fs: port vfs{g,u}id helpers to mnt_idmap fs: port fs{g,u}id helpers to mnt_idmap fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap fs: port i_{g,u}id_{needs_}update() to mnt_idmap quota: port to mnt_idmap fs: port privilege checking helpers to mnt_idmap fs: port inode_owner_or_capable() to mnt_idmap fs: port inode_init_owner() to mnt_idmap fs: port acl to mnt_idmap fs: port xattr to mnt_idmap fs: port ->permission() to pass mnt_idmap fs: port ->fileattr_set() to pass mnt_idmap fs: port ->set_acl() to pass mnt_idmap fs: port ->get_acl() to pass mnt_idmap fs: port ->tmpfile() to pass mnt_idmap fs: port ->rename() to pass mnt_idmap fs: port ->mknod() to pass mnt_idmap fs: port ->mkdir() to pass mnt_idmap ...
2023-02-20Merge tag 'locks-v6.3' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking updates from Jeff Layton: "The main change here is that I've broken out most of the file locking definitions into a new header file. I also went ahead and completed the removal of locks_inode function" * tag 'locks-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: fs: remove locks_inode filelock: move file l