summaryrefslogtreecommitdiff
path: root/Documentation/admin-guide/sysctl
AgeCommit message (Collapse)AuthorFilesLines
2024-05-02net: make SK_MEMORY_PCPU_RESERV tunableAdam Li1-0/+5
[ Upstream commit 12a686c2e761f1f1f6e6e2117a9ab9c6de2ac8a7 ] This patch adds /proc/sys/net/core/mem_pcpu_rsv sysctl file, to make SK_MEMORY_PCPU_RESERV tunable. Commit 3cd3399dd7a8 ("net: implement per-cpu reserves for memory_allocated") introduced per-cpu forward alloc cache: "Implement a per-cpu cache of +1/-1 MB, to reduce number of changes to sk->sk_prot->memory_allocated, which would otherwise be cause of false sharing." sk_prot->memory_allocated points to global atomic variable: atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp; If increasing the per-cpu cache size from 1MB to e.g. 16MB, changes to sk->sk_prot->memory_allocated can be further reduced. Performance may be improved on system with many cores. Signed-off-by: Adam Li <adamli@os.amperecomputing.com> Reviewed-by: Christoph Lameter (Ampere) <cl@linux.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: 3584718cf2ec ("net: fix sk_memory_allocated_{add|sub} vs softirqs") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-08Merge tag 'io_uring-6.6-2023-09-08' of git://git.kernel.dk/linuxLinus Torvalds1-0/+29
Pull io_uring fixes from Jens Axboe: "A few fixes that should go into the 6.6-rc merge window: - Fix for a regression this merge window caused by the SQPOLL affinity patch, where we can race with SQPOLL thread shutdown and cause an oops when trying to set affinity (Gabriel) - Fix for a regression this merge window where fdinfo reading with for a ring setup with IORING_SETUP_NO_SQARRAY will attempt to deference the non-existing SQ ring array (me) - Add the patch that allows more finegrained control over who can use io_uring (Matteo) - Locking fix for a regression added this merge window for IOPOLL overflow (Pavel) - IOPOLL fix for stable, breaking our loop if helper threads are exiting (Pavel) Also had a fix for unreaped iopoll requests from io-wq from Ming, but we found an issue with that and hence it got reverted. Will get this sorted for a future rc" * tag 'io_uring-6.6-2023-09-08' of git://git.kernel.dk/linux: Revert "io_uring: fix IO hang in io_wq_put_and_exit from do_exit()" io_uring: fix unprotected iopoll overflow io_uring: break out of iowq iopoll on teardown io_uring: add a sysctl to disable io_uring system-wide io_uring/fdinfo: only print ->sq_array[] if it's there io_uring: fix IO hang in io_wq_put_and_exit from do_exit() io_uring: Don't set affinity on a dying sqpoll thread
2023-09-05io_uring: add a sysctl to disable io_uring system-wideMatteo Rizzo1-0/+29
Introduce a new sysctl (io_uring_disabled) which can be either 0, 1, or 2. When 0 (the default), all processes are allowed to create io_uring instances, which is the current behavior. When 1, io_uring creation is disabled (io_uring_setup() will fail with -EPERM) for unprivileged processes not in the kernel.io_uring_group group. When 2, calls to io_uring_setup() fail with -EPERM regardless of privilege. Signed-off-by: Matteo Rizzo <matteorizzo@google.com> [JEM: modified to add io_uring_group] Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Link: https://lore.kernel.org/r/x49y1i42j1z.fsf@segfault.boston.devel.redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-16Documentation: admin-guide: Add riscv sysctl_perf_user_accessAlexandre Ghiti1-4/+23
riscv now uses this sysctl so document its usage for this architecture. Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
2023-06-28Merge tag 'net-next-6.5' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking changes from Jakub Kicinski: "WiFi 7 and sendpage changes are the biggest pieces of work for this release. The latter will definitely require fixes but I think that we got it to a reasonable point. Core: - Rework the sendpage & splice implementations Instead of feeding data into sockets page by page extend sendmsg handlers to support taking a reference on the data, controlled by a new flag called MSG_SPLICE_PAGES Rework the handling of unexpected-end-of-file to invoke an additional callback instead of trying to predict what the right combination of MORE/NOTLAST flags is Remove the MSG_SENDPAGE_NOTLAST flag completely - Implement SCM_PIDFD, a new type of CMSG type analogous to SCM_CREDENTIALS, but it contains pidfd instead of plain pid - Enable socket busy polling with CONFIG_RT - Improve reliability and efficiency of reporting for ref_tracker - Auto-generate a user space C library for various Netlink families Protocols: - Allow TCP to shrink the advertised window when necessary, prevent sk_rcvbuf auto-tuning from growing the window all the way up to tcp_rmem[2] - Use per-VMA locking for "page-flipping" TCP receive zerocopy - Prepare TCP for device-to-device data transfers, by making sure that payloads are always attached to skbs as page frags - Make the backoff time for the first N TCP SYN retransmissions linear. Exponential backoff is unnecessarily conservative - Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO) - Avoid waking up applications using TLS sockets until we have a full record - Allow using kernel memory for protocol ioctl callbacks, paving the way to issuing ioctls over io_uring - Add nolocalbypass option to VxLAN, forcing packets to be fully encapsulated even if they are destined for a local IP address - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure in-kernel ECMP implementation (e.g. Open vSwitch) select the same link for all packets. Support L4 symmetric hashing in Open vSwitch - PPPoE: make number of hash bits configurable - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client (ipconfig) - Add layer 2 miss indication and filtering, allowing higher layers (e.g. ACL filters) to make forwarding decisions based on whether packet matched forwarding state in lower devices (bridge) - Support matching on Connectivity Fault Management (CFM) packets - Hide the "link becomes ready" IPv6 messages by demoting their printk level to debug - HSR: don't enable promiscuous mode if device offloads the proto - Support active scanning in IEEE 802.15.4 - Continue work on Multi-Link Operation for WiFi 7 BPF: - Add precision propagation for subprogs and callbacks. This allows maintaining verification efficiency when subprograms are used, or in fact passing the verifier at all for complex programs, especially those using open-coded iterators - Improve BPF's {g,s}setsockopt() length handling. Previously BPF assumed the length is always equal to the amount of written data. But some protos allow passing a NULL buffer to discover what the output buffer *should* be, without writing anything - Accept dynptr memory as memory arguments passed to helpers - Add routing table ID to bpf_fib_lookup BPF helper - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark maps as read-only) - Show target_{obj,btf}_id in tracing link fdinfo - Addition of several new kfuncs (most of the names are self-explanatory): - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(), bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size() and bpf_dynptr_clone(). - bpf_task_under_cgroup() - bpf_sock_destroy() - force closing sockets - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs Netfilter: - Relax set/map validation checks in nf_tables. Allow checking presence of an entry in a map without using the value - Increase ip_vs_conn_tab_bits range for 64BIT builds - Allow updating size of a set - Improve NAT tuple selection when connection is closing Driver API: - Integrate netdev with LED subsystem, to allow configuring HW "offloaded" blinking of LEDs based on link state and activity (i.e. packets coming in and out) - Support configuring rate selection pins of SFP modules - Factor Clause 73 auto-negotiation code out of the drivers, provide common helper routines - Add more fool-proof helpers for managing lifetime of MDIO devices associated with the PCS layer - Allow drivers to report advanced statistics related to Time Aware scheduler offload (taprio) - Allow opting out of VF statistics in link dump, to allow more VFs to fit into the message - Split devlink instance and devlink port operations New hardware / drivers: - Ethernet: - Synopsys EMAC4 IP support (stmmac) - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches - Marvell 88E6250 7 port switches - Microchip LAN8650/1 Rev.B0 PHYs - MediaTek MT7981/MT7988 built-in 1GE PHY driver - WiFi: - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps - Realtek RTL8723DS (SDIO variant) - Realtek RTL8851BE - CAN: - Fintek F81604 Drivers: - Ethernet NICs: - Intel (100G, ice): - support dynamic interrupt allocation - use meta data match instead of VF MAC addr on slow-path - nVidia/Mellanox: - extend link aggregation to handle 4, rather than just 2 ports - spawn sub-functions without any features by default - OcteonTX2: - support HTB (Tx scheduling/QoS) offload - make RSS hash generation configurable - support selecting Rx queue using TC filters - Wangxun (ngbe/txgbe): - add basic Tx/Rx packet offloads - add phylink support (SFP/PCS control) - Freescale/NXP (enetc): - report TAPRIO packet statistics - Solarflare/AMD: - support matching on IP ToS and UDP source port of outer header - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6 - add devlink dev info support for EF10 - Virtual NICs: - Microsoft vNIC: - size the Rx indirection table based on requested configuration - support VLAN tagging - Amazon vNIC: - try to reuse Rx buffers if not fully consumed, useful for ARM servers running with 16kB pages - Google vNIC: - support TCP segmentation of >64kB frames - Ethernet embedded switches: - Marvell (mv88e6xxx): - enable USXGMII (88E6191X) - Microchip: - lan966x: add support for Egress Stage 0 ACL engine - lan966x: support mapping packet priority to internal switch priority (based on PCP or DSCP) - Ethernet PHYs: - Broadcom PHYs: - support for Wake-on-LAN for BCM54210E/B50212E - report LPI counter - Microsemi PHYs: support RGMII delay configuration (VSC85xx) - Micrel PHYs: receive timestamp in the frame (LAN8841) - Realtek PHYs: support optional external PHY clock - Altera TSE PCS: merge the driver into Lynx PCS which it is a variant of - CAN: Kvaser PCIEcan: - support packet timestamping - WiFi: - Intel (iwlwifi): - major update for new firmware and Multi-Link Operation (MLO) - configuration rework to drop test devices and split the different families - support for segmented PNVM images and power tables - new vendor entries for PPAG (platform antenna gain) feature - Qualcomm 802.11ax (ath11k): - Multiple Basic Service Set Identifier (MBSSID) and Enhanced MBSSID Advertisement (EMA) support in AP mode - support factory test mode - RealTek (rtw89): - add RSSI based antenna diversity - support U-NII-4 channels on 5 GHz band - RealTek (rtl8xxxu): - AP mode support for 8188f - support USB RX aggregation for the newer chips" * tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits) net: scm: introduce and use scm_recv_unix helper af_unix: Skip SCM_PIDFD if scm->pid is NULL. net: lan743x: Simplify comparison netlink: Add __sock_i_ino() for __netlink_diag_dump(). net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses Revert "af_unix: Call scm_recv() only after scm_set_cred()." phylink: ReST-ify the phylink_pcs_neg_mode() kdoc libceph: Partially revert changes to support MSG_SPLICE_PAGES net: phy: mscc: fix packet loss due to RGMII delays net: mana: use vmalloc_array and vcalloc net: enetc: use vmalloc_array and vcalloc ionic: use vmalloc_array and vcalloc pds_core: use vmalloc_array and vcalloc gve: use vmalloc_array and vcalloc octeon_ep: use vmalloc_array and vcalloc net: usb: qmi_wwan: add u-blox 0x1312 composition perf trace: fix MSG_SPLICE_PAGES build error ipvlan: Fix return value of ipvlan_queue_xmit() netfilter: nf_tables: fix underflow in chain reference counter netfilter: nf_tables: unbind non-anonymous set if rule construction fails ...
2023-06-21docs: arm64: Move arm64 documentation under Documentation/arch/Jonathan Corbet1-1/+1
Architecture-specific documentation is being moved into Documentation/arch/ as a way of cleaning up the top-level documentation directory and making the docs hierarchy more closely match the source hierarchy. Move Documentation/arm64 into arch/ (along with the Chinese equvalent translations) and fix up documentation references. Cc: Will Deacon <will@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Hu Haowen <src.res@email.cn> Cc: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Yantengsi <siyanteng@loongson.cn> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2023-05-25Documentation: net: net.core.txrehash is not specific to listening socketsAntoine Tenart1-2/+2
The net.core.txrehash documentation mentions this knob is for listening sockets only, while sk_rethink_txhash can be called on SYN and RTO retransmits on all TCP sockets. Remove the listening socket part. Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-30docs: move x86 documentation into Documentation/arch/Jonathan Corbet1-2/+2
Move the x86 documentation under Documentation/arch/ as a way of cleaning up the top-level directory and making the structure of our docs more closely match the structure of the source directories it describes. All in-kernel references to the old paths have been updated. Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20230315211523.108836-1-corbet@lwn.net/ Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2023-02-23Merge tag 'mm-nonmm-stable-2023-02-20-15-29' of ↵Linus Torvalds1-3/+22
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "There is no particular theme here - mainly quick hits all over the tree. Most notable is a set of zlib changes from Mikhail Zaslonko which enhances and fixes zlib's use of S390 hardware support: 'lib/zlib: Set of s390 DFLTCC related patches for kernel zlib'" * tag 'mm-nonmm-stable-2023-02-20-15-29' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (55 commits) Update CREDITS file entry for Jesper Juhl sparc: allow PM configs for sparc32 COMPILE_TEST hung_task: print message when hung_task_warnings gets down to zero. arch/Kconfig: fix indentation scripts/tags.sh: fix the Kconfig tags generation when using latest ctags nilfs2: prevent WARNING in nilfs_dat_commit_end() lib/zlib: remove redundation assignement of avail_in dfltcc_gdht() lib/Kconfig.debug: do not enable DEBUG_PREEMPT by default lib/zlib: DFLTCC always switch to software inflate for Z_PACKET_FLUSH option lib/zlib: DFLTCC support inflate with small window lib/zlib: Split deflate and inflate states for DFLTCC lib/zlib: DFLTCC not writing header bits when avail_out == 0 lib/zlib: fix DFLTCC ignoring flush modes when avail_in == 0 lib/zlib: fix DFLTCC not flushing EOBS when creating raw streams lib/zlib: implement switching between DFLTCC and software lib/zlib: adjust offset calculation for dfltcc_state nilfs2: replace WARN_ONs for invalid DAT metadata block requests scripts/spelling.txt: add "exsits" pattern and fix typo instances fs: gracefully handle ->get_block not mapping bh in __mpage_writepage cramfs: Kconfig: fix spelling & punctuation ...
2023-02-22Merge tag 'docs-6.3' of git://git.lwn.net/linuxLinus Torvalds1-2/+2
Pull documentation updates from Jonathan Corbet: "It has been a moderately calm cycle for documentation; the significant changes include: - Some significant additions to the memory-management documentation - Some improvements to navigation in the HTML-rendered docs - More Spanish and Chinese translations ... and the usual set of typo fixes and such" * tag 'docs-6.3' of git://git.lwn.net/linux: (68 commits) Documentation/watchdog/hpwdt: Fix Format Documentation/watchdog/hpwdt: Fix Reference Documentation: core-api: padata: correct spelling docs/mm: Physical Memory: correct spelling in reference to CONFIG_PAGE_EXTENSION docs: Use HTML comments for the kernel-toc SPDX line docs: Add more information to the HTML sidebar Documentation: KVM: Update AMD memory encryption link printk: Document that CONFIG_BOOT_PRINTK_DELAY required for boot_delay= Documentation: userspace-api: correct spelling Documentation: sparc: correct spelling Documentation: driver-api: correct spelling Documentation: admin-guide: correct spelling docs: add workload-tracing document to admin-guide docs/admin-guide/mm: remove useless markup docs/mm: remove useless markup docs/mm: Physical Memory: remove useless markup docs/sp_SP: Add process magic-number translation docs: ftrace: always use canonical ftrace path Doc/damon: fix the data path error dma-buf: Add "dma-buf" to title of documentation ...
2023-02-09net: introduce default_rps_mask netns attributePaolo Abeni1-0/+6
If RPS is enabled, this allows configuring a default rps mask, which is effective since receive queue creation time. A default RPS mask allows the system admin to ensure proper isolation, avoiding races at network namespace or device creation time. The default RPS mask is initially empty, and can be modified via a newly added sysctl entry. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-02kexec: introduce sysctl parameters kexec_load_limit_*Ricardo Ribalda1-0/+18
kexec allows replacing the current kernel with a different one. This is usually a source of concerns for sysadmins that want to harden a system. Linux already provides a way to disable loading new kexec kernel via kexec_load_disabled, but that control is very coard, it is all or nothing and does not make distinction between a panic kexec and a normal kexec. This patch introduces new sysctl parameters, with finer tuning to specify how many times a kexec kernel can be loaded. The sysadmin can set different limits for kexec panic and kexec reboot kernels. The value can be modified at runtime via sysctl, but only with a stricter value. With these new parameters on place, a system with loadpin and verity enabled, using the following kernel parameters: sysctl.kexec_load_limit_reboot=0 sysct.kexec_load_limit_panic=1 can have a good warranty that if initrd tries to load a panic kernel, a malitious user will have small chances to replace that kernel with a different one, even if they can trigger timeouts on the disk where the panic kernel lives. Link: https://lkml.kernel.org/r/20221114-disable-kexec-reset-v6-3-6a8531a09b9a@chromium.org Signed-off-by: Ricardo Ribalda <ribalda@chromium.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Baoquan He <bhe@redhat.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Philipp Rudo <prudo@redhat.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02Documentation: sysctl: correct kexec_load_disabledRicardo Ribalda1-3/+4
Patch series "kexec: Add new parameter to limit the access to kexec", v6. Add two parameter to specify how many times a kexec kernel can be loaded. These parameter allow hardening the system. While we are at it, fix a documentation issue and refactor some code. This patch (of 3): kexec_load_disabled affects both ``kexec_load`` and ``kexec_file_load`` syscalls. Make it explicit. Link: https://lkml.kernel.org/r/20221114-disable-kexec-reset-v6-0-6a8531a09b9a@chromium.org Link: https://lkml.kernel.org/r/20221114-disable-kexec-reset-v6-1-6a8531a09b9a@chromium.org Signed-off-by: Ricardo Ribalda <ribalda@chromium.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Baoquan He <bhe@redhat.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Guilherme G. Piccoli <gpiccoli@igalia.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Philipp Rudo <prudo@redhat.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02Documentation: admin-guide: correct spellingRandy Dunlap1-2/+2
Correct spelling problems for Documentation/admin-guide/ as reported by codespell. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: cgroups@vger.kernel.org Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: dm-devel@redhat.com Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: linux-media@vger.kernel.org Cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20230129231053.20863-2-rdunlap@infradead.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-12-19Merge tag 'loongarch-6.2' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson Pull LoongArch updates from Huacai Chen: - Switch to relative exception tables - Add unaligned access support - Add alternative runtime patching mechanism - Add FDT booting support from efi system table - Add suspend/hibernation (ACPI S3/S4) support - Add basic STACKPROTECTOR support - Add ftrace (function tracer) support - Update the default config file * tag 'loongarch-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: (24 commits) LoongArch: Update Loongson-3 default config file LoongArch: modules/ftrace: Initialize PLT at load time LoongArch/ftrace: Add HAVE_FUNCTION_GRAPH_RET_ADDR_PTR support LoongArch/ftrace: Add HAVE_DYNAMIC_FTRACE_WITH_ARGS support LoongArch/ftrace: Add HAVE_DYNAMIC_FTRACE_WITH_REGS support LoongArch/ftrace: Add dynamic function graph tracer support LoongArch/ftrace: Add dynamic function tracer support LoongArch/ftrace: Add recordmcount support LoongArch/ftrace: Add basic support LoongArch: module: Use got/plt section indices for relocations LoongArch: Add basic STACKPROTECTOR support LoongArch: Add hibernation (ACPI S4) support LoongArch: Add suspend (ACPI S3) support LoongArch: Add processing ISA Node in DeviceTree LoongArch: Add FDT booting support from efi system table LoongArch: Use alternative to optimize libraries LoongArch: Add alternative runtime patching mechanism LoongArch: Add unaligned access support LoongArch: BPF: Add BPF exception tables LoongArch: Remove the .fixup section usage ...
2022-12-14Merge tag 'hardening-v6.2-rc1' of ↵Linus Torvalds1-0/+19
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kernel hardening updates from Kees Cook: - Convert flexible array members, fix -Wstringop-overflow warnings, and fix KCFI function type mismatches that went ignored by maintainers (Gustavo A. R. Silva, Nathan Chancellor, Kees Cook) - Remove the remaining side-effect users of ksize() by converting dma-buf, btrfs, and coredump to using kmalloc_size_roundup(), add more __alloc_size attributes, and introduce full testing of all allocator functions. Finally remove the ksize() side-effect so that each allocation-aware checker can finally behave without exceptions - Introduce oops_limit (default 10,000) and warn_limit (default off) to provide greater granularity of control for panic_on_oops and panic_on_warn (Jann Horn, Kees Cook) - Introduce overflows_type() and castable_to_type() helpers for cleaner overflow checking - Improve code generation for strscpy() and update str*() kern-doc - Convert strscpy and sigphash tests to KUnit, and expand memcpy tests - Always use a non-NULL argument for prepare_kernel_cred() - Disable structleak plugin in FORTIFY KUnit test (Anders Roxell) - Adjust orphan linker section checking to respect CONFIG_WERROR (Xin Li) - Make sure siginfo is cleared for forced SIGKILL (haifeng.xu) - Fix um vs FORTIFY warnings for always-NULL arguments * tag 'hardening-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (31 commits) ksmbd: replace one-element arrays with flexible-array members hpet: Replace one-element array with flexible-array member um: virt-pci: Avoid GCC non-NULL warning signal: Initialize the info in ksignal lib: fortify_kunit: build without structleak plugin panic: Expose "warn_count" to sysfs panic: Introduce warn_limit panic: Consolidate open-coded panic_on_warn checks exit: Allow oops_limit to be disabled exit: Expose "oops_count" to sysfs exit: Put an upper limit on how often we can oops panic: Separate sysctl logic from CONFIG_SMP mm/pgtable: Fix multiple -Wstringop-overflow warnings mm: Make ksize() a reporting-only function kunit/fortify: Validate __alloc_size attribute results drm/sti: Fix return type of sti_{dvo,hda,hdmi}_connector_mode_valid() drm/fsl-dcu: Fix return type of fsl_dcu_drm_connector_mode_valid() driver core: Add __alloc_size hint to devm allocators overflow: Introduce overflows_type() and castable_to_type() coredump: Proactively round up to kmalloc bucket size ...
2022-12-14LoongArch: Add unaligned access supportHuacai Chen1-4/+4
Loongson-2 series (Loongson-2K500, Loongson-2K1000) don't support unaligned access in hardware, while Loongson-3 series (Loongson-3A5000, Loongson-3C5000) are configurable whether support unaligned access in hardware. This patch add unaligned access emulation for those LoongArch processors without hardware support. Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2022-12-12Merge tag 'mm-nonmm-stable-2022-12-12' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - A ptrace API cleanup series from Sergey Shtylyov - Fixes and cleanups for kexec from ye xingchen - nilfs2 updates from Ryusuke Konishi - squashfs feature work from Xiaoming Ni: permit configuration of the filesystem's compression concurrency from the mount command line - A series from Akinobu Mita which addresses bound checking errors when writing to debugfs files - A series from Yang Yingliang to address rapidio memory leaks - A series from Zheng Yejian to address possible overflow errors in encode_comp_t() - And a whole shower of singleton patches all over the place * tag 'mm-nonmm-stable-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (79 commits) ipc: fix memory leak in init_mqueue_fs() hfsplus: fix bug causing custom uid and gid being unable to be assigned with mount rapidio: devices: fix missing put_device in mport_cdev_open kcov: fix spelling typos in comments hfs: Fix OOB Write in hfs_asc2mac hfs: fix OOB Read in __hfs_brec_find relay: fix type mismatch when allocating memory in relay_create_buf() ocfs2: always read both high and low parts of dinode link count io-mapping: move some code within the include guarded section kernel: kcsan: kcsan_test: build without structleak plugin mailmap: update email for Iskren Chernev eventfd: change int to __u64 in eventfd_signal() ifndef CONFIG_EVENTFD rapidio: fix possible UAF when kfifo_alloc() fails relay: use strscpy() is more robust and safer cpumask: limit visibility of FORCE_NR_CPUS acct: fix potential integer overflow in encode_comp_t() acct: fix accuracy loss for input value of encode_comp_t() linux/init.h: include <linux/build_bug.h> and <linux/stringify.h> rapidio: rio: fix possible name leak in rio_register_mport() rapidio: fix possible name leaks when rio_add_device() fails ...
2022-12-12Merge tag 'docs-6.2' of git://git.lwn.net/linuxLinus Torvalds2-145/+97
Pull documentation updates from Jonathan Corbet: "This was a not-too-busy cycle for documentation; highlights include: - The beginnings of a set of translations into Spanish, headed up by Carlos Bilbao - More Chinese translations - A change to the Sphinx "alabaster" theme by default for HTML generation. Unlike the previous default (Read the Docs), alabaster is shipped with Sphinx by default, reducing the number of other dependencies that need to be installed. It also (IMO) produces a cleaner and more readable result. - The ability to render the documentation into the texinfo format (something Sphinx could always do, we just never wired it up until now) Plus the usual collection of typo fixes, build-warning fixes, and minor updates" * tag 'docs-6.2' of git://git.lwn.net/linux: (67 commits) Documentation/features: Use loongarch instead of loong Documentation/features-refresh.sh: Only sed the beginning "arch" of ARCH_DIR docs/zh_CN: Fix '.. only::' directive's expression docs/sp_SP: Add memory-barriers.txt Spanish translation docs/zh_CN/LoongArch: Update links of LoongArch ISA Vol1 and ELF psABI docs/LoongArch: Update links of LoongArch ISA Vol1 and ELF psABI Documentation/features: Update feature lists for 6.1 Documentation: Fixed a typo in bootconfig.rst docs/sp_SP: Add process coding-style translation docs/sp_SP: Add kernel-docs.rst Spanish translation docs: Create translations/sp_SP/process/, move submitting-patches.rst docs: Add book to process/kernel-docs.rst docs: Retire old resources from kernel-docs.rst docs: Update maintainer of kernel-docs.rst Documentation: riscv: Document the sv57 VM layout Documentation: USB: correct possessive "its" usage math64: fix kernel-doc return value warnings math64: add kernel-doc for DIV64_U64_ROUND_UP math64: favor kernel-doc from header files doc: add texinfodocs and infodocs targets ...
2022-12-02panic: Introduce warn_limitKees Cook1-0/+10
Like oops_limit, add warn_limit for limiting the number of warnings when panic_on_warn is not set. Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Petr Mladek <pmladek@suse.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-doc@vger.kernel.org Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-5-keescook@chromium.org
2022-12-02exit: Allow oops_limit to be disabledKees Cook1-2/+3
In preparation for keeping oops_limit logic in sync with warn_limit, have oops_limit == 0 disable checking the Oops counter. Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: linux-doc@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2022-12-01exit: Put an upper limit on how often we can oopsJann Horn1-0/+8
Many Linux systems are configured to not panic on oops; but allowing an attacker to oops the system **really** often can make even bugs that look completely unexploitable exploitable (like NULL dereferences and such) if each crash elevates a refcount by one or a lock is taken in read mode, and this causes a counter to eventually overflow. The most interesting counters for this are 32 bits wide (like open-coded refcounts that don't use refcount_t). (The ldsem reader count on 32-bit platforms is just 16 bits, but probably nobody cares about 32-bit platforms that much nowadays.) So let's panic the system if the kernel is constantly oopsing. The speed of oopsing 2^32 times probably depends on several factors, like how long the stack trace is and which unwinder you're using; an empirically important one is whether your console is showing a graphical environment or a text console that oopses will be printed to. In a quick single-threaded benchmark, it looks like oopsing in a vfork() child with a very short stack trace only takes ~510 microseconds per run when a graphical console is active; but switching to a text console that oopses are printed to slows it down around 87x, to ~45 milliseconds per run. (Adding more threads makes this faster, but the actual oops printing happens under &die_lock on x86, so you can maybe speed this up by a factor of around 2 and then any further improvement gets eaten up by lock contention.) It looks like it would take around 8-12 days to overflow a 32-bit counter with repeated oopsing on a multi-core X86 system running a graphical environment; both me (in an X86 VM) and Seth (with a distro kernel on normal hardware in a standard configuration) got numbers in that ballpark. 12 days aren't *that* short on a desktop system, and you'd likely need much longer on a typical server system (assuming that people don't run graphical desktop environments on their servers), and this is a *very* noisy and violent approach to exploiting the kernel; and it also seems to take orders of magnitude longer on some machines, probably because stuff like EFI pstore will slow it down a ton if that's active. Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20221107201317.324457-1-jannh@google.com Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-2-keescook@chromium.org
2022-11-18core_pattern: add CPU specifierOleksandr Natalenko1-0/+1
Statistically, in a large deployment regular segfaults may indicate a CPU issue. Currently, it is not possible to find out what CPU the segfault happened on. There are at least two attempts to improve segfault logging with this regard, but they do not help in case the logs rotate. Hence, lets make sure it is possible to permanently record a CPU the task ran on using a new core_pattern specifier. Link: https://lkml.kernel.org/r/20220903064330.20772-1-oleksandr@redhat.com Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com> Suggested-by: Renaud Métrich <rmetrich@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Grzegorz Halat <ghalat@redhat.com> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Joel Savitz <jsavitz@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Stephen Kitt <steve@sk2.org> Cc: Will Deacon <will@kernel.org> Cc: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-10x86/split_lock: Add sysctl to control the misery modeGuilherme G. Piccoli1-0/+23
Commit b041b525dab9 ("x86/split_lock: Make life miserable for split lockers") changed the way the split lock detector works when in "warn" mode; basically, it not only shows the warn message, but also intentionally introduces a slowdown through sleeping plus serialization mechanism on such task. Based on discussions in [0], seems the warning alone wasn't enough motivation for userspace developers to fix their applications. This slowdown is enough to totally break some proprietary (aka. unfixable) userspace[1]. Happens that originally the proposal in [0] was to add a new mode which would warns + slowdown the "split locking" task, keeping the old warn mode untouched. In the end, that idea was discarded and the regular/default "warn" mode now slows down the applications. This is quite aggressive with regards proprietary/legacy programs that basically are unable to properly run in kernel with this change. While it is understandable that a malicious application could DoS by split locking, it seems unacceptable to regress old/proprietary userspace programs through a default configuration that previously worked. An example of such breakage was reported in [1]. Add a sysctl to allow controlling the "misery mode" behavior, as per Thomas suggestion on [2]. This way, users running legacy and/or proprietary software are allowed to still execute them with a decent performance while still observing the warning messages on kernel log. [0] https://lore.kernel.org/lkml/20220217012721.9694-1-tony.luck@intel.com/ [1] https://github.com/doitsujin/dxvk/issues/2938 [2] https://lore.kernel.org/lkml/87pmf4bter.ffs@tglx/ [ dhansen: minor changelog tweaks, including clarifying the actual problem ] Fixes: b041b525dab9 ("x86/split_lock: Make life miserable for split lockers") Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Tested-by: Andre Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/all/20221024200254.635256-1-gpiccoli%40igalia.com
2022-10-19docs: sysctl/fs: re-order, prettifyStephen Kitt2-98/+92
This brings the text markup in line with sysctl/abi and sysctl/kernel: * the entries are ordered alphabetically * the table of contents is automatically generated * markup is used as appropriate for constants etc. The content isn't fully up-to-date but the obsolete entries are gone, so remove the kernel version mention. Signed-off-by: Stephen Kitt <steve@sk2.org> Link: https://lore.kernel.org/r/20220930102937.135841-6-steve@sk2.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-10-19docs: sysctl/fs: remove references to super-max/-nrStephen Kitt1-10/+0
These were removed in 2.4.7.8. Remove references to super-max and super-nr in the sysctl documentation. Signed-off-by: Stephen Kitt <steve@sk2.org> Link: https://lore.kernel.org/r/20220930102937.135841-5-steve@sk2.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-10-19docs: sysctl/fs: merge the aio sectionsStephen Kitt1-13/+5
There are two sections documenting aio-nr and aio-max-nr, merge them. I kept the second explanation of aio-nr, which seems clearer to me, along with the effects of the values from the first section. Signed-off-by: Stephen Kitt <steve@sk2.org> Link: https://lore.kernel.org/r/20220930102937.135841-4-steve@sk2.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-10-19docs: sysctl/fs: remove references to dquot-max/-nrStephen Kitt1-16/+0
dquot-max was removed in 2.4.10.5; dquot-nr was replaced with dqstats in 2.5.18 which is now /proc/sys/fs/quota. Remove references to dquot-max and dquot-nr in the sysctl documentation. Signed-off-by: Stephen Kitt <steve@sk2.org> Link: https://lore.kernel.org/r/20220930102937.135841-3-steve@sk2.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-10-19docs: sysctl/fs: remove references to inode-maxStephen Kitt1-12/+4
inode-max was removed in 2.3.20pre1, remove references to it in the sysctl documentation. Signed-off-by: Stephen Kitt <steve@sk2.org> Link: https://lore.kernel.org/r/20220930102937.135841-2-steve@sk2.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-10-12Merge tag 'mm-nonmm-stable-2022-10-11' of ↵Linus Torvalds1-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - hfs and hfsplus kmap API modernization (Fabio Francesco) - make crash-kexec work properly when invoked from an NMI-time panic (Valentin Schneider) - ntfs bugfixes (Hawkins Jiawei) - improve IPC msg scalability by replacing atomic_t's with percpu counters (Jiebin Sun) - nilfs2 cleanups (Minghao Chi) - lots of other single patches all over the tree! * tag 'mm-nonmm-stable-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (71 commits) include/linux/entry-common.h: remove has_signal comment of arch_do_signal_or_restart() prototype proc: test how it holds up with mapping'less process mailmap: update Frank Rowand email address ia64: mca: use strscpy() is more robust and safer init/Kconfig: fix unmet direct dependencies ia64: update config files nilfs2: replace WARN_ONs by nilfs_error for checkpoint acquisition failure fork: remove duplicate included header files init/main.c: remove unnecessary (void*) conversions proc: mark more files as permanent nilfs2: remove the unneeded result variable nilfs2: delete unnecessary checks before brelse() checkpatch: warn for non-standard fixes tag style usr/gen_init_cpio.c: remove unnecessary -1 values from int file ipc/msg: mitigate the lock contention with percpu counter percpu: add percpu_counter_add_local and percpu_counter_sub_local fs/ocfs2: fix repeated words in comments relay: use kvcalloc to alloc page array in relay_alloc_page_array proc: make config PROC_CHILDREN depend on PROC_FS fs: uninline inode_maybe_inc_iversion() ...
2022-10-10Merge tag 'mm-stable-2022-10-08' of ↵Linus Torvalds2-0/+14
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ...
2022-09-16bpf: Use bpf_capable() instead of CAP_SYS_ADMIN for blinding decisionYauheni Kaliuta1-0/+3