summaryrefslogtreecommitdiff
path: root/include/uapi/linux
AgeCommit message (Collapse)AuthorFilesLines
2024-09-11Merge branch 'tip/sched/core' into sched_ext/for-6.12Tejun Heo1-3/+3
Pull in tip/sched/core to resolve two merge conflicts: - 96fd6c65efc6 ("sched: Factor out update_other_load_avgs() from __update_blocked_others()") 5d871a63997f ("sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c") A simple context conflict. The former added __update_blocked_others() in the same #ifdef CONFIG_SMP block that effective_cpu_util() and sched_cpu_util() are in and the latter moved those functions to fair.c. This makes __update_blocked_others() more out of place. Will follow up with a patch to relocate. - 96fd6c65efc6 ("sched: Factor out update_other_load_avgs() from __update_blocked_others()") 84d265281d6c ("sched/pelt: Use rq_clock_task() for hw_pressure") The former factored out the body of __update_blocked_others() into update_other_load_avgs(). The latter changed how update_hw_load_avg() is called in the body. Resolved by applying the change to update_other_load_avgs() instead. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-11block: implement async io_uring discard cmdPavel Begunkov1-0/+14
io_uring allows implementing custom file specific asynchronous operations via the fops->uring_cmd callback, a.k.a. IORING_OP_URING_CMD requests or just io_uring commands. Use it to add support for async discards. Normally, it first tries to queue up bios in a non-blocking context, and if that fails, we'd retry from a blocking context by returning -EAGAIN to the core io_uring. We always get the result from bios asynchronously by setting a custom bi_end_io callback, at which point we drag the request into the task context to either reissue or complete it and post a completion to the user. Unlike ioctl(BLKDISCARD) with stronger guarantees against races, we only do a best effort attempt to invalidate page cache, and it can race with any writes and reads and leave page cache stale. It's the same kind of races we allow to direct writes. Also, apart from cases where discarding is not allowed at all, e.g. discards are not supported or the file/device is read only, the user should assume that the sector range on disk is not valid anymore, even when an error was returned to the user. Suggested-by: Conrad Meyer <conradmeyer@meta.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2b5210443e4fa0257934f73dfafcc18a77cd0e09.1726072086.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-11Merge branch 'for-6.12/io_uring' into for-6.12/io_uring-discardJens Axboe1-1/+28
* for-6.12/io_uring: (31 commits) io_uring/io-wq: inherit cpuset of cgroup in io worker io_uring/io-wq: do not allow pinning outside of cpuset io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common() io_uring/rw: treat -EOPNOTSUPP for IOCB_NOWAIT like -EAGAIN io_uring/sqpoll: do not allow pinning outside of cpuset io_uring/eventfd: move refs to refcount_t io_uring: remove unused rsrc_put_fn io_uring: add new line after variable declaration io_uring: add GCOV_PROFILE_URING Kconfig option io_uring/kbuf: add support for incremental buffer consumption io_uring/kbuf: pass in 'len' argument for buffer commit Revert "io_uring: Require zeroed sqe->len on provided-buffers send" io_uring/kbuf: move io_ring_head_to_buf() to kbuf.h io_uring/kbuf: add io_kbuf_commit() helper io_uring/kbuf: shrink nr_iovs/mode in struct buf_sel_arg io_uring: wire up min batch wake timeout io_uring: add support for batch wait timeout io_uring: implement our own schedule timeout handling io_uring: move schedule wait logic into helper io_uring: encapsulate extraneous wait flags into a separate struct ...
2024-09-11Merge branch 'for-6.12/block' into for-6.12/io_uring-discardJens Axboe1-2/+6
* for-6.12/block: (115 commits) block: unpin user pages belonging to a folio at once mm: release number of pages of a folio block: introduce folio awareness and add a bigger size from folio block: Added folio-ized version of bio_add_hw_page() block, bfq: factor out a helper to split bfqq in bfq_init_rq() block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq() block, bfq: remove local variable 'split' in bfq_init_rq() block, bfq: remove bfq_log_bfqg() block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator() block, bfq: fix procress reference leakage for bfqq in merge chain block, bfq: fix uaf for accessing waker_bfqq after splitting blk-throttle: support prioritized processing of metadata blk-throttle: remove last_low_overflow_time drbd: Add NULL check for net_conf to prevent dereference in state validation blk-mq: add missing unplug trace event mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init() md: Add new_level sysfs interface zram: Shrink zram_table_entry::flags. zram: Remove ZRAM_LOCK zram: Replace bit spinlocks with a spinlock_t. ...
2024-09-11sched/deadline: Clarify nanoseconds in uapiChristian Loehle1-3/+3
Specify the time values of the deadline parameters of deadline, runtime, and period as being in nanoseconds explicitly as they always have been. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20240813144348.1180344-3-christian.loehle@arm.com
2024-09-10PCI: Rename CRS Completion Status to RRSBjorn Helgaas1-2/+4
PCIe r6.0 changed the abbreviation for "Configuration Request Retry Status" Completion Status from "CRS" to "RRS" and uses the terminology of "Configuration RRS Software Visibility" instead of "CRS Software Visibility". Align the Linux usage with the r6.0 spec language. No functional change intended. It's confusing to make this change, but I think "RRS" *is* a better abbreviation because it was easy to interpret "CRS" as "Completion Retry Status", which really didn't make any sense. Link: https://lore.kernel.org/r/20240827234848.4429-4-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2024-09-10net-timestamp: introduce SOF_TIMESTAMPING_OPT_RX_FILTER flagJason Xing1-1/+2
introduce a new flag SOF_TIMESTAMPING_OPT_RX_FILTER in the receive path. User can set it with SOF_TIMESTAMPING_SOFTWARE to filter out rx software timestamp report, especially after a process turns on netstamp_needed_key which can time stamp every incoming skb. Previously, we found out if an application starts first which turns on netstamp_needed_key, then another one only passing SOF_TIMESTAMPING_SOFTWARE could also get rx timestamp. Now we handle this case by introducing this new flag without breaking users. Quoting Willem to explain why we need the flag: "why a process would want to request software timestamp reporting, but not receive software timestamp generation. The only use I see is when the application does request SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_TX_SOFTWARE." Similarly, this new flag could also be used for hardware case where we can set it with SOF_TIMESTAMPING_RAW_HARDWARE, then we won't receive hardware receive timestamp. Another thing about errqueue in this patch I have a few words to say: In this case, we need to handle the egress path carefully, or else reporting the tx timestamp will fail. Egress path and ingress path will finally call sock_recv_timestamp(). We have to distinguish them. Errqueue is a good indicator to reflect the flow direction. Suggested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240909015612.3856-2-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-10vdpa: support set mac address from vdpa toolCindy Lu1-0/+1
Add new UAPI to support the mac address from vdpa tool Function vdpa_nl_cmd_dev_attr_set_doit() will get the new MAC address from the vdpa tool and then set it to the device. The usage is: vdpa dev set name vdpa_name mac **:**:**:**:**:** Here is example: root@L1# vdpa -jp dev config show vdpa0 { "config": { "vdpa0": { "mac": "82:4d:e9:5d:d7:e6", "link ": "up", "link_announce ": false, "mtu": 1500 } } } root@L1# vdpa dev set name vdpa0 mac 00:11:22:33:44:55 root@L1# vdpa -jp dev config show vdpa0 { "config": { "vdpa0": { "mac": "00:11:22:33:44:55", "link ": "up", "link_announce ": false, "mtu": 1500 } } } Signed-off-by: Cindy Lu <lulu@redhat.com> Message-Id: <20240731031653.1047692-2-lulu@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2024-09-10virtio_balloon: introduce memory scan/reclaim infozhenwei pi1-2/+10
Expose memory scan/reclaim information to the host side via virtio balloon device. Now we have a metric to analyze the memory performance: y: counter increases n: counter does not changes h: the rate of counter change is high l: the rate of counter change is low OOM: VIRTIO_BALLOON_S_OOM_KILL STALL: VIRTIO_BALLOON_S_ALLOC_STALL ASCAN: VIRTIO_BALLOON_S_SCAN_ASYNC DSCAN: VIRTIO_BALLOON_S_SCAN_DIRECT ARCLM: VIRTIO_BALLOON_S_RECLAIM_ASYNC DRCLM: VIRTIO_BALLOON_S_RECLAIM_DIRECT - OOM[y], STALL[*], ASCAN[*], DSCAN[*], ARCLM[*], DRCLM[*]: the guest runs under really critial memory pressure - OOM[n], STALL[h], ASCAN[*], DSCAN[l], ARCLM[*], DRCLM[l]: the memory allocation stalls due to cgroup, not the global memory pressure. - OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[h]: the memory allocation stalls due to global memory pressure. The performance gets hurt a lot. A high ratio between DRCLM/DSCAN shows quite effective memory reclaiming. - OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[l]: the memory allocation stalls due to global memory pressure. the ratio between DRCLM/DSCAN gets low, the guest OS is thrashing heavily, the serious case leads poor performance and difficult trouble shooting. Ex, sshd may block on memory allocation when accepting new connections, a user can't login a VM by ssh command. - OOM[n], STALL[n], ASCAN[h], DSCAN[n], ARCLM[l], DRCLM[n]: the low ratio between ARCLM/ASCAN shows that the guest tries to reclaim more memory, but it can't. Once more memory is required in future, it will struggle to reclaim memory. Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Message-Id: <20240423034109.1552866-5-pizhenwei@bytedance.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-10virtio_balloon: introduce memory allocation stall counterzhenwei pi1-2/+4
Memory allocation stall counter represents the performance/latency of memory allocation, expose this counter to the host side by virtio balloon device via out-of-bound way. Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Message-Id: <20240423034109.1552866-4-pizhenwei@bytedance.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-10virtio_balloon: introduce oom-kill invocationszhenwei pi1-2/+4
When the guest OS runs under critical memory pressure, the guest starts to kill processes. A guest monitor agent may scan 'oom_kill' from /proc/vmstat, and reports the OOM KILL event. However, the agent may be killed and we will loss this critical event(and the later events). For now we can also grep for magic words in guest kernel log from host side. Rather than this unstable way, virtio balloon reports OOM-KILL invocations instead. Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Message-Id: <20240423034109.1552866-3-pizhenwei@bytedance.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-09-08ptp/ioctl: support MONOTONIC{,_RAW} timestamps for PTP_SYS_OFFSET_EXTENDEDMahesh Bandewar1-6/+18
The ability to read the PHC (Physical Hardware Clock) alongside multiple system clocks is currently dependent on the specific hardware architecture. This limitation restricts the use of PTP_SYS_OFFSET_PRECISE to certain hardware configurations. The generic soultion which would work across all architectures is to read the PHC along with the latency to perform PHC-read as offered by PTP_SYS_OFFSET_EXTENDED which provides pre and post timestamps. However, these timestamps are currently limited to the CLOCK_REALTIME timebase. Since CLOCK_REALTIME is affected by NTP (or similar time synchronization services), it can experience significant jumps forward or backward. This hinders the precise latency measurements that PTP_SYS_OFFSET_EXTENDED is designed to provide. This problem could be addressed by supporting MONOTONIC_RAW timestamps within PTP_SYS_OFFSET_EXTENDED. Unlike CLOCK_REALTIME or CLOCK_MONOTONIC, the MONOTONIC_RAW timebase is unaffected by NTP adjustments. This enhancement can be implemented by utilizing one of the three reserved words within the PTP_SYS_OFFSET_EXTENDED struct to pass the clock-id for timestamps. The current behavior aligns with clock-id for CLOCK_REALTIME timebase (value of 0), ensuring backward compatibility of the UAPI. Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-09-06Merge tag 'nf-next-24-09-06' of ↵Jakub Kicinski1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: Patch #1 adds ctnetlink support for kernel side filtering for deletions, from Changliang Wu. Patch #2 updates nft_counter support to Use u64_stats_t, from Sebastian Andrzej Siewior. Patch #3 uses kmemdup_array() in all xtables frontends, from Yan Zhen. Patch #4 is a oneliner to use ERR_CAST() in nf_conntrack instead opencoded casting, from Shen Lichuan. Patch #5 removes unused argument in nftables .validate interface, from Florian Westphal. Patch #6 is a oneliner to correct a typo in nftables kdoc, from Simon Horman. Patch #7 fixes missing kdoc in nftables, also from Simon. Patch #8 updates nftables to handle timeout less than CONFIG_HZ. Patch #9 rejects element expiration if timeout is zero, otherwise it is silently ignored. Patch #10 disallows element expiration larger than timeout. Patch #11 removes unnecessary READ_ONCE annotation while mutex is held. Patch #12 adds missing READ_ONCE/WRITE_ONCE annotation in dynset. Patch #13 annotates data-races around element expiration. Patch #14 allocates timeout and expiration in one single set element extension, they are tighly couple, no reason to keep them separated anymore. Patch #15 updates nftables to interpret zero timeout element as never times out. Note that it is already possible to declare sets with elements that never time out but this generalizes to all kind of set with timeouts. Patch #16 supports for element timeout and expiration updates. * tag 'nf-next-24-09-06' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_tables: set element timeout update support netfilter: nf_tables: zero timeout means element never times out netfilter: nf_tables: consolidate timeout extension for elements netfilter: nf_tables: annotate data-races around element expiration netfilter: nft_dynset: annotate data-races around set timeout netfilter: nf_tables: remove annotation to access set timeout while holding lock netfilter: nf_tables: reject expiration higher than timeout netfilter: nf_tables: reject element expiration with no timeout netfilter: nf_tables: elements with timeout below CONFIG_HZ never expire netfilter: nf_tables: Add missing Kernel doc netfilter: nf_tables: Correct spelling in nf_tables.h netfilter: nf_tables: drop unused 3rd argument from validate callback ops netfilter: conntrack: Convert to use ERR_CAST() netfilter: Use kmemdup_array instead of kmemdup for multiple allocation netfilter: nft_counter: Use u64_stats_t for statistic. netfilter: ctnetlink: support CTA_FILTER for flush ==================== Link: https://patch.msgid.link/20240905232920.5481-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-06drm/amdkfd: Document and define SVM events message macroPhilip Yang1-13/+87
Document how to use SMI system management interface to enable and receive SVM events. Document SVM event triggers. Define SVM events message string format macro that could be used by user mode for sscanf to parse the event. Add it to uAPI header file to make it obvious that is changing uAPI in future. No functional changes. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: James Zhu <James.Zhu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-06nbd: implement the WRITE_ZEROES commandWouter Verhelst1-1/+4
The NBD protocol defines a message for zeroing out a region of an export Add support to the kernel driver for that message. Signed-off-by: Wouter Verhelst <w@uter.be> Cc: Eric Blake <eblake@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240812133032.115134-3-w@uter.be Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-05media: input: serio.h: add SERIO_EXTRON_DA_HD_PLUSHans Verkuil1-0/+1
Add a new serio ID for the Extron DA HD 4K Plus series of 4K HDMI Distribution Amplifiers. These devices support CEC over the serial port, so a new serio ID is needed to be able to associate the CEC driver. Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
2024-09-05media: videodev2.h: add V4L2_CAP_EDIDErling Ljunggren1-0/+1
Add capability flag to indicate that the device is an EDID-only device. Signed-off-by: Erling Ljunggren <hljunggr@cisco.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Reviewed-by: Sebastian Fricke <sebastian.fricke@collabora.com> Reviewed-by: Ricardo Ribalda <ribalda@chromium.org> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
2024-09-05fhandle: expose u64 mount id to name_to_handle_at(2)Aleksa Sarai1-0/+1
Now that we provide a unique 64-bit mount ID interface in statx(2), we can now provide a race-free way for name_to_handle_at(2) to provide a file handle and corresponding mount without needing to worry about racing with /proc/mountinfo parsing or having to open a file just to do statx(2). While this is not necessary if you are using AT_EMPTY_PATH and don't care about an extra statx(2) call, users that pass full paths into name_to_handle_at(2) need to know which mount the file handle comes from (to make sure they don't try to open_by_handle_at a file handle from a different filesystem) and switching to AT_EMPTY_PATH would require allocating a file for every name_to_handle_at(2) call, turning err = name_to_handle_at(-EBADF, "/foo/bar/baz", &handle, &mntid, AT_HANDLE_MNT_ID_UNIQUE); into int fd = openat(-EBADF, "/foo/bar/baz", O_PATH | O_CLOEXEC); err1 = name_to_handle_at(fd, "", &handle, &unused_mntid, AT_EMPTY_PATH); err2 = statx(fd, "", AT_EMPTY_PATH, STATX_MNT_ID_UNIQUE, &statxbuf); mntid = statxbuf.stx_mnt_id; close(fd); Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20240828-exportfs-u64-mount-id-v3-2-10c2c4c16708@cyphar.com Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-05uapi: explain how per-syscall AT_* flags should be allocatedAleksa Sarai1-24/+56
Unfortunately, the way we have gone about adding new AT_* flags has been a little messy. In the beginning, all of the AT_* flags had generic meanings and so it made sense to share the flag bits indiscriminately. However, we inevitably ran into syscalls that needed their own syscall-specific flags. Due to the lack of a planned out policy, we ended up with the following situations: * Existing syscalls adding new features tended to use new AT_* bits, with some effort taken to try to re-use bits for flags that were so obviously syscall specific that they only make sense for a single syscall (such as the AT_EACCESS/AT_REMOVEDIR/AT_HANDLE_FID triplet). Given the constraints of bitflags, this works well in practice, but ideally (to avoid future confusion) we would plan ahead and define a set of "per-syscall bits" ahead of time so that when allocating new bits we don't end up with a complete mish-mash of which bits are supposed to be per-syscall and which aren't. * New syscalls dealt with this in several ways: - Some syscalls (like renameat2(2), move_mount(2), fsopen(2), and fspick(2)) created their separate own flag spaces that have no overlap with the AT_* flags. Most of these ended up allocating their bits sequentually. In the case of move_mount(2) and fspick(2), several flags have identical meanings to AT_* flags but were allocated in their own flag space. This makes sense for syscalls that will never share AT_* flags, but for some syscalls this leads to duplication with AT_* flags in a way that could cause confusion (if renameat2(2) grew a RENAME_EMPTY_PATH it seems likely that users could mistake it for AT_EMPTY_PATH since it is an *at(2) syscall). - Some syscalls unfortunately ended up both creating their own flag space while also using bits from other flag spaces. The most obvious example is open_tree(2), where the standard usage ends up using flags from *THREE* separate flag spaces: open_tree(AT_FDCWD, "/foo", OPEN_TREE_CLONE|O_CLOEXEC|AT_RECURSIVE); (Note that O_CLOEXEC is also platform-specific, so several future OPEN_TREE_* bits are also made unusable in one fell swoop.) It's not entirely clear to me what the "right" choice is for new syscalls. Just saying that all future VFS syscalls should use AT_* flags doesn't seem practical. openat2(2) has RESOLVE_* flags (many of which don't make much sense to burn generic AT_* flags for) and move_mount(2) has separate AT_*-like flags for both the source and target so separate flags are needed anyway (though it seems possible that renameat2(2) could grow *_EMPTY_PATH flags at some point, and it's a bit of a shame they can't be reused). But at least for syscalls that _do_ choose to use AT_* flags, we should explicitly state the policy that 0x2ff is currently intended for per-syscall flags and that new flags should err on the side of overlapping with existing flag bits (so we can extend the scope of generic flags in the future if necessary). And add AT_* aliases for the RENAME_* flags to further cement that renameat2(2) is an *at(2) flag, just with its own per-syscall flags. Suggested-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20240828-exportfs-u64-mount-id-v3-1-10c2c4c16708@cyphar.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-04ipv4: Fix user space build failure due to header changeIdo Schimmel1-2/+0
RT_TOS() from include/uapi/linux/in_route.h is defined using IPTOS_TOS_MASK from include/uapi/linux/ip.h. This is problematic for files such as include/net/ip_fib.h that want to use RT_TOS() as without including both header files kernel compilation fails: In file included from ./include/net/ip_fib.h:25, from ./include/net/route.h:27, from ./include/net/lwtunnel.h:9, from net/core/dst.c:24: ./include/net/ip_fib.h: In function ‘fib_dscp_masked_match’: ./include/uapi/linux/in_route.h:31:32: error: ‘IPTOS_TOS_MASK’ undeclared (first use in this function) 31 | #define RT_TOS(tos) ((tos)&IPTOS_TOS_MASK) | ^~~~~~~~~~~~~~ ./include/net/ip_fib.h:440:45: note: in expansion of macro ‘RT_TOS’ 440 | return dscp == inet_dsfield_to_dscp(RT_TOS(fl4->flowi4_tos)); Therefore, cited commit changed linux/in_route.h to include linux/ip.h. However, as reported by David, this breaks iproute2 compilation due overlapping definitions between linux/ip.h and /usr/include/netinet/ip.h: In file included from ../include/uapi/linux/in_route.h:5, from iproute.c:19: ../include/uapi/linux/ip.h:25:9: warning: "IPTOS_TOS" redefined 25 | #define IPTOS_TOS(tos) ((tos)&IPTOS_TOS_MASK) | ^~~~~~~~~ In file included from iproute.c:17: /usr/include/netinet/ip.h:222:9: note: this is the location of the previous definition 222 | #define IPTOS_TOS(tos) ((tos) & IPTOS_TOS_MASK) Fix by changing include/net/ip_fib.h to include linux/ip.h. Note that usage of RT_TOS() should not spread further in the kernel due to recent work in this area. Fixes: 1fa3314c14c6 ("ipv4: Centralize TOS matching") Reported-by: David Ahern <dsahern@kernel.org> Closes: https://lore.kernel.org/netdev/2f5146ff-507d-4cab-a195-b28c0c9e654e@kernel.org/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20240903133554.2807343-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-04PCI/NPEM: Add Native PCIe Enclosure Management supportMariusz Tkaczyk1-0/+35
Native PCIe Enclosure Management (NPEM, PCIe r6.1 sec 6.28) allows managing LEDs in storage enclosures. NPEM is indication oriented and it does not give direct access to LEDs. Although each indication *could* represent an individual LED, multiple indications could also be represented as a single, multi-color LED or a single LED blinking in a specific interval. The specification leaves that open. Each enabled indication (capability register bit on) is represented as a ledclass_dev which can be controlled through sysfs. For every ledclass device only 2 brightness states are allowed: LED_ON (1) or LED_OFF (0). This corresponds to the NPEM control register (Indication bit on/off). Ledclass devices appear in sysfs as child devices (subdirectory) of PCI device which has an NPEM Extended Capability and indication is enabled in NPEM capability register. For example, these are LEDs created for pcieport "10000:02:05.0" on my setup: leds/ ├── 10000:02:05.0:enclosure:fail ├── 10000:02:05.0:enclosure:locate ├── 10000:02:05.0:enclosure:ok └── 10000:02:05.0:enclosure:rebuild They can be also found in "/sys/class/leds" directory. The parent PCIe device domain/bus/device/function address is used to guarantee uniqueness across leds subsystem. To enable/disable a "fail" indication, the "brightness" file can be edited: echo 1 > ./leds/10000:02:05.0:enclosure:fail/brightness echo 0 > ./leds/10000:02:05.0:enclosure:fail/brightness PCIe r6.1, sec 7.9.19.2 defines the possible indications. Multiple indications for same parent PCIe device can conflict and hardware may update them when processing new request. To avoid issues, driver refresh all indications by reading back control register. This driver expects to be the exclusive NPEM extended capability manager. It waits up to 1 second after imposing new request, it doesn't verify if controller is busy before write, and it assumes the mutex lock gives protection from concurrent updates. If _DSM LED management is available, we assume the platform may be using NPEM for its own purposes (see PCI Firmware Spec r3.3 sec 4.7), so the driver does not use NPEM. A future patch will add _DSM support; an info message notes whether NPEM or _DSM is being used. NPEM is a PCIe extended capability so it should be registered in pcie_init_capabilities() but it is not possible due to LED dependency. The parent pci_device must be added earlier for led_classdev_register() to be successful. NPEM does not require configuration on kernel side, so it is safe to register LED devices later. Link: https://lore.kernel.org/r/20240904104848.23480-3-mariusz.tkaczyk@linux.intel.com Suggested-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Stuart Hayes <stuart.w.hayes@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2024-09-04Merge branch 'bpf/master' into for-6.12Tejun Heo4-6/+18
Pull bpf/master to receive baebe9aaba1e ("bpf: allow passing struct bpf_iter_<type> as kfunc arguments") and related changes in preparation for the DSQ iterator patchset. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-04fuse: allow idmapped mountsAlexander Mikhalitsyn1-1/+19
Now we have everything in place and we can allow idmapped mounts by setting the FS_ALLOW_IDMAP flag. Notice that real availability of idmapped mounts will depend on the fuse daemon. Fuse daemon have to set FUSE_ALLOW_IDMAP flag in the FUSE_INIT reply. To discuss: - we enable idmapped mounts support only if "default_permissions" mode is enabled, because otherwise we would need to deal with UID/GID mappings in the userspace side OR provide the userspace with idmapped req->in.h.uid/req->in.h.gid values which is not something that we probably want to. Idmapped mounts philosophy is not about faking caller uid/gid. Some extra links and examples: - libfuse support https://github.com/mihalicyn/libfuse/commits/idmap_support - fuse-overlayfs support: https://github.com/mihalicyn/fuse-overlayfs/commits/idmap_support - cephfs-fuse conversion example https://github.com/mihalicyn/ceph/commits/fuse_idmap - glusterfs conversion example https://github.com/mihalicyn/glusterfs/commits/fuse_idmap Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-09-04fuse: add basic infrastructure to support idmappingsAlexander Mikhalitsyn1-0/+2
Add some preparational changes in fuse_get_req/fuse_force_creds to handle idmappings. Miklos suggested [1], [2] to change the meaning of in.h.uid/in.h.gid fields when daemon declares support for idmapped mounts. In a new semantic, we fill uid/gid values in fuse header with a id-mapped caller uid/gid (for requests which create new inodes), for all the rest cases we just send -1 to userspace. No functional changes intended. Link: https://lore.kernel.org/all/CAJfpegsVY97_5mHSc06mSw79FehFWtoXT=hhTUK_E-Yhr7OAuQ@mail.gmail.com/ [1] Link: https://lore.kernel.org/all/CAJfpegtHQsEUuFq1k4ZbTD3E1h-GsrN3PWyv7X8cg6sfU_W2Yw@mail.gmail.com/ [2] Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-09-04arm64/ptrace: add support for FEAT_POEJoey Gouly1-0/+1
Add a regset for POE containing POR_EL0. Signed-off-by: Joey Gouly <joey.gouly@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/r/20240822151113.1479789-21-joey.gouly@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2024-09-03netfilter: nf_tables: zero timeout means element never times outPablo Neira Ayuso1-1/+1
This patch uses zero as timeout marker for those elements that never expire when the element is created. If userspace provides no timeout for an element, then the default set timeout applies. However, if no default set timeout is specified and timeout flag is set on, then timeout extension is allocated and timeout is set to zero to allow for future updates. Use of zero a never timeout marker has been suggested by Phil Sutter. Note that, in older kernels, it is already possible to define elements that never expire by declaring a set with the set timeout flag set on and no global set timeout, in this case, new element with no explicit timeout never expire do not allocate the timeout extension, hence, they never expire. This approach makes it complicated to accomodate element timeout update, because element extensions do not support reallocations. Therefore, allocate the timeout extension and use the new marker for this case, but do not expose it to userspace to retain backward compatibility in the set listing. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-09-03Merge tag 'iio-for-6.12a' of ↵Greg Kroah-Hartman1-2/+3
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jic23/iio into char-misc-testing Jonathan writes: IIO: 1st set of new device support, features and cleanup for 6.12 Includes a merge of spi-mos-config branch from spi.git that brings support needed for the AD4000 driver. Lots of new device support this time including 9 new drivers and substantial changes to add new support to several more. New device support ------------------ Given we have a lot of new support, I've subcategorized them: Substantial changes, or new driver ********************************** adi,ad4000 - New driver for this high speed ADC. adi,ad4695 - New driver supporting AD4690, AD4696, AD4697 and AD4698 ADCs. - Follow up series added triggered buffer support. adi,ad7380 - Add support for single ended parts, AD7386, ADC7387, AD7388 and -4 variants. (driver previously only support differential parts). These variants have an additional front end MUX so only half the channels can be sampled efficiently. adi,ad9467 - Refactor and extend driver to support ad9643, ad9449 and ad9652 high speed ADCs. adi,adxl380 - New driver for this low power accelerometer. adi,ltc2664 - New driver supporting LTC2664 and LTC2672 DACs. microchip,pac1921 - New driver for this power/current monitor chip. rohm,bh1745 - New driver for this RGBC colour sensor. rohm,bu27034anuc - The original bu27034 was canceled before mass production, so the driver is modified to support the BU27034ANUC which had some significant differences. DT compatible changed to avoid chance of old driver ever binding to real hardware. sciosense,ens210 - New driver for ens210, ens210a, ens211, ens212, ens213a, and ens215 temperature and humidity sensors (all register compatible up to some conversion time differences) sensiron,sdp500 - New driver for this differential pressure sensor. tyhx,hx9023s - New driver to support this capacitive proximity sensor. Minor changes to support new devices ************************************ adi,adf4377 - Add support for the single output adf4378. kionix,kxcjk-1013 - Add support for KX022-1020 accelerometer (binding and ID table only) liteon,ltrf216a - Add support for ltr-308. A few minor differences in features set rockchip,saradc - Add ID for rk3576-saradc sensortek,stk3310 - Add ID for stk3013 proximity sensor which (despite documentation) has an ambient light sensor and is compatible with existing parts. Documentation updates --------------------- Generalize ABI docs for shunt resistor attribute Improve calibscale and calibbias related documentation. A couple of follow up patches to resolve duplicate documentation that resulted. New core features ----------------- backend - Add option for debugfs - useful for test pattern control - Use this for both adi-axi-adc and adi-axi-dac trigger suspend - Add functions to allow triggers to be suspended. This avoids problems when a device enters suspend to idle with a sysfs trigger. Use it for now in the bmi323 only. New driver features ------------------- adi,ad7192 - Add option to be a clock provider (+ additional clock config options) adi,ad7380 - Add documentation for this fairly new driver. adi,ad9461 - Provide control of test modes and backend validation blocks used to identify problems (via debugfs) adi,ad9739 - Add backend debugfs and docs for what is provided via adi-axi-dac avago,apds9960 - Add proximity and gesture calibration offset control bosch,bmp280 - Triggered buffer support including adding raw+scale output for sysfs. liteon,ltr390 - Add configuration of integration time and scale. stm,dfsdm - Convert this SD modulator driver to backend framework and add support for channel scaling + modern channel bindings. Treewide cleanup ---------------- iio_dev->masklength: Making it private. - Provide access function to read the core compute channel mask length and a macro to iterate over elements in the active_scan_mask. - Enables marking masklength __private preventing drivers from writing it without triggering a build warning whilst minimizing overhead in what are typically hot paths. - Convert all drivers and finally mark it private. Merge conflicts resolved in drivers applied after this point. Constify regmap_bus - These are never modified, so mark them const. Core cleanup ------------ backend - A few late breaking bits of feedback (unused variable, error messages) dma-buffer - Namespace exports. core - Drop unused assignment. Driver cleanup -------------- adi,ad4695 - Fixing binding to reflect that common-mode-channel is a scalar. adi,ad7280a - Use __free(kfree) to simplify freeing of receive buffer. adi,ad7606 - Various dt-binding cleanup and improvements. - Fix oversampling related gpio handling. - Make polarity of standby gpio match documentation. - use guard() to simplify lock handling. adi,ad7768 - Use device_for_each_child_node_scoped() instead of fwnode equivalent. adi,ad7124 - Reduce SPI transfers by avoiding separate writes to different fields in the same register. - Start the ADC in idle mode. adi,adis - Drop ifdefs in favor of IS_ENABLED. adi,admv8818 - Fix wrong ABI docs. asahi-kasei,ak8975 - Drop a prefix free compatible accidentally added recently. aspeed,adc - Use of_property_present() instead of of_find_property() to see if the property is there or not. atmel,at91, - Use __free(kfree) to simplify freeing of channel related array. bosch,bma400 - Use __free(kfree) to simplify freeing a locally allocated string. bosch,bmc150 - Add missing mount-matrix binding docs. bosch,bme680 - Fix read/write to ensure multiple necessary sequential reads without device configuration change. - Drop unnecessary type casts and use more appropriate data types. - Drop some left over ACPI code as ACPI support was removed due to invalid IDs (and no known users). - Sort headers consistently. - Avoid unnecessary duplicate read and redundant read of gas config. - Use bulk reads to get calibration data. - Reorder allocation of IIO device to be prior to device init. - Add remaining read/write buffers to the union used already for all others. - Tidy up error checks for consistency of style, including dev_err_probe() - Bring the device startup procedure inline with the vendor code. - Reorder code so mode forcing is more obvious occurring where needed. - Tidy up data locality in reading functions so no magic data is stored in state structures just to get it across function calls. - Make a local lookup table static to avoid placing it on the stack. bosch,bmp280 - Fix BME280 regmap to not include registers it doesn't have. - Wait a little longer after config to allow for maximum possible necessary wait. - Reorganize headers. - Make conversion_time_max array static to avoid placing it on the stack. maxim,max1363 - Use __free(kfree) to simplify freeing transmission buffer. microchip,mcp3964 - Use devm_regulator_get_enable_read_voltage() microchip,mcp3911 - Use devm_regulator_get_enable_read_voltage() microchip,mcp4728 - Use devm_regulator_get_enable_read_voltage() microchip,mcp4922 - Use devm_regulator_get_enable_read_voltage() and devm_* to allow dropping of explicit remove() callback. onnn,noa1305 - Various tidy up. - Provide available scale values. - Make integration time configurable. - Fix up integration time look up (/2 error) ti,dac7311 - Check if spi_setup() succeeded. ti,tsc2046 - Use __free(kfree) to simplify freeing rx and tx buffers. - Use devm_regulator_get_enable_read_voltage() Various minor fixes not called out explicitly. * tag 'iio-for-6.12a' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jic23/iio: (250 commits) drivers:iio:Fix the NULL vs IS_ERR() bug for debugfs_create_dir() iio: sgp40: retain documentation in driver iio: ABI: remove duplicate in_resistance_calibbias dt-bindings: iio: st,stm32-adc: add top-level constraints iio: ABI: add missing calibbias attributes iio: ABI: add missing calibscale attributes iio: ABI: sort calibscale attributes iio: ABI: document calibscale_available attributes iio: light: ltr390: Calculate 'counts_per_uvi' dynamically iio: light: ltr390: Add ALS channel and support for gain and resolution doc: iio: ad4695: document buffered read iio: adc: ad4695: implement triggered buffer iio: proximity: hx9023s: Fix error code in hx9023s_property_get() iio: light: noa1305: Fix up integration time look up iio: humidity: Add support for ENS210 dt-bindings: iio: humidity: add ENS210 sensor family iio: imu: adis16460: drop ifdef around CONFIG_DEBUG_FS iio: imu: adis16400: drop ifdef around CONFIG_DEBUG_FS iio: imu: adis16480: drop ifdef around CONFIG_DEBUG_FS iio: imu: adis16475: drop ifdef around CONFIG_DEBUG_FS ...
2024-09-01mm: remove PG_errorMatthew Wilcox (Oracle)1-1/+1
The PG_error bit is now unused; delete it and free up a bit in page->flags. Link: https://lkml.kernel.org/r/20240807193528.1865100-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-30autofs: add per dentry expire timeoutIan Kent1-1/+1
Add ability to set per-dentry mount expire timeout to autofs. There are two fairly well known automounter map formats, the autofs format and the amd format (more or less System V and Berkley). Some time ago Linux autofs added an amd map format parser that implemented a fair amount of the amd functionality. This was done within the autofs infrastructure and some functionality wasn't implemented because it either didn't make sense or required extra kernel changes. The idea was to restrict changes to be within the existing autofs functionality as much as possible and leave changes with a wider scope to be considered later. One of these changes is implementing the amd options: 1) "unmount", expire this mount according to a timeout (same as the current autofs default). 2) "nounmount", don't expire this mount (same as setting the autofs timeout to 0 except only for this specific mount) . 3) "utimeout=<seconds>", expire this mount using the specified timeout (again same as setting the autofs timeout but only for this mount). To implement these options per-dentry expire timeouts need to be implemented for autofs indirect mounts. This is because all map keys (mounts) for autofs indirect mounts use an expire timeout stored in the autofs mount super block info. structure and all indirect mounts use the same expire timeout. Now I have a request to add the "nounmount" option so I need to add the per-dentry expire handling to the kernel implementation to do this. The implementation uses the trailing path component to identify the mount (and is also used as the autofs map key) which is passed in the autofs_dev_ioctl structure path field. The expire timeout is passed in autofs_dev_ioctl timeout field (well, of the timeout union). If the passed in timeout is equal to -1 the per-dentry timeout and flag are cleared providing for the "unmount" option. If the timeout is greater than or equal to 0 the timeout is set to the value and the flag is also set. If the dentry timeout is 0 the dentry will not expire by timeout which enables the implementation of the "nounmount" option for the specific mount. When the dentry timeout is greater than zero it allows for the implementation of the "utimeout=<seconds>" option. Signed-off-by: Ian Kent <raven@themaw.net> Link: https://lore.kernel.org/r/20240814090231.963520-1-raven@themaw.net Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-29io_uring/kbuf: add support for incremental buffer consumptionJens Axboe1-0/+18
By default, any recv/read operation that uses provided buffers will consume at least 1 buffer fully (and maybe more, in case of bundles). This adds support for incremental consumption, meaning that an application may add large buffers, and each read/recv will just consume the part of the buffer that it needs. For example, let's say an application registers 1MB buffers in a provided buffer ring, for streaming receives. If it gets a short recv, then the full 1MB buffer will be consumed and passed back to the application. With incremental consumption, only the part that was actually used is consumed, and the buffer remains the current one. This means that both the application and the kernel needs to keep track of what the current receive point is. Each recv will still pass back a buffer ID and the size consumed, the only difference is that before the next receive would always be the next buffer in the ring. Now the same buffer ID may return multiple receives, each at an offset into that buffer from where the previous receive left off. Example: Application registers a provided buffer ring, and adds two 32K buffers to the ring. Buffer1 address: 0x1000000 (buffer ID 0) Buffer2 address: 0x2000000 (buffer ID 1) A recv completion is received with the following values: cqe->res 0x1000 (4k bytes received) cqe->flags 0x11 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0) and the application now knows that 4096b of data is available at 0x1000000, the start of that buffer, and that more data from this buffer will be coming. Now the next receive comes in: cqe->res 0x2010 (8k bytes received) cqe->flags 0x11 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0) which tells the application that 8k is available where the last completion left off, at 0x1001000. Next completion is: cqe->res 0x5000 (20k bytes received) cqe->flags 0x1 (CQE_F_BUFFER set, buffer ID 0) and the application now knows that 20k of data is available at 0x1003000, which is where the previous receive ended. CQE_F_BUF_MORE isn't set, as no more data is available in this buffer ID. The next completion is then: cqe->res 0x1000 (4k bytes received) cqe->flags 0x10001 (CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 1) which tells the application that buffer ID 1 is now the current one, hence there's 4k of valid data at 0x2000000. 0x2001000 will be the next receive point for this buffer ID. When a buffer will be reused by future CQE completions, IORING_CQE_BUF_MORE will be set in cqe->flags. This tells the application that the kernel isn't done with the buffer yet, and that it should expect more completions for this buffer ID. Will only be set by provided buffer rings setup with IOU_PBUF_RING INC, as that's the only type of buffer that will see multiple consecutive completions for the same buffer ID. For any other provided buffer type, any completion that passes back a buffer to the application is final. Once a buffer has been fully consumed, the buffer ring head is incremented and the next receive will indicate the next buffer ID in the CQE cflags. On the send side, the application can manage how much data is sent from an existing buffer by setting sqe->len to the desired send length. An application can request incremental consumption by setting IOU_PBUF_RING_INC in the provided buffer ring registration. Outside of that, any provided buffer ring setup and buffer additions is done like before, no changes there. The only change is in how an application may see multiple completions for the same buffer ID, hence needing to know where the next receive will happen. Note that like existing provided buffer rings, this should not be used with IOSQE_ASYNC, as both really require the ring to remain locked over the duration of the buffer selection and the operation completion. It will consume a buffer otherwise regardless of the size of the IO done. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29HID: hidraw: add HIDIOCREVOKE ioctlPeter Hutterer1-0/+1
There is a need for userspace applications to open HID devices directly. Use-cases include configuration of gaming mice or direct access to joystick devices. The latter is currently handled by the uaccess tag in systemd, other devices include more custom/local configurations or just sudo. A better approac