summaryrefslogtreecommitdiff
path: root/fs/tracefs/inode.c
AgeCommit message (Collapse)AuthorFilesLines
2024-11-01tracing: Fix tracefs mount optionsKalesh Singh1-3/+9
Commit 78ff64081949 ("vfs: Convert tracefs to use the new mount API") converted tracefs to use the new mount APIs caused mount options (e.g. gid=<gid>) to not take effect. The tracefs superblock can be updated from multiple paths: - on fs_initcall() to init_trace_printk_function_export() - from a work queue to initialize eventfs tracer_init_tracefs_work_func() - fsconfig() syscall to mount or remount of tracefs The tracefs superblock root inode gets created early on in init_trace_printk_function_export(). With the new mount API, tracefs effectively uses get_tree_single() instead of the old API mount_single(). Previously, mount_single() ensured that the options are always applied to the superblock root inode: (1) If the root inode didn't exist, call fill_super() to create it and apply the options. (2) If the root inode exists, call reconfigure_single() which effectively calls tracefs_apply_options() to parse and apply options to the subperblock's fs_info and inode and remount eventfs (if necessary) On the other hand, get_tree_single() effectively calls vfs_get_super() which: (3) If the root inode doesn't exists, calls fill_super() to create it and apply the options. (4) If the root inode already exists, updates the fs_context root with the superblock's root inode. (4) above is always the case for tracefs mounts, since the super block's root inode will already be created by init_trace_printk_function_export(). This means that the mount options get ignored: - Since it isn't applied to the superblock's root inode, it doesn't get inherited by the children. - Since eventfs is initialized from a separate work queue and before call to mount with the options, and it doesn't get remounted for mount. Ensure that the mount options are applied to the super block and eventfs is remounted to respect the mount options. To understand this better, if fstab has the following: tracefs /sys/kernel/tracing tracefs nosuid,nodev,noexec,gid=tracing 0 0 On boot up, permissions look like: # ls -l /sys/kernel/tracing/trace -rw-r----- 1 root root 0 Nov 1 08:37 /sys/kernel/tracing/trace When it should look like: # ls -l /sys/kernel/tracing/trace -rw-r----- 1 root tracing 0 Nov 1 08:37 /sys/kernel/tracing/trace Link: https://lore.kernel.org/r/536e99d3-345c-448b-adee-a21389d7ab4b@redhat.com/ Cc: Eric Sandeen <sandeen@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Ali Zahraee <ahzahraee@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: stable@vger.kernel.org Fixes: 78ff64081949 ("vfs: Convert tracefs to use the new mount API") Link: https://lore.kernel.org/20241030171928.4168869-2-kaleshsingh@google.com Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-07tracefs: Use generic inode RCU for synchronizing freeingSteven Rostedt1-6/+4
With structure layout randomization enabled for 'struct inode' we need to avoid overlapping any of the RCU-used / initialized-only-once members, e.g. i_lru or i_sb_list to not corrupt related list traversals when making use of the rcu_head. For an unlucky structure layout of 'struct inode' we may end up with the following splat when running the ftrace selftests: [<...>] list_del corruption, ffff888103ee2cb0->next (tracefs_inode_cache+0x0/0x4e0 [slab object]) is NULL (prev is tracefs_inode_cache+0x78/0x4e0 [slab object]) [<...>] ------------[ cut here ]------------ [<...>] kernel BUG at lib/list_debug.c:54! [<...>] invalid opcode: 0000 [#1] PREEMPT SMP KASAN [<...>] CPU: 3 PID: 2550 Comm: mount Tainted: G N 6.8.12-grsec+ #122 ed2f536ca62f28b087b90e3cc906a8d25b3ddc65 [<...>] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [<...>] RIP: 0010:[<ffffffff84656018>] __list_del_entry_valid_or_report+0x138/0x3e0 [<...>] Code: 48 b8 99 fb 65 f2 ff ff ff ff e9 03 5c d9 fc cc 48 b8 99 fb 65 f2 ff ff ff ff e9 33 5a d9 fc cc 48 b8 99 fb 65 f2 ff ff ff ff <0f> 0b 4c 89 e9 48 89 ea 48 89 ee 48 c7 c7 60 8f dd 89 31 c0 e8 2f [<...>] RSP: 0018:fffffe80416afaf0 EFLAGS: 00010283 [<...>] RAX: 0000000000000098 RBX: ffff888103ee2cb0 RCX: 0000000000000000 [<...>] RDX: ffffffff84655fe8 RSI: ffffffff89dd8b60 RDI: 0000000000000001 [<...>] RBP: ffff888103ee2cb0 R08: 0000000000000001 R09: fffffbd0082d5f25 [<...>] R10: fffffe80416af92f R11: 0000000000000001 R12: fdf99c16731d9b6d [<...>] R13: 0000000000000000 R14: ffff88819ad4b8b8 R15: 0000000000000000 [<...>] RBX: tracefs_inode_cache+0x0/0x4e0 [slab object] [<...>] RDX: __list_del_entry_valid_or_report+0x108/0x3e0 [<...>] RSI: __func__.47+0x4340/0x4400 [<...>] RBP: tracefs_inode_cache+0x0/0x4e0 [slab object] [<...>] RSP: process kstack fffffe80416afaf0+0x7af0/0x8000 [mount 2550 2550] [<...>] R09: kasan shadow of process kstack fffffe80416af928+0x7928/0x8000 [mount 2550 2550] [<...>] R10: process kstack fffffe80416af92f+0x792f/0x8000 [mount 2550 2550] [<...>] R14: tracefs_inode_cache+0x78/0x4e0 [slab object] [<...>] FS: 00006dcb380c1840(0000) GS:ffff8881e0600000(0000) knlGS:0000000000000000 [<...>] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [<...>] CR2: 000076ab72b30e84 CR3: 000000000b088004 CR4: 0000000000360ef0 shadow CR4: 0000000000360ef0 [<...>] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [<...>] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [<...>] ASID: 0003 [<...>] Stack: [<...>] ffffffff818a2315 00000000f5c856ee ffffffff896f1840 ffff888103ee2cb0 [<...>] ffff88812b6b9750 0000000079d714b6 fffffbfff1e9280b ffffffff8f49405f [<...>] 0000000000000001 0000000000000000 ffff888104457280 ffffffff8248b392 [<...>] Call Trace: [<...>] <TASK> [<...>] [<ffffffff818a2315>] ? lock_release+0x175/0x380 fffffe80416afaf0 [<...>] [<ffffffff8248b392>] list_lru_del+0x152/0x740 fffffe80416afb48 [<...>] [<ffffffff8248ba93>] list_lru_del_obj+0x113/0x280 fffffe80416afb88 [<...>] [<ffffffff8940fd19>] ? _atomic_dec_and_lock+0x119/0x200 fffffe80416afb90 [<...>] [<ffffffff8295b244>] iput_final+0x1c4/0x9a0 fffffe80416afbb8 [<...>] [<ffffffff8293a52b>] dentry_unlink_inode+0x44b/0xaa0 fffffe80416afbf8 [<...>] [<ffffffff8293fefc>] __dentry_kill+0x23c/0xf00 fffffe80416afc40 [<...>] [<ffffffff8953a85f>] ? __this_cpu_preempt_check+0x1f/0xa0 fffffe80416afc48 [<...>] [<ffffffff82949ce5>] ? shrink_dentry_list+0x1c5/0x760 fffffe80416afc70 [<...>] [<ffffffff82949b71>] ? shrink_dentry_list+0x51/0x760 fffffe80416afc78 [<...>] [<ffffffff82949da8>] shrink_dentry_list+0x288/0x760 fffffe80416afc80 [<...>] [<ffffffff8294ae75>] shrink_dcache_sb+0x155/0x420 fffffe80416afcc8 [<...>] [<ffffffff8953a7c3>] ? debug_smp_processor_id+0x23/0xa0 fffffe80416afce0 [<...>] [<ffffffff8294ad20>] ? do_one_tree+0x140/0x140 fffffe80416afcf8 [<...>] [<ffffffff82997349>] ? do_remount+0x329/0xa00 fffffe80416afd18 [<...>] [<ffffffff83ebf7a1>] ? security_sb_remount+0x81/0x1c0 fffffe80416afd38 [<...>] [<ffffffff82892096>] reconfigure_super+0x856/0x14e0 fffffe80416afd70 [<...>] [<ffffffff815d1327>] ? ns_capable_common+0xe7/0x2a0 fffffe80416afd90 [<...>] [<ffffffff82997436>] do_remount+0x416/0xa00 fffffe80416afdd0 [<...>] [<ffffffff829b2ba4>] path_mount+0x5c4/0x900 fffffe80416afe28 [<...>] [<ffffffff829b25e0>] ? finish_automount+0x13a0/0x13a0 fffffe80416afe60 [<...>] [<ffffffff82903812>] ? user_path_at_empty+0xb2/0x140 fffffe80416afe88 [<...>] [<ffffffff829b2ff5>] do_mount+0x115/0x1c0 fffffe80416afeb8 [<...>] [<ffffffff829b2ee0>] ? path_mount+0x900/0x900 fffffe80416afed8 [<...>] [<ffffffff8272461c>] ? __kasan_check_write+0x1c/0xa0 fffffe80416afee0 [<...>] [<ffffffff829b31cf>] __do_sys_mount+0x12f/0x280 fffffe80416aff30 [<...>] [<ffffffff829b36cd>] __x64_sys_mount+0xcd/0x2e0 fffffe80416aff70 [<...>] [<ffffffff819f8818>] ? syscall_trace_enter+0x218/0x380 fffffe80416aff88 [<...>] [<ffffffff8111655e>] x64_sys_call+0x5d5e/0x6720 fffffe80416affa8 [<...>] [<ffffffff8952756d>] do_syscall_64+0xcd/0x3c0 fffffe80416affb8 [<...>] [<ffffffff8100119b>] entry_SYSCALL_64_safe_stack+0x4c/0x87 fffffe80416affe8 [<...>] </TASK> [<...>] <PTREGS> [<...>] RIP: 0033:[<00006dcb382ff66a>] vm_area_struct[mount 2550 2550 file 6dcb38225000-6dcb3837e000 22 55(read|exec|mayread|mayexec)]+0x0/0xb8 [userland map] [<...>] Code: 48 8b 0d 29 18 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f6 17 0d 00 f7 d8 64 89 01 48 [<...>] RSP: 002b:0000763d68192558 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [<...>] RAX: ffffffffffffffda RBX: 00006dcb38433264 RCX: 00006dcb382ff66a [<...>] RDX: 000017c3e0d11210 RSI: 000017c3e0d1a5a0 RDI: 000017c3e0d1ae70 [<...>] RBP: 000017c3e0d10fb0 R08: 000017c3e0d11260 R09: 00006dcb383d1be0 [<...>] R10: 000000000020002e R11: 0000000000000246 R12: 0000000000000000 [<...>] R13: 000017c3e0d1ae70 R14: 000017c3e0d11210 R15: 000017c3e0d10fb0 [<...>] RBX: vm_area_struct[mount 2550 2550 file 6dcb38433000-6dcb38434000 5b 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] RCX: vm_area_struct[mount 2550 2550 file 6dcb38225000-6dcb3837e000 22 55(read|exec|mayread|mayexec)]+0x0/0xb8 [userland map] [<...>] RDX: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] RSI: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] RDI: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] RBP: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] RSP: vm_area_struct[mount 2550 2550 anon 763d68173000-763d68195000 7ffffffdd 100133(read|write|mayread|maywrite|growsdown|account)]+0x0/0xb8 [userland map] [<...>] R08: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] R09: vm_area_struct[mount 2550 2550 file 6dcb383d1000-6dcb383d3000 1cd 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] R13: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] R14: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] R15: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map] [<...>] </PTREGS> [<...>] Modules linked in: [<...>] ---[ end trace 0000000000000000 ]--- The list debug message as well as RBX's symbolic value point out that the object in question was allocated from 'tracefs_inode_cache' and that the list's '->next' member is at offset 0. Dumping the layout of the relevant parts of 'struct tracefs_inode' gives the following: struct tracefs_inode { union { struct inode { struct list_head { struct list_head * next; /* 0 8 */ struct list_head * prev; /* 8 8 */ } i_lru; [...] } vfs_inode; struct callback_head { void (*func)(struct callback_head *); /* 0 8 */ struct callback_head * next; /* 8 8 */ } rcu; }; [...] }; Above shows that 'vfs_inode.i_lru' overlaps with 'rcu' which will destroy the 'i_lru' list as soon as the 'rcu' member gets used, e.g. in call_rcu() or later when calling the RCU callback. This will disturb concurrent list traversals as well as object reuse which assumes these list heads will keep their integrity. For reproduction, the following diff manually overlays 'i_lru' with 'rcu' as, otherwise, one would require some good portion of luck for gambling an unlucky RANDSTRUCT seed: --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -629,6 +629,7 @@ struct inode { umode_t i_mode; unsigned short i_opflags; kuid_t i_uid; + struct list_head i_lru; /* inode LRU list */ kgid_t i_gid; unsigned int i_flags; @@ -690,7 +691,6 @@ struct inode { u16 i_wb_frn_avg_time; u16 i_wb_frn_history; #endif - struct list_head i_lru; /* inode LRU list */ struct list_head i_sb_list; struct list_head i_wb_list; /* backing dev writeback list */ union { The tracefs inode does not need to supply its own RCU delayed destruction of its inode. The inode code itself offers both a "destroy_inode()" callback that gets called when the last reference of the inode is released, and the "free_inode()" which is called after a RCU synchronization period from the "destroy_inode()". The tracefs code can unlink the inode from its list in the destroy_inode() callback, and the simply free it from the free_inode() callback. This should provide the same protection. Link: https://lore.kernel.org/all/20240807115143.45927-3-minipli@grsecurity.net/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ajay Kaher <ajay.kaher@broadcom.com> Cc: Ilkka =?utf-8?b?TmF1bGFww6TDpA==?= <digirigawa@gmail.com> Link: https://lore.kernel.org/20240807185402.61410544@gandalf.local.home Fixes: baa23a8d4360 ("tracefs: Reset permissions on remount if permissions are options") Reported-by: Mathias Krause <minipli@grsecurity.net> Reported-by: Brad Spengler <spender@grsecurity.net> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-07tracefs: Fix inode allocationMathias Krause1-1/+1
The leading comment above alloc_inode_sb() is pretty explicit about it: /* * This must be used for allocating filesystems specific inodes to set * up the inode reclaim context correctly. */ Switch tracefs over to alloc_inode_sb() to make sure inodes are properly linked. Cc: Ajay Kaher <ajay.kaher@broadcom.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20240807115143.45927-2-minipli@grsecurity.net Fixes: ba37ff75e04b ("eventfs: Implement tracefs_inode_cache") Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-07-02tracefs: Convert to new uid/gid option parsing helpersEric Sandeen1-12/+4
Convert to new uid/gid option parsing helpers Signed-off-by: Eric Sandeen <sandeen@redhat.com> Link: https://lore.kernel.org/r/6c9b0b16-e61b-4dfc-852d-e2eb5bb11b82@redhat.com Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-23tracefs: Clear EVENT_INODE flag in tracefs_drop_inode()Steven Rostedt (Google)1-16/+17
When the inode is being dropped from the dentry, the TRACEFS_EVENT_INODE flag needs to be cleared to prevent a remount from calling eventfs_remount() on the tracefs_inode private data. There's a race between the inode is dropped (and the dentry freed) to where the inode is actually freed. If a remount happens between the two, the eventfs_inode could be accessed after it is freed (only the dentry keeps a ref count on it). Currently the TRACEFS_EVENT_INODE flag is cleared from the dentry iput() function. But this is incorrect, as it is possible that the inode has another reference to it. The flag should only be cleared when the inode is really being dropped and has no more references. That happens in the drop_inode callback of the inode, as that gets called when the last reference of the inode is released. Remove the tracefs_d_iput() function and move its logic to the more appropriate tracefs_drop_inode() callback function. Link: https://lore.kernel.org/linux-trace-kernel/20240523051539.908205106@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Fixes: baa23a8d4360d ("tracefs: Reset permissions on remount if permissions are options") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-23tracefs: Update inode permissions on remountSteven Rostedt (Google)1-3/+12
When a remount happens, if a gid or uid is specified update the inodes to have the same gid and uid. This will allow the simplification of the permissions logic for the dynamically created files and directories. Link: https://lore.kernel.org/linux-trace-kernel/20240523051539.592429986@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Fixes: baa23a8d4360d ("tracefs: Reset permissions on remount if permissions are options") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-13Merge tag 'vfs-6.10.mount' of ↵Linus Torvalds1-108/+94
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount API conversions from Christian Brauner: "This converts qnx6, minix, debugfs, tracefs, freevxfs, and openpromfs to the new mount api, further reducing the number of filesystems relying on the legacy mount api" * tag 'vfs-6.10.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: minix: convert minix to use the new mount api vfs: Convert tracefs to use the new mount API vfs: Convert debugfs to use the new mount API openpromfs: finish conversion to the new mount API freevxfs: Convert freevxfs to the new mount API. qnx6: convert qnx6 to use the new mount api
2024-05-04tracefs: Still use mount point as default permissions for instancesSteven Rostedt (Google)1-2/+25
If the instances directory's permissions were never change, then have it and its children use the mount point permissions as the default. Currently, the permissions of instance directories are determined by the instance directory's permissions itself. But if the tracefs file system is remounted and changes the permissions, the instance directory and its children should use the new permission. But because both the instance directory and its children use the instance directory's inode for permissions, it misses the update. To demonstrate this: # cd /sys/kernel/tracing/ # mkdir instances/foo # ls -ld instances/foo drwxr-x--- 5 root root 0 May 1 19:07 instances/foo # ls -ld instances drwxr-x--- 3 root root 0 May 1 18:57 instances # ls -ld current_tracer -rw-r----- 1 root root 0 May 1 18:57 current_tracer # mount -o remount,gid=1002 . # ls -ld instances drwxr-x--- 3 root root 0 May 1 18:57 instances # ls -ld instances/foo/ drwxr-x--- 5 root root 0 May 1 19:07 instances/foo/ # ls -ld current_tracer -rw-r----- 1 root lkp 0 May 1 18:57 current_tracer Notice that changing the group id to that of "lkp" did not affect the instances directory nor its children. It should have been: # ls -ld current_tracer -rw-r----- 1 root root 0 May 1 19:19 current_tracer # ls -ld instances/foo/ drwxr-x--- 5 root root 0 May 1 19:25 instances/foo/ # ls -ld instances drwxr-x--- 3 root root 0 May 1 19:19 instances # mount -o remount,gid=1002 . # ls -ld current_tracer -rw-r----- 1 root lkp 0 May 1 19:19 current_tracer # ls -ld instances drwxr-x--- 3 root lkp 0 May 1 19:19 instances # ls -ld instances/foo/ drwxr-x--- 5 root lkp 0 May 1 19:25 instances/foo/ Where all files were updated by the remount gid update. Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.686838327@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-04tracefs: Reset permissions on remount if permissions are optionsSteven Rostedt (Google)1-1/+64
There's an inconsistency with the way permissions are handled in tracefs. Because the permissions are generated when accessed, they default to the root inode's permission if they were never set by the user. If the user sets the permissions, then a flag is set and the permissions are saved via the inode (for tracefs files) or an internal attribute field (for eventfs). But if a remount happens that specify the permissions, all the files that were not changed by the user gets updated, but the ones that were are not. If the user were to remount the file system with a given permission, then all files and directories within that file system should be updated. This can cause security issues if a file's permission was updated but the admin forgot about it. They could incorrectly think that remounting with permissions set would update all files, but miss some. For example: # cd /sys/kernel/tracing # chgrp 1002 current_tracer # ls -l [..] -rw-r----- 1 root root 0 May 1 21:25 buffer_size_kb -rw-r----- 1 root root 0 May 1 21:25 buffer_subbuf_size_kb -r--r----- 1 root root 0 May 1 21:25 buffer_total_size_kb -rw-r----- 1 root lkp 0 May 1 21:25 current_tracer -rw-r----- 1 root root 0 May 1 21:25 dynamic_events -r--r----- 1 root root 0 May 1 21:25 dyn_ftrace_total_info -r--r----- 1 root root 0 May 1 21:25 enabled_functions Where current_tracer now has group "lkp". # mount -o remount,gid=1001 . # ls -l -rw-r----- 1 root tracing 0 May 1 21:25 buffer_size_kb -rw-r----- 1 root tracing 0 May 1 21:25 buffer_subbuf_size_kb -r--r----- 1 root tracing 0 May 1 21:25 buffer_total_size_kb -rw-r----- 1 root lkp 0 May 1 21:25 current_tracer -rw-r----- 1 root tracing 0 May 1 21:25 dynamic_events -r--r----- 1 root tracing 0 May 1 21:25 dyn_ftrace_total_info -r--r----- 1 root tracing 0 May 1 21:25 enabled_functions Everything changed but the "current_tracer". Add a new link list that keeps track of all the tracefs_inodes which has the permission flags that tell if the file/dir should use the root inode's permission or not. Then on remount, clear all the flags so that the default behavior of using the root inode's permission is done for all files and directories. Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.529542160@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-26vfs: Convert tracefs to use the new mount APIDavid Howells1-105/+91
Convert the tracefs filesystem to the new internal mount API as the old one will be obsoleted and removed. This allows greater flexibility in communication of mount parameters between userspace, the VFS and the filesystem. See Documentation/filesystems/mount_api.txt for more information. Signed-off-by: David Howells <dhowells@redhat.com> Co-developed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> [sandeen: forward port to modern kernel, fix remounting] Link: https://lore.kernel.org/r/536e99d3-345c-448b-adee-a21389d7ab4b@redhat.com cc: Steven Rostedt <rostedt@goodmis.org> cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-12mm, slab: remove last vestiges of SLAB_MEM_SPREADLinus Torvalds1-1/+0
Yes, yes, I know the slab people were planning on going slow and letting every subsystem fight this thing on their own. But let's just rip off the band-aid and get it over and done with. I don't want to see a number of unnecessary pull requests just to get rid of a flag that no longer has any meaning. This was mainly done with a couple of 'sed' scripts and then some manual cleanup of the end result. Link: https://lore.kernel.org/all/CAHk-=wji0u+OOtmAOD-5JV3SXcRJF___k_+8XNKmak0yd5vW1Q@mail.gmail.com/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-02-01eventfs: Clean up dentry ops and add revalidate functionLinus Torvalds1-9/+18
In order for the dentries to stay up-to-date with the eventfs changes, just add a 'd_revalidate' function that checks the 'is_freed' bit. Also, clean up the dentry release to actually use d_release() rather than the slightly odd d_iput() function. We don't care about the inode, all we want to do is to get rid of the refcount to the eventfs data added by dentry->d_fsdata. It would probably be cleaner to make eventfs its own filesystem, or at least set its own dentry ops when looking up eventfs files. But as it is, only eventfs dentries use d_fsdata, so we don't really need to split these things up by use. Another thing that might be worth doing is to make all eventfs lookups mark their dentries as not worth caching. We could do that with d_delete(), but the DCACHE_DONTCACHE flag would likely be even better. As it is, the dentries are all freeable, but they only tend to get freed at memory pressure rather than more proactively. But that's a separate issue. Link: https://lore.kernel.org/linux-trace-kernel/202401291043.e62e89dc-oliver.sang@intel.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240131185513.124644253@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Ajay Kaher <ajay.kaher@broadcom.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fixes: c1504e510238 ("eventfs: Implement eventfs dir creation functions") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-01tracefs: dentry lookup crapectomyLinus Torvalds1-69/+0
The dentry lookup for eventfs files was very broken, and had lots of signs of the old situation where the filesystem names were all created statically in the dentry tree, rather than being looked up dynamically based on the eventfs data structures. You could see it in the naming - how it claimed to "create" dentries rather than just look up the dentries that were given it. You could see it in various nonsensical and very incorrect operations, like using "simple_lookup()" on the dentries that were passed in, which only results in those dentries becoming negative dentries. Which meant that any other lookup would possibly return ENOENT if it saw that negative dentry before the data was then later filled in. You could see it in the immense amount of nonsensical code that didn't actually just do lookups. Link: https://lore.kernel.org/linux-trace-kernel/202401291043.e62e89dc-oliver.sang@intel.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240131233227.73db55e1@gandalf.local.home Cc: stable@vger.kernel.org Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ajay Kaher <ajay.kaher@broadcom.com> Cc: Mark Rutland <mark.rutland@arm.com> Fixes: c1504e510238 ("eventfs: Implement eventfs dir creation functions") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-01-31tracefs: Zero out the tracefs_inode when allocating itSteven Rostedt (Google)1-2/+4
eventfs uses the tracefs_inode and assumes that it's already initialized to zero. That is, it doesn't set fields to zero (like ti->private) after getting its tracefs_inode. This causes bugs due to stale values. Just initialize the entire structure to zero on allocation so there isn't any more surprises. This is a partial fix to access to ti->private. The assignment still needs to be made before the dentry is instantiated. Link: https://lore.kernel.org/linux-trace-kernel/20240131185512.315825944@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Ajay Kaher <ajay.kaher@broadcom.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202401291043.e62e89dc-oliver.sang@intel.com Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-01-03tracefs/eventfs: Use root and instance inodes as default ownershipSteven Rostedt (Google)1-88/+110
Instead of walking the dentries on mount/remount to update the gid values of all the dentries if a gid option is specified on mount, just update the root inode. Add .getattr, .setattr, and .permissions on the tracefs inode operations to update the permissions of the files and directories. For all files and directories in the top level instance: /sys/kernel/tracing/* It will use the root inode as the default permissions. The inode that represents: /sys/kernel/tracing (or wherever it is mounted). When an instance is created: mkdir /sys/kernel/tracing/instance/foo The directory "foo" and all its files and directories underneath will use the default of what foo is when it was created. A remount of tracefs will not affect it. If a user were to modify the permissions of any file or directory in tracefs, it will also no longer be modified by a change in ownership of a remount. The events directory, if it is in the top level instance, will use the tracefs root inode as the default ownership for itself and all the files and directories below it. For the events directory in an instance ("foo"), it will keep the ownership of what it was when it was created, and that will be used as the default ownership for the files and directories beneath it. Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wjVdGkjDXBbvLn2wbZnqP4UsH46E3gqJ9m7UG6DpX2+WA@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240103215016.1e0c9811@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-01-02tracefs: Check for dentry->d_inode exists in set_gid()Steven Rostedt (Google)1-0/+4
If a getdents() is called on the tracefs directory but does not get all the files, it can leave a "cursor" dentry in the d_subdirs list of tracefs dentry. This cursor dentry does not have a d_inode for it. Before referencing tracefs_inode from the dentry, the d_inode must first be checked if it has content. If not, then it's not a tracefs_inode and can be ignored. The following caused a crash: #define getdents64(fd, dirp, count) syscall(SYS_getdents64, fd, dirp, count) #define BUF_SIZE 256 #define TDIR "/tmp/file0" int main(void) { char buf[BUF_SIZE]; int fd; int n; mkdir(TDIR, 0777); mount(NULL, TDIR, "tracefs", 0, NULL); fd = openat(AT_FDCWD, TDIR, O_RDONLY); n = getdents64(fd, buf, BUF_SIZE); ret = mount(NULL, TDIR, NULL, MS_NOSUID|MS_REMOUNT|MS_RELATIME|MS_LAZYTIME, "gid=1000"); return 0; } That's because the 256 BUF_SIZE was not big enough to read all the dentries of the tracefs file system and it left a "cursor" dentry in the subdirs of the tracefs root inode. Then on remounting with "gid=1000", it would cause an iteration of all dentries which hit: ti = get_tracefs(dentry->d_inode); if (ti && (ti->flags & TRACEFS_EVENT_INODE)) eventfs_update_gid(dentry, gid); Which crashed because of the dereference of the cursor dentry which had a NULL d_inode. In the subdir loop of the dentry lookup of set_gid(), if a child has a NULL d_inode, simply skip it. Link: https://lore.kernel.org/all/20240102135637.3a21fb10@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20240102151249.05da244d@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 7e8358edf503e ("eventfs: Fix file and directory uid and gid ownership") Reported-by: "Ubisectech Sirius" <bugreport@ubisectech.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-22eventfs: Fix file and directory uid and gid ownershipSteven Rostedt (Google)1-0/+6
It was reported that when mounting the tracefs file system with a gid other than root, the ownership did not carry down to the eventfs directory due to the dynamic nature of it. A fix was done to solve this, but it had two issues. (a) if the attr passed into update_inode_attr() was NULL, it didn't do anything. This is true for files that have not had a chown or chgrp done to itself or any of its sibling files, as the attr is allocated for all children when any one needs it. # umount /sys/kernel/tracing # mount -o rw,seclabel,relatime,gid=1000 -t tracefs nodev /mnt # ls -ld /mnt/events/sched drwxr-xr-x 28 root rostedt 0 Dec 21 13:12 /mnt/events/sched/ # ls -ld /mnt/events/sched/sched_switch drwxr-xr-x 2 root rostedt 0 Dec 21 13:12 /mnt/events/sched/sched_switch/ But when checking the files: # ls -l /mnt/events/sched/sched_switch total 0 -rw-r----- 1 root root 0 Dec 21 13:12 enable -rw-r----- 1 root root 0 Dec 21 13:12 filter -r--r----- 1 root root 0 Dec 21 13:12 format -r--r----- 1 root root 0 Dec 21 13:12 hist -r--r----- 1 root root 0 Dec 21 13:12 id -rw-r----- 1 root root 0 Dec 21 13:12 trigger (b) When the attr does not denote the UID or GID, it defaulted to using the parent uid or gid. This is incorrect as changing the parent uid or gid will automatically change all its children. # chgrp tracing /mnt/events/timer # ls -ld /mnt/events/timer drwxr-xr-x 2 root tracing 0 Dec 21 14:34 /mnt/events/timer # ls -l /mnt/events/timer total 0 -rw-r----- 1 root root 0 Dec 21 14:35 enable -rw-r----- 1 root root 0 Dec 21 14:35 filter drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_cancel drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_expire_entry drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_expire_exit drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_init drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_start drwxr-xr-x 2 root tracing 0 Dec 21 14:35 itimer_expire drwxr-xr-x 2 root tracing 0 Dec 21 14:35 itimer_state drwxr-xr-x 2 root tracing 0 Dec 21 14:35 tick_stop drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_cancel drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_expire_entry drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_expire_exit drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_init drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_start At first it was thought that this could be easily fixed by just making the default ownership of the superblock when it was mounted. But this does not handle the case of: # chgrp tracing instances # mkdir instances/foo If the superblock was used, then the group ownership would be that of what it was when it was mounted, when it should instead be "tracing". Instead, set a flag for the top level eventfs directory ("events") to flag which eventfs_inode belongs to it. Since the "events" directory's dentry and inode are never freed, it does not need to use its attr field to restore its mode and ownership. Use the this eventfs_inode's attr as the default ownership for all the files and directories underneath it. When the events eventfs_inode is created, it sets its ownership to its parent uid and gid. As the events directory is created at boot up before it gets mounted, this will always be uid=0 and gid=0. If it's created via an instance, then it will take the ownership of the instance directory. When the file system is mounted, it will update all the gids if one is specified. This will have a callback to update the events evenfs_inode's default entries. When a file or directory is created under the events directory, it will walk the ei->dentry parents until it finds the evenfs_inode that belongs to the events directory to retrieve the default uid and gid values. Link: https://lore.kernel.org/all/CAHk-=wiwQtUHvzwyZucDq8=Gtw+AnwScyLhpFswrQ84PjhoGsg@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20231221190757.7eddbca9@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Dongliang Cui <cuidongliang390@gmail.com> Cc: Hongyu Jin <hongyu.jin@unisoc.com> Fixes: 0dfc852b6fe3 ("eventfs: Have event files and directories default to parent uid and gid") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Tested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-11-22eventfs: Do not allow NULL parent to eventfs_start_creating()Steven Rostedt (Google)1-9/+4
The eventfs directory is dynamically created via the meta data supplied by the existing trace events. All files and directories in eventfs has a parent. Do not allow NULL to be passed into eventfs_start_creating() as the parent because that should never happen. Warn if it does. Link: https://lkml.kernel.org/r/20231121231112.693841807@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-11-03Merge tag 'trace-v6.7' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: - Remove eventfs_file descriptor This is the biggest change, and the second part of making eventfs create its files dynamically. In 6.6 the first part was added, and that maintained a one to one mapping between eventfs meta descriptors and the directories and file inodes and dentries that were dynamically created. The directories were represented by a eventfs_inode and the files were represented by a eventfs_file. In v6.7 the eventfs_file is removed. As all events have the same directory make up (sched_switch has an "enable", "id", "format", etc files), the handing of what files are underneath each leaf eventfs directory is moved back to the tracing subsystem via a callback. When an event is added to the eventfs, it registers an array of evenfs_entry's. These hold the names of the files and the callbacks to call when the file is referenced. The callback gets the name so that the same callback may be used by multiple files. The callback then supplies the filesystem_operations structure needed to create this file. This has brought the memory footprint of creating multiple eventfs instances down by 2 megs each! - User events now has persistent events that are not associated to a single processes. These are privileged events that hang around even if no process is attached to them - Clean up of seq_buf There's talk about using seq_buf more to replace strscpy() and friends. But this also requires some minor modifications of seq_buf to be able to do this - Expand instance ring buffers individually Currently if boot up creates an instance, and a trace event is enabled on that instance, the ring buffer for that instance and the top level ring buffer are expanded (1.4 MB per CPU). This wastes memory as this happens when nothing is using the top level instance - Other minor clean ups and fixes * tag 'trace-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (34 commits) seq_buf: Export seq_buf_puts() seq_buf: Export seq_buf_putc() eventfs: Use simple_recursive_removal() to clean up dentries eventfs: Remove special processing of dput() of events directory eventfs: Delete eventfs_inode when the last dentry is freed eventfs: Hold eventfs_mutex when calling callback functions eventfs: Save ownership and mode eventfs: Test for ei->is_freed when accessing ei->dentry eventfs: Have a free_ei() that just frees the eventfs_inode eventfs: Remove "is_freed" union with rcu head eventfs: Fix kerneldoc of eventfs_remove_rec() tracing: Have the user copy of synthetic event address use correct context eventfs: Remove extra dget() in eventfs_create_events_dir() tracing: Have trace_event_file have ref counters seq_buf: Introduce DECLARE_SEQ_BUF and seq_buf_str() eventfs: Fix typo in eventfs_inode union comment eventfs: Fix WARN_ON() in create_file_dentry() powerpc: Remove initialisation of readpos tracing/histograms: Simplify last_cmd_set() seq_buf: fix a misleading comment ...
2023-10-18tracefs: convert to new timestamp accessorsJeff Layton1-1/+1
Convert to using the new inode timestamp accessor functions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185347.80880-70-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-04eventfs: Remove eventfs_file and just use eventfs_inodeSteven Rostedt (Google)1-1/+1
Instead of having a descriptor for every file represented in the eventfs directory, only have the directory itself represented. Change the API to send in a list of entries that represent all the files in the directory (but not other directories). The entry list contains a name and a callback function that will be used to create the files when they are accessed. struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry *parent, const struct eventfs_entry *entries, int size, void *data); is used for the top level eventfs directory, and returns an eventfs_inode that will be used by: struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode *parent, const struct eventfs_entry *entries, int size, void *data); where both of the above take an array of struct eventfs_entry entries for every file that is in the directory. The entries are defined by: typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data, const struct file_operations **fops); struct eventfs_entry { const char *name; eventfs_callback callback; }; Where the name is the name of the file and the callback gets called when the file is being created. The callback passes in the name (in case the same callback is used for multiple files), a pointer to the mode, data and fops. The data will be pointing to the data that was passed in eventfs_create_dir() or eventfs_create_events_dir() but may be overridden to point to something else, as it will be used to point to the inode->i_private that is created. The information passed back from the callback is used to create the dentry/inode. If the callback fills the data and the file should be created, it must return a positive number. On zero or negative, the file is ignored. This logic may also be used as a prototype to convert entire pseudo file systems into just-in-time allocation. The "show_events_dentry" file has been updated to show the directories, and any files they have. With just the eventfs_file allocations: Before after deltas for meminfo (in kB): MemFree: -14360 MemAvailable: -14260 Buffers: 40 Cached: 24 Active: 44 Inactive: 48 Inactive(anon): 28 Active(file): 44 Inactive(file): 20 Dirty: -4 AnonPages: 28 Mapped: 4 KReclaimable: 132 Slab: 1604 SReclaimable: 132 SUnreclaim: 1472 Committed_AS: 12 Before after deltas for slabinfo: <slab>: <objects> [ * <size> = <total>] ext4_inode_cache 27 [* 1184 = 31968 ] extent_status 102 [* 40 = 4080 ] tracefs_inode_cache 144 [* 656 = 94464 ] buffer_head 39 [* 104 = 4056 ] shmem_inode_cache 49 [* 800 = 39200 ] filp -53 [* 256 = -13568 ] dentry 251 [* 192 = 48192 ] lsm_file_cache 277 [* 32 = 8864 ] vm_area_struct -14 [* 184 = -2576 ] trace_event_file 1748 [* 88 = 153824 ] kmalloc-1k 35 [* 1024 = 35840 ] kmalloc-256 49 [* 256 = 12544 ] kmalloc-192 -28 [* 192 = -5376 ] kmalloc-128 -30 [* 128 = -3840 ] kmalloc-96 10581 [* 96 = 1015776 ] kmalloc-64 3056 [* 64 = 195584 ] kmalloc-32 1291 [* 32 = 41312 ] kmalloc-16 2310 [* 16 = 36960 ] kmalloc-8 9216 [* 8 = 73728 ] Free memory dropped by 14,360 kB Available memory dropped by 14,260 kB Total slab additions in size: 1,771,032 bytes With this change: Before after deltas for meminfo (in kB): MemFree: -12084 MemAvailable: -11976 Buffers: 32 Cached: 32 Active: 72 Inactive: 168 Inactive(anon): 176 Active(file): 72 Inactive(file): -8 Dirty: 24 AnonPages: 196 Mapped: 8 KReclaimable: 148 Slab: 836 SReclaimable: 148 SUnreclaim: 688 Committed_AS: 324 Before after deltas for slabinfo: <slab>: <objects> [ * <size> = <total>] tracefs_inode_cache 144 [* 656 = 94464 ] shmem_inode_cache -23 [* 800 = -18400 ] filp -92 [* 256 = -23552 ] dentry 179 [* 192 = 34368 ] lsm_file_cache -3 [* 32 = -96 ] vm_area_struct -13 [* 184 = -2392 ] trace_event_file 1748 [* 88 = 153824 ] kmalloc-1k -49 [* 1024 = -50176 ] kmalloc-256 -27 [* 256 = -6912 ] kmalloc-128 1864 [* 128 = 238592 ] kmalloc-64 4685 [* 64 = 299840 ] kmalloc-32 -72 [* 32 = -2304 ] kmalloc-16 256 [* 16 = 4096 ] total = 721352 Free memory dropped by 12,084 kB Available memory dropped by 11,976 kB Total slab additions in size: 721,352 bytes That's over 2 MB in savings per instance for free and available memory, and over 1 MB in savings per instance of slab memory. Link: https://lore.kernel.org/linux-trace-kernel/20231003184059.4924468e@gandalf.local.home Link: https://lore.kernel.org/linux-trace-kernel/20231004165007.43d79161@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ajay Kaher <akaher@vmware.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07tracefs/eventfs: Use dput to free the toplevel events directorySteven Rostedt (Google)1-1/+1
Currently when rmdir on an instance is done, eventfs_remove_events_dir() is called and it does a dput on the dentry and then frees the eventfs_inode that represents the events directory. But there's no protection against a reader reading the top level events directory at the same time and we can get a use after free error. Instead, use the dput() associated to the dentry to also free the eventfs_inode associated to the events directory, as that will get called when the last reference to the directory is released. This issue triggered the following KASAN report: ================================================================== BUG: KASAN: slab-use-after-free in eventfs_root_lookup+0x88/0x1b0 Read of size 8 at addr ffff888120130ca0 by task ftracetest/1201 CPU: 4 PID: 1201 Comm: ftracetest Not tainted 6.5.0-test-10737-g469e0a8194e7 #13 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x57/0x90 print_report+0xcf/0x670 ? __pfx_ring_buffer_record_off+0x10/0x10 ? _raw_spin_lock_irqsave+0x2b/0x70 ? __virt_addr_valid+0xd9/0x160 kasan_report+0xd4/0x110 ? eventfs_root_lookup+0x88/0x1b0 ? eventfs_root_lookup+0x88/0x1b0 eventfs_root_lookup+0x88/0x1b0 ? eventfs_root_lookup+0x33/0x1b0 __lookup_slow+0x194/0x2a0 ? __pfx___lookup_slow+0x10/0x10 ? down_read+0x11c/0x330 walk_component+0x166/0x220 link_path_walk.part.0.constprop.0+0x3a3/0x5a0 ? seqcount_lockdep_reader_access+0x82/0x90 ? __pfx_link_path_walk.part.0.constprop.0+0x10/0x10 path_openat+0x143/0x11f0 ? __lock_acquire+0xa1a/0x3220 ? __pfx_path_openat+0x10/0x10 ? __pfx___lock_acquire+0x10/0x10 do_filp_open+0x166/0x290 ? __pfx_do_filp_open+0x10/0x10 ? lock_is_held_type+0xce/0x120 ? preempt_count_sub+0xb7/0x100 ? _raw_spin_unlock+0x29/0x50 ? alloc_fd+0x1a0/0x320 do_sys_openat2+0x126/0x160 ? rcu_is_watching+0x34/0x60 ? __pfx_do_sys_openat2+0x10/0x10 ? __might_resched+0x2cf/0x3b0 ? __fget_light+0xdf/0x100 __x64_sys_openat+0xcd/0x140 ? __pfx___x64_sys_openat+0x10/0x10 ? syscall_enter_from_user_mode+0x22/0x90 ? lockdep_hardirqs_on+0x7d/0x100 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0033:0x7f1dceef5e51 Code: 75 57 89 f0 25 00 00 41 00 3d 00 00 41 00 74 49 80 3d 9a 27 0e 00 00 74 6d 89 da 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 93 00 00 00 48 8b 54 24 28 64 48 2b 14 25 RSP: 002b:00007fff2cddf380 EFLAGS: 00000202 ORIG_RAX: 0000000000000101 RAX: ffffffffffffffda RBX: 0000000000000241 RCX: 00007f1dceef5e51 RDX: 0000000000000241 RSI: 000055d7520677d0 RDI: 00000000ffffff9c RBP: 000055d7520677d0 R08: 000000000000001e R09: 0000000000000001 R10: 00000000000001b6 R11: 0000000000000202 R12: 0000000000000000 R13: 0000000000000003 R14: 000055d752035678 R15: 000055d752067788 </TASK> Allocated by task 1200: kasan_save_stack+0x2f/0x50 kasan_set_track+0x21/0x30 __kasan_kmalloc+0x8b/0x90 eventfs_create_events_dir+0x54/0x220 create_event_toplevel_files+0x42/0x130 event_trace_add_tracer+0x33/0x180 trace_array_create_dir+0x52/0xf0 trace_array_create+0x361/0x410 instance_mkdir+0x6b/0xb0 tracefs_syscall_mkdir+0x57/0x80 vfs_mkdir+0x275/0x380 do_mkdirat+0x1da/0x210 __x64_sys_mkdir+0x74/0xa0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Freed by task 1251: kasan_save_stack+0x2f/0x50 kasan_set_track+0x21/0x30 kasan_save_free_info+0x27/0x40 __kasan_slab_free+0x106/0x180 __kmem_cache_free+0x149/0x2e0 event_trace_del_tracer+0xcb/0x120 __remove_instance+0x16a/0x340 instance_rmdir+0x77/0xa0 tracefs_syscall_rmdir+0x77/0xc0 vfs_rmdir+0xed/0x2d0 do_rmdir+0x235/0x280 __x64_sys_rmdir+0x5f/0x90 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 The buggy address belongs to the object at ffff888120130ca0 which belongs to the cache kmalloc-16 of size 16 The buggy address is located 0 bytes inside of freed 16-byte region [ffff888120130ca0, ffff888120130cb0) The buggy address belongs to the physical page: page:000000004dbddbb0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x120130 flags: 0x17ffffc0000800(slab|node=0|zone=2|lastcpupid=0x1fffff) page_type: 0xffffffff() raw: 0017ffffc0000800 ffff8881000423c0 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000800080 00000001ffffffff 0000000000000000 page d