diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/erofs.rst | 4 | ||||
-rw-r--r-- | Documentation/filesystems/f2fs.rst | 2 | ||||
-rw-r--r-- | Documentation/filesystems/idmappings.rst | 178 | ||||
-rw-r--r-- | Documentation/filesystems/index.rst | 1 | ||||
-rw-r--r-- | Documentation/filesystems/locking.rst | 4 | ||||
-rw-r--r-- | Documentation/filesystems/mount_api.rst | 1 | ||||
-rw-r--r-- | Documentation/filesystems/proc.rst | 54 | ||||
-rw-r--r-- | Documentation/filesystems/sysfs.rst | 4 | ||||
-rw-r--r-- | Documentation/filesystems/tmpfs.rst | 66 | ||||
-rw-r--r-- | Documentation/filesystems/vfs.rst | 105 | ||||
-rw-r--r-- | Documentation/filesystems/xfs-online-fsck-design.rst | 5315 | ||||
-rw-r--r-- | Documentation/filesystems/xfs-self-describing-metadata.rst | 1 |
12 files changed, 5623 insertions, 112 deletions
diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst index a43aacf1494e..4654ee57c1d5 100644 --- a/Documentation/filesystems/erofs.rst +++ b/Documentation/filesystems/erofs.rst @@ -40,8 +40,8 @@ Here are the main features of EROFS: - Support multiple devices to refer to external blobs, which can be used for container images; - - 4KiB block size and 32-bit block addresses for each device, therefore - 16TiB address space at most for now; + - 32-bit block addresses for each device, therefore 16TiB address space at + most with 4KiB block size for now; - Two inode layouts for different requirements: diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst index 2055e72871fe..c57745375edb 100644 --- a/Documentation/filesystems/f2fs.rst +++ b/Documentation/filesystems/f2fs.rst @@ -264,7 +264,7 @@ checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enabl disabled, any unmounting or unexpected shutdowns will cause the filesystem contents to appear as they did when the filesystem was mounted with that option. - While mounting with checkpoint=disabled, the filesystem must + While mounting with checkpoint=disable, the filesystem must run garbage collection to ensure that all available space can be used. If this takes too much time, the mount may return EAGAIN. You may optionally add a value to indicate how much diff --git a/Documentation/filesystems/idmappings.rst b/Documentation/filesystems/idmappings.rst index b9b31066aef2..ad6d21640576 100644 --- a/Documentation/filesystems/idmappings.rst +++ b/Documentation/filesystems/idmappings.rst @@ -241,7 +241,7 @@ according to the filesystem's idmapping as this would give the wrong owner if the caller is using an idmapping. So the kernel will map the id back up in the idmapping of the caller. Let's -assume the caller has the slighly unconventional idmapping +assume the caller has the somewhat unconventional idmapping ``u3000:k20000:r10000`` then ``k21000`` would map back up to ``u4000``. Consequently the user would see that this file is owned by ``u4000``. @@ -320,6 +320,10 @@ and equally wrong:: from_kuid(u20000:k0:r10000, u1000) = k21000 ~~~~~ +Since userspace ids have type ``uid_t`` and ``gid_t`` and kernel ids have type +``kuid_t`` and ``kgid_t`` the compiler will throw an error when they are +conflated. So the two examples above would cause a compilation failure. + Idmappings when creating filesystem objects ------------------------------------------- @@ -623,42 +627,105 @@ privileged users in the initial user namespace. However, it is perfectly possible to combine idmapped mounts with filesystems mountable inside user namespaces. We will touch on this further below. +Filesystem types vs idmapped mount types +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +With the introduction of idmapped mounts we need to distinguish between +filesystem ownership and mount ownership of a VFS object such as an inode. The +owner of a inode might be different when looked at from a filesystem +perspective than when looked at from an idmapped mount. Such fundamental +conceptual distinctions should almost always be clearly expressed in the code. +So, to distinguish idmapped mount ownership from filesystem ownership separate +types have been introduced. + +If a uid or gid has been generated using the filesystem or caller's idmapping +then we will use the ``kuid_t`` and ``kgid_t`` types. However, if a uid or gid +has been generated using a mount idmapping then we will be using the dedicated +``vfsuid_t`` and ``vfsgid_t`` types. + +All VFS helpers that generate or take uids and gids as arguments use the +``vfsuid_t`` and ``vfsgid_t`` types and we will be able to rely on the compiler +to catch errors that originate from conflating filesystem and VFS uids and gids. + +The ``vfsuid_t`` and ``vfsgid_t`` types are often mapped from and to ``kuid_t`` +and ``kgid_t`` types similar how ``kuid_t`` and ``kgid_t`` types are mapped +from and to ``uid_t`` and ``gid_t`` types:: + + uid_t <--> kuid_t <--> vfsuid_t + gid_t <--> kgid_t <--> vfsgid_t + +Whenever we report ownership based on a ``vfsuid_t`` or ``vfsgid_t`` type, +e.g., during ``stat()``, or store ownership information in a shared VFS object +based on a ``vfsuid_t`` or ``vfsgid_t`` type, e.g., during ``chown()`` we can +use the ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()`` helpers. + +To illustrate why this helper currently exists, consider what happens when we +change ownership of an inode from an idmapped mount. After we generated +a ``vfsuid_t`` or ``vfsgid_t`` based on the mount idmapping we later commit to +this ``vfsuid_t`` or ``vfsgid_t`` to become the new filesytem wide ownership. +Thus, we are turning the ``vfsuid_t`` or ``vfsgid_t`` into a global ``kuid_t`` +or ``kgid_t``. And this can be done by using ``vfsuid_into_kuid()`` and +``vfsgid_into_kgid()``. + +Note, whenever a shared VFS object, e.g., a cached ``struct inode`` or a cached +``struct posix_acl``, stores ownership information a filesystem or "global" +``kuid_t`` and ``kgid_t`` must be used. Ownership expressed via ``vfsuid_t`` +and ``vfsgid_t`` is specific to an idmapped mount. + +We already noted that ``vfsuid_t`` and ``vfsgid_t`` types are generated based +on mount idmappings whereas ``kuid_t`` and ``kgid_t`` types are generated based +on filesystem idmappings. To prevent abusing filesystem idmappings to generate +``vfsuid_t`` or ``vfsgid_t`` types or mount idmappings to generate ``kuid_t`` +or ``kgid_t`` types filesystem idmappings and mount idmappings are different +types as well. + +All helpers that map to or from ``vfsuid_t`` and ``vfsgid_t`` types require +a mount idmapping to be passed which is of type ``struct mnt_idmap``. Passing +a filesystem or caller idmapping will cause a compilation error. + +Similar to how we prefix all userspace ids in this document with ``u`` and all +kernel ids with ``k`` we will prefix all VFS ids with ``v``. So a mount +idmapping will be written as: ``u0:v10000:r10000``. + Remapping helpers ~~~~~~~~~~~~~~~~~ Idmapping functions were added that translate between idmappings. They make use -of the remapping algorithm we've introduced earlier. We're going to look at -two: +of the remapping algorithm we've introduced earlier. We're going to look at: -- ``i_uid_into_mnt()`` and ``i_gid_into_mnt()`` +- ``i_uid_into_vfsuid()`` and ``i_gid_into_vfsgid()`` - The ``i_*id_into_mnt()`` functions translate filesystem's kernel ids into - kernel ids in the mount's idmapping:: + The ``i_*id_into_vfs*id()`` functions translate filesystem's kernel ids into + VFS ids in the mount's idmapping:: /* Map the filesystem's kernel id up into a userspace id in the filesystem's idmapping. */ from_kuid(filesystem, kid) = uid - /* Map the filesystem's userspace id down ito a kernel id in the mount's idmapping. */ + /* Map the filesystem's userspace id down ito a VFS id in the mount's idmapping. */ make_kuid(mount, uid) = kuid - ``mapped_fsuid()`` and ``mapped_fsgid()`` The ``mapped_fs*id()`` functions translate the caller's kernel ids into kernel ids in the filesystem's idmapping. This translation is achieved by - remapping the caller's kernel ids using the mount's idmapping:: + remapping the caller's VFS ids using the mount's idmapping:: - /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */ + /* Map the caller's VFS id up into a userspace id in the mount's idmapping. */ from_kuid(mount, kid) = uid /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */ make_kuid(filesystem, uid) = kuid +- ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()`` + + Whenever + Note that these two functions invert each other. Consider the following idmappings:: caller idmapping: u0:k10000:r10000 filesystem idmapping: u0:k20000:r10000 - mount idmapping: u0:k10000:r10000 + mount idmapping: u0:v10000:r10000 Assume a file owned by ``u1000`` is read from disk. The filesystem maps this id to ``k21000`` according to its idmapping. This is what is stored in the @@ -669,20 +736,21 @@ would usually simply use the crossmapping algorithm and map the filesystem's kernel id up to a userspace id in the caller's idmapping. But when the caller is accessing the file on an idmapped mount the kernel will -first call ``i_uid_into_mnt()`` thereby translating the filesystem's kernel id -into a kernel id in the mount's idmapping:: +first call ``i_uid_into_vfsuid()`` thereby translating the filesystem's kernel +id into a VFS id in the mount's idmapping:: - i_uid_into_mnt(k21000): + i_uid_into_vfsuid(k21000): /* Map the filesystem's kernel id up into a userspace id. */ from_kuid(u0:k20000:r10000, k21000) = u1000 - /* Map the filesystem's userspace id down ito a kernel id in the mount's idmapping. */ - make_kuid(u0:k10000:r10000, u1000) = k11000 + /* Map the filesystem's userspace id down into a VFS id in the mount's idmapping. */ + make_kuid(u0:v10000:r10000, u1000) = v11000 Finally, when the kernel reports the owner to the caller it will turn the -kernel id in the mount's idmapping into a userspace id in the caller's +VFS id in the mount's idmapping into a userspace id in the caller's idmapping:: + k11000 = vfsuid_into_kuid(v11000) from_kuid(u0:k10000:r10000, k11000) = u1000 We can test whether this algorithm really works by verifying what happens when @@ -696,18 +764,19 @@ fails. But when the caller is accessing the file on an idmapped mount the kernel will first call ``mapped_fs*id()`` thereby translating the caller's kernel id into -a kernel id according to the mount's idmapping:: +a VFS id according to the mount's idmapping:: mapped_fsuid(k11000): /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */ from_kuid(u0:k10000:r10000, k11000) = u1000 /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */ - make_kuid(u0:k20000:r10000, u1000) = k21000 + make_kuid(u0:v20000:r10000, u1000) = v21000 -When finally writing to disk the kernel will then map ``k21000`` up into a +When finally writing to disk the kernel will then map ``v21000`` up into a userspace id in the filesystem's idmapping:: + k21000 = vfsuid_into_kuid(v21000) from_kuid(u0:k20000:r10000, k21000) = u1000 As we can see, we end up with an invertible and therefore information @@ -725,7 +794,7 @@ Example 2 reconsidered caller id: u1000 caller idmapping: u0:k10000:r10000 filesystem idmapping: u0:k20000:r10000 - mount idmapping: u0:k10000:r10000 + mount idmapping: u0:v10000:r10000 When the caller is using a non-initial idmapping the common case is to attach the same idmapping to the mount. We now perform three steps: @@ -734,12 +803,12 @@ the same idmapping to the mount. We now perform three steps: make_kuid(u0:k10000:r10000, u1000) = k11000 -2. Translate the caller's kernel id into a kernel id in the filesystem's +2. Translate the caller's VFS id into a kernel id in the filesystem's idmapping:: - mapped_fsuid(k11000): - /* Map the kernel id up into a userspace id in the mount's idmapping. */ - from_kuid(u0:k10000:r10000, k11000) = u1000 + mapped_fsuid(v11000): + /* Map the VFS id up into a userspace id in the mount's idmapping. */ + from_kuid(u0:v10000:r10000, v11000) = u1000 /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ make_kuid(u0:k20000:r10000, u1000) = k21000 @@ -759,7 +828,7 @@ Example 3 reconsidered caller id: u1000 caller idmapping: u0:k10000:r10000 filesystem idmapping: u0:k0:r4294967295 - mount idmapping: u0:k10000:r10000 + mount idmapping: u0:v10000:r10000 The same translation algorithm works with the third example. @@ -767,12 +836,12 @@ The same translation algorithm works with the third example. make_kuid(u0:k10000:r10000, u1000) = k11000 -2. Translate the caller's kernel id into a kernel id in the filesystem's +2. Translate the caller's VFS id into a kernel id in the filesystem's idmapping:: - mapped_fsuid(k11000): - /* Map the kernel id up into a userspace id in the mount's idmapping. */ - from_kuid(u0:k10000:r10000, k11000) = u1000 + mapped_fsuid(v11000): + /* Map the VFS id up into a userspace id in the mount's idmapping. */ + from_kuid(u0:v10000:r10000, v11000) = u1000 /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ make_kuid(u0:k0:r4294967295, u1000) = k1000 @@ -792,7 +861,7 @@ Example 4 reconsidered file id: u1000 caller idmapping: u0:k10000:r10000 filesystem idmapping: u0:k0:r4294967295 - mount idmapping: u0:k10000:r10000 + mount idmapping: u0:v10000:r10000 In order to report ownership to userspace the kernel now does three steps using the translation algorithm we introduced earlier: @@ -802,17 +871,18 @@ the translation algorithm we introduced earlier: make_kuid(u0:k0:r4294967295, u1000) = k1000 -2. Translate the kernel id into a kernel id in the mount's idmapping:: +2. Translate the kernel id into a VFS id in the mount's idmapping:: - i_uid_into_mnt(k1000): + i_uid_into_vfsuid(k1000): /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ from_kuid(u0:k0:r4294967295, k1000) = u1000 - /* Map the userspace id down into a kernel id in the mounts's idmapping. */ - make_kuid(u0:k10000:r10000, u1000) = k11000 + /* Map the userspace id down into a VFS id in the mounts's idmapping. */ + make_kuid(u0:v10000:r10000, u1000) = v11000 -3. Map the kernel id up into a userspace id in the caller's idmapping:: +3. Map the VFS id up into a userspace id in the caller's idmapping:: + k11000 = vfsuid_into_kuid(v11000) from_kuid(u0:k10000:r10000, k11000) = u1000 Earlier, the caller's kernel id couldn't be crossmapped in the filesystems's @@ -828,7 +898,7 @@ Example 5 reconsidered file id: u1000 caller idmapping: u0:k10000:r10000 filesystem idmapping: u0:k20000:r10000 - mount idmapping: u0:k10000:r10000 + mount idmapping: u0:v10000:r10000 Again, in order to report ownership to userspace the kernel now does three steps using the translation algorithm we introduced earlier: @@ -838,17 +908,18 @@ steps using the translation algorithm we introduced earlier: make_kuid(u0:k20000:r10000, u1000) = k21000 -2. Translate the kernel id into a kernel id in the mount's idmapping:: +2. Translate the kernel id into a VFS id in the mount's idmapping:: - i_uid_into_mnt(k21000): + i_uid_into_vfsuid(k21000): /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ from_kuid(u0:k20000:r10000, k21000) = u1000 - /* Map the userspace id down into a kernel id in the mounts's idmapping. */ - make_kuid(u0:k10000:r10000, u1000) = k11000 + /* Map the userspace id down into a VFS id in the mounts's idmapping. */ + make_kuid(u0:v10000:r10000, u1000) = v11000 -3. Map the kernel id up into a userspace id in the caller's idmapping:: +3. Map the VFS id up into a userspace id in the caller's idmapping:: + k11000 = vfsuid_into_kuid(v11000) from_kuid(u0:k10000:r10000, k11000) = u1000 Earlier, the file's kernel id couldn't be crossmapped in the filesystems's @@ -899,23 +970,23 @@ from above::: caller id: u1125 caller idmapping: u0:k0:r4294967295 filesystem idmapping: u0:k0:r4294967295 - mount idmapping: u1000:k1125:r1 + mount idmapping: u1000:v1125:r1 1. Map the caller's userspace ids into kernel ids in the caller's idmapping:: make_kuid(u0:k0:r4294967295, u1125) = k1125 -2. Translate the caller's kernel id into a kernel id in the filesystem's +2. Translate the caller's VFS id into a kernel id in the filesystem's idmapping:: - mapped_fsuid(k1125): - /* Map the kernel id up into a userspace id in the mount's idmapping. */ - from_kuid(u1000:k1125:r1, k1125) = u1000 + mapped_fsuid(v1125): + /* Map the VFS id up into a userspace id in the mount's idmapping. */ + from_kuid(u1000:v1125:r1, v1125) = u1000 /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ make_kuid(u0:k0:r4294967295, u1000) = k1000 -2. Verify that the caller's kernel ids can be mapped to userspace ids in the +2. Verify that the caller's filesystem ids can be mapped to userspace ids in the filesystem's idmapping:: from_kuid(u0:k0:r4294967295, k1000) = u1000 @@ -930,24 +1001,25 @@ on their work computer: file id: u1000 caller idmapping: u0:k0:r4294967295 filesystem idmapping: u0:k0:r4294967295 - mount idmapping: u1000:k1125:r1 + mount idmapping: u1000:v1125:r1 1. Map the userspace id on disk down into a kernel id in the filesystem's idmapping:: make_kuid(u0:k0:r4294967295, u1000) = k1000 -2. Translate the kernel id into a kernel id in the mount's idmapping:: +2. Translate the kernel id into a VFS id in the mount's idmapping:: - i_uid_into_mnt(k1000): + i_uid_into_vfsuid(k1000): /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ from_kuid(u0:k0:r4294967295, k1000) = u1000 - /* Map the userspace id down into a kernel id in the mounts's idmapping. */ - make_kuid(u1000:k1125:r1, u1000) = k1125 + /* Map the userspace id down into a VFS id in the mounts's idmapping. */ + make_kuid(u1000:v1125:r1, u1000) = v1125 -3. Map the kernel id up into a userspace id in the caller's idmapping:: +3. Map the VFS id up into a userspace id in the caller's idmapping:: + k1125 = vfsuid_into_kuid(v1125) from_kuid(u0:k0:r4294967295, k1125) = u1125 So ultimately the caller will be reported that the file belongs to ``u1125`` diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index bee63d42e5ec..fbb2b5ada95b 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -123,4 +123,5 @@ Documentation for filesystem implementations. vfat xfs-delayed-logging-design xfs-self-describing-metadata + xfs-online-fsck-design zonefs diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index 7de7a7272a5e..aa1a233b0fa8 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -645,7 +645,7 @@ ops mmap_lock PageLocked(page) open: yes close: yes fault: yes can return with page locked -map_pages: yes +map_pages: read page_mkwrite: yes can return with page locked pfn_mkwrite: yes access: yes @@ -661,7 +661,7 @@ locked. The VM will unlock the page. ->map_pages() is called when VM asks to map easy accessible pages. Filesystem should find and map pages associated with offsets from "start_pgoff" -till "end_pgoff". ->map_pages() is called with page table locked and must +till "end_pgoff". ->map_pages() is called with the RCU lock held and must not block. If it's not possible to reach a page without blocking, filesystem should skip it. Filesystem should use do_set_pte() to setup page table entry. Pointer to entry associated with the page is passed in diff --git a/Documentation/filesystems/mount_api.rst b/Documentation/filesystems/mount_api.rst index 63204d2094fd..9aaf6ef75eb5 100644 --- a/Documentation/filesystems/mount_api.rst +++ b/Documentation/filesystems/mount_api.rst @@ -79,7 +79,6 @@ context. This is represented by the fs_context structure:: unsigned int sb_flags; unsigned int sb_flags_mask; unsigned int s_iflags; - unsigned int lsm_flags; enum fs_context_purpose purpose:8; ... }; diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 9d5fd9424e8b..7897a7dafcbc 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -85,7 +85,7 @@ contact Bodo Bauer at bb@ricochet.net. We'll be happy to add them to this document. The latest version of this document is available online at -http://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html +https://www.kernel.org/doc/html/latest/filesystems/proc.html If the above direction does not works for you, you could try the kernel mailing list at linux-kernel@vger.kernel.org and/or try to reach me at @@ -179,6 +179,7 @@ read the file /proc/PID/status:: Gid: 100 100 100 100 FDSize: 256 Groups: 100 14 16 + Kthread: 0 VmPeak: 5004 kB VmSize: 5004 kB VmLck: 0 kB @@ -232,7 +233,7 @@ asynchronous manner and the value may not be very precise. To see a precise snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table. It's slow but very precise. -.. table:: Table 1-2: Contents of the status files (as of 4.19) +.. table:: Table 1-2: Contents of the status fields (as of 4.19) ========================== =================================================== Field Content @@ -256,6 +257,7 @@ It's slow but very precise. NSpid descendant namespace process ID hierarchy NSpgid descendant namespace process group ID hierarchy NSsid descendant namespace session ID hierarchy + Kthread kernel thread flag, 1 is yes, 0 is no VmPeak peak virtual memory size VmSize total program size VmLck locked memory size @@ -305,7 +307,7 @@ It's slow but very precise. ========================== =================================================== -.. table:: Table 1-3: Contents of the statm files (as of 2.6.8-rc3) +.. table:: Table 1-3: Contents of the statm fields (as of 2.6.8-rc3) ======== =============================== ============================== Field Content @@ -323,7 +325,7 @@ It's slow but very precise. ======== =============================== ============================== -.. table:: Table 1-4: Contents of the stat files (as of 2.6.30-rc7) +.. table:: Table 1-4: Contents of the stat fields (as of 2.6.30-rc7) ============= =============================================================== Field Content @@ -996,6 +998,7 @@ Example output. You may not have all of these fields. VmallocUsed: 40444 kB VmallocChunk: 0 kB Percpu: 29312 kB + EarlyMemtestBad: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 4149248 kB ShmemHugePages: 0 kB @@ -1146,6 +1149,13 @@ VmallocChunk Percpu Memory allocated to the percpu allocator used to back percpu allocations. This stat excludes the cost of metadata. +EarlyMemtestBad + The amount of RAM/memory in kB, that was identified as corrupted + by early memtest. If memtest was not run, this field will not + be displayed at all. Size is never rounded down to 0 kB. + That means if 0 kB is reported, you can safely assume + there was at least one pass of memtest and none of the passes + found a single faulty byte of RAM. HardwareCorrupted The amount of RAM/memory in KB, the kernel identifies as corrupted. @@ -1321,9 +1331,9 @@ many times the slaves link has failed. 1.4 SCSI info ------------- -If you have a SCSI host adapter in your system, you'll find a subdirectory -named after the driver for this adapter in /proc/scsi. You'll also see a list -of all recognized SCSI devices in /proc/scsi:: +If you have a SCSI or ATA host adapter in your system, you'll find a +subdirectory named after the driver for this adapter in /proc/scsi. +You'll also see a list of all recognized SCSI devices in /proc/scsi:: >cat /proc/scsi/scsi Attached devices: @@ -1449,16 +1459,18 @@ Various pieces of information about kernel activity are available in the since the system first booted. For a quick look, simply cat the file:: > cat /proc/stat - cpu 2255 34 2290 22625563 6290 127 456 0 0 0 - cpu0 1132 34 1441 11311718 3675 127 438 0 0 0 - cpu1 1123 0 849 11313845 2614 0 18 0 0 0 - intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...] - ctxt 1990473 - btime 1062191376 - processes 2915 - procs_running 1 + cpu 237902850 368826709 106375398 1873517540 1135548 0 14507935 0 0 0 + cpu0 60045249 91891769 26331539 468411416 495718 0 5739640 0 0 0 + cpu1 59746288 91759249 26609887 468860630 312281 0 4384817 0 0 0 + cpu2 59489247 92985423 26904446 467808813 171668 0 2268998 0 0 0 + cpu3 58622065 92190267 26529524 468436680 155879 0 2114478 0 0 0 + intr 8688370575 8 3373 0 0 0 0 0 0 1 40791 0 0 353317 0 0 0 0 224789828 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 190974333 41958554 123983334 43 0 224593 0 0 0 <more 0's deleted> + ctxt 22848221062 + btime 1605316999 + processes 746787147 + procs_running 2 procs_blocked 0 - softirq 183433 0 21755 12 39 1137 231 21459 2263 + softirq 12121874454 100099120 3938138295 127375644 2795979 187870761 0 173808342 3072582055 52608 224184354 The very first "cpu" line aggregates the numbers in all of the other "cpuN" lines. These numbers identify the amount of time the CPU has spent performing @@ -1520,8 +1532,8 @@ softirq. Information about mounted ext4 file systems can be found in /proc/fs/ext4. Each mounted filesystem will have a directory in /proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or -/proc/fs/ext4/dm-0). The files in each per-device directory are shown -in Table 1-12, below. +/proc/fs/ext4/sda9 or /proc/fs/ext4/dm-0). The files in each per-device +directory are shown in Table 1-12, below. .. table:: Table 1-12: Files in /proc/fs/ext4/<devname> @@ -1601,12 +1613,12 @@ can inadvertently disrupt your system, it is advisable to read both documentation and source before actually making adjustments. In any case, be very careful when writing to any of these files. The entries in /proc may change slightly between the 2.1.* and the 2.2 kernel, so if there is any doubt -review the kernel documentation in the directory /usr/src/linux/Documentation. +review the kernel documentation in the directory linux/Documentation. This chapter is heavily based on the documentation included in the pre 2.2 kernels, and became part of it in version 2.2.1 of the Linux kernel. -Please see: Documentation/admin-guide/sysctl/ directory for descriptions of these -entries. +Please see: Documentation/admin-guide/sysctl/ directory for descriptions of +these entries. Summary ------- diff --git a/Documentation/filesystems/sysfs.rst b/Documentation/filesystems/sysfs.rst index f8187d466b97..c32993bc83c7 100644 --- a/Documentation/filesystems/sysfs.rst +++ b/Documentation/filesystems/sysfs.rst @@ -373,8 +373,8 @@ Structure:: struct bus_attribute { struct attribute attr; - ssize_t (*show)(struct bus_type *, char * buf); - ssize_t (*store)(struct bus_type *, const char * buf, size_t count); + ssize_t (*show)(const struct bus_type *, char * buf); + ssize_t (*store)(const struct bus_type *, const char * buf, size_t count); }; Declaring:: diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst index 0408c245785e..f18f46be5c0c 100644 --- a/Documentation/filesystems/tmpfs.rst +++ b/Documentation/filesystems/tmpfs.rst @@ -13,17 +13,29 @@ everything stored therein is lost. tmpfs puts everything into the kernel internal caches and grows and shrinks to accommodate the files it contains and is able to swap -unneeded pages out to swap space. It has maximum size limits which can -be adjusted on the fly via 'mount -o remount ...' - -If you compare it to ramfs (which was the template to create tmpfs) -you gain swapping and limit checking. Another similar thing is the RAM -disk (/dev/ram*), which simulates a fixed size hard disk in physical -RAM, where you have to create an ordinary filesystem on top. Ramdisks -cannot swap and you do not have the possibility to resize them. - -Since tmpfs lives completely in the page cache and on swap, all tmpfs -pages will be shown as "Shmem" in /proc/meminfo and "Shared" in +unneeded pages out to swap space, if swap was enabled for the tmpfs +mount. tmpfs also supports THP. + +tmpfs extends ramfs with a few userspace configurable options listed and +explained further below, some of which can be reconfigured dynamically on the +fly using a remount ('mount -o remount ...') of the filesystem. A tmpfs +filesystem can be resized but it cannot be resized to a size below its current +usage. tmpfs also supports POSIX ACLs, and extended attributes for the +trusted.* and security.* namespaces. ramfs does not use swap and you cannot +modify any parameter for a ramfs filesystem. The size limit of a ramfs +filesystem is how much memory you have available, and so care must be taken if +used so to not run out of memory. + +An alternative to tmpfs and ramfs is to use brd to create RAM disks +(/dev/ram*), which allows you to simulate a block device disk in physical RAM. +To write data you would just then need to create an regular filesystem on top +this ramdisk. As with ramfs, brd ramdisks cannot swap. brd ramdisks are also +configured in size at initialization and you cannot dynamically resize them. +Contrary to brd ramdisks, tmpfs has its own filesystem, it does not rely on the +block layer at all. + +Since tmpfs lives completely in the page cache and optionally on swap, +all tmpfs pages will be shown as "Shmem" in /proc/meminfo and "Shared" in free(1). Notice that these counters also include shared memory (shmem, see ipcs(1)). The most reliable way to get the count is using df(1) and du(1). @@ -72,6 +84,8 @@ nr_inodes The maximum number of inodes for this instance. The default is half of the number of your physical RAM pages, or (on a machine with highmem) the number of lowmem RAM pages, whichever is the lower. +noswap Disables swap. Remounts must respect the original settings. + By default swap is enabled. ========= ============================================================ These parameters accept a suffix k, m or g for kilo, mega and giga and @@ -85,6 +99,36 @@ mount with such options, since it allows any user with write access to use up all the memory on the machine; but enhances the scalability of that instance in a system with many CPUs making intensive use of it. +tmpfs also supports Transparent Huge Pages which requires a kernel +configured with CONFIG_TRANSPARENT_HUGEPAGE and with huge supported for +your system (has_transparent_hugepage(), which is architecture specific). +The mount options for this are: + +====== ============================================================ +huge=0 never: disables huge pages for the mount +huge=1 always: enables huge pages for the mount +huge=2 within_size: only allocate huge pages if the page will be + fully within i_size, also respect fadvise()/madvise() hints. +huge=3 advise: only allocate huge pages if requested with + fadvise()/madvise() +====== ============================================================ + +There is a sysfs file which you can also use to control system wide THP +configuration for all tmpfs mounts, the file is: + +/sys/kernel/mm/transparent_hugepage/shmem_enabled + +This sysfs file is placed on top of THP sysfs directory and so is registered +by THP code. It is however only used to control all tmpfs mounts with one +single knob. Since it controls all tmpfs mounts it should only be used either +for emergency or testing purposes. The values you can set for shmem_enabled are: + +== ============================================================ +-1 deny: disables huge on shm_mnt and all mounts, for + emergency use +-2 force: enables huge on shm_mnt and all mounts, w/o needing + option, for testing +== ============================================================ tmpfs has a mount option to set the NUMA memory allocation policy for all files in that instance (if CONFIG_NUMA is enabled) - which can be diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index f3b344f0c0a4..769be5230210 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -107,7 +107,7 @@ file /proc/filesystems. struct file_system_type ----------------------- -This describes the filesystem. As of kernel 2.6.39, the following +This describes the filesystem. The following members are defined: .. code-block:: c @@ -115,14 +115,24 @@ members are defined: struct file_system_type { const char *name; int fs_flags; + int (*init_fs_context)(struct fs_context *); + const struct fs_parameter_spec *parameters; struct dentry *(*mount) (struct file_system_type *, int, - const char *, void *); + const char *, void *); void (*kill_sb) (struct super_block *); struct module *owner; struct file_system_type * next; - struct list_head fs_supers; + struct hlist_head fs_supers; + struct lock_class_key s_lock_key; struct lock_class_key s_umount_key; + struct lock_class_key s_vfs_rename_key; + struct lock_class_key s_writers_key[SB_FREEZE_LEVELS]; + + struct lock_class_key i_lock_key; + struct lock_class_key i_mutex_key; + struct lock_class_key invalidate_lock_key; + struct lock_class_key i_mutex_dir_key; }; ``name`` @@ -132,6 +142,15 @@ members are defined: |