Merge tag 'vfs-6.11.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull iomap updates from Christian Brauner: "This contains some minor work for the iomap subsystem: - Add documentation on the design of iomap and how to port to it - Optimize iomap_read_folio() - Bring back the change to iomap_write_end() to no increase i_size. This is accompanied by a change to xfs to reserve blocks for truncating large realtime inodes to avoid exposing stale data when iomap_write_end() stops increasing i_size" * tag 'vfs-6.11.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: iomap: don't increase i_size in iomap_write_end() xfs: reserve blocks for truncating large realtime inode Documentation: the design of iomap and how to port iomap: Optimize iomap_read_folio
author: Linus Torvalds <torvalds@linux-foundation.org> 2024-07-15 13:28:14 -0700
committer: Linus Torvalds <torvalds@linux-foundation.org> 2024-07-15 13:28:14 -0700
commit: 4f5e249ec0ea8872e1644df23cffffbe28007188 (patch)
tree: ecb7066ea436d502889c86ba27c598fc0947d4cf /Documentation/filesystems
parent: 98f3a9a4fd449641010c77abca16aebb0b8d4419 (diff)
parent: 602f09f4029c7b5e1a2f44a7651ac8922a904a1b (diff)
download: linux-4f5e249ec0ea8872e1644df23cffffbe28007188.tar.gz
linux-4f5e249ec0ea8872e1644df23cffffbe28007188.tar.bz2
linux-4f5e249ec0ea8872e1644df23cffffbe28007188.zip
5 files changed, 1288 insertions, 0 deletions
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 8f5c1ee02e2f..e8e496d23e1d 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -34,6 +34,7 @@ algorithms work.
    seq_file
    sharedsubtree
    idmappings
+   iomap/index
 
    automount-support
 
diff --git a/Documentation/filesystems/iomap/design.rst b/Documentation/filesystems/iomap/design.rst
new file mode 100644
index 000000000000..f8ee3427bc1a
--- /dev/null
+++ b/Documentation/filesystems/iomap/design.rst
@@ -0,0 +1,441 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _iomap_design:
+
+..
+        Dumb style notes to maintain the author's sanity:
+        Please try to start sentences on separate lines so that
+        sentence changes don't bleed colors in diff.
+        Heading decorations are documented in sphinx.rst.
+
+==============
+Library Design
+==============
+
+.. contents:: Table of Contents
+   :local:
+
+Introduction
+============
+
+iomap is a filesystem library for handling common file operations.
+The library has two layers:
+
+ 1. A lower layer that provides an iterator over ranges of file offsets.
+    This layer tries to obtain mappings of each file ranges to storage
+    from the filesystem, but the storage information is not necessarily
+    required.
+
+ 2. An upper layer that acts upon the space mappings provided by the
+    lower layer iterator.
+
+The iteration can involve mappings of file's logical offset ranges to
+physical extents, but the storage layer information is not necessarily
+required, e.g. for walking cached file information.
+The library exports various APIs for implementing file operations such
+as:
+
+ * Pagecache reads and writes
+ * Folio write faults to the pagecache
+ * Writeback of dirty folios
+ * Direct I/O reads and writes
+ * fsdax I/O reads, writes, loads, and stores
+ * FIEMAP
+ * lseek ``SEEK_DATA`` and ``SEEK_HOLE``
+ * swapfile activation
+
+This origins of this library is the file I/O path that XFS once used; it
+has now been extended to cover several other operations.
+
+Who Should Read This?
+=====================
+
+The target audience for this document are filesystem, storage, and
+pagecache programmers and code reviewers.
+
+If you are working on PCI, machine architectures, or device drivers, you
+are most likely in the wrong place.
+
+How Is This Better?
+===================
+
+Unlike the classic Linux I/O model which breaks file I/O into small
+units (generally memory pages or blocks) and looks up space mappings on
+the basis of that unit, the iomap model asks the filesystem for the
+largest space mappings that it can create for a given file operation and
+initiates operations on that basis.
+This strategy improves the filesystem's visibility into the size of the
+operation being performed, which enables it to combat fragmentation with
+larger space allocations when possible.
+Larger space mappings improve runtime performance by amortizing the cost
+of mapping function calls into the filesystem across a larger amount of
+data.
+
+At a high level, an iomap operation `looks like this
+<https://lore.kernel.org/all/ZGbVaewzcCysclPt@dread.disaster.area/>`_:
+
+1. For each byte in the operation range...
+
+   1. Obtain a space mapping via ``->iomap_begin``
+
+   2. For each sub-unit of work...
+
+      1. Revalidate the mapping and go back to (1) above, if necessary.
+         So far only the pagecache operations need to do this.
+
+      2. Do the work
+
+   3. Increment operation cursor
+
+   4. Release the mapping via ``->iomap_end``, if necessary
+
+Each iomap operation will be covered in more detail below.
+This library was covered previously by an `LWN article
+<https://lwn.net/Articles/935934/>`_ and a `KernelNewbies page
+<https://kernelnewbies.org/KernelProjects/iomap>`_.
+
+The goal of this document is to provide a brief discussion of the
+design and capabilities of iomap, followed by a more detailed catalog
+of the interfaces presented by iomap.
+If you change iomap, please update this design document.
+
+File Range Iterator
+===================
+
+Definitions
+-----------
+
+ * **buffer head**: Shattered remnants of the old buffer cache.
+
+ * ``fsblock``: The block size of a file, also known as ``i_blocksize``.
+
+ * ``i_rwsem``: The VFS ``struct inode`` rwsemaphore.
+   Processes hold this in shared mode to read file state and contents.
+   Some filesystems may allow shared mode for writes.
+   Processes often hold this in exclusive mode to change file state and
+   contents.
+
+ * ``invalidate_lock``: The pagecache ``struct address_space``
+   rwsemaphore that protects against folio insertion and removal for
+   filesystems that support punching out folios below EOF.
+   Processes wishing to insert folios must hold this lock in shared
+   mode to prevent removal, though concurrent insertion is allowed.
+   Processes wishing to remove folios must hold this lock in exclusive
+   mode to prevent insertions.
+   Concurrent removals are not allowed.
+
+ * ``dax_read_lock``: The RCU read lock that dax takes to prevent a
+   device pre-shutdown hook from returning before other threads have
+   released resources.
+
+ * **filesystem mapping lock**: This synchronization primitive is
+   internal to the filesystem and must protect the file mapping data
+   from updates while a mapping is being sampled.
+   The filesystem author must determine how this coordination should
+   happen; it does not need to be an actual lock.
+
+ * **iomap internal operation lock**: This is a general term for
+   synchronization primitives that iomap functions take while holding a
+   mapping.
+   A specific example would be taking the folio lock while reading or
+   writing the pagecache.
+
+ * **pure overwrite**: A write operation that does not require any
+   metadata or zeroing operations to perform during either submission
+   or completion.
+   This implies that the fileystem must have already allocated space
+   on disk as ``IOMAP_MAPPED`` and the filesystem must not place any
+   constaints on IO alignment or size.
+   The only constraints on I/O alignment are device level (minimum I/O
+   size and alignment, typically sector size).
+
+``struct iomap``
+----------------
+
+The filesystem communicates to the iomap iterator the mapping of
+byte ranges of a file to byte ranges of a storage device with the
+structure below:
+
+.. code-block:: c
+
+ struct iomap {
+     u64                 addr;
+     loff_t              offset;
+     u64                 length;
+     u16                 type;
+     u16                 flags;
+     struct block_device *bdev;
+     struct dax_device   *dax_dev;
+     voidw               *inline_data;
+     void                *private;
+     const struct iomap_folio_ops *folio_ops;
+     u64                 validity_cookie;
+ };
+
+The fields are as follows:
+
+ * ``offset`` and ``length`` describe the range of file offsets, in
+   bytes, covered by this mapping.
+   These fields must always be set by the filesystem.
+
+ * ``type`` describes the type of the space mapping:
+
+   * **IOMAP_HOLE**: No storage has been allocated.
+     This type must never be returned in response to an ``IOMAP_WRITE``
+     operation because writes must allocate and map space, and return
+     the mapping.
+     The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
+     iomap does not support writing (whether via pagecache or direct
+     I/O) to a hole.
+
+   * **IOMAP_DELALLOC**: A promise to allocate space at a later time
+     ("delayed allocation").
+     If the filesystem returns IOMAP_F_NEW here and the write fails, the
+     ``->iomap_end`` function must delete the reservation.
+     The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
+
+   * **IOMAP_MAPPED**: The file range maps to specific space on the
+     storage device.
+     The device is returned in ``bdev`` or ``dax_dev``.
+     The device address, in bytes, is returned via ``addr``.
+
+   * **IOMAP_UNWRITTEN**: The file range maps to specific space on the
+     storage device, but the space has not yet been initialized.
+     The device is returned in ``bdev`` or ``dax_dev``.
+     The device address, in bytes, is returned via ``addr``.
+     Reads from this type of mapping will return zeroes to the caller.
+     For a write or writeback operation, the ioend should update the
+     mapping to MAPPED.
+     Refer to the sections about ioends for more details.
+
+   * **IOMAP_INLINE**: The file range maps to the memory buffer
+     specified by ``inline_data``.
+     For write operation, the ``->iomap_end`` function presumably
+     handles persisting the data.
+     The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
+
+ * ``flags`` describe the status of the space mapping.
+   These flags should be set by the filesystem in ``->iomap_begin``:
+
+   * **IOMAP_F_NEW**: The space under the mapping is newly allocated.
+     Areas that will not be written to must be zeroed.
+     If a write fails and the mapping is a space reservation, the
+     reservation must be deleted.
+
+   * **IOMAP_F_DIRTY**: The inode will have uncommitted metadata needed
+     to access any data written.
+     fdatasync is required to commit these changes to persistent
+     storage.
+     This needs to take into account metadata changes that *may* be made
+     at I/O completion, such as file size updates from direct I/O.
+
+   * **IOMAP_F_SHARED**: The space under the mapping is shared.
+     Copy on write is necessary to avoid corrupting other file data.
+
+   * **IOMAP_F_BUFFER_HEAD**: This mapping requires the use of buffer
+     heads for pagecache operations.
+     Do not add more uses of this.
+
+   * **IOMAP_F_MERGED**: Multiple contiguous block mappings were
+     coalesced into this single mapping.
+     This is only useful for FIEMAP.
+
+   * **IOMAP_F_XATTR**: The mapping is for extended attribute data, not
+     regular file data.
+     This is only useful for FIEMAP.
+
+   * **IOMAP_F_PRIVATE**: Starting with this value, the upper bits can
+     be set by the filesystem for its own purposes.
+
+   These flags can be set by iomap itself during file operations.
+   The filesystem should supply an ``->iomap_end`` function if it needs
+   to observe these flags:
+
+   * **IOMAP_F_SIZE_CHANGED**: The file size has changed as a result of
+     using this mapping.
+
+   * **IOMAP_F_STALE**: The mapping was found to be stale.
+     iomap will call ``->iomap_end`` on this mapping and then
+     ``->iomap_begin`` to obtain a new mapping.
+
+   Currently, these flags are only set by pagecache operations.
+
+ * ``addr`` describes the device address, in bytes.
+
+ * ``bdev`` describes the block device for this mapping.
+   This only needs to be set for mapped or unwritten operations.
+
+ * ``dax_dev`` describes the DAX device for this mapping.
+   This only needs to be set for mapped or unwritten operations, and
+   only for a fsdax operation.
+
+ * ``inline_data`` points to a memory buffer for I/O involving
+   ``IOMAP_INLINE`` mappings.
+   This value is ignored for all other mapping types.
+
+ * ``private`` is a pointer to `filesystem-private information
+   <https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/>`_.
+   This value will be passed unchanged to ``->iomap_end``.
+
+ * ``folio_ops`` will be covered in the section on pagecache operations.
+
+ * ``validity_cookie`` is a magic freshness value set by the filesystem
+   that should be used to detect stale mappings.
+   For pagecache operations this is critical for correct operation
+   because page faults can occur, which implies that filesystem locks
+   should not be held between ``->iomap_begin`` and ``->iomap_end``.
+   Filesystems with completely static mappings need not set this value.
+   Only pagecache operations revalidate mappings; see the section about
+   ``iomap_valid`` for details.
+
+``struct iomap_ops``
+--------------------
+
+Every iomap function requires the filesystem to pass an operations
+structure to obtain a mapping and (optionally) to release the mapping:
+
+.. code-block:: c
+
+ struct iomap_ops {
+     int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
+                        unsigned flags, struct iomap *iomap,
+                        struct iomap *srcmap);
+
+     int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length,
+                      ssize_t written, unsigned flags,
+                      struct iomap *iomap);
+ };
+
+``->iomap_begin``
+~~~~~~~~~~~~~~~~~
+
+iomap operations call ``->iomap_begin`` to obtain one file mapping for
+the range of bytes specified by ``pos`` and ``length`` for the file
+``inode``.
+This mapping should be returned through the ``iomap`` pointer.
+The mapping must cover at least the first byte of the supplied file
+range, but it does not need to cover the entire requested range.
+
+Each iomap operation describes the requested operation through the
+``flags`` argument.
+The exact value of ``flags`` will be documented in the
+operation-specific sections below.
+These flags can, at least in principle, apply generally to iomap
+operations:
+
+ * ``IOMAP_DIRECT`` is set when the caller wishes to issue file I/O to
+   block storage.
+
+ * ``IOMAP_DAX`` is set when the caller wishes to issue file I/O to
+   memory-like storage.
+
+ * ``IOMAP_NOWAIT`` is set when the caller wishes to perform a best
+   effort attempt to avoid any operation that would result in blocking
+   the submitting task.
+   This is similar in intent to ``O_NONBLOCK`` for network APIs - it is
+   intended for asynchronous applications to keep doing other work
+   instead of waiting for the specific unavailable filesystem resource
+   to become available.
+   Filesystems implementing ``IOMAP_NOWAIT`` semantics need to use
+   trylock algorithms.
+   They need to be able to satisfy the entire I/O request range with a
+   single iomap mapping.
+   They need to avoid reading or writing metadata synchronously.
+   They need to avoid blocking memory allocations.
+   They need to avoid waiting on transaction reservations to allow
+   modifications to take place.
+   They probably should not be allocating new space.
+   And so on.
+   If there is any doubt in the filesystem developer's mind as to
+   whether any specific ``IOMAP_NOWAIT`` operation may end up blocking,
+   then they should return ``-EAGAIN`` as early as possible rather than
+   start the operation and force the submitting task to block.
+   ``IOMAP_NOWAIT`` is often set on behalf of ``IOCB_NOWAIT`` or
+   ``RWF_NOWAIT``.
+
+If it is necessary to read existing file contents from a `different
+<https://lore.kernel.org/all/20191008071527.29304-9-hch@lst.de/>`_
+device or address range on a device, the filesystem should return that
+information via ``srcmap``.
+Only pagecache and fsdax operations support reading from one mapping and
+writing to another.
+
+``->iomap_end``
+~~~~~~~~~~~~~~~
+
+After the operation completes, the ``->iomap_end`` function, if present,
+is called to signal that iomap is finished with a mapping.
+Typically, implementations will use this function to tear down any
+context that were set up in ``->iomap_begin``.
+For example, a write might wish to commit the reservations for the bytes
+that were operated upon and unreserve any space that was not operated
+upon.
+``written`` might be zero if no bytes were touched.
+``flags`` will contain the same value passed to ``->iomap_begin``.
+iomap ops for reads are not likely to need to supply this function.
+
+Both functions should return a negative errno code on error, or zero on
+success.
+
+Preparing for File Operations
+=============================
+
+iomap only handles mapping and I/O.
+Filesystems must still call out to the VFS to check input parameters
+and file state before initiating an I/O operation.
+It does not handle obtaining filesystem freeze protection, updating of
+timestamps, stripping privileges, or access control.
+
+Locking Hierarchy
+=================
+
+iomap requires that filesystems supply their own locking model.
+There are three categories of synchronization primitives, as far as
+iomap is concerned:
+
+ * The **upper** level primitive is provided by the filesystem to
+   coordinate access to different iomap operations.
+   The exact primitive is specifc to the filesystem and operation,
+   but is often a VFS inode, pagecache invalidation, or folio lock.
+   For example, a filesystem might take ``i_rwsem`` before calling
+   ``iomap_file_buffered_write`` and ``iomap_file_unshare`` to prevent
+   these two file operations from clobbering each other.
+   Pagecache writeback may lock a folio to prevent other threads from
+   accessing the folio until writeback is underway.
+
+   * The **lower** level primitive is taken by the filesystem in the
+     ``->iomap_begin`` and ``->iomap_end`` functions to coordinate
+     access to the file space mapping information.
+     The fields of the iomap object should be filled out while holding
+     this primitive.
+     The upper level synchronization primitive, if any, remains held
+     while acquiring the lower level synchronization primitive.
+     For example, XFS takes ``ILOCK_EXCL`` and ext4 takes ``i_data_sem``
+     while sampling mappings.
+     Filesystems with immutable mapping information may not require
+     synchronization here.
+
+   * The **operation** primitive is taken by an iomap operation to
+     coordinate access to its own internal data structures.
+     The upper level synchronization primitive, if any, remains held
+     while acquiring this primitive.
+     The lower level primitive is not held while acquiring this
+     primitive.
+     For example, pagecache write operations will obtain a file mapping,
+     then grab and lock a folio to copy new contents.
+     It may also lock an internal folio state object to update metadata.
+
+The exact locking requirements are specific to the filesystem; for
+certain operations, some of these locks can be elided.
+All further mention of locking are *recommendations*, not mandates.
+Each filesystem author must figure out the locking for themself.
+
+Bugs and Limitations
+====================
+
+ * No support for fscrypt.
+ * No support for compression.
+ * No support for fsverity yet.
+ * Strong assumptions that IO should work the way it does on XFS.
+ * Does iomap *actually* work for non-regular file data?
+
+Patches welcome!
diff --git a/Documentation/filesystems/iomap/index.rst b/Documentation/filesystems/iomap/index.rst
new file mode 100644
index 000000000000..3c6a52440250
--- /dev/null
+++ b/Documentation/filesystems/iomap/index.rst
@@ -0,0 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+VFS iomap Documentation
+=======================
+
+.. toctree::
+   :maxdepth: 2
+   :numbered:
+
+   design
+   operations
+   porting
diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
new file mode 100644
index 000000000000..8e6c721d2330
--- /dev/null
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -0,0 +1,713 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _iomap_operations:
+
+..
+        Dumb style notes to maintain the author's sanity:
+        Please try to start sentences on separate lines so that
+        sentence changes don't bleed colors in diff.
+        Heading decorations are documented in sphinx.rst.
+
+=========================
+Supported File Operations
+=========================
+
+.. contents:: Table of Contents
+   :local:
+
+Below are a discussion of the high level file operations that iomap
+implements.
+
+Buffered I/O
+============
+
+Buffered I/O is the default file I/O path in Linux.
+File contents are cached in memory ("pagecache") to satisfy reads and
+writes.
+Dirty cache will be written back to disk at some point that can be
+forced via ``fsync`` and variants.
+
+iomap implements nearly all the folio and pagecache management that
+filesystems have to implement themselves under the legacy I/O model.
+This means that the filesystem need not know the details of allocating,
+mapping, managing uptodate and dirty state, or writeback of pagecache
+folios.
+Under the legacy I/O model, this was managed very inefficiently with
+linked lists of buffer heads instead of the per-folio bitmaps that iomap
+uses.
+Unless the filesystem explicitly opts in to buffer heads, they will not
+be used, which makes buffered I/O much more efficient, and the pagecache
+maintainer much happier.
+
+``struct address_space_operations``
+-----------------------------------
+
+The following iomap functions can be referenced directly from the
+address space operations structure:
+
+ * ``iomap_dirty_folio``
+ * ``iomap_release_folio``
+ * ``iomap_invalidate_folio``
+ * ``iomap_is_partially_uptodate``
+
+The following address space operations can be wrapped easily:
+
+ * ``read_folio``
+ * ``readahead``
+ * ``writepages``
+ * ``bmap``
+ * ``swap_activate``
+
+``struct iomap_folio_ops``
+--------------------------
+
+The ``->iomap_begin`` function for pagecache operations may set the
+``struct iomap::folio_ops`` field to an ops structure to override
+default behaviors of iomap:
+
+.. code-block:: c
+
+ struct iomap_folio_ops {
+     struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
+                                unsigned len);
+     void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
+                       struct folio *folio);
+     bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
+ };
+
+iomap calls these functions:
+
+  - ``get_folio``: Called to allocate and return an active reference to
+    a locked folio prior to starting a write.
+    If this function is not provided, iomap will call
+    ``iomap_get_folio``.
+    This could be used to `set up per-folio filesystem state
+    <https://lore.kernel.org/all/20190429220934.10415-5-agruenba@redhat.com/>`_
+    for a write.
+
+  - ``put_folio``: Called to unlock and put a folio after a pagecache
+    operation completes.
+    If this function is not provided, iomap will ``folio_unlock`` and
+    ``folio_put`` on its own.
+    This could be used to `commit per-folio filesystem state
+    <https://lore.kernel.org/all/20180619164137.13720-6-hch@lst.de/>`_
+    that was set up by ``->get_folio``.
+
+  - ``iomap_valid``: The filesystem may not hold locks between
+    ``->iomap_begin`` and ``->iomap_end`` because pagecache operations
+    can take folio locks, fault on userspace pages, initiate writeback
+    for memory reclamation, or engage in other time-consuming actions.
+    If a file's space mapping data are mutable, it is possible that the
+    mapping for a particular pagecache folio can `change in the time it
+    takes
+    <https://lore.kernel.org/all/20221123055812.747923-8-david@fromorbit.com/>`_
+    to allocate, install, and lock that folio.
+
+    For the pagecache, races can happen if writeback doesn't take
+    ``i_rwsem`` or ``invalidate_lock`` and updates mapping information.
+    Races can also happen if the filesytem allows concurrent writes.
+    For such files, the mapping *must* be revalidated after the folio
+    lock has been taken so that iomap can manage the folio correctly.
+
+    fsdax does not need this revalidation because there's no writeback
+    and no support for unwritten extents.
+
+    Filesystems subject to this kind of race must provide a
+    ``->iomap_valid`` function to decide if the mapping is still valid.
+    If the mapping is not valid, the mapping will be sampled again.
+
+    To support making the validity decision, the filesystem's
+    ``->iomap_begin`` function may set ``struct iomap::validity_cookie``
+    at the same time that it populates the other iomap fields.
+    A simple validation cookie implementation is a sequence counter.
+    If the filesystem bumps the sequence counter every time it modifies
+    the inode's extent map, it can be placed in the ``struct
+    iomap::validity_cookie`` during ``->iomap_begin``.
+    If the value in the cookie is found to be different to the value
+    the filesystem holds when the mapping is passed back to
+    ``->iomap_valid``, then the iomap should considered stale and the
+    validation failed.
+
+These ``struct kiocb`` flags are significant for buffered I/O with iomap:
+
+ * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``.
+
+Internal per-Folio State
+------------------------
+
+If the fsblock size matches the size of a pagecache folio, it is assumed
+that all disk I/O operations will operate on the entire folio.
+The uptodate (memory contents are at least as new as what's on disk) and
+dirty (memory contents are newer than what's on disk) status of the
+folio are all that's needed for this case.
+
+If the fsblock size is less than the size of a pagecache folio, iomap
+tracks the per-fsblock uptodate and dirty state itself.
+This enables iomap to handle both "bs < ps" `filesystems
+<https://lore.kernel.org/all/20230725122932.144426-1-ritesh.list@gmail.com/>`_
+and large folios in the pagecache.
+
+iomap internally tracks two state bits per fsblock:
+
+ * ``uptodate``: iomap will try to keep folios fully up to date.
+   If there are read(ahead) errors, those fsblocks will not be marked
+   uptodate.
+   The folio itself will be marked uptodate when all fsblocks within the
+   folio are uptodate.
+
+ * ``dirty``: iomap will set the per-block dirty state when programs
+   write to the file.
+   The folio itself will be marked dirty when any fsblock within the
+   folio is dirty.
+
+iomap also tracks the amount of read and write disk IOs that are in
+flight.
+This structure is much lighter weight than ``struct buffer_head``
+because there is only one per folio, and the per-fsblock overhead is two
+bits vs. 104 bytes.
+
+Filesystems wishing to turn on large folios in the pagecache should call
+``mapping_set_large_folios`` when initializing the incore inode.
+
+Buffered Readahead and Reads
+----------------------------
+
+The ``iomap_readahead`` function initiates readahead to the pagecache.
+The ``iomap_read_folio`` function reads one folio's worth of data into
+the pagecache.
+The ``flags`` argument to ``->iomap_begin`` will be set to zero.
+The pagecache takes whatever locks it needs before calling the
+filesystem.
+
+Buffered Writes
+---------------
+
+The ``iomap_file_buffered_write`` function writes an ``iocb`` to the
+pagecache.
+``IOMAP_WRITE`` or ``IOMAP_WRITE`` | ``IOMAP_NOWAIT`` will be passed as
+the ``flags`` argument to ``->iomap_begin``.
+Callers commonly take ``i_rwsem`` in either shared or exclusive mode
+before calling this function.
+
+mmap Write Faults
+~~~~~~~~~~~~~~~~~
+
+The ``iomap_page_mkwrite`` function handles a write fault to a folio in
+the pagecache.
+``IOMAP_WRITE | IOMAP_FAULT`` will be passed as the ``flags`` argument
+to ``->iomap_begin``.
+Callers commonly take the mmap ``invalidate_lock`` in shared or
+exclusive mode before calling this function.
+
+Buffered Write Failures
+~~~~~~~~~~~~~~~~~~~~~~~
+
+After a short write to the pagecache, the areas not written will not
+become marked dirty.
+The filesystem must arrange to `cancel
+<https://lore.kernel.org/all/20221123055812.747923-6-david@fromorbit.com/>`_
+such `reservations
+<https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/>`_
+because writeback will not consume the reservation.
+The ``iomap_file_buffered_write_punch_delalloc`` can be called from a
+``->iomap_end`` function to find all the clean areas of the folios
+caching a fresh (``IOMAP_F_NEW``) delalloc mapping.
+It takes the ``invalidate_lock``.
+
+The filesystem must supply a function ``punch`` to be called for
+each file range in this state.
+This function must *only* remove delayed allocation reservations, in
+case another thread racing with the current thread writes successfully
+to the same region and triggers writeback to flush the dirty data out to
+disk.
+
+Zeroing for File Operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Filesystems can call ``iomap_zero_range`` to perform zeroing of the
+pagecache for non-truncation file operations that are not aligned to
+the fsblock size.
+``IOMAP_ZERO`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
+mode before calling this function.
+
+Unsharing Reflinked File Data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Filesystems can call ``iomap_file_unshare`` to force a file sharing
+storage with another file to preemptively copy the shared data to newly
+allocate storage.
+``IOMAP_WRITE | IOMAP_UNSHARE`` will be passed as the ``flags`` argument
+to ``->iomap_begin``.
+Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
+mode before calling this function.
+
+Truncation
+----------
+
+Filesystems can call ``iomap_truncate_page`` to zero the bytes in the
+pagecache from EOF to the end of the fsblock during a file truncation
+operation.
+``truncate_setsize`` or ``truncate_pagecache`` will take care of
+everything after the EOF block.
+``IOMAP_ZERO`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
+mode before calling this function.
+
+Pagecache Writeback
+-------------------
+
+Filesystems can call ``iomap_writepages`` to respond to a request to
+write dirty pagecache folios to disk.
+The ``mapping`` and ``wbc`` parameters should be passed unchanged.
+The ``wpc`` pointer should be allocated by the filesystem and must
+be initialized to zero.
+
+The pagecache will lock each folio before trying to schedule it for
+writeback.
+It does not lock ``i_rwsem`` or ``invalidate_lock``.
+
+The dirty bit will be cleared for all folios run through the
+``->map_blocks`` machinery described below even if the writeback fails.
+This is to prevent dirty folio clots when storage devices fail; an
+``-EIO`` is recorded for userspace to collect via ``fsync``.
+
+The ``ops`` structure must be specified and is as follows:
+
+``struct iomap_writeback_ops``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ struct iomap_writeback_ops {
+     int (*map_blocks)(struct iomap_writepage_ctx *wpc, struct inode *inode,
+                       loff_t offset, unsigned len);
+     int (*prepare_ioend)(struct iomap_ioend *ioend, int status);
+     void (*discard_folio)(struct folio *folio, loff_t pos);
+ };
+
+The fields are as follows:
+
+  - ``map_blocks``: Sets ``wpc->iomap`` to the space mapping of the file
+    range (in bytes) given by ``offset`` and ``len``.
+    iomap calls this function for each dirty fs block in each dirty folio,
+    though it will `reuse mappings
+    <https://lore.kernel.org/all/20231207072710.176093-15-hch@lst.de/>`_
+    for runs of contiguous dirty fsblocks within a folio.
+    Do not return ``IOMAP_INLINE`` mappings here; the ``->iomap_end``
+    function must deal with persisting written data.
+    Do not return ``IOMAP_DELALLOC`` mappings here; iomap currently
+    requires mapping to allocated space.
+    Filesystems can skip a potentially expensive mapping lookup if the
+    mappings have not changed.
+    This revalidation must be open-coded by the filesystem; it is
+    unclear if ``iomap::validity_cookie`` can be reused for this
+    purpose.
+    This function must be supplied by the filesystem.
+
+  - ``prepare_ioend``: Enables filesystems to transform the writeback
+    ioend or perform any other preparatory work before the writeback I/O
+    is submitted.
+    This might include pre-write space accounting updates, or installing
+    a custom ``->bi_end_io`` function for internal purposes, such as
+    deferring the ioend completion to a workqueue to run metadata update
+    transactions from process context.
+    This function is optional.
+
+  - ``discard_folio``: iomap calls this function after ``->map_blocks``
+    fails to schedule I/O for any part of a dirty folio.
+    The function should throw away any reservations that may have been
+    made for the write.
+    The folio will be marked clean and an ``-EIO`` recorded in the
+    pagecache.
+    Filesystems can use this callback to `remove
+    <https://lore.kernel.org/all/20201029163313.1766967-1-bfoster@redhat.com/>`_
+    delalloc reservations to avoid having delalloc reservations for
+    clean pagecache.
+    This function is optional.
+
+Pagecache Writeback Completion
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To handle the bookkeeping that must happen after disk I/O for writeback
+completes, iomap creates chains of ``struct iomap_ioend`` objects that
+wrap the ``bio`` that is used to write pagecache data to disk.
+By default, iomap finishes writeback ioends by clearing the writeback
+bit on the folios attached to the ``ioend``.
+If the write failed, it will also set the error bits on the folios and
+the address space.
+This can happen in interrupt or process context, depending on the
+storage device.
+
+Filesystems that need to update internal bookkeeping (e.g. unwritten
+extent conversions) should provide a ``->prepare_ioend`` function to
+set ``struct iomap_end::bio::bi_end_io`` to its own function.
+This function should call ``iomap_finish_ioends`` after finishing its
+own work (e.g. unwritten extent conversion).
+
+Some filesystems may wish to `amortize the cost of running metadata
+transactions
+<https://lore.kernel.org/all/20220120034733.221737-1-david@fromorbit.com/>`_
+for post-writeback updates by batching them.
+They may also require transactions to run from process context, which
+implies punting batches to a workqueue.
+iomap ioends contain a ``list_head`` to enable batching.
+
+Given a batch of ioends, iomap has a few helpers to assist with
+amortization:
+
+ * ``iomap_sort_ioends``: Sort all the ioends in the list by file
+   offset.
+
+ * ``iomap_ioend_try_merge``: Given an ioend that is not in any list and
+   a separate list of sorted ioends, merge as many of the ioends from
+   the head of the list into the given ioend.
+   ioends can only be merged if the file range and storage addresses are
+   contiguous; the unwritten and shared status are the same; and the
+   write I/O outcome is the same.
+   The merged ioends become their own list.
+
+ * ``iomap_finish_ioends``: Finish an ioend that possibly has other
+   ioends linked to it.
+
+Direct I/O
+==========
+
+In Linux, direct I/O is defined as file I/O that is issued directly to
+storage, bypassing the pagecache.
+The ``iomap_dio_rw`` function implements O_DIRECT (direct I/O) reads and
+writes for files.
+
+.. code-block:: c
+
+ ssize_t iomap_di
author	Linus Torvalds <torvalds@linux-foundation.org>	2024-07-15 13:28:14 -0700
committer	Linus Torvalds <torvalds@linux-foundation.org>	2024-07-15 13:28:14 -0700
commit	4f5e249ec0ea8872e1644df23cffffbe28007188 (patch)
tree	ecb7066ea436d502889c86ba27c598fc0947d4cf /Documentation/filesystems
parent	98f3a9a4fd449641010c77abca16aebb0b8d4419 (diff)
parent	602f09f4029c7b5e1a2f44a7651ac8922a904a1b (diff)
download	linux-4f5e249ec0ea8872e1644df23cffffbe28007188.tar.gz linux-4f5e249ec0ea8872e1644df23cffffbe28007188.tar.bz2 linux-4f5e249ec0ea8872e1644df23cffffbe28007188.zip