===============
Pathname lookup
===============
This write-up is based on three articles published at lwn.net:
- <https://lwn.net/Articles/649115/> Pathname lookup in Linux
- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
- <https://lwn.net/Articles/650786/> A walk among the symlinks
Written by Neil Brown with help from Al Viro and Jon Corbet.
It has subsequently been updated to reflect changes in the kernel
including:
- per-directory parallel name lookup.
- ``openat2()`` resolution restriction flags.
Introduction to pathname lookup
===============================
The most obvious aspect of pathname lookup, which very little
exploration is needed to discover, is that it is complex. There are
many rules, special cases, and implementation alternatives that all
combine to confuse the unwary reader. Computer science has long been
acquainted with such complexity and has tools to help manage it. One
tool that we will make extensive use of is "divide and conquer". For
the early parts of the analysis we will divide off symlinks - leaving
them until the final part. Well before we get to symlinks we have
another major division based on the VFS's approach to locking which
will allow us to review "REF-walk" and "RCU-walk" separately. But we
are getting ahead of ourselves. There are some important low level
distinctions we need to clarify first.
There are two sorts of ...
--------------------------
.. _openat: http://man7.org/linux/man-pages/man2/openat.2.html
Pathnames (sometimes "file names"), used to identify objects in the
filesystem, will be familiar to most readers. They contain two sorts
of elements: "slashes" that are sequences of one or more "``/``"
characters, and "components" that are sequences of one or more
non-"``/``" characters. These form two kinds of paths. Those that
start with slashes are "absolute" and start from the filesystem root.
The others are "relative" and start from the current directory, or
from some other location specified by a file descriptor given to
"``*at()``" system calls such as `openat() <openat_>`_.
.. _execveat: http://man7.org/linux/man-pages/man2/execveat.2.html
It is tempting to describe the second kind as starting with a
component, but that isn't always accurate: a pathname can lack both
slashes and components, it can be empty, in other words. This is
generally forbidden in POSIX, but some of those "``*at()``" system calls
in Linux permit it when the ``AT_EMPTY_PATH`` flag is given. For
example, if you have an open file descriptor on an executable file you
can execute it by calling `execveat() <execveat_>`_ passing
the file descriptor, an empty path, and the ``AT_EMPTY_PATH`` flag.
These paths can be divided into two sections: the final component and
everything else. The "everything else" is the easy bit. In all cases
it must identify a directory that already exists, otherwise an error
such as ``ENOENT`` or ``ENOTDIR`` will be reported.
The final component is not so simple. Not only do different system
calls interpret it quite differently (e.g. some create it, some do
not), but it might not even exist: neither the empty pathname nor the
pathname that is just slashes have a final component. If it does
exist, it could be "``.``" or "``..``" which are handled quite differently
from other components.
.. _POSIX: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12
If a pathname ends with a slash, such as "``/tmp/foo/``" it might be
tempting to consider that to have an empty final component. In many
ways that would lead to correct results, but not always. In
particular, ``mkdir()`` and ``rmdir()`` each create or remove a directory named
by the final component, and they are required to work with pathnames
ending in "``/``". According to POSIX_:
A pathname that contains at least one non-<slash> character and
that ends with one or more trailing <slash> characters shall not
be resolved successfully unless the last pathname component before
the trailing <slash> characters names an existing directory or a
directory entry that is to be created for a directory immediately
after the pathname is resolved.
The Linux pathname walking code (mostly in ``fs/namei.c``) deals with
all of these issues: breaking the path into components, handling the
"everything else" quite separately from the final component, and
checking that the trailing slash is not used where it isn't
permitted. It also addresses the important issue of concurrent
access.
While one process is looking up a pathname, another might be making
changes that affect that lookup. One fairly extreme case is that if
"a/b" were renamed to "a/c/b" while another process were looking up
"a/b/..", that process might successfully resolve on "a/c".
Most races are much more subtle, and a big part of the task of
pathname lookup is to prevent them from having damaging effects. Many
of the possible races are seen most clearly in the context of the
"dcache" and an understanding of that is central to understanding
pathname lookup.
More than just a cache
----------------------
The "dcache" caches information about names in each filesystem to
make them quickly available for lookup. Each entry (known as a
"dentry") contains three significant fields: a component name, a
pointer to a parent dentry, and a pointer to the "inode" which
contains further information about the object in that parent with
the given name. The inode pointer can be ``NULL`` indicating that the
name doesn't exist in the parent. While there can be linkage in the
dentry of a directory to the dentries of the children, that linkage is
not used for pathname lookup, and so will not be considered here.
The dcache has a number of uses apart from accelerating lookup. One
that will be particularly relevant is that it is closely integrated
with the mount table that records which filesystem is mounted where.
What the mount table actually stores is which dentry is mounted on top
of which other dentry.
When considering the dcache, we have another of our "two types"
distinctions: there are two types of filesystems.
Some filesystems ensure that the information in the dcache is always
completely accurate (though not necessarily complete). This can allow
the VFS to determine if a particular file does or doesn't exist
without checking with the filesystem, and means that the VFS can
protect the filesystem against certain races and other problems.
These are typically "local" filesystems such as ext3, XFS, and Btrfs.
Other filesystems don't provide that guarantee because they cannot.
These are typically filesystems that are shared across a network,
whether remote filesystems like NFS and 9P, or cluster filesystems
like ocfs2 or cephfs. These filesystems allow the VFS to revalidate
cac