Documentation/filesystems/idmappings.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959

.. SPDX-License-Identifier: GPL-2.0

Idmappings
==========

Most filesystem developers will have encountered idmappings. They are used when
reading from or writing ownership to disk, reporting ownership to userspace, or
for permission checking. This document is aimed at filesystem developers that
want to know how idmappings work.

Formal notes
------------

An idmapping is essentially a translation of a range of ids into another or the
same range of ids. The notational convention for idmappings that is widely used
in userspace is::

 u:k:r

``u`` indicates the first element in the upper idmapset ``U`` and ``k``
indicates the first element in the lower idmapset ``K``. The ``r`` parameter
indicates the range of the idmapping, i.e. how many ids are mapped. From now
on, we will always prefix ids with ``u`` or ``k`` to make it clear whether
we're talking about an id in the upper or lower idmapset.

To see what this looks like in practice, let's take the following idmapping::

 u22:k10000:r3

and write down the mappings it will generate::

 u22 -> k10000
 u23 -> k10001
 u24 -> k10002

From a mathematical viewpoint ``U`` and ``K`` are well-ordered sets and an
idmapping is an order isomorphism from ``U`` into ``K``. So ``U`` and ``K`` are
order isomorphic. In fact, ``U`` and ``K`` are always well-ordered subsets of
the set of all possible ids useable on a given system.

Looking at this mathematically briefly will help us highlight some properties
that make it easier to understand how we can translate between idmappings. For
example, we know that the inverse idmapping is an order isomorphism as well::

 k10000 -> u22
 k10001 -> u23
 k10002 -> u24

Given that we are dealing with order isomorphisms plus the fact that we're
dealing with subsets we can embedd idmappings into each other, i.e. we can
sensibly translate between different idmappings. For example, assume we've been
given the three idmappings::

 1. u0:k10000:r10000
 2. u0:k20000:r10000
 3. u0:k30000:r10000

and id ``k11000`` which has been generated by the first idmapping by mapping
``u1000`` from the upper idmapset down to ``k11000`` in the lower idmapset.

Because we're dealing with order isomorphic subsets it is meaningful to ask
what id ``k11000`` corresponds to in the second or third idmapping. The
straightfoward algorithm to use is to apply the inverse of the first idmapping,
mapping ``k11000`` up to ``u1000``. Afterwards, we can map ``u1000`` down using
either the second idmapping mapping or third idmapping mapping. The second
idmapping would map ``u1000`` down to ``21000``. The third idmapping would map
``u1000`` down to ``u31000``.

If we were given the same task for the following three idmappings::

 1. u0:k10000:r10000
 2. u0:k20000:r200
 3. u0:k30000:r300

we would fail to translate as the sets aren't order isomorphic over the full
range of the first idmapping anymore (However they are order isomorphic over
the full range of the second idmapping.). Neither the second or third idmapping
contain ``u1000`` in the upper idmapset ``U``. This is equivalent to not having
an id mapped. We can simply say that ``u1000`` is unmapped in the second and
third idmapping. The kernel will report unmapped ids as the overflowuid
``(uid_t)-1`` or overflowgid ``(gid_t)-1`` to userspace.

The algorithm to calculate what a given id maps to is pretty simple. First, we
need to verify that the range can contain our target id. We will skip this step
for simplicity. After that if we want to know what ``id`` maps to we can do
simple calculations:

- If we want to map from left to right::

   u:k:r
   id - u + k = n

- If we want to map from right to left::

   u:k:r
   id - k + u = n

Instead of "left to right" we can also say "down" and instead of "right to
left" we can also say "up". Obviously mapping down and up invert each other.

To see whether the simple formulas above work, consider the following two
idmappings::

 1. u0:k20000:r10000
 2. u500:k30000:r10000

Assume we are given ``k21000`` in the lower idmapset of the first idmapping. We
want to know what id this was mapped from in the upper idmapset of the first
idmapping. So we're mapping up in the first idmapping::

 id     - k      + u  = n
 k21000 - k20000 + u0 = u1000

Now assume we are given the id ``u1100`` in the upper idmapset of the second
idmapping and we want to know what this id maps down to in the lower idmapset
of the second idmapping. This means we're mapping down in the second
idmapping::

 id    - u    + k      = n
 u1100 - u500 + k30000 = k30600

General notes
-------------

In the context of the kernel an idmapping can be interpreted as mapping a range
of userspace ids into a range of kernel ids::

 userspace-id:kernel-id:range

A userspace id is always an element in the upper idmapset of an idmapping of
type ``uid_t`` or ``gid_t`` and a kernel id is always an element in the lower
idmapset of an idmapping of type ``kuid_t`` or ``kgid_t``. From now on
"userspace id" will be used to refer to the well known ``uid_t`` and ``gid_t``
types and "kernel id" will be used to refer to ``kuid_t`` and ``kgid_t``.

The kernel is mostly concerned with kernel ids. They are used when performing
permission checks and are stored in an inode's ``i_uid`` and ``i_gid`` field.
A userspace id on the other hand is an id that is reported to userspace by the
kernel, or is passed by userspace to the kernel, or a raw device id that is
written or read from disk.

Note that we are only concerned with idmappings as the kernel stores them not
how userspace would specify them.

For the rest of this document we will prefix all userspace ids with ``u`` and
all kernel ids with ``k``. Ranges of idmappings will be prefixed with ``r``. So
an idmapping will be written as ``u0:k10000:r10000``.

For example, the id ``u1000`` is an id in the upper idmapset or "userspace
idmapset" starting with ``u1000``. And it is mapped to ``k11000`` which is a
kernel id in the lower idmapset or "kernel idmapset" starting with ``k10000``.

A kernel id is always created by an idmapping. Such idmappings are associated
with user namespaces. Since we mainly care about how idmappings work we're not
going to be concerned with how idmappings are created nor how they are used
outside of the filesystem context. This is best left to an explanation of user
namespaces.

The initial user namespace is special. It always has an idmapping of the
following form::

 u0:k0:r4294967295

which is an identity idmapping over the full range of ids available on this
system.

Other user namespaces usually have non-identity idmappings such as::

 u0:k10000:r10000

When a process creates or wants to change ownership of a file, or when the
ownership of a file is read from disk by a filesystem, the userspace id is
immediately translated into a kernel id according to the idmapping associated
with the relevant user namespace.

For instance, consider a file that is stored on disk by a filesystem as being
owned by ``u1000``:

- If a filesystem were to be mounted in the initial user namespaces (as most
  filesystems are) then the initial idmapping will be used. As we saw this is
  simply the identity idmapping. This would mean id ``u1000`` read from disk
  would be mapped to id ``k1000``. So an inode's ``i_uid`` and ``i_gid`` field
  would contain ``k1000``.

- If a filesystem were to be mounted with an idmapping of ``u0:k10000:r10000``
  then ``u1000`` read from disk would be mapped to ``k11000``. So an inode's
  ``i_uid`` and ``i_gid`` would contain ``k11000``.

Translation algorithms
----------------------

We've already seen briefly that it is possible to translate between different
idmappings. We'll now take a closer look how that works.

Crossmapping
~~~~~~~~~~~~

This translation algorithm is used by the kernel in quite a few places. For
example, it is used when reporting back the ownership of a file to userspace
via the ``stat()`` system call family.

If we've been given ``k11000`` from one idmapping we can map that id up in
another idmapping. In order for this to work both idmappings need to contain
the same kernel id in their kernel idmapsets. For example, consider the
following idmappings::

 1. u0:k10000:r10000
 2. u20000:k10000:r10000

and we are mapping ``u1000`` down to ``k11000`` in the first idmapping . We can
then translate ``k11000`` into a userspace id in the second idmapping using the
kernel idmapset of the second idmapping::

 /* Map the kernel id up into a userspace id in the second idmapping. */
 from_kuid(u20000:k10000:r10000, k11000) = u21000

Note, how we can get back to the kernel id in the first idmapping by inverting
the algorithm::

 /* Map the userspace id down into a kernel id in the second idmapping. */
 make_kuid(u20000:k10000:r10000, u21000) = k11000

 /* Map the kernel id up into a userspace id in the first idmapping. */
 from_kuid(u0:k10000:r10000, k11000) = u1000

This algorithm allows us to answer the question what userspace id a given
kernel id corresponds to in a given idmapping. In order to be able to answer
this question both idmappings need to contain the same kernel id in their
respective kernel idmapsets.

For example, when the kernel reads a raw userspace id from disk it maps it down
into a kernel id according to the idmapping associated with the filesystem.
Let's assume the filesystem was mounted with an idmapping of
``u0:k20000:r10000`` and it reads a file owned by ``u1000`` from disk. This
means ``u1000`` will be mapped to ``k21000`` which is what will be stored in
the inode's ``i_uid`` and ``i_gid`` field.

When someone in userspace calls ``stat()`` or a related function to get
ownership information about the file the kernel can't simply map the id back up
according to the filesystem's idmapping as this would give the wrong owner if
the caller is using an idmapping.

So the kernel will map the id back up in the idmapping of the caller. Let's
assume the caller has the slighly unconventional idmapping
``u3000:k20000:r10000`` then ``k21000`` would map back up to ``u4000``.
Consequently the user would see that this file is owned by ``u4000``.

Remapping
~~~~~~~~~

It is possible to translate a kernel id from one idmapping to another one via
the userspace idmapset of the two idmappings. This is equivalent to remapping
a kernel id.

Let's look at an example. We are given the following two idmappings::

 1. u0:k10000:r10000
 2. u0:k20000:r10000

and we are given ``k11000`` in the first idmapping. In order to translate this
kernel id in the first idmapping into a kernel id in the second idmapping we
need to perform two steps:

1. Map the kernel id up into a userspace id in the first idmapping::

    /* Map the kernel id up into a userspace id in the first idmapping. */
    from_kuid(u0:k10000:r10000, k11000) = u1000

2. Map the userspace id down into a kernel id in the second idmapping::

    /* Map the userspace id down into a kernel id in the second idmapping. */
    make_kuid(u0:k20000:r10000, u1000) = k21000

As you can see we used the userspace idmapset in both idmappings to translate
the kernel id in one idmapping to a kernel id in another idmapping.

This allows us to answer the question what kernel id we would need to use to
get the same userspace id in another idmapping. In order to be able to answer
this question both idmappings need to contain the same userspace id in their
respective userspace idmapsets.

Note, how we can easily get back to the kernel id in the first idmapping by
inverting the algorithm:

1. Map the kernel id up into a userspace id in the second idmapping::

    /* Map the kernel id up into a userspace id in the second idmapping. */
    from_kuid(u0:k20000:r10000, k21000) = u1000

2. Map the userspace id down into a kernel id in the first idmapping::

    /* Map the userspace id down into a kernel id in the first idmapping. */
    make_kuid(u0:k10000:r10000, u1000) = k11000

Another way to look at this translation is to treat it as inverting one
idmapping and applying another idmapping if both idmappings have the relevant
userspace id mapped. This will come in handy when working with idmapped mounts.

Invalid translations
~~~~~~~~~~~~~~~~~~~~

It is never valid to use an id in the kernel idmapset of one idmapping as the
id in the userspace idmapset of another or the same idmapping. While the kernel
idmapset always indicates an idmapset in the kernel id space the userspace
idmapset indicates a userspace id. So the following translations are forbidden::

 /* Map the userspace id down into a kernel id in the first idmapping. */
 make_kuid(u0:k10000:r10000, u1000) = k11000

 /* INVALID: Map the kernel id down into a kernel id in the second idmapping. */
 make_kuid(u10000:k20000:r10000, k110000) = k21000
                                 ~~~~~~~

and equally wrong::

 /* Map the kernel id up into a userspace id in the first idmapping. */
 from_kuid(u0:k10000:r10000, k11000) = u1000

 /* INVALID: Map the userspace id up into a userspace id in the second idmapping. */
 from_kuid(u20000:k0:r10000, u1000) = k21000
                             ~~~~~

Idmappings when creating filesystem objects
-------------------------------------------

The concepts of mapping an id down or mapping an id up are expressed in the two
kernel functions filesystem developers are rather familiar with and which we've
already used in this document::

 /* Map the userspace id down into a kernel id. */
 make_kuid(idmapping, uid)

 /* Map the kernel id up into a userspace id. */
 from_kuid(idmapping, kuid)

We will take an abbreviated look into how idmappings figure into creating
filesystem objects. For simplicity we will only look at what happens when the
VFS has already completed path lookup right before it calls into the filesystem
itself. So we're concerned with what happens when e.g. ``vfs_mkdir()`` is
called. We will also assume that the directory we're creating filesystem
objects in is readable and writable for everyone.

When creating a filesystem object the caller will look at the caller's
filesystem ids. These are just regular ``uid_t`` and ``gid_t`` userspace ids
but they are exclusively used when determining file ownership which is why they
are called "filesystem ids". They are usually identical to the uid and gid of
the caller but can differ. We will just assume they are always identical to not
get lost in too many details.

When the caller enters the kernel two things happen:

1. Map the caller's userspace ids down into kernel ids in the caller's
   idmapping.
   (To be precise, the kernel will simply look at the kernel ids stashed in the
   credentials of the current task but for our education we'll pretend this
   translation happens just in time.)
2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
   filesystem's idmapping.

The second step is important as regular filesystem will ultimately need to map
the kernel id back up into a userspace id when writing to disk.
So with the second step the kernel guarantees that a valid userspace id can be
written to disk. If it can't the kernel will refuse the creation request to not
even remotely risk filesystem corruption.

The astute reader will have realized that this is simply a varation of the
crossmapping algorithm we mentioned above in a previous section. First, the
kernel maps the caller's userspace id down into a kernel id according to the
caller's idmapping and then maps that kernel id up according to the
filesystem's idmapping.

Let's see some examples with caller/filesystem idmapping but without mount
idmappings. This will exhibit some problems we can hit. After that we will
revisit/reconsider these examples, this time using mount idmappings, to see how
they can solve the problems we observed before.

Example 1
~~~~~~~~~

::

 caller id:            u1000
 caller idmapping:     u0:k0:r4294967295
 filesystem idmapping: u0:k0:r4294967295

Both the caller and the filesystem use the identity idmapping:

1. Map the caller's userspace ids into kernel ids in the caller's idmapping::

    make_kuid(u0:k0:r4294967295, u1000) = k1000

2. Verify that the caller's kernel ids can be mapped to userspace ids in the
   filesystem's idmapping.

   For this second step the kernel will call the function
   ``fsuidgid_has_mapping()`` which ultimately boils down to calling
   ``from_kuid()``::

    from_kuid(u0:k0:r4294967295, k1000) = u1000

In this example both idmappings are the same so there's nothing exciting going
on. Ultimately the userspace id that lands on disk will be ``u1000``.

Example 2
~~~~~~~~~

::

 caller id:            u1000
 caller idmapping:     u0:k10000:r10000
 filesystem idmapping: u0:k20000:r10000

1. Map the caller's userspace ids down into kernel ids in the caller's
   idmapping::

    make_kuid(u0:k10000:r10000, u1000) = k11000

2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
   filesystem's idmapping::

    from_kuid(u0:k20000:r10000, k11000) = u-1

It's immediately clear that while the caller's userspace id could be
successfully mapped down into kernel ids in the caller's idmapping the kernel
ids could not be mapped up according to the filesystem's idmapping. So the
kernel will deny this creation request.

Note that while this example is less common, because most filesystem can't be
mounted with non-initial idmappings this is a general problem as we can see in
the next examples.

Example 3
~~~~~~~~~

::

 caller id:            u1000
 caller idmapping:     u0:k10000:r10000
 filesystem idmapping: u0:k0:r4294967295

1. Map the caller's userspace ids down into kernel ids in the caller's
   idmapping::

    make_kuid(u0:k10000:r10000, u1000) = k11000

2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
   filesystem's idmapping::

    from_kuid(u0:k0:r4294967295, k11000) = u11000

We can see that the translation always succeeds. The userspace id that the
filesystem will ultimately put to disk will always be identical to the value of
the kernel id that was created in the caller's idmapping. This has mainly two
consequences.

First, that we can't allow a caller to ultimately write to disk with another
userspace id. We could only do this if we were to mount the whole fileystem
with the caller's or another idmapping. But that solution is limited to a few
filesystems and not very flexible. But this is a use-case that is pretty
important in containerized workloads.

Second, the caller will usually not be able to create any files or access
directories that have stricter permissions because none of the filesystem's
kernel ids map up into valid userspace ids in the caller's idmapping