[RFC PATCH v2] hostfs: handle idmapped mounts
Christian Brauner
brauner at kernel.org
Thu Mar 16 03:37:19 PDT 2023
On Thu, Mar 16, 2023 at 06:20:19AM +0000, Glenn Washburn wrote:
> On 3/4/23 12:01, Christian Brauner wrote:
> > On Sat, Mar 04, 2023 at 12:28:46AM -0600, Glenn Washburn wrote:
> > > On Thu, 2 Mar 2023 09:39:28 +0100
> > > Christian Brauner <brauner at kernel.org> wrote:
> > >
> > > > On Tue, Feb 28, 2023 at 07:50:02PM -0600, Glenn Washburn wrote:
> > > > > Let hostfs handle idmapped mounts. This allows to have the same
> > > > > hostfs mount appear in multiple locations with different id
> > > > > mappings.
> > > > >
> > > > > root@(none):/media# id
> > > > > uid=0(root) gid=0(root) groups=0(root)
> > > > > root@(none):/media# mkdir mnt idmapped
> > > > > root@(none):/media# mount -thostfs -o/home/user hostfs mnt
> > > > >
> > > > > root@(none):/media# touch mnt/aaa
> > > > > root@(none):/media# mount-idmapped --map-mount u:`id -u user`:0:1
> > > > > --map-mount g:`id -g user`:0:1 /media/mnt /media/idmapped
> > > > > root@(none):/media# ls -l mnt/aaa idmapped/aaa -rw-r--r-- 1 root
> > > > > root 0 Jan 28 01:23 idmapped/aaa -rw-r--r-- 1 user user 0 Jan 28
> > > > > 01:23 mnt/aaa
> > > > >
> > > > > root@(none):/media# touch idmapped/bbb
> > > > > root@(none):/media# ls -l mnt/bbb idmapped/bbb
> > > > > -rw-r--r-- 1 root root 0 Jan 28 01:26 idmapped/bbb
> > > > > -rw-r--r-- 1 user user 0 Jan 28 01:26 mnt/bbb
> > > > >
> > > > > Signed-off-by: Glenn Washburn <development at efficientek.com>
> > > > > ---
> > > > > Changes from v1:
> > > > > * Rebase on to tip. The above commands work and have the results
> > > > > expected. The __vfsuid_val(make_vfsuid(...)) seems ugly to get the
> > > > > uid_t, but it seemed like the best one I've come across. Is there a
> > > > > better way?
> > > >
> > > > Sure, I can help you with that. ;)
> > >
> > > Thank you!
> > >
> > > > >
> > > > > Glenn
> > > > > ---
> > > > > fs/hostfs/hostfs_kern.c | 13 +++++++------
> > > > > 1 file changed, 7 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
> > > > > index c18bb50c31b6..9459da99a0db 100644
> > > > > --- a/fs/hostfs/hostfs_kern.c
> > > > > +++ b/fs/hostfs/hostfs_kern.c
> > > > > @@ -786,7 +786,7 @@ static int hostfs_permission(struct mnt_idmap
> > > > > *idmap, err = access_file(name, r, w, x);
> > > > > __putname(name);
> > > > > if (!err)
> > > > > - err = generic_permission(&nop_mnt_idmap, ino,
> > > > > desired);
> > > > > + err = generic_permission(idmap, ino, desired);
> > > > > return err;
> > > > > }
> > > > > @@ -794,13 +794,14 @@ static int hostfs_setattr(struct mnt_idmap
> > > > > *idmap, struct dentry *dentry, struct iattr *attr)
> > > > > {
> > > > > struct inode *inode = d_inode(dentry);
> > > > > + struct user_namespace *fs_userns = i_user_ns(inode);
> > > >
> > > > Fyi, since hostfs can't be mounted in a user namespace
> > > > fs_userns == &init_user_ns
> > > > so it doesn't really matter what you use.
> > >
> > > What would you suggest as preferable?
> >
> > I would leave init_user_ns hardcoded as it clearly indicates that hostfs
> > can only be mounted in the initial user namespace. Plus, the patch is
> > smaller.
> >
> > >
> > > > > struct hostfs_iattr attrs;
> > > > > char *name;
> > > > > int err;
> > > > > int fd = HOSTFS_I(inode)->fd;
> > > > > - err = setattr_prepare(&nop_mnt_idmap, dentry, attr);
> > > > > + err = setattr_prepare(idmap, dentry, attr);
> > > > > if (err)
> > > > > return err;
> > > > > @@ -814,11 +815,11 @@ static int hostfs_setattr(struct mnt_idmap
> > > > > *idmap, }
> > > > > if (attr->ia_valid & ATTR_UID) {
> > > > > attrs.ia_valid |= HOSTFS_ATTR_UID;
> > > > > - attrs.ia_uid = from_kuid(&init_user_ns,
> > > > > attr->ia_uid);
> > > > > + attrs.ia_uid = __vfsuid_val(make_vfsuid(idmap,
> > > > > fs_userns, attr->ia_uid)); }
> > > > > if (attr->ia_valid & ATTR_GID) {
> > > > > attrs.ia_valid |= HOSTFS_ATTR_GID;
> > > > > - attrs.ia_gid = from_kgid(&init_user_ns,
> > > > > attr->ia_gid);
> > > > > + attrs.ia_gid = __vfsgid_val(make_vfsgid(idmap,
> > > > > fs_userns, attr->ia_gid));
> > > >
> > > > Heh, if you look include/linux/fs.h:
> > > >
> > > > /*
> > > > * The two anonymous unions wrap structures with the same
> > > > member. *
> > > > * Filesystems raising FS_ALLOW_IDMAP need to use
> > > > ia_vfs{g,u}id which
> > > > * are a dedicated type requiring the filesystem to use the
> > > > dedicated
> > > > * helpers. Other filesystem can continue to use ia_{g,u}id
> > > > until they
> > > > * have been ported.
> > > > *
> > > > * They always contain the same value. In other words
> > > > FS_ALLOW_IDMAP
> > > > * pass down the same value on idmapped mounts as they would
> > > > on regular
> > > > * mounts.
> > > > */
> > > > union {
> > > > kuid_t ia_uid;
> > > > vfsuid_t ia_vfsuid;
> > > > };
> > > > union {
> > > > kgid_t ia_gid;
> > > > vfsgid_t ia_vfsgid;
> > > > };
> > > >
> > > > this just is:
> > > >
> > > > attrs.ia_uid = from_vfsuid(idmap, fs_userns, attr->ia_vfsuid));
> > > > attrs.ia_gid = from_vfsgid(idmap, fs_userns, attr->ia_vfsgid));
> > >
> > > Its easy to miss from this patch because of lack of context, but attrs
> > > is a struct hostfs_iattr, not struct iattr. And attrs.ia_uid is of type
> > > uid_t, not kuid_t. So the above fails to compile. This is why I needed
> >
> > Oh, I see. And then that raw value is used by calling
> > fchmod()/chmod()/chown()/fchown() and so on. That's rather special.
> > Ok, then I know what to do.
> >
> > > to wrap make_vfsuid() in __vfsuid_val() (to get the uid_t).
> >
> > Right. My point had been - independent of the struct hostfs_iattr issue
> > you thankfully pointed out - that make_vfsuid() is wrong here.
> >
> > make_vfsuid() is used to map a filesystem wide k{g,u}id_t according to
> > the mount's idmapping that operation originated from. But that's done
> > by the vfs way before we're calling into the filesystem. For example,
> > it's done in chown_common().
> >
> > So the value placed in struct iattr (the VFS struct) is already a
> > vfs{g,u}id stored in iattr->ia_vfs{g,u}id. So you need to use
> > from_vfs{g,u}id() here.
> >
> > >
> > > I had decided against using from_vfsuid() because then I thought I'd
> > > need to use from_kuid() to get the uid_t. And from_kuid() takes the
> > > namespace (again), which seemed uglier.
> > >
> > > Based on this, what do you suggest?
> >
> > Ok, so just some details on the background before I paste what I think
> > we should do.
> > As soon as you support idmapped mounts you at least technically are
> Thanks for the detailed explanation. I apologize for not getting back to
> this sooner.
No need to apologize!
>
> > always dealing with two mappings:
> >
> > (1) First, there's the filesystem wide idmapping which is taken from the
> > namespace the filessytem was mounted in. This idmapping is applied
> > when you read the raw uid/gid value from disk and turn into a kuid_t
> > type. That value is persistent and stored in inode->i_{g,u}id. All
> > things that are cached and that can be accessed from multiple mounts
> > concurrently can only ever cache k{g,u}id_t aka filesystem values.
> > (2) Whenever we're dealing with an operation that's coming from an
> > idmapped mount we need to take the idmapping of the mount into
> > account. That idmapping is completely separate type struct
> > mnt_idmap that's opaque for filesystems and most of the vfs.
> >
> > That idmapping is used to generate the vfs{g,u}id_t. IOW, translates
> > from the filesystem representation to a mount/vfs representation.
> >
> > So, in order to store the correct value on disk we need to invert those
> > two idmappings to arrive at the raw value that we want to store:
> > (U1) from_vfsuid() // map to the filesystem wide value aka something
> > that we can store in inode->i_{g,u}id and that's cacheable. This is
> > done in setattr_copy().
> > (U2) from_kuid() // map the filesystem wide value to the raw value we
> > want to store on disk
>
> It seems to me that there are actually 3 mappings, with the third being (U2)
> above (ie vfsuid_t -> kuid_t). And that from_vfsuid() does mappings (1) and
> (2) above. Is this incorrect?
My language was confusing. You're dealing two _idmappings_ but
3 mappings _may_ be performed iff the filesystem you're dealing with can
itself be mounted with an idmapping/in a namespace. Otherwise it's just
two mappings as the filesystem idmapping is a nop,
i.e., maps 0:0, 1:1, ..., n:n.
For details you can read Documentation/filesystems/idmappings.rst but
note that it's out of date and requires the patch I picked up a few days
ago:
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping.git/commit/?h=for-next&id=5d3ca5968480758a29a0b2777da9049a7c5134e3
>
> Whats confusing to me is that from_vfsuid() takes both an idmap and a user
> namespace, so presumably it will handle both mapping types (1) and (2). And
> then there's from_kuid() which takes an idmap, so I thought it might also do
> a type (2) mapping. But looking at the code it doesn't seem to ever use its
> idmap parameter. Can you explain the rational behind having from_kuid() take
> an idmap? Is it legacy that will be cleaned up as this code settles down /
> stabilizes? Or perhaps its
Sure, I can explain. The from_kuid() function maps from a kuid_t into a
uid_t and the make_kuid() function maps from a uid_t into a kuid_t. They
are used to interact with a caller's idmapping (current_user_ns()) or a
filesystem idmapping (sb->s_user_ns). (As far as I'm concerned, that's
already a severe ambiguity as a filesystem idmapping is something very
different from a caller idmapping. Technically, this should've
been expressed in the form of a dedicated type struct fs_userns or sm to
keep them distinct. But that's a different topic.)
So, idmapped mounts bring in the concept of per mount ownership. So you
need to distinguish between filesystem wide ownership (kuid_t/kgid_t)
and mount local/VFS ownership (vfsuid_t/vfsgid_t). Expressing this in
two different types is a safety measure so you can't generate a vfsuid_t
via make_kuid() and you can't generate a uid_t from a vfsuid_t via
from_kuid(). Because they fundamentally shouldn't be concerned with
VFS/mount ownership to avoid confusion.
So the next step is to ensure that the idmapping used to generate such
vfsuid_t/vfsgid_t is type-distinct from caller- and filesystem
namespaces. This way you can never pass a mount idmapping to
from_kuid(), make_kuid(), or accidently use it for permission checking
in ns_capable() etc.
So the dedicated struct mnt_idmap type makes such bugs impossible as it
can only be passed to dedicated helpers and cannot be dereferenced
outside of a tiny piece of core VFS code.
In addition to that, we should aim to let filesystems create idmapped
mounts without having to allocate namespaces. Ofc, for the container
use-case it's very convenient to just be able to reuse the container's
namespace, attach it to a mount and thus delegate the mount to the
container. This is especially nice because it _can_ be used make
container ownership completely transient. IOW, you could spawn the same
container with a random idmapping everytime because the idmap isn't
persisted on-disk.
Back to your case. As I've explained. hostfs cannot be mounted inside of
namespaces. Thus, the idmapping attached to the filesystem retievable
via i_user_ns() is a nop, i.e., it maps 0:0, 1:1, ..., n:n. So there's
nothing to do for from_kuid() or make_kuid() before mapping into a
mount.
More details you should be able to find in
Documentation/filesystems/idmappings.rst as mentioned before.
>
> >
> > For nearly all filesystems these steps almost never need to be performed
> > explicitly. Instead, dedicated vfs helpers will do this:
> >
> > (U1) i_{g,u}id_update() // map to filesystem wide value
> > (U2) i_{g,u}id_read() // map to raw on-disk value
> >
> > For filesystems that don't support being mounted in namespaces the (U2)
> > step is always a nop. So technically there's no difference between:
> >
> > (U2) from_kuid() and __kuid_val(kuid)
> >
> > but it's cleaner to use the helpers even in that case.
> >
> > So given how hostfs works these two steps need to be performed
> > explicitly. So I suggest (untested):
> >
> > diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
> > index c18bb50c31b6..72b7e1bcc32e 100644
> > --- a/fs/hostfs/hostfs_kern.c
> > +++ b/fs/hostfs/hostfs_kern.c
> > @@ -813,12 +813,22 @@ static int hostfs_setattr(struct mnt_idmap *idmap,
> > attrs.ia_mode = attr->ia_mode;
> > }
> > if (attr->ia_valid & ATTR_UID) {
> > + kuid_t kuid;
> > +
> > attrs.ia_valid |= HOSTFS_ATTR_UID;
> > - attrs.ia_uid = from_kuid(&init_user_ns, attr->ia_uid);
> > + /* Map the vfs id into the filesystem. */
> > + kuid = from_vfsuid(idmap, &init_user_ns, attr->ia_vfsuid);
> > + /* Map the filesystem id to its raw on disk value. */
> > + attrs.ia_uid = from_kuid(&init_user_ns, kuid);
>
> Its interesting that this is what I originally discarded, as an unfamiliar
> reader, it looks like you're doing two namespace mappings. But that's not
> happening because from_kuid() disregards its namespace parameter.
It's a nop because hostfs isn't mountable inside of namespace/with an
idmapping.
>
> I've tested this and it does seems to work. Thanks!
+1
Not sure if you have xfstests integration but if you do:
sudo ./check -g idmapped
you'll get a whole testsuite dedicated to it.
More information about the linux-um
mailing list