[PATCH v6 06/10] io_uring/rw: add support to send metadata along with read/write
Kanchan Joshi
joshi.k at samsung.com
Sun Nov 10 10:36:57 PST 2024
On 11/7/2024 10:53 PM, Pavel Begunkov wrote:
> Let's say we have 3 different attributes META_TYPE{1,2,3}.
>
> How are they placed in an SQE?
>
> meta1 = (void *)get_big_sqe(sqe);
> meta2 = meta1 + sizeof(?); // sizeof(struct meta1_struct)
> meta3 = meta2 + sizeof(struct meta2_struct);
Not necessary to do this kind of additions and think in terms of
sequential ordering for the extra information placed into
primary/secondary SQE.
Please see v8:
https://lore.kernel.org/io-uring/20241106121842.5004-7-anuj20.g@samsung.com/
It exposes a distinct flag (sqe->ext_cap) for each attribute/cap, and
userspace should place the corresponding information where kernel has
mandated.
If a particular attribute (example write-hint) requires <20b of extra
information, we should just place that in first SQE. PI requires more so
we are placing that into second SQE.
When both PI and write-hint flags are specified by user they can get
processed fine without actually having to care about above
additions/ordering.
> Structures are likely not fixed size (?). At least the PI looks large
> enough to force everyone to be just aliased to it.
>
> And can the user pass first meta2 in the sqe and then meta1?
Yes. Just set the ext_cap flags without bothering about first/second.
User can pass either or both, along with the corresponding info. Just
don't have to assume specific placement into SQE.
> meta2 = (void *)get_big_sqe(sqe);
> meta1 = meta2 + sizeof(?); // sizeof(struct meta2_struct)
>
> If yes, how parsing should look like? Does the kernel need to read each
> chunk's type and look up its size to iterate to the next one?
We don't need to iterate if we are not assuming any ordering.
> If no, what happens if we want to pass meta2 and meta3, do they start
> from the big_sqe?
The one who adds the support for meta2/meta3 in kernel decides where to
place them within first/second SQE or get them fetched via a pointer
from userspace.
> How do we pass how many of such attributes is there for the request?
ext_cap allows to pass 16 cap/attribute flags. Maybe all can or can not
be passed inline in SQE, but I have no real visibility about the space
requirement of future users.
> It should support arbitrary number of attributes in the long run, which
> we can't pass in an SQE, bumping the SQE size is not scalable in
> general, so it'd need to support user pointers or sth similar at some
> point. Placing them in an SQE can serve as an optimisation, and a first> step, though it might be easier to start with user pointer instead.
>
> Also, when we eventually come to user pointers, we want it to be
> performant as well and e.g. get by just one copy_from_user, and the
> api/struct layouts would need to be able to support it. And once it's
> copied we'll want it to be handled uniformly with the SQE variant, that
> requires a common format. For different formats there will be a question
> of perfomance, maintainability, duplicating kernel and userspace code.
>
> All that doesn't need to be implemented, but we need a clear direction
> for the API. Maybe we can get a simplified user space pseudo code
> showing how the end API is supposed to look like?
Yes. For a large/arbitrary number, we may have to fetch the entire
attribute list using a user pointer/len combo. And parse it (that's
where all your previous questions fit).
And that can still be added on top of v8.
For example, adding a flag (in ext_cap) that disables inline-sqe
processing and switches to external attribute buffer:
/* Second SQE has PI information */
#define EXT_CAP_PI (1U << 0)
/* First SQE has hint information */
#define EXT_CAP_WRITE_HINT (1U << 1)
/* Do not assume CAP presence in SQE, and fetch capability buffer page
instead */
#define EXT_CAP_INDIRECT (1U << 2)
Corresponding pointer (and/or len) can be put into last 16b of SQE.
Use the same flags/structures for the given attributes within this buffer.
That will keep things uniform and will reuse the same handling that we
add for inline attributes.
More information about the Linux-nvme
mailing list