[PATCH v6 06/10] io_uring/rw: add support to send metadata along with read/write

Kanchan Joshi joshi.k at samsung.com
Sun Nov 10 10:36:57 PST 2024


On 11/7/2024 10:53 PM, Pavel Begunkov wrote:

> Let's say we have 3 different attributes META_TYPE{1,2,3}.
> 
> How are they placed in an SQE?
> 
> meta1 = (void *)get_big_sqe(sqe);
> meta2 = meta1 + sizeof(?); // sizeof(struct meta1_struct)
> meta3 = meta2 + sizeof(struct meta2_struct);

Not necessary to do this kind of additions and think in terms of 
sequential ordering for the extra information placed into 
primary/secondary SQE.

Please see v8:
https://lore.kernel.org/io-uring/20241106121842.5004-7-anuj20.g@samsung.com/

It exposes a distinct flag (sqe->ext_cap) for each attribute/cap, and 
userspace should place the corresponding information where kernel has 
mandated.

If a particular attribute (example write-hint) requires <20b of extra 
information, we should just place that in first SQE. PI requires more so 
we are placing that into second SQE.

When both PI and write-hint flags are specified by user they can get 
processed fine without actually having to care about above 
additions/ordering.

> Structures are likely not fixed size (?). At least the PI looks large
> enough to force everyone to be just aliased to it.
> 
> And can the user pass first meta2 in the sqe and then meta1?

Yes. Just set the ext_cap flags without bothering about first/second.
User can pass either or both, along with the corresponding info. Just 
don't have to assume specific placement into SQE.


> meta2 = (void *)get_big_sqe(sqe);
> meta1 = meta2 + sizeof(?); // sizeof(struct meta2_struct)
> 
> If yes, how parsing should look like? Does the kernel need to read each
> chunk's type and look up its size to iterate to the next one?

We don't need to iterate if we are not assuming any ordering.

> If no, what happens if we want to pass meta2 and meta3, do they start
> from the big_sqe?

The one who adds the support for meta2/meta3 in kernel decides where to 
place them within first/second SQE or get them fetched via a pointer 
from userspace.

> How do we pass how many of such attributes is there for the request?

ext_cap allows to pass 16 cap/attribute flags. Maybe all can or can not 
be passed inline in SQE, but I have no real visibility about the space 
requirement of future users.


> It should support arbitrary number of attributes in the long run, which
> we can't pass in an SQE, bumping the SQE size is not scalable in
> general, so it'd need to support user pointers or sth similar at some
> point. Placing them in an SQE can serve as an optimisation, and a first> step, though it might be easier to start with user pointer instead.
> 
> Also, when we eventually come to user pointers, we want it to be
> performant as well and e.g. get by just one copy_from_user, and the
> api/struct layouts would need to be able to support it. And once it's
> copied we'll want it to be handled uniformly with the SQE variant, that
> requires a common format. For different formats there will be a question
> of perfomance, maintainability, duplicating kernel and userspace code.
> 
> All that doesn't need to be implemented, but we need a clear direction
> for the API. Maybe we can get a simplified user space pseudo code
> showing how the end API is supposed to look like?

Yes. For a large/arbitrary number, we may have to fetch the entire 
attribute list using a user pointer/len combo. And parse it (that's 
where all your previous questions fit).

And that can still be added on top of v8.
For example, adding a flag (in ext_cap) that disables inline-sqe 
processing and switches to external attribute buffer:

/* Second SQE has PI information */
#define EXT_CAP_PI		(1U << 0)
/* First SQE has hint information */
#define EXT_CAP_WRITE_HINT	(1U << 1)	
/* Do not assume CAP presence in SQE, and fetch capability buffer page 
instead */
#define EXT_CAP_INDIRECT 	(1U << 2)

Corresponding pointer (and/or len) can be put into last 16b of SQE.
Use the same flags/structures for the given attributes within this buffer.
That will keep things uniform and will reuse the same handling that we 
add for inline attributes.



More information about the Linux-nvme mailing list