[PATCH v6 06/10] io_uring/rw: add support to send metadata along with read/write

Fri Nov 1 10:54:15 PDT 2024

On 10/31/2024 8:09 PM, Pavel Begunkov wrote:
> On 10/30/24 21:09, Keith Busch wrote:
>> On Wed, Oct 30, 2024 at 11:31:08PM +0530, Kanchan Joshi wrote:
>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/ 
>>> io_uring.h
>>> index 024745283783..48dcca125db3 100644
>>> --- a/include/uapi/linux/io_uring.h
>>> +++ b/include/uapi/linux/io_uring.h
>>> @@ -105,6 +105,22 @@ struct io_uring_sqe {
>>>            */
>>>           __u8    cmd[0];
>>>       };
>>> +    /*
>>> +     * If the ring is initialized with IORING_SETUP_SQE128, then
>>> +     * this field is starting offset for 64 bytes of data. For meta io
>>> +     * this contains 'struct io_uring_meta_pi'
>>> +     */
>>> +    __u8    big_sqe[0];
>>> +};
> 
> I don't think zero sized arrays are good as a uapi regardless of
> cmd[0] above, let's just do
> 
> sqe = get_sqe();
> big_sqe = (void *)(sqe + 1)
> 
> with an appropriate helper.

In one of the internal version I did just that (i.e., sqe + 1), and 
that's fine for kernel.
But afterwards added big_sqe so that userspace can directly access 
access second-half of SQE_128. We have the similar big_cqe[] within 
io_uring_cqe too.

Is this still an eyesore?

>>> +
>>> +/* this is placed in SQE128 */
>>> +struct io_uring_meta_pi {
>>> +    __u16        pi_flags;
>>> +    __u16        app_tag;
>>> +    __u32        len;
>>> +    __u64        addr;
>>> +    __u64        seed;
>>> +    __u64        rsvd[2];
>>>   };
>>
>> On the previous version, I was more questioning if it aligns with what
> 
> I missed that discussion, let me know if I need to look it up

Yes, please take a look at previous iteration (v5):
https://lore.kernel.org/io-uring/e7aae741-c139-48d1-bb22-dbcd69aa2f73@samsung.com/

Also the corresponding code, since my other answers will use that.

>> Pavel was trying to do here. I didn't quite get it, so I was more
>> confused than saying it should be this way now.
> 
> The point is, SQEs don't have nearly enough space to accommodate all
> such optional features, especially when it's taking so much space and
> not applicable to all reads but rather some specific  use cases and
> files. Consider that there might be more similar extensions and we might
> even want to use them together.
> 
> 1. SQE128 makes it big for all requests, intermixing with requests that
> don't need additional space wastes space. SQE128 is fine to use but at
> the same time we should be mindful about it and try to avoid enabling it
> if feasible.

Right. And initial versions of this series did not use SQE128. But as we 
moved towards passing more comprehensive PI information, first SQE was 
not enough. And we thought to make use of SQE128 rather than taking 
copy_from_user cost.

 > 2. This API hard codes io_uring_meta_pi into the extended part of the
> SQE. If we want to add another feature it'd need to go after the meta
> struct. SQE256?

Not necessarily. It depends on how much extra space it needs for another 
feature. To keep free space in first SQE, I chose to place PI in the 
second one. Anyone requiring 20b (in v6) or 18b (in v5) space, does not 
even have to ask for SQE128.
For more, they can use leftover space in second SQE (about half of 
second sqe will still be free). In v5, they have entire second SQE if 
they don't want to use PI.
If contiguity is a concern, we can move all PI bytes (about 32b) to the 
end of second SQE.

 > And what if the user doesn't need PI but only the second
> feature?

Not this version, but v5 exposed meta_type as bit flags.
And with that, user will not pass the PI flag and that enables to use 
all the PI bytes for something else. We will have union of PI with some 
other info that is known not to co-exist.

> In short, the uAPI need to have a clear vision of how it can be used
> with / extended to multiple optional features and not just PI.
> 
> One option I mentioned before is passing a user pointer to an array of
> structures, each would will have the type specifying what kind of
> feature / meta information it is, e.g. META_TYPE_PI. It's not a
> complete solution but a base idea to extend upon. I separately
> mentioned before, if copy_from_user is expensive we can optimise it
> with pre-registering memory. I think Jens even tried something similar
> with structures we pass as waiting parameters.
> 
> I didn't read through all iterations of the series, so if there is
> some other approach described that ticks the boxes and flexible
> enough, I'd be absolutely fine with it.

Please just read v5. I think it ticks as many boxes as possible without 
having to resort to copy_from_user.