[PATCH v7 06/10] io_uring/rw: add support to send metadata along with read/write

Kanchan Joshi joshi.k at samsung.com
Tue Nov 5 08:38:46 PST 2024


On 11/5/2024 9:30 PM, Christoph Hellwig wrote:
> On Tue, Nov 05, 2024 at 09:21:27PM +0530, Kanchan Joshi wrote:
>> Can add the documentation (if this version is palatable for Jens/Pavel),
>> but this was discussed in previous iteration:
>>
>> 1. Each meta type may have different space requirement in SQE.
>>
>> Only for PI, we need so much space that we can't fit that in first SQE.
>> The SQE128 requirement is only for PI type.
>> Another different meta type may just fit into the first SQE. For that we
>> don't have to mandate SQE128.
> 
> Ok, I'm really confused now.  The way I understood Anuj was that this
> is NOT about block level metadata, but about other uses of the big SQE.
> 
> Which version is right?  Or did I just completely misunderstand Anuj?

We both mean the same. Currently read/write don't [need to] use big SQE 
as all the information is there in the first SQE.
Down the line there may be users fighting for space in SQE. The flag 
(meta_type) may help a bit when that happens.

>> 2. If two meta types are known not to co-exist, they can be kept in the
>> same place within SQE. Since each meta-type is a flag, we can check what
>> combinations are valid within io_uring and throw the error in case of
>> incompatibility.
> 
> And this sounds like what you refer to is not actually block metadata
> as in this patchset or nvme, (or weirdly enough integrity in the block
> layer code).

Right, not about block metadata/pi. But some extra information 
(different in size/semantics etc.) that user wants to pass into SQE 
along with read/write.

>> 3. Previous version was relying on SQE128 flag. If user set the ring
>> that way, it is assumed that PI information was sent.
>> This is more explicitly conveyed now - if user passed META_TYPE_PI flag,
>> it has sent the PI. This comment in the code:
>>
>> +       /* if sqe->meta_type is META_TYPE_PI, last 32 bytes are for PI */
>> +       union {
>>
>> If this flag is not passed, parsing of second SQE is skipped, which is
>> the current behavior as now also one can send regular (non pi)
>> read/write on SQE128 ring.
> 
> And while I don't understand how this threads in with the previous
> statements, this makes sense.  If you only want to send a pointer (+len)
> to metadata you can use the normal 64-byte SQE.  If you want to send
> a PI tuple you need SEQ128.  Is that what the various above statements
> try to express? 

Not exactly. You are talking about pi-type 0 (which only requires meta 
buffer/len) versus !0 pi-type. We thought about it, but decided to keep 
asking for SQE128 regardless of that (pi 0 or non-zero). In both cases 
user will set meta-buffer/len, and other type-specific flags are taken 
care by the low-level code. This keeps thing simple and at io_uring 
level we don't have to distinguish that case.

What I rather meant in this statement was - one can setup a ring with 
SQE128 today and send IORING_OP_READ/IORING_OP_WRITE. That goes fine 
without any processing/error as SQE128 is skipped completely. So relying 
only on SQE128 flag to detect the presence of PI is a bit fragile.



More information about the Linux-nvme mailing list