[PATCH v7 06/10] io_uring/rw: add support to send metadata along with read/write
Keith Busch
kbusch at kernel.org
Tue Nov 5 08:23:19 PST 2024
On Tue, Nov 05, 2024 at 05:00:51PM +0100, Christoph Hellwig wrote:
> On Tue, Nov 05, 2024 at 09:21:27PM +0530, Kanchan Joshi wrote:
> > Can add the documentation (if this version is palatable for Jens/Pavel),
> > but this was discussed in previous iteration:
> >
> > 1. Each meta type may have different space requirement in SQE.
> >
> > Only for PI, we need so much space that we can't fit that in first SQE.
> > The SQE128 requirement is only for PI type.
> > Another different meta type may just fit into the first SQE. For that we
> > don't have to mandate SQE128.
>
> Ok, I'm really confused now. The way I understood Anuj was that this
> is NOT about block level metadata, but about other uses of the big SQE.
>
> Which version is right? Or did I just completely misunderstand Anuj?
Let's not call this "meta_type". Can we use something that has a less
overloaded meaning, like "sqe_extended_capabilities", or "ecap", or
something like that.
> > 2. If two meta types are known not to co-exist, they can be kept in the
> > same place within SQE. Since each meta-type is a flag, we can check what
> > combinations are valid within io_uring and throw the error in case of
> > incompatibility.
>
> And this sounds like what you refer to is not actually block metadata
> as in this patchset or nvme, (or weirdly enough integrity in the block
> layer code).
>
> > 3. Previous version was relying on SQE128 flag. If user set the ring
> > that way, it is assumed that PI information was sent.
> > This is more explicitly conveyed now - if user passed META_TYPE_PI flag,
> > it has sent the PI. This comment in the code:
> >
> > + /* if sqe->meta_type is META_TYPE_PI, last 32 bytes are for PI */
> > + union {
> >
> > If this flag is not passed, parsing of second SQE is skipped, which is
> > the current behavior as now also one can send regular (non pi)
> > read/write on SQE128 ring.
>
> And while I don't understand how this threads in with the previous
> statements, this makes sense. If you only want to send a pointer (+len)
> to metadata you can use the normal 64-byte SQE. If you want to send
> a PI tuple you need SEQ128. Is that what the various above statements
> try to express? If so the right API to me would be to have two flags:
>
> - a flag that a pointer to metadata is passed. This can work with
> a 64-bit SQE.
> - another flag that a PI tuple is passed. This requires a 128-byte
> and also the previous flag.
I don't think anything done so far aligns with what Pavel had in mind.
Let me try to lay out what I think he's going for. Just bare with me,
this is just a hypothetical example.
This patch adds a PI extension.
Later, let's say write streams needs another extenion.
Then key per-IO wants another extention.
Then someone else adds wizbang-awesome-feature extention.
Let's say you have device that can do all 4, or any combination of them.
Pavel wants a solution that is future proof to such a scenario. So not
just a single new "meta_type" with its structure, but a list of types in
no particular order, and their structures.
That list can exist either in the extended SQE, or in some other user
address that the kernel will need copy.
More information about the Linux-nvme
mailing list