[PATCH 1/8] io_uring: split up io_uring_sqe into hdr + main
Jens Axboe
axboe at kernel.dk
Thu Mar 18 18:40:25 GMT 2021
On 3/17/21 11:34 PM, Christoph Hellwig wrote:
>> @@ -14,11 +14,22 @@
>> /*
>> * IO submission data structure (Submission Queue Entry)
>> */
>> +struct io_uring_sqe_hdr {
>> + __u8 opcode; /* type of operation for this sqe */
>> + __u8 flags; /* IOSQE_ flags */
>> + __u16 ioprio; /* ioprio for the request */
>> + __s32 fd; /* file descriptor to do IO on */
>> +};
>> +
>> struct io_uring_sqe {
>> +#ifdef __KERNEL__
>> + struct io_uring_sqe_hdr hdr;
>> +#else
>> __u8 opcode; /* type of operation for this sqe */
>> __u8 flags; /* IOSQE_ flags */
>> __u16 ioprio; /* ioprio for the request */
>> __s32 fd; /* file descriptor to do IO on */
>> +#endif
>> union {
>> __u64 off; /* offset into file */
>> __u64 addr2;
>
> Please don't do that ifdef __KERNEL__ mess. We never guaranteed
> userspace API compatbility, just ABI compatibility.
Right, but I'm the one that has to deal with the fallout. For the
in-kernel one I can skip the __KERNEL__ part, and the layout is the
same anyway.
> But we really do have a biger problem here, and that is ioprio is
> a field that is specific to the read and write commands and thus
> should not be in the generic header. On the other hand the
> personality is.
>
> So I'm not sure trying to retrofit this even makes all that much sense.
>
> Maybe we should just define io_uring_sqe_hdr the way it makes
> sense:
>
> struct io_uring_sqe_hdr {
> __u8 opcode;
> __u8 flags;
> __u16 personality;
> __s32 fd;
> __u64 user_data;
> };
>
> and use that for all new commands going forward while marking the
> old ones as legacy.
>
> io_uring_cmd_sqe would then be:
>
> struct io_uring_cmd_sqe {
> struct io_uring_sqe_hdr hdr;
> __u33 ioc;
> __u32 len;
> __u8 data[40];
> };
>
> for example. Note the 32-bit opcode just like ioctl to avoid
> getting into too much trouble due to collisions.
I was debating that with myself too, it's essentially making
the existing io_uring_sqe into io_uring_sqe_v1 and then making a new
v2 one. That would impact _all_ commands, and we'd need some trickery
to have newly compiled stuff use v2 and have existing applications
continue to work with the v1 format. That's very different from having
a single (or new) opcodes use a v2 format, effectively.
Looking into the feasibility of this. But if that is done, there are
other things that need to be factored in, as I'm not at all interested
in having a v3 down the line as well. And I'd need to be able to do this
seamlessly, both from an application point of view, and a performance
point of view (no stupid conversions inline).
Things that come up when something like this is on the table
- Should flags be extended? We're almost out... It hasn't been an
issue so far, but seems a bit silly to go v2 and not at least leave
a bit of room there. But obviously comes at a cost of losing eg 8
bits somewhere else.
- Is u8 enough for the opcode? Again, we're nowhere near the limits
here, but eventually multiplexing might be necessary.
That's just off the top of my head, probably other things to consider
too.
--
Jens Axboe
More information about the Linux-nvme
mailing list