[PATCH 1/8] io_uring: split up io_uring_sqe into hdr + main

Jens Axboe axboe at kernel.dk
Thu Mar 18 18:40:25 GMT 2021


On 3/17/21 11:34 PM, Christoph Hellwig wrote:
>> @@ -14,11 +14,22 @@
>>  /*
>>   * IO submission data structure (Submission Queue Entry)
>>   */
>> +struct io_uring_sqe_hdr {
>> +	__u8	opcode;		/* type of operation for this sqe */
>> +	__u8	flags;		/* IOSQE_ flags */
>> +	__u16	ioprio;		/* ioprio for the request */
>> +	__s32	fd;		/* file descriptor to do IO on */
>> +};
>> +
>>  struct io_uring_sqe {
>> +#ifdef __KERNEL__
>> +	struct io_uring_sqe_hdr	hdr;
>> +#else
>>  	__u8	opcode;		/* type of operation for this sqe */
>>  	__u8	flags;		/* IOSQE_ flags */
>>  	__u16	ioprio;		/* ioprio for the request */
>>  	__s32	fd;		/* file descriptor to do IO on */
>> +#endif
>>  	union {
>>  		__u64	off;	/* offset into file */
>>  		__u64	addr2;
> 
> Please don't do that ifdef __KERNEL__ mess.  We never guaranteed
> userspace API compatbility, just ABI compatibility.

Right, but I'm the one that has to deal with the fallout. For the
in-kernel one I can skip the __KERNEL__ part, and the layout is the
same anyway.

> But we really do have a biger problem here, and that is ioprio is
> a field that is specific to the read and write commands and thus
> should not be in the generic header.  On the other hand the
> personality is.
> 
> So I'm not sure trying to retrofit this even makes all that much sense.
> 
> Maybe we should just define io_uring_sqe_hdr the way it makes
> sense:
> 
> struct io_uring_sqe_hdr {
> 	__u8	opcode;	
> 	__u8	flags;
> 	__u16	personality;
> 	__s32	fd;
> 	__u64	user_data;
> };
> 
> and use that for all new commands going forward while marking the
> old ones as legacy.
> 
> io_uring_cmd_sqe would then be:
> 
> struct io_uring_cmd_sqe {
>         struct io_uring_sqe_hdr	hdr;
> 	__u33			ioc;
> 	__u32 			len;
> 	__u8			data[40];
> };
> 
> for example.  Note the 32-bit opcode just like ioctl to avoid
> getting into too much trouble due to collisions.

I was debating that with myself too, it's essentially making
the existing io_uring_sqe into io_uring_sqe_v1 and then making a new
v2 one. That would impact _all_ commands, and we'd need some trickery
to have newly compiled stuff use v2 and have existing applications
continue to work with the v1 format. That's very different from having
a single (or new) opcodes use a v2 format, effectively.

Looking into the feasibility of this. But if that is done, there are
other things that need to be factored in, as I'm not at all interested
in having a v3 down the line as well. And I'd need to be able to do this
seamlessly, both from an application point of view, and a performance
point of view (no stupid conversions inline).

Things that come up when something like this is on the table

- Should flags be extended? We're almost out... It hasn't been an
  issue so far, but seems a bit silly to go v2 and not at least leave
  a bit of room there. But obviously comes at a cost of losing eg 8
  bits somewhere else.

- Is u8 enough for the opcode? Again, we're nowhere near the limits
  here, but eventually multiplexing might be necessary.

That's just off the top of my head, probably other things to consider
too.

-- 
Jens Axboe




More information about the Linux-nvme mailing list