[PATCH 1/8] io_uring: split up io_uring_sqe into hdr + main
Stefan Metzmacher
metze at samba.org
Fri Mar 19 11:20:05 GMT 2021
Am 18.03.21 um 19:40 schrieb Jens Axboe:
> On 3/17/21 11:34 PM, Christoph Hellwig wrote:
>>> @@ -14,11 +14,22 @@
>>> /*
>>> * IO submission data structure (Submission Queue Entry)
>>> */
>>> +struct io_uring_sqe_hdr {
>>> + __u8 opcode; /* type of operation for this sqe */
>>> + __u8 flags; /* IOSQE_ flags */
>>> + __u16 ioprio; /* ioprio for the request */
>>> + __s32 fd; /* file descriptor to do IO on */
>>> +};
>>> +
>>> struct io_uring_sqe {
>>> +#ifdef __KERNEL__
>>> + struct io_uring_sqe_hdr hdr;
>>> +#else
>>> __u8 opcode; /* type of operation for this sqe */
>>> __u8 flags; /* IOSQE_ flags */
>>> __u16 ioprio; /* ioprio for the request */
>>> __s32 fd; /* file descriptor to do IO on */
>>> +#endif
>>> union {
>>> __u64 off; /* offset into file */
>>> __u64 addr2;
>>
>> Please don't do that ifdef __KERNEL__ mess. We never guaranteed
>> userspace API compatbility, just ABI compatibility.
>
> Right, but I'm the one that has to deal with the fallout. For the
> in-kernel one I can skip the __KERNEL__ part, and the layout is the
> same anyway.
>
>> But we really do have a biger problem here, and that is ioprio is
>> a field that is specific to the read and write commands and thus
>> should not be in the generic header. On the other hand the
>> personality is.
>>
>> So I'm not sure trying to retrofit this even makes all that much sense.
>>
>> Maybe we should just define io_uring_sqe_hdr the way it makes
>> sense:
>>
>> struct io_uring_sqe_hdr {
>> __u8 opcode;
>> __u8 flags;
>> __u16 personality;
>> __s32 fd;
>> __u64 user_data;
>> };
>>
>> and use that for all new commands going forward while marking the
>> old ones as legacy.
>>
>> io_uring_cmd_sqe would then be:
>>
>> struct io_uring_cmd_sqe {
>> struct io_uring_sqe_hdr hdr;
>> __u33 ioc;
>> __u32 len;
>> __u8 data[40];
>> };
>>
>> for example. Note the 32-bit opcode just like ioctl to avoid
>> getting into too much trouble due to collisions.
>
> I was debating that with myself too, it's essentially making
> the existing io_uring_sqe into io_uring_sqe_v1 and then making a new
> v2 one. That would impact _all_ commands, and we'd need some trickery
> to have newly compiled stuff use v2 and have existing applications
> continue to work with the v1 format. That's very different from having
> a single (or new) opcodes use a v2 format, effectively.
I think we should use v0 and v1.
I think io_init_req and io_prep_req could be merged into an io_init_prep_req()
which could then do:
switch (ctx->sqe_version)
case 0:
return io_init_prep_req_v0();
case 1:
return io_init_prep_req_v1();
default:
return -EINVAL;
The kernel would return IORING_FEAT_SQE_V1
and set ctx->sqe_version = 1 if IORING_SETUP_SQE_V1 was passed from
the caller.
liburing whould then need to pass struct io_uring *ring to
io_uring_prep_*(), io_uring_sqe_set_flags() and io_uring_sqe_set_data().
in order to use struct io_uring->sq.sqe_version to alter the behavior.
(I think we should also have a io_uring_sqe_set_personality() helper).
static inline void io_uring_prep_nop(struct io_uring *ring, struct io_uring_sqe *sqe)
{
struct io_uring_sqe_common *nop = &sqe->common;
if (ring->sq.sqe_version == 0)
io_uring_prep_rw_v0(IORING_OP_NOP, sqe, -1, NULL, 0, 0);
else
*nop = (struct io_uring_sqe_common) {
.hdr = {
.opcode = IORING_OP_NOP,
},
};
}
For new features the prep functions would return a pointer to
the specific structure (see also below).
static inline struct io_uring_sqe_file_cmd *
io_uring_prep_file_cmd(struct io_uring *ring, struct io_uring_sqe *sqe, int fd, uint32_t cmd_opcode)
{
struct io_uring_sqe_file_cmd *file_cmd = &sqe->file_cmd;
*file_cmd = (struct io_uring_sqe_file_cmd) {
.hdr = {
.opcode = IORING_OP_FILE_CMD,
},
.fd = fd,
.cmd_opcode = cmd_opcode,
}
return file_cmd;
}
The application could then also check for a n
In order to test v1 it should have a way to skip IORING_FEAT_SQE_V2
and all existing tests could have a helper function to toggle that
based on an environment variable, so that make runtests could run
each test in both modes.
> Looking into the feasibility of this. But if that is done, there are
> other things that need to be factored in, as I'm not at all interested
> in having a v3 down the line as well. And I'd need to be able to do this
> seamlessly, both from an application point of view, and a performance
> point of view (no stupid conversions inline).
> Things that come up when something like this is on the table
>
> - Should flags be extended? We're almost out... It hasn't been an
> issue so far, but seems a bit silly to go v2 and not at least leave
> a bit of room there. But obviously comes at a cost of losing eg 8
> bits somewhere else.
>
> - Is u8 enough for the opcode? Again, we're nowhere near the limits
> here, but eventually multiplexing might be necessary.
>
> That's just off the top of my head, probably other things to consider
> too.
What about using something like this:
struct io_uring_sqe_hdr {
__u64 user_data;
__u16 personality;
__u16 opcode;
__u32 flags;
};
I moved __s32 fd out of it as not all commands need it and some need more than
one. So I guess it's easier to have them in the per opcode structure.
and the io_file_get() should better be done in the per opcode prep_vX function.
struct io_uring_sqe_common {
struct io_uring_sqe_hdr hdr;
__u8 __reserved[48];
};
struct io_uring_sqe_rw_common {
struct io_uring_sqe_hdr hdr;
__s32 fd; /* file descriptor to do IO on */
__u32 len; /* buffer size or number of iovecs */
__u64 off; /* offset into file */
__u64 addr; /* pointer to buffer or iovecs */
__kernel_rwf_t rw_flags;
__u16 ioprio; /* ioprio for the request */
__u16 buf_index; /* index into fixed buffers, if used */
__u8 __reserved[16];
};
struct io_uring_sqe_file_cmd {
struct io_uring_sqe_hdr hdr;
__s32 fd; /* file descriptor to do IO on */
__u32 cmd_opcode; /* file specific command */
__u8 cmd_data[40]; /* command spefic data */
};
struct io_uring_sqe {
union {
struct io_uring_sqe_common common;
struct io_uring_sqe_common nop;
struct io_uring_sqe_rw_common readv;
struct io_uring_sqe_rw_common writev;
struct io_uring_sqe_rw_common read_fixed;
struct io_uring_sqe_rw_common write_fixed;
struct io_uring_sqe_file_cmd file_cmd;
};
};
Each _opcode_prep() function would then check hdr.flags for unsupported flags
and __reserved for zeros. Instead of having a global io_op_defs[] array
the _opcode_prep() function would have a static const definition for the opcode
and lease req->op_def (which would be const struct io_op_def *op_def);
Does that sound useful in anyway?
metze
More information about the Linux-nvme
mailing list