[PATCH v4 1/5] fs,io_uring: add infrastructure for uring-cmd

Jens Axboe axboe at kernel.dk
Thu May 5 09:17:39 PDT 2022


On 5/5/22 12:06 AM, Kanchan Joshi wrote:
> +static int io_uring_cmd_prep(struct io_kiocb *req,
> +			     const struct io_uring_sqe *sqe)
> +{
> +	struct io_uring_cmd *ioucmd = &req->uring_cmd;
> +	struct io_ring_ctx *ctx = req->ctx;
> +
> +	if (ctx->flags & IORING_SETUP_IOPOLL)
> +		return -EOPNOTSUPP;
> +	/* do not support uring-cmd without big SQE/CQE */
> +	if (!(ctx->flags & IORING_SETUP_SQE128))
> +		return -EOPNOTSUPP;
> +	if (!(ctx->flags & IORING_SETUP_CQE32))
> +		return -EOPNOTSUPP;
> +	if (sqe->ioprio || sqe->rw_flags)
> +		return -EINVAL;
> +	ioucmd->cmd = sqe->cmd;
> +	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
> +	return 0;
> +}

While looking at the other suggested changes, I noticed a more
fundamental issue with the passthrough support. For any other command,
SQE contents are stable once prep has been done. The above does do that
for the basic items, but this case is special as the lower level command
itself resides in the SQE.

For cases where the command needs deferral, it's problematic. There are
two main cases where this can happen:

- The issue attempt yields -EAGAIN (we ran out of requests, etc). If you
  look at other commands, if they have data that doesn't fit in the
  io_kiocb itself, then they need to allocate room for that data and have
  it be persistent

- Deferral is specified by the application, using eg IOSQE_IO_LINK or
  IOSQE_ASYNC.

We're totally missing support for both of these cases. Consider the case
where the ring is setup with an SQ size of 1. You prep a passthrough
command (request A) and issue it with io_uring_submit(). Due to one of
the two above mentioned conditions, the internal request is deferred.
Either it was sent to ->uring_cmd() but we got -EAGAIN, or it was
deferred even before that happened. The application doesn't know this
happened, it gets another SQE to submit a new request (request B). Fills
it in, calls io_uring_submit(). Since we only have one SQE available in
that ring, when request A gets re-issued, it's now happily reading SQE
contents from command B. Oops.

This is why prep handlers are the only ones that get an sqe passed to
them. They are supposed to ensure that we no longer read from the SQE
past that. Applications can always rely on that fact that once
io_uring_submit() has been done, which consumes the SQE in the SQ ring,
that no further reads are done from that SQE.

IOW, we need io_req_prep_async() handling for URING_CMD, which needs to
allocate the full 80 bytes and copy them over. Subsequent issue attempts
will then use that memory rather than the SQE parts. Just need to find a
sane way to do that so we don't need to make the usual prep + direct
issue path any slower.

-- 
Jens Axboe




More information about the Linux-nvme mailing list