[PATCH] nvmet-tcp: Fix a possible sporadic response drops in weakly ordered arch

Thu Feb 20 09:17:19 PST 2025

On Thu, Feb 20, 2025 at 3:56 AM Meir Elisha <meir.elisha at volumez.com> wrote:
>
> The order in which queue->cmd and rcv_state are updated is crucial.
> If these assignments are reordered by the compiler, the worker might not
> get queued in nvmet_tcp_queue_response(), hanging the IO. to enforce the
> the correct reordering, set rcv_state using smp_store_release().
>
> Fixes: bdaf13279192 ("nvmet-tcp: fix a segmentation fault during io parsing error")
> Signed-off-by: Meir Elisha <meir.elisha at volumez.com>
> ---
>  drivers/nvme/target/tcp.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
> index 7c51c2a8c109..4021468c8857 100644
> --- a/drivers/nvme/target/tcp.c
> +++ b/drivers/nvme/target/tcp.c
> @@ -571,10 +571,16 @@ static void nvmet_tcp_queue_response(struct nvmet_req *req)
>         struct nvmet_tcp_cmd *cmd =
>                 container_of(req, struct nvmet_tcp_cmd, req);
>         struct nvmet_tcp_queue  *queue = cmd->queue;
> +       enum nvmet_tcp_recv_state queue_state = READ_ONCE(queue->state);

Why did this change from queue->rcv_state to queue->state? Doesn't
look like enum nvmet_tcp_recv_state is the correct type for
queue->state either.

> +       /*
> +        * Use an acquire load to ensure that any updates to queue->state are visible
> +        * before loading queue->cmd.
> +        */
> +       struct nvmet_tcp_cmd *queue_cmd = smp_load_acquire(&queue->cmd);

Acquire ordering prevents memory operations that come *after* from
being reordered *before*. It does not prevent earlier operations (such
as the load of queue->state) from being reordered after the acquire
load. Additionally, an acquire must pair with a release store *on the
same value* to have any effect. But the release store is to
queue->rcv_state, not queue->cmd.

Correct uses of release-acquire ordering generally look something like this:
Thread 1:
Non-atomic store to A
Release-ordering store to B

Thread 2:
Acquire-ordering load from B
Non-atomic load from A

This ensures that if thread 2 observes the new value thread 1 stored
in B, it will also observe the new value in A.

>         struct nvme_sgl_desc *sgl;
>         u32 len;
>
> -       if (unlikely(cmd == queue->cmd)) {
> +       if (unlikely(cmd == queue_cmd)) {
>                 sgl = &cmd->req.cmd->common.dptr.sgl;
>                 len = le32_to_cpu(sgl->length);
>
> @@ -583,7 +589,7 @@ static void nvmet_tcp_queue_response(struct nvmet_req *req)
>                  * Avoid using helpers, this might happen before
>                  * nvmet_req_init is completed.
>                  */
> -               if (queue->rcv_state == NVMET_TCP_RECV_PDU &&
> +               if (queue_state == NVMET_TCP_RECV_PDU &&
>                     len && len <= cmd->req.port->inline_data_size &&
>                     nvme_is_write(cmd->req.cmd))
>                         return;
> @@ -847,8 +853,9 @@ static void nvmet_prepare_receive_pdu(struct nvmet_tcp_queue *queue)
>  {
>         queue->offset = 0;
>         queue->left = sizeof(struct nvme_tcp_hdr);
> -       queue->cmd = NULL;
> -       queue->rcv_state = NVMET_TCP_RECV_PDU;
> +       WRITE_ONCE(queue->cmd, NULL);
> +       /* Ensure rcv_state is visible only after queue->cmd is set */
> +       smp_store_release(&queue->rcv_state, NVMET_TCP_RECV_PDU);

Is this also needed in the other places updating queue->rcv_state and
queue->cmd, e.g. nvmet_tcp_handle_h2c_data_pdu()?

Best,
Caleb

>  }
>
>  static void nvmet_tcp_free_crypto(struct nvmet_tcp_queue *queue)
> --
> 2.34.1
>
> This ordering is critical on weakly ordered architectures (such as ARM)
> so that any observer which sees the new rcv_state is guaranteed to also
> see the updated cmd. Without this guarantee (i.e if the two stores were
> reordered), a parallel context might see the new state while queue->cmd
> still holds a stale value. This could cause the inline-data check to
> return early and ultimately hang the IO.
> Additionally, I reviewed the assembly code for ARM and confirmed that
> the instructions were reordered(unlike x86), reinforcing the need for
> this change.
>
> This scenario was encountered during fio testing, which involved
> running 2 min of 4K random writes using an ARM-based machine as the
> target. We observed hanging I/O typically after 10-20 iterations.
>
> fio config used:
> [global]
> ioengine=libaio
> max_latency=45s
> end_fsync=1
> create_serialize=0
> size=3200m
> directory=/mnt/volumez/vol0
> ramp_time=30
> lat_percentiles=1
> direct=1
> filename_format=fiodata.$jobnum
> verify_dump=1
> numjobs=16
> fallocate=native
> stonewall=1
> group_reporting=1
> file_service_type=random
> iodepth=16
> runtime=5m
> time_based=1
> [random_0_100_4k]
> bs=4k
> rw=randwrite
>