[PATCH] nvmet-tcp: Fix a possible sporadic response drops in weakly ordered arch

Thu Feb 20 10:28:58 PST 2025

Hi Caleb

Thanks for the review. I'll resend patch after testing.

On 20/02/2025 19:17, Caleb Sander Mateos wrote:
> On Thu, Feb 20, 2025 at 3:56 AM Meir Elisha <meir.elisha at volumez.com> wrote:
>>
>> The order in which queue->cmd and rcv_state are updated is crucial.
>> If these assignments are reordered by the compiler, the worker might not
>> get queued in nvmet_tcp_queue_response(), hanging the IO. to enforce the
>> the correct reordering, set rcv_state using smp_store_release().
>>
>> Fixes: bdaf13279192 ("nvmet-tcp: fix a segmentation fault during io parsing error")
>> Signed-off-by: Meir Elisha <meir.elisha at volumez.com>
>> ---
>>  drivers/nvme/target/tcp.c | 15 +++++++++++----
>>  1 file changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
>> index 7c51c2a8c109..4021468c8857 100644
>> --- a/drivers/nvme/target/tcp.c
>> +++ b/drivers/nvme/target/tcp.c
>> @@ -571,10 +571,16 @@ static void nvmet_tcp_queue_response(struct nvmet_req *req)
>>         struct nvmet_tcp_cmd *cmd =
>>                 container_of(req, struct nvmet_tcp_cmd, req);
>>         struct nvmet_tcp_queue  *queue = cmd->queue;
>> +       enum nvmet_tcp_recv_state queue_state = READ_ONCE(queue->state);
> 
> Why did this change from queue->rcv_state to queue->state? Doesn't
> look like enum nvmet_tcp_recv_state is the correct type for
> queue->state either.
it was by mistake. should be queue->rcv_state.
> 
>> +       /*
>> +        * Use an acquire load to ensure that any updates to queue->state are visible
>> +        * before loading queue->cmd.
>> +        */
>> +       struct nvmet_tcp_cmd *queue_cmd = smp_load_acquire(&queue->cmd);
> 
> Acquire ordering prevents memory operations that come *after* from
> being reordered *before*. It does not prevent earlier operations (such
> as the load of queue->state) from being reordered after the acquire
> load. Additionally, an acquire must pair with a release store *on the
> same value* to have any effect. But the release store is to
> queue->rcv_state, not queue->cmd.
> 
> Correct uses of release-acquire ordering generally look something like this:
> Thread 1:
> Non-atomic store to A
> Release-ordering store to B
> 
> Thread 2:
> Acquire-ordering load from B
> Non-atomic load from A
> 
> This ensures that if thread 2 observes the new value thread 1 stored
> in B, it will also observe the new value in A.

Thanks for noticed that.
> 
>>         struct nvme_sgl_desc *sgl;
>>         u32 len;
>>
>> -       if (unlikely(cmd == queue->cmd)) {
>> +       if (unlikely(cmd == queue_cmd)) {
>>                 sgl = &cmd->req.cmd->common.dptr.sgl;
>>                 len = le32_to_cpu(sgl->length);
>>
>> @@ -583,7 +589,7 @@ static void nvmet_tcp_queue_response(struct nvmet_req *req)
>>                  * Avoid using helpers, this might happen before
>>                  * nvmet_req_init is completed.
>>                  */
>> -               if (queue->rcv_state == NVMET_TCP_RECV_PDU &&
>> +               if (queue_state == NVMET_TCP_RECV_PDU &&
>>                     len && len <= cmd->req.port->inline_data_size &&
>>                     nvme_is_write(cmd->req.cmd))
>>                         return;
>> @@ -847,8 +853,9 @@ static void nvmet_prepare_receive_pdu(struct nvmet_tcp_queue *queue)
>>  {
>>         queue->offset = 0;
>>         queue->left = sizeof(struct nvme_tcp_hdr);
>> -       queue->cmd = NULL;
>> -       queue->rcv_state = NVMET_TCP_RECV_PDU;
>> +       WRITE_ONCE(queue->cmd, NULL);
>> +       /* Ensure rcv_state is visible only after queue->cmd is set */
>> +       smp_store_release(&queue->rcv_state, NVMET_TCP_RECV_PDU);
> 
> Is this also needed in the other places updating queue->rcv_state and
> queue->cmd, e.g. nvmet_tcp_handle_h2c_data_pdu()?
nvmet_tcp_handle_h2c_data_pdu() doesn't set rcv_state to NVMET_TCP_RECV_PDU.
the other context wont exit early in nvmet_tcp_queue_response() so I don't
think we need it.
> 
> Best,
> Caleb
> 
>>  }
>>
>>  static void nvmet_tcp_free_crypto(struct nvmet_tcp_queue *queue)
>> --
>> 2.34.1
>>
>> This ordering is critical on weakly ordered architectures (such as ARM)
>> so that any observer which sees the new rcv_state is guaranteed to also
>> see the updated cmd. Without this guarantee (i.e if the two stores were
>> reordered), a parallel context might see the new state while queue->cmd
>> still holds a stale value. This could cause the inline-data check to
>> return early and ultimately hang the IO.
>> Additionally, I reviewed the assembly code for ARM and confirmed that
>> the instructions were reordered(unlike x86), reinforcing the need for
>> this change.
>>
>> This scenario was encountered during fio testing, which involved
>> running 2 min of 4K random writes using an ARM-based machine as the
>> target. We observed hanging I/O typically after 10-20 iterations.
>>
>> fio config used:
>> [global]
>> ioengine=libaio
>> max_latency=45s
>> end_fsync=1
>> create_serialize=0
>> size=3200m
>> directory=/mnt/volumez/vol0
>> ramp_time=30
>> lat_percentiles=1
>> direct=1
>> filename_format=fiodata.$jobnum
>> verify_dump=1
>> numjobs=16
>> fallocate=native
>> stonewall=1
>> group_reporting=1
>> file_service_type=random
>> iodepth=16
>> runtime=5m
>> time_based=1
>> [random_0_100_4k]
>> bs=4k
>> rw=randwrite
>>