Request timeout seen with NVMEoF TCP

Wed Dec 16 00:51:23 EST 2020

On Monday, December 12/14/20, 2020 at 17:53:44 -0800, Sagi Grimberg wrote:
> 
> > Hey Potnuri,
> > 
> > Have you observed this further?
> > 
> > I'd think that if the io_work reschedule itself when it races
> > with the direct send path this should not happen, but we may be
> > seeing a different race going on here, adding Samuel who saw
> > a similar phenomenon.
> 
> I think we still have a race here with the following:
> 1. queue_rq sends h2cdata PDU (no data)
> 2. host receives r2t - prepares data PDU to send and schedules io_work
> 3. queue_rq sends another h2cdata PDU - ends up sending (2) because it was
> queued before it
> 4. io_work starts, loops but never able to acquire the send_mutex -
> eventually just ends (dosn't requeue)
> 5. (3) completes, now nothing will send (2)
> 
> We can either schedule the io_work from the direct send path, but that
> is less efficient than just trying to drain the send queue in the
> direct send path and if not all was sent, the write_space callback
> will trigger it.
> 
> Potnuri, does this patch solves what you are seeing?

Hi Sagi,
Below patch works fine. I have it running all night with out any issues.
Thanks.

> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 1ba659927442..1b4e25624ba4 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -262,6 +262,16 @@ static inline void nvme_tcp_advance_req(struct
> nvme_tcp_request *req,
>         }
>  }
> 
> +static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
> +{
> +       int ret;
> +
> +       /* drain the send queue as much as we can... */
> +       do {
> +               ret = nvme_tcp_try_send(queue);
> +       } while (ret > 0);
> +}
> +
>  static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
>                 bool sync, bool last)
>  {
> @@ -279,7 +289,7 @@ static inline void nvme_tcp_queue_request(struct
> nvme_tcp_request *req,
>         if (queue->io_cpu == smp_processor_id() &&
>             sync && empty && mutex_trylock(&queue->send_mutex)) {
>                 queue->more_requests = !last;
> -               nvme_tcp_try_send(queue);
> +               nvme_tcp_send_all(queue);
>                 queue->more_requests = false;
>                 mutex_unlock(&queue->send_mutex);
>         } else if (last) {
> @@ -1122,6 +1132,14 @@ static void nvme_tcp_io_work(struct work_struct *w)
>                                 pending = true;
>                         else if (unlikely(result < 0))
>                                 break;
> +               } else {
> +                       /*
> +                        * submission path is sending, we need to
> +                        * continue or resched because the submission
> +                        * path direct send is not concerned with
> +                        * rescheduling...
> +                        */
> +                       pending = true;
>                 }
> 
>                 result = nvme_tcp_try_recv(queue);
> --