nvme tcp receive errors

Mon May 3 19:51:21 BST 2021

On Wed, Apr 28, 2021 at 09:52:37PM -0700, Sagi Grimberg wrote:
> 
> > The driver tracepoints captured millions of IO's where everything
> > happened as expected, so I really think something got confused and
> > mucked with the wrong request. I've added more trace points to increase
> > visibility because I frankly didn't find how that could happen just from
> > code inspection. We will also incorporate your patch below for the next
> > recreate.
> 
> Keith, does the issue still happen with eliminating the network send
> from .queue_rq() ?

This patch is successful at resolving the observed r2t issues after the
weekend test run, which is much longer than it could have run
previously. I'm happy we're narrowing this down, but I'm not seeing how
this addresses the problem. It looks like the mutex single threads the
critical parts, but maybe I'm missing something. Any ideas?

> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index eb1feaacd11a..b3fafa536345 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -288,7 +288,7 @@ static inline void nvme_tcp_queue_request(struct
> nvme_tcp_request *req,
>          * directly, otherwise queue io_work. Also, only do that if we
>          * are on the same cpu, so we don't introduce contention.
>          */
> -       if (queue->io_cpu == __smp_processor_id() &&
> +       if (0 && queue->io_cpu == __smp_processor_id() &&
>             sync && empty && mutex_trylock(&queue->send_mutex)) {
>                 queue->more_requests = !last;
>                 nvme_tcp_send_all(queue);
> --