Command timeouts with NVMe TCP kernel driver

Wed Sep 1 06:21:49 PDT 2021

>>> Hi all,
>>>
>>> I'm experiencing command timeouts with recent versions of nvme-tcp driver. I
>>> have set up with a VM running 5.8 which is ok, but the same test load (100%RD
>>> 16K blocks) on a VM 5.11 or later, including 5.14-rc7, shows the same issue.
>>> The initatior complains as follows:
>>>
>>> Aug 30 14:58:05 centos7 kernel: nvme nvme0: queue 7: timeout request 0x10 type 4
>>> Aug 30 14:58:05 centos7 kernel: nvme nvme0: starting error recovery
>>>
>>> My target is a Kalray target, but I have no trace of any outstanding commands
>>> when this situation occurs. Quite the opposite: The Kalray board observes that
>>> the initiator stops sending new requests on the queue some time before this
>>> command times out.
>>>
>>> I don't have any similar issues using SPDK as an initiator. I made the following
>>> modification to nvme-tcp code and my problem has disappeared:
>>>
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -287,6 +287,7 @@ static inline void nvme_tcp_queue_request(struct
>>> nvme_tcp_request *req,
>>> * directly, otherwise queue io_work. Also, only do that if we
>>> * are on the same cpu, so we don't introduce contention.
>>> */
>>> +#if 0
>>> if (queue->io_cpu == raw_smp_processor_id() &&
>>> sync && empty && mutex_trylock(&queue->send_mutex)) {
>>> queue->more_requests = !last;
>>> @@ -296,6 +297,9 @@ static inline void nvme_tcp_queue_request(struct
>>> nvme_tcp_request *req,
>>> } else if (last) {
>>> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>> }
>>> +#else
>>> + queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>> +#endif
>>> }
>>>
>>> To be honest, this bit of code has always bothered me: I don't understand how we
>>> can guarantee that the thread doesn't change CPU between the call to
>>> raw_smp_processor_id() and the trylock. I assume that even if this does occur,
>>> the fact that we hold queue->send_mutex() is supposed to make sure that there
>>> are no atomicity issues, but I'm concerned about the logic that tries to
>>> determine if there is any more pending work and reschedule io_work if and only
>>> if there is something to do. I can't pinpoint an issue from just reading the
>>> code though..
>>>
>>> Does anyone have any thoughts on this? Is my patch proof that there is an issue
>>> in this code, or am I likely to be changing the program flow so significantly
>>> that we can't conclude? Are there any open issues wrt command timeouts at the
>>> moment?
>>
>> I've seen similar, but I thought we'd fixed all those issues. Your poc
>> patch indicates to me that there is still at least another condition
>> that's not being considered. I don't think the possibility of CPU
>> rescheduling is the culprit, though.
>>
>> One thing that looks problematic is if 'bd.last' is true but
>> nvme_tcp_queue_rq() fails early after enqueuing previous commands in the
>> sequence. That could cause the io_work to never kick. There should have
>> been something else in dmesg indicating that error, though, so I'll look
>> for some other possibility.
>>
> 
> Thanks for your help on this Keith. I think I have a better idea about what is happening now. First of all, I have a large number of devices exposed over NVMe-TCP: I'm connecting to 4 remote NVMe devices each of which is multi-namespace, so I have a total of 37 block devices at once.

That shouldn't matter really, what matters is the number of controllers 
that you have. you have 4 controllers, and how many queues are there 
per-controller? and how many cores does the VM have?

> Then I run a userspace benchmark that's running 8 threads per block device and generating IO to each of them. I'm not sure how the multi queue block devices are layered over the X NVMe queues per device, but however it works my system is on its knees. I have been timestamping the time spent by NVMe-TCP threads inside kernel_sendpage() within nvme_tcp_try_send_cmd_pdu() when called from nvme_tcp_queue_rq() and I'm getting some ridiculous outliers : up to 60 seconds. I am confident this is what is responsible for my timeouts. If I increase the nvme_io_timeout to 300 seconds, my test case passes.

That is strange indeed. what are you taking the timing on? just
kernel_sendpage?

> What's interesting is my quick patch that removed the direct call to kernel_sendpage via queue_rq() makes the system behave much better: no more outliers and no timeouts. I don't know much about the kernel network stack or the block device layer but I imagine that the difference is due to the number of calls into the network stack that can happen in parallel. Is queue_rq executed from the context of the userspace caller? In this case I will have close to 300 threads active on this code path. Whereas when we go through the workqueue, I guess we have only N max calls into the network stack at once, where N = the number of CPUs on the system?
> 
> What do you think? Does my analysis make any sense?

I don't think this is related to the number of contexts you have running 
against the namespaces.

However, the difference between the submission from queue_rq and the
workqueue is that queue_rq may happen concurrently and hence may be
blocked by the socket mutex as the io_work does RX. Now I guess that
if RX is intensive (and it sounds like it is in your test) that the
TX path would be potentially starved...

But if the RX is so intensive, then I'd expect you to have timeouts
regardless as the context is not performing sends anyways, but maybe
this needs a bit more thought...

If this theory is correct, then a possible reason for the big delay
inside kernel_sendpage is due to the fact that io_work just ingest as 
much as the stack gives it (which can be a lot). If we were to break
the recvs into fixed chunks, it would allow to send between recvs and
avoid this potential starvation.

Will it be possible to tryout this untested patch?
--

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 032e8d582e6d..ec05b26cee38 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -837,6 +837,7 @@ static int nvme_tcp_recv_skb(read_descriptor_t 
*desc, struct sk_buff *skb,
                 }
         }

+       desc->count -= consumed;
         return consumed;
  }

@@ -1107,7 +1108,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue 
*queue)
         int consumed;

         rd_desc.arg.data = queue;
-       rd_desc.count = 1;
+       rd_desc.count = 0x4000; /* 16k */
         lock_sock(sk);
         queue->nr_cqe = 0;
         consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
--