Command timeouts with NVMe TCP kernel driver
Keith Busch
kbusch at kernel.org
Mon Aug 30 19:37:12 PDT 2021
On Mon, Aug 30, 2021 at 03:30:29PM +0200, Samuel Jones wrote:
> Hi all,
>
> I'm experiencing command timeouts with recent versions of nvme-tcp driver. I have set up with a VM running 5.8 which is ok, but the same test load (100%RD 16K blocks) on a VM 5.11 or later, including 5.14-rc7, shows the same issue. The initatior complains as follows:
>
> Aug 30 14:58:05 centos7 kernel: nvme nvme0: queue 7: timeout request 0x10 type 4
> Aug 30 14:58:05 centos7 kernel: nvme nvme0: starting error recovery
>
> My target is a Kalray target, but I have no trace of any outstanding commands when this situation occurs. Quite the opposite: The Kalray board observes that the initiator stops sending new requests on the queue some time before this command times out.
>
> I don't have any similar issues using SPDK as an initiator. I made the following modification to nvme-tcp code and my problem has disappeared:
>
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -287,6 +287,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
> * directly, otherwise queue io_work. Also, only do that if we
> * are on the same cpu, so we don't introduce contention.
> */
> +#if 0
> if (queue->io_cpu == raw_smp_processor_id() &&
> sync && empty && mutex_trylock(&queue->send_mutex)) {
> queue->more_requests = !last;
> @@ -296,6 +297,9 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
> } else if (last) {
> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
> }
> +#else
> + queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
> +#endif
> }
>
> To be honest, this bit of code has always bothered me: I don't understand how we can guarantee that the thread doesn't change CPU between the call to raw_smp_processor_id() and the trylock. I assume that even if this does occur, the fact that we hold queue->send_mutex() is supposed to make sure that there are no atomicity issues, but I'm concerned about the logic that tries to determine if there is any more pending work and reschedule io_work if and only if there is something to do. I can't pinpoint an issue from just reading the code though..
>
> Does anyone have any thoughts on this? Is my patch proof that there is an issue in this code, or am I likely to be changing the program flow so significantly that we can't conclude? Are there any open issues wrt command timeouts at the moment?
I've seen similar, but I thought we'd fixed all those issues. Your poc
patch indicates to me that there is still at least another condition
that's not being considered. I don't think the possibility of CPU
rescheduling is the culprit, though.
One thing that looks problematic is if 'bd.last' is true but
nvme_tcp_queue_rq() fails early after enqueuing previous commands in the
sequence. That could cause the io_work to never kick. There should have
been something else in dmesg indicating that error, though, so I'll look
for some other possibility.
More information about the Linux-nvme
mailing list