[PATCH] nvme-tcp: add additional info for nvme_tcp_timeout log

Sun Dec 4 09:28:05 PST 2022

Hi Sagi, thank you for your inputs

It seems to me that logging nvme command id is sufficient and there is no need to log also tag and genctr separately
This is due to the fact that command id is constructed as such:

| xxxx | xxxxxxxxxxxx |
  gen    request tag

What do you think?

Regarding the io timeout value, I got your point..
logging the io timeout threshold (which is the delta from allocation to submission) is sufficient

regarding nvme-tcp trace points -  I don’t believe that would help.
This timeout log is on the host side, and we can't ensure that a customer that is running nvme host will enable/use trace points
In addition, the whole point of this specific log is for debug, I guess - right?

How does this log look?
kernel: nvme nvme0: queue 4: cntlid 1 type 4 cid 4163 opcode 0x2 io timeout threshold 30000ms
(Including the cntlid, cid (which includes genctr+tag), opc, and timeout threshold value in ms)

comparing to the existing log:
kernel: nvme nvme0: queue 4: timeout request 0x43 type 4

let me know if that make sense and I will submit an updated patch.

Thanks
Amit

Internal Use - Confidential

-----Original Message-----
From: Sagi Grimberg <sagi at grimberg.me> 
Sent: Monday, 21 November 2022 16:34
To: Engel, Amit <Amit.Engel at Dell.com>; linux-nvme at lists.infradead.org
Subject: Re: [PATCH] nvme-tcp: add additional info for nvme_tcp_timeout log

[EXTERNAL EMAIL] 

> Hi Sagi.
> 
> The previous msg:
> nvme nvme0: queue 0: timeout request 0x0 type 4
> 
> We found it more useful to have also:
> cmd opc that was timed out

That is fine

> command id that was timed out

That is the lower 12 bits of the tag.
You can introduce a generic helper to extract genctr+cid from a tag.

> The timeout value for this specific io - in case of a timeout - it 
> make sense to know the timeout value itself more info about the blk rq - such as:
> 	a. time of a request that was allocated for the IO that was timed out
> 	b. time of the timed out IO that was submitted to the device

I don't understand what added value that is providing you.
An IO timeout is usually (if not universally) >> the delta from tag allocation to submission.

This is debug information, and in my mind the proper place for this type of debug is blktrace.

> The log is already exists, why not provide additional info that can help with io timeouts debug?

We can say that about a lot of things.

There is negative value from spam information in the log which is debug in nature and is separate from the normal error/warn log information.

If this information existed in the form of trace points would that solve your problem? Keith had a patch for nvme-tcp trace points, which can provide this type of information.