[PATCH blktests] nvme/039: avoid failure by error message rate limit

Shin'ichiro Kawasaki shinichiro at fastmail.com
Wed Apr 12 20:56:47 PDT 2023


On Apr 13, 2023 / 09:20, Shin'ichiro Kawasaki wrote:
> On Apr 12, 2023 / 10:49, alan.adamson at oracle.com wrote:
> > 
> > On 4/12/23 1:59 AM, Shin'ichiro Kawasaki wrote:
> > > From: Shin'ichiro Kawasaki <shinichiro.kawasaki at wdc.com>
> > > 
> > > The test case nvme/039 tests that expected error messages are printed
> > > for errors injected to the nvme driver. However, the test case fails by
> > > chance when previous test cases generate many error messages. In this
> > > case, the kernel function pr_err_ratelimited() may suppress the error
> > > messages that the test case expects. Also, it may print messages that
> > > the test case does not expect, such as "blk_print_req_error: xxxx
> > > callbacks suppressed".
> > > 
> > > To avoid the failure, make two improvements for the test case. Firstly,
> > > wait DEFAULT_RATE_LIMIT seconds at the beginning of the test to ensure
> > > the expected error messages are not suppressed. Secondly, exclude the
> > > unexpected message for the error message check. Introduce a helper
> > > function last_dmesg() for the second improvement.
> > 
> > Why are we seeing the callback messages?  By the time the test starts
> > generating errors (after a 5 sec delay) we should be able to log 10 messages
> > without any being suppressed.
> 
> That is because other test cases before nvme/039 can generate errors. For
> instance, block/014 generates many errors. When I ran block/014 and nvme/039 in
> sequence, I always observe nvme/039 failure even with the 5 seconds wait in
> nvme/039. I suggest to excldue the "callbacks message" to avoid the nvme/038
> failure regardless of the errors generated before nvme/039.

Reading back my explanation above, I found it may not be clear enough. Let me
ammend it. My understanding is as follows.

The test case block/014 generates many error messages by blk_print_req_error().
The messages are rate limited and some of them are suppressed. At that time,
__ratelimit() calls printk_deferred() to print the "callbacks suppressed"
message, which is deferred and not printed during block/014. When nvme/039
triggers an error message (after the 5 sec delay), the "callbacks suppressed"
message for blk_print_req_error() and block/014 is printed. It makes the
nvme/039 fail, since nvme/039 expects the error message it triggered, but it
finds the "callbacks suppressed" message instead. In theory, not only block/014
but also other workload with rate limited error messages can cause this nvme/039
failure.

The 5 sec delay in nvme/039 ensures that the error messages for nvme/039 are not
suppressed, but it can not avoid the "callbacks suppressed" messages that
deferred before nvme/039.



More information about the Linux-nvme mailing list