[PATCH blktests] nvme/039: avoid failure by error message rate limit
alan.adamson at oracle.com
alan.adamson at oracle.com
Thu Apr 13 10:19:01 PDT 2023
On 4/12/23 8:56 PM, Shin'ichiro Kawasaki wrote:
> On Apr 13, 2023 / 09:20, Shin'ichiro Kawasaki wrote:
>> On Apr 12, 2023 / 10:49, alan.adamson at oracle.com wrote:
>>> On 4/12/23 1:59 AM, Shin'ichiro Kawasaki wrote:
>>>> From: Shin'ichiro Kawasaki <shinichiro.kawasaki at wdc.com>
>>>>
>>>> The test case nvme/039 tests that expected error messages are printed
>>>> for errors injected to the nvme driver. However, the test case fails by
>>>> chance when previous test cases generate many error messages. In this
>>>> case, the kernel function pr_err_ratelimited() may suppress the error
>>>> messages that the test case expects. Also, it may print messages that
>>>> the test case does not expect, such as "blk_print_req_error: xxxx
>>>> callbacks suppressed".
>>>>
>>>> To avoid the failure, make two improvements for the test case. Firstly,
>>>> wait DEFAULT_RATE_LIMIT seconds at the beginning of the test to ensure
>>>> the expected error messages are not suppressed. Secondly, exclude the
>>>> unexpected message for the error message check. Introduce a helper
>>>> function last_dmesg() for the second improvement.
>>> Why are we seeing the callback messages? By the time the test starts
>>> generating errors (after a 5 sec delay) we should be able to log 10 messages
>>> without any being suppressed.
>> That is because other test cases before nvme/039 can generate errors. For
>> instance, block/014 generates many errors. When I ran block/014 and nvme/039 in
>> sequence, I always observe nvme/039 failure even with the 5 seconds wait in
>> nvme/039. I suggest to excldue the "callbacks message" to avoid the nvme/038
>> failure regardless of the errors generated before nvme/039.
> Reading back my explanation above, I found it may not be clear enough. Let me
> ammend it. My understanding is as follows.
>
> The test case block/014 generates many error messages by blk_print_req_error().
> The messages are rate limited and some of them are suppressed. At that time,
> __ratelimit() calls printk_deferred() to print the "callbacks suppressed"
> message, which is deferred and not printed during block/014. When nvme/039
> triggers an error message (after the 5 sec delay), the "callbacks suppressed"
> message for blk_print_req_error() and block/014 is printed. It makes the
> nvme/039 fail, since nvme/039 expects the error message it triggered, but it
> finds the "callbacks suppressed" message instead. In theory, not only block/014
> but also other workload with rate limited error messages can cause this nvme/039
> failure.
>
> The 5 sec delay in nvme/039 ensures that the error messages for nvme/039 are not
> suppressed, but it can not avoid the "callbacks suppressed" messages that
> deferred before nvme/039.
That's right. I went through the rate limit code and that's how it
works, you will get a "callback suppressed" message along with the first
message nvme/039 logs.
I ran the following for a while:
while true
do
./check block/014
./check nvme/039
done
Reviewed-by: Alan Adamson <alan.adamson at oracle.com>
Tested-by: Alan Adamson <alan.adamson at oracle.com>
Alan
More information about the Linux-nvme
mailing list