[bug report] blktests nvme/062 hang

Sat Apr 19 23:10:40 PDT 2025

On 18/04/2025 20:01, Caleb Sander Mateos wrote:
> On Wed, Apr 16, 2025 at 3:28 PM Sagi Grimberg <sagi at grimberg.me> wrote:
>>
>>
>> On 16/04/2025 9:31, Shinichiro Kawasaki wrote:
>>> Hello all,
>>>
>>> Recently the new test case nvme/062 was added, which tests
>>> "TLS-encrypted connections". I ran it with the kernel v6.15-rc2, and observed
>>> "BUG: kernel NULL pointer dereference" followed by system hang [1].
>>>
>>>     When I had run the test case with v6.14 kernel, I had not observed the
>>>     failure. However, I noticed that when I repeat the test case several times on
>>>     the v6.14 kernel, the same failure happens. It looks that the problem has been
>>>     existing for a while, and kernel changes between v6.14 and v6.15-rc2 increased
>>>     the failure ratio.
>>>
>>> Actions for fix will be appreciated. I can run tests with debug patches or fix
>>> candidate patches.
>>>
>>> [1]
>>>
>>> [  285.443497][ T1440] run blktests nvme/062 at 2025-04-16 14:41:30
>>> [  285.586272][ T1531] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>> [  285.596633][ T1532] nvmet: Allow non-TLS connections while TLS1.3 is enabled
>>> [  285.601564][ T1535] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
>>> [  285.646164][ T1542] nvme nvme1: failed to connect socket: -512
>>> [  285.651941][   T48] nvmet_tcp: failed to allocate queue, error -107
>>> [  285.654954][   T65] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
>>> [  285.658205][ T1542] nvme nvme1: Please enable CONFIG_NVME_MULTIPATH for full support of multi-port devices.
>>> [  285.659759][ T1542] nvme nvme1: creating 4 I/O queues.
>>> [  285.662596][ T1542] nvme nvme1: mapped 4/0/0 default/read/poll queues.
>>> [  285.665446][ T1542] nvme nvme1: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
>>> [  285.763327][ T1560] nvme nvme1: Removing ctrl: NQN "blktests-subsystem-1"
>>> [  285.923585][ T1567] nvme nvme1: failed to connect socket: -512
>>> [  285.932856][  T120] nvmet_tcp: failed to allocate queue, error -107
>>> [  286.035851][   T98] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349, TLS.
>>> [  286.038496][ T1567] nvme nvme1: Please enable CONFIG_NVME_MULTIPATH for full support of multi-port devices.
>>> [  286.040096][ T1567] nvme nvme1: creating 4 I/O queues.
>>> [  286.070856][ T1567] nvme nvme1: mapped 4/0/0 default/read/poll queues.
>>> [  286.074999][ T1567] nvme nvme1: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
>>> [  286.200199][ T1604] nvme nvme1: Removing ctrl: NQN "blktests-subsystem-1"
>>> [  286.388688][ T1617] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>> [  286.406950][ T1621] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
>>> [  286.460711][ T1628] nvme_tcp: queue 0: failed to receive icresp, error -4
>> Adding Caleb as well I see here also icresp failed with -4
>> Should we hadd special handling for -EINTR?
> What do you have in mind? As far as I can tell, the only relevant
> place in net code that could result in -EINTR is sock_intr_errno(),
> which is only returned when the current task has a signal pending. (If
> there is evidence to the contrary, I would love to see it! Maybe
> tracing signal_generate/signal_deliver would illuminate what's going
> on?) Returning immediately from the syscall when there's a pending
> signal and letting userspace handle the signal and choose to retry the
> syscall seems like standard practice.

We need to understand why this test crashes, and why other tests 
(outside of blktests)
are regressing.

I would like to understand if the source is MSG_WAITALL, and why.