[bug report] blktests nvme/062 hang

Fri Apr 18 10:01:45 PDT 2025

On Wed, Apr 16, 2025 at 3:28 PM Sagi Grimberg <sagi at grimberg.me> wrote:
>
>
>
> On 16/04/2025 9:31, Shinichiro Kawasaki wrote:
> > Hello all,
> >
> > Recently the new test case nvme/062 was added, which tests
> > "TLS-encrypted connections". I ran it with the kernel v6.15-rc2, and observed
> > "BUG: kernel NULL pointer dereference" followed by system hang [1].
> >
> >    When I had run the test case with v6.14 kernel, I had not observed the
> >    failure. However, I noticed that when I repeat the test case several times on
> >    the v6.14 kernel, the same failure happens. It looks that the problem has been
> >    existing for a while, and kernel changes between v6.14 and v6.15-rc2 increased
> >    the failure ratio.
> >
> > Actions for fix will be appreciated. I can run tests with debug patches or fix
> > candidate patches.
> >
> > [1]
> >
> > [  285.443497][ T1440] run blktests nvme/062 at 2025-04-16 14:41:30
> > [  285.586272][ T1531] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> > [  285.596633][ T1532] nvmet: Allow non-TLS connections while TLS1.3 is enabled
> > [  285.601564][ T1535] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
> > [  285.646164][ T1542] nvme nvme1: failed to connect socket: -512
> > [  285.651941][   T48] nvmet_tcp: failed to allocate queue, error -107
> > [  285.654954][   T65] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
> > [  285.658205][ T1542] nvme nvme1: Please enable CONFIG_NVME_MULTIPATH for full support of multi-port devices.
> > [  285.659759][ T1542] nvme nvme1: creating 4 I/O queues.
> > [  285.662596][ T1542] nvme nvme1: mapped 4/0/0 default/read/poll queues.
> > [  285.665446][ T1542] nvme nvme1: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
> > [  285.763327][ T1560] nvme nvme1: Removing ctrl: NQN "blktests-subsystem-1"
> > [  285.923585][ T1567] nvme nvme1: failed to connect socket: -512
> > [  285.932856][  T120] nvmet_tcp: failed to allocate queue, error -107
> > [  286.035851][   T98] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349, TLS.
> > [  286.038496][ T1567] nvme nvme1: Please enable CONFIG_NVME_MULTIPATH for full support of multi-port devices.
> > [  286.040096][ T1567] nvme nvme1: creating 4 I/O queues.
> > [  286.070856][ T1567] nvme nvme1: mapped 4/0/0 default/read/poll queues.
> > [  286.074999][ T1567] nvme nvme1: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
> > [  286.200199][ T1604] nvme nvme1: Removing ctrl: NQN "blktests-subsystem-1"
> > [  286.388688][ T1617] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> > [  286.406950][ T1621] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
> > [  286.460711][ T1628] nvme_tcp: queue 0: failed to receive icresp, error -4
>
> Adding Caleb as well I see here also icresp failed with -4
> Should we hadd special handling for -EINTR?

What do you have in mind? As far as I can tell, the only relevant
place in net code that could result in -EINTR is sock_intr_errno(),
which is only returned when the current task has a signal pending. (If
there is evidence to the contrary, I would love to see it! Maybe
tracing signal_generate/signal_deliver would illuminate what's going
on?) Returning immediately from the syscall when there's a pending
signal and letting userspace handle the signal and choose to retry the
syscall seems like standard practice.

Best,
Caleb