Mellanox CX6 and nvmet connectivity failure, happens on RHEL9.2 kernels and latest 6.6 upstream
Laurence Oberman
loberman at redhat.com
Wed Nov 8 13:10:58 PST 2023
On Wed, 2023-11-08 at 15:55 -0500, Laurence Oberman wrote:
> On Wed, 2023-11-08 at 15:07 -0500, Laurence Oberman wrote:
> > On Wed, 2023-11-08 at 12:57 -0700, Mark Lehrer wrote:
> > > > [ 286.547112] nvme nvme4: Connect Invalid Data Parameter,
> > > > cntlid:
> > > > 1
> > > > [ 286.555181] nvme nvme4: failed to connect queue: 1 ret=16770
> > >
> > > It looks like the admin queue pair (0) worked at least. The code
> > > path
> > > for the two is a bit different.
> > >
> > > This error sounds familiar. I wonder if there's an error code
> > > 16xxx
> > > cheat sheet out there.
> > >
> > > We recently had to downgrade a ConnectX firmware version to fix a
> > > similar issue, but on a CX7. I can't remember the firmware
> > > versions
> > > involved but I could probably dig it up.
> > >
> > > Have you tried TCP mode? Whether TCP works or not will be useful
> > > information for debugging.
> > >
> >
> > Hi MArk
> >
> > I landed up changing the default kato from 5s to 30 and its working
> > now
> > We don't jump ship too early anymore and it connects fine.
> > See prior response where I answered my own message
> >
> > diff -Nurp linux-5.14.0-
> > 284.25.1.el9_2.orig/drivers/nvme/host/nvme.h
> > linux-5.14.0-284.25.1.el9_2/drivers/nvme/host/nvme.h
> > --- linux-5.14.0-
> > 284.25.1.el9_2.orig/drivers/nvme/host/nvme.h 2023-
> > 07-20 08:42:08.000000000 -0400
> > +++ linux-5.14.0-
> > 284.25.1.el9_2/drivers/nvme/host/nvme.h 2023-
> > 11-08 14:16:37.924155469 -0500
> > @@ -25,7 +25,7 @@ extern unsigned int nvme_io_timeout;
> > extern unsigned int admin_timeout;
> > #define NVME_ADMIN_TIMEOUT (admin_timeout * HZ)
> >
> > -#define NVME_DEFAULT_KATO 5
> > +#define NVME_DEFAULT_KATO 30
> >
> > #ifdef CONFIG_ARCH_NO_SG_CHAIN
> > #define NVME_INLINE_SG_CNT 0
> >
> >
> > I will wait for Sagi and Keith and then send a patch
> > I had the wrong email for Keith
> >
> > Thanks a lot
> > Laurence
> >
>
> Hello
>
> No fix needed, I was unaware of the -k option in the nvme connect.
> My colleague showed it to me.
> This works now to give the CX6 longer to handle the connection
>
> #!/bin/bash
> modprobe nvme-fc
> nvme connect -t rdma -n nqn.2023-10.org.dell -a 172.18.60.2 -s 4420
> -
> k 30
>
>
> Thanks
> So a Heads up for these newer cards I guess, need more time
>
> Regards
> Laurence
>
>
>
>
>
Finalizing this discussion and adding appropriate cc's
No patch needed, I was unaware of the -k option in the nvme connect.
My colleague John Pittman showed it to me. and in fact Mark also
pointed it out in a follow up email.
This works now to give the CX6 longer to handle the connection.
C.K Thanks to you as well for responding
Initiator
#!/bin/bash
modprobe nvme-fc
nvme connect -t rdma -n nqn.2023-10.org.dell -a 172.18.60.2 -s 4420
-k 30
Thanks
So a Heads up for these newer cards I guess, need more time
Learn something new every day
More information about the Linux-nvme
mailing list