Mellanox CX6 and nvmet connectivity failure, happens on RHEL9.2 kernels and latest 6.6 upstream

Laurence Oberman loberman at redhat.com
Wed Nov 8 13:10:58 PST 2023


On Wed, 2023-11-08 at 15:55 -0500, Laurence Oberman wrote:
> On Wed, 2023-11-08 at 15:07 -0500, Laurence Oberman wrote:
> > On Wed, 2023-11-08 at 12:57 -0700, Mark Lehrer wrote:
> > > > [  286.547112] nvme nvme4: Connect Invalid Data Parameter,
> > > > cntlid:
> > > > 1
> > > > [  286.555181] nvme nvme4: failed to connect queue: 1 ret=16770
> > > 
> > > It looks like the admin queue pair (0) worked at least.  The code
> > > path
> > > for the two is a bit different.
> > > 
> > > This error sounds familiar.  I wonder if there's an error code
> > > 16xxx
> > > cheat sheet out there.
> > > 
> > > We recently had to downgrade a ConnectX firmware version to fix a
> > > similar issue, but on a CX7.  I can't remember the firmware
> > > versions
> > > involved but I could probably dig it up.
> > > 
> > > Have you tried TCP mode?  Whether TCP works or not will be useful
> > > information for debugging.
> > > 
> > 
> > Hi MArk
> > 
> > I landed up changing the default kato from 5s to 30 and its working
> > now
> > We don't jump ship too early anymore and it connects fine.
> > See prior response where I answered my own message
> > 
> > diff -Nurp linux-5.14.0-
> > 284.25.1.el9_2.orig/drivers/nvme/host/nvme.h
> > linux-5.14.0-284.25.1.el9_2/drivers/nvme/host/nvme.h
> > --- linux-5.14.0-
> > 284.25.1.el9_2.orig/drivers/nvme/host/nvme.h   2023-
> > 07-20 08:42:08.000000000 -0400
> > +++ linux-5.14.0-
> > 284.25.1.el9_2/drivers/nvme/host/nvme.h        2023-
> > 11-08 14:16:37.924155469 -0500
> > @@ -25,7 +25,7 @@ extern unsigned int nvme_io_timeout;
> >  extern unsigned int admin_timeout;
> >  #define NVME_ADMIN_TIMEOUT     (admin_timeout * HZ)
> >  
> > -#define NVME_DEFAULT_KATO      5
> > +#define NVME_DEFAULT_KATO      30
> >  
> >  #ifdef CONFIG_ARCH_NO_SG_CHAIN
> >  #define  NVME_INLINE_SG_CNT  0
> > 
> > 
> > I will wait for Sagi and Keith and then send a patch
> > I had the wrong email for Keith
> > 
> > Thanks a lot
> > Laurence
> > 
> 
> Hello
> 
> No fix needed, I was unaware of the -k option in the nvme connect.
> My colleague showed it to me.
> This works now to give the CX6 longer to handle the connection
> 
> #!/bin/bash
> modprobe nvme-fc
> nvme connect -t rdma -n nqn.2023-10.org.dell -a  172.18.60.2  -s 4420
> -
> k 30
> 
> 
> Thanks
> So a Heads up for these newer cards I guess, need more time
> 
> Regards
> Laurence
> 
> 
> 
> 
> 

Finalizing this discussion and adding appropriate cc's


No patch needed, I was unaware of the -k option in the nvme connect.
My colleague John Pittman showed it to me. and in fact Mark also
pointed it out in a follow up email.
This works now to give the CX6 longer to handle the connection.
C.K Thanks to you as well for responding

Initiator
#!/bin/bash
modprobe nvme-fc
nvme connect -t rdma -n nqn.2023-10.org.dell -a  172.18.60.2  -s 4420 
-k 30

Thanks
So a Heads up for these newer cards I guess, need more time

Learn something new every day




More information about the Linux-nvme mailing list