target crash / host hang with nvme-all.3 branch of nvme-fabrics

Tue Jun 28 09:49:56 PDT 2016

On Tue, 2016-06-28 at 11:31 -0500, Steve Wise wrote:
> > On Tue, Jun 28, 2016 at 09:15:22AM -0500, Steve Wise wrote:
> > > I'm not so sure.  I don't see where nvmet leaves unsignaled wrs on the SQ.
> > > It either posts chains via RDMA-RW and the last in the chain is always
> > > signaled (I think), or it posts signaled IO responses.
> > 
> > Indeed.  So we need to figure out where we don't release a rsp.
> > 
> 
> Hey Ming, 
> 
> For what its worth, the change you proposed in this thread isn't working for me.
> I see maybe one or two recoveries successful, then the target gets stuck.  I see
> several workq threads stuck destroying various qps, one thread stuck draining a
> qp.  If this change is not the proper fix, then I'm not going to debug this
> further.

I didn't see this during overnight test. Possibly another bug.
Could you post the stuck call stack?

I assume you are still doing below tests on host:

run fio test

Then, 

while [ 1 ] ; do
        ifconfig $ETH down ; sleep $(( 10 + ($RANDOM & 0x7) )); ifconfig $ETH up ;sleep $(( 10 + ($RANDOM & 0x7) ))
done