Hang when running LLVM+clang test suite

Sun Jan 21 06:13:27 PST 2018

> On Jan 21, 2018, at 08:49, David Zarzycki <dave at znu.io> wrote:
> 
> 
> 
>> On Jan 20, 2018, at 21:50, Keith Busch <keith.busch at intel.com> wrote:
>> 
>> On Sat, Jan 20, 2018 at 05:47:06AM -0500, David Zarzycki wrote:
>>> Hello NVMe developers,
>>> 
>>> The LLVM+clang the test suite regularly (but not reliably) hangs the kernel (version 4.14.13-300.fc27.x86_64). I don’t see this hang when running the test suite in /tmp (tmpfs) or on a SATA SSD.
>>> 
>>> Here are photos of the console debug info, with the NVMe driver in the backtrace:
>>> 
>>> http://znu.io/dual8168hang.tar
>>> 
>>> Here is another instance of the hang, again with NVMe in the backtrace:
>>> 
>>> http://znu.io/IMG_0362.jpg
>> 
>> It looks like the scheduler is stuck or a task struct is corrupt. I can't
>> think of anything off the top of my head what nvme has to do with that,
>> though. It just invokes the callback associated with a command and
>> doesn't directly manipulate any scheduler structs.
> 
> Hi Keith,
> 
> Thanks for looking at the backtraces. What other subsystems should I be looking at then?
> 
> Given that the LLVM+clang test suite is reliable when built/run in tmpfs, that implies that most of the kernel is reliable. I’ve also run the test suite reliably on an ext4 filesystem on a SATA SSD.
> 
> I’ve tried both xfs and ext4 on NVMe and they both crash, which implies that individual filesystems aren't the problem. Please note that the NVMe setup is simple: one partition and no LVM, RAID, bcache, etc.
> 
> What’s left at this point? What other combinations or debug parameters should I test?

To my surprise, I think I’ve narrowed down the bug to the block multi-queue layer. Can you please confirm that the following test is reasonable?

1) Create a file in /tmp (tmpfs with plenty of RAM and 2x the size needed by the test suite)
2) ‘losetup’ the file
3) 'cat /sys/block/loop0/queue/scheduler' ("[mq-deadline] none” in this case)
4) Create ext4 partition on /dev/loop0 and mount it
5) Run stress test in the loopback filesystem

Thanks,
Dave