Oops when completing request on the wrong queue
Gabriel Krisman Bertazi
krisman at linux.vnet.ibm.com
Tue Aug 23 13:54:03 PDT 2016
Gabriel Krisman Bertazi <krisman at linux.vnet.ibm.com> writes:
>> Can you share what you ran to online/offline CPUs? I can't reproduce
>> this here.
>
> I was using the ppc64_cpu tool, which shouldn't do nothing more than
> write to sysfs. but I just reproduced it with the script below.
>
> Note that this is ppc64le. I don't have a x86 in hand to attempt to
> reproduce right now, but I'll look for one and see how it goes.
Hi,
Any luck on reproducing it? We were initially reproducing with a
proprietary stress test, but I gave a try to a generated fio jobfile
associated with the SMT script I shared earlier and I could reproduce
the crash consistently in less than 10 minutes of execution. this was
still ppc64le, though. I couldn't get my hands on nvme on x86 yet.
The job file I used, as well as the smt.sh script, in case you want to
give it a try:
jobfile: http://krisman.be/k/nvmejob.fio
smt.sh: http://krisman.be/k/smt.sh
Still, the trigger seems to be consistently a heavy load of IO
associated with CPU addition/removal.
Let me share my progress from the last couple days in the hope that it
rings a bell for you.
Firstly, I verified that when we hit the BUG_ON in nvme_queue_rq, the
request_queue's freeze_depth is 0, which points away from a fault in the
freeze/unfreeze mechanism. If a request was escaping and going through
the block layer during a freeze, we'd see freeze_depth >= 1. Before
that, I had also tried to keep the q_usage_counter in atomic mode, in
case of a bug in the percpu refcount. No luck, the BUG_ON was still
hit.
Also, I don't see anything special about the request that reaches the
BUG_ON. It's a REQ_TYPE_FS request and, at least in the last time I
reproduced, it was a READ that came from the stress test task through
submit_bio. So nothing remarkable about it too, as far as I can see.
I'm still thinking about a case in which the mapping get's screwed up,
where a ctx would appear into two hctxs bitmaps after a remap, or if the
ctx got remaped to another hctx. I'm still learning my way through the
cpumap code, so I'm not sure it's a real possibility, but I'm not
convinced it isn't. Some preliminary tests don't suggest it's the case
at play, but I wanna spend a little more time on this theory (maybe
for my lack of better ideas :)
On a side note, probably unrelated to this crash, it also got me
thinking about the current usefulness of blk_mq_hctx_notify. Since CPU
is dead, no more requests would be coming through its ctx. I think we
could force a queue run in blk_mq_queue_reinit_notify, before remapping,
which would cause the hctx to fetch the remaining requests from that
dead ctx (since it's not unmapped yet). This way, we could maintain a
single hotplug notification hook and simplify the hotplug path. I
haven't written code for it yet, but I'll see if I can come up with
something and send to the list.
--
Gabriel Krisman Bertazi
More information about the Linux-nvme
mailing list