nvme: split bios issued in reverse order

Jonathan Nicklin jnicklin at blockbridge.com
Tue May 24 06:25:20 PDT 2022


On Tue, May 24, 2022 at 8:58 AM Sagi Grimberg <sagi at grimberg.me> wrote:
>
>
> > There seems to be an inconsistency in the order of writes that are
> > issued after splitting a bio. Ordering depends on how the application
> > write is submitted and the number of I/O queues configured.
> >
> > In our testing nvme/tcp,
>
> Is this specific to nvme-tcp?

No. This is not specific to nvme-tcp. I confirmed the same behavior
directly to a pci device.

>
> > a 128K write issued with fio/pvsync
>
> is this specific to the io engine?

Yes. With ioengine=libaio, the IOs are reversed. With ioengine=pvsync
the IOs are sequential.

>
> > is split
> > into four 32K I/Os (the target maximum data transfer size is set to
> > 32K, and max_sectors_kb is therefore 32K). As expected, the four write
> > I/Os are issued to the target in sequential order. However, if the
> > 128K write is issued using fio/libaio, the four 32K writes are issued
> > in reverse order:
> >
> > fio-8098 [001] ..... 254009.711080: nvme_setup_cmd: nvme1:
> > disk=nvme1c1n1, qid=2, cmdid=16468, nsid=1, flags=0x0, meta=0x0,
> > cmd=(nvme_cmd_write slba=192, len=63, ctrl=0x0, dsmgmt=0, reftag=0)
> >
> > fio-8098 [001] ..... 254009.711083: nvme_setup_cmd: nvme1:
> > disk=nvme1c1n1, qid=2, cmdid=16467, nsid=1, flags=0x0, meta=0x0,
> > cmd=(nvme_cmd_write slba=128, len=63, ctrl=0x0, dsmgmt=0, reftag=0)
> >
> > fio-8098 [001] ..... 254009.711084: nvme_setup_cmd: nvme1:
> > disk=nvme1c1n1, qid=2, cmdid=16466, nsid=1, flags=0x0, meta=0x0,
> > cmd=(nvme_cmd_write slba=64, len=63, ctrl=0x0, dsmgmt=0, reftag=0)
> >
> > fio-8098 [001] ..... 254009.711085: nvme_setup_cmd: nvme1:
> > disk=nvme1c1n1, qid=2, cmdid=16465, nsid=1, flags=0x0, meta=0x0,
> > cmd=(nvme_cmd_write slba=0, len=63, ctrl=0x0, dsmgmt=0, reftag=0)
> >
> > Further investigation found that if the number of I/Os queues is
> > limited to 1 at connect time,
>
> Is this specific to a single I/O queue?

With ioengine=libaio && queues > 1, the IOs are issued in reverse
order. With ioengine=libaio && queues == 1, the IOs are in sequential
order.

>
> > the issue order is sequential for both
> > pwritev and libaio.
>
> I'm assuming that this is 100% repeatable?

Yes. !00% repeatable.

>
> >
> > I've spent some time tracing through the bio/blk_mq code and
> > can't seem to find what might be causing the difference in
> > behavior. Can anyone confirm that this is expected or desired
> > behavior?
>
> What is the controller mdts? does the 32k go in-capsule? or does
> the controller send r2t?

mdts=32K, io capsule size=32K, no R2T

>
>
> Also, if we assume that this is indeed the case, is this a fundamental
> issue?

Maybe it is fundamental since it occurs for both PCI and TCP devices?
The part that I can't reconcile is why there is a difference in
behavior for ioengine=libaio when multiple queues are present. It
feels like it has something to do with the interaction with bio
splitting and plugging.

Here are a couple more details:
- you can reproduce it on a PCI device by setting max_sectors_kb to 32
- the order issued is not present if the submitted IO is a read.

I'm happy to run additional testing to shed more light on the behavior.



More information about the Linux-nvme mailing list