Data corruption when using multiple devices with NVMEoF TCP

Fri Dec 25 02:49:08 EST 2020

In my current setup, on the initiator side, nvme3n1 & nvme4n1 are 2
nvme-tcp devices, schedulers for 3 is:
 - cat /sys/block/nvme3n1/queue/scheduler: "none"
 - cat /sys/block/nvme3c3n1/queue/scheduler: "[none] mq-deadline kyber"
Not sure what is nvme3c3n1 here?

And disabling merges on nvme-tcp devices solves the data corruption issue!

Hao

On Thu, Dec 24, 2020 at 9:56 AM Sagi Grimberg <sagi at grimberg.me> wrote:
>
>
> > Sagi, thanks a lot for helping look into this.
> >
> >> Question, if you build the raid0 in the target and expose that over nvmet-tcp (with a single namespace), does the issue happen?
> > No, it works fine in that case.
> > Actually with this setup, initially the latency was pretty bad, and it
> > seems enabling CONFIG_NVME_MULTIPATH improved it significantly.
> > I'm not exactly sure though as I've changed too many things and didn't
> > specifically test for this setup.
> > Could you help confirm that?
> >
> > And after applying your patch,
> >   - With the problematic setup, i.e. creating a 2-device raid0, I did
> > see numerous numerous prints popping up in dmesg; a few lines are
> > pasted below:
> >   - With the good setup, i.e. only using 1 device, this line also pops
> > up, but a lot less frequent.
>
> Hao, question, what is the io-scheduler in-use for the nvme-tcp devices?
>
> Can you try to reproduce this issue when disabling merges on the
> nvme-tcp devices?
>
> echo 2 > /sys/block/nvmeXnY/queue/nomerges
>
> I want to see if this is an issue with merged bios.