Data corruption when using multiple devices with NVMEoF TCP

Thu Dec 24 05:28:57 EST 2020

Sagi, thanks a lot for helping look into this.

> Question, if you build the raid0 in the target and expose that over nvmet-tcp (with a single namespace), does the issue happen?
No, it works fine in that case.
Actually with this setup, initially the latency was pretty bad, and it
seems enabling CONFIG_NVME_MULTIPATH improved it significantly.
I'm not exactly sure though as I've changed too many things and didn't
specifically test for this setup.
Could you help confirm that?

And after applying your patch,
 - With the problematic setup, i.e. creating a 2-device raid0, I did
see numerous numerous prints popping up in dmesg; a few lines are
pasted below:
 - With the good setup, i.e. only using 1 device, this line also pops
up, but a lot less frequent.

[  390.240595] nvme_tcp: rq 10 (WRITE) contains multiple bios bvec:
nsegs 25 size 102400 offset 0
[  390.243146] nvme_tcp: rq 35 (WRITE) contains multiple bios bvec:
nsegs 7 size 28672 offset 4096
[  390.246893] nvme_tcp: rq 35 (WRITE) contains multiple bios bvec:
nsegs 25 size 102400 offset 4096
[  390.250631] nvme_tcp: rq 35 (WRITE) contains multiple bios bvec:
nsegs 4 size 16384 offset 16384
[  390.254374] nvme_tcp: rq 11 (WRITE) contains multiple bios bvec:
nsegs 7 size 28672 offset 0
[  390.256869] nvme_tcp: rq 11 (WRITE) contains multiple bios bvec:
nsegs 25 size 102400 offset 12288
[  390.266877] nvme_tcp: rq 57 (READ) contains multiple bios bvec:
nsegs 4 size 16384 offset 118784
[  390.269444] nvme_tcp: rq 58 (READ) contains multiple bios bvec:
nsegs 4 size 16384 offset 118784
[  390.273281] nvme_tcp: rq 59 (READ) contains multiple bios bvec:
nsegs 4 size 16384 offset 0
[  390.275776] nvme_tcp: rq 60 (READ) contains multiple bios bvec:
nsegs 4 size 16384 offset 118784

On Wed, Dec 23, 2020 at 6:57 PM Sagi Grimberg <sagi at grimberg.me> wrote:
>
>
> > Okay, tried both v5.10 and latest 58cf05f597b0.
> >
> > And same behavior
> >   - data corruption on the initiator side when creating a raid-0 volume
> > using 2 nvme-tcp devices;
> >   - no data corruption either on local target side, or on initiator
> > side but only using 1 nvme-tcp devoce.
> >
> > A difference I can see on the max_sectors_kb is that, now on the
> > target side, /sys/block/nvme*n1/queue/max_sectors_kb also becomes
> > 1280.
> >
>
> Thanks Hao,
>
> I'm thinking we maybe have an issue with bio splitting/merge/cloning.
>
> Question, if you build the raid0 in the target and expose that over
> nvmet-tcp (with a single namespace), does the issue happen?
>
> Also, would be interesting to add this patch and see if the following
> print pops up, and if it correlates when you see the issue:
>
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 979ee31b8dd1..d0a68cdb374f 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -243,6 +243,9 @@ static void nvme_tcp_init_iter(struct
> nvme_tcp_request *req,
>                  nsegs = bio_segments(bio);
>                  size = bio->bi_iter.bi_size;
>                  offset = bio->bi_iter.bi_bvec_done;
> +               if (rq->bio != rq->biotail)
> +                       pr_info("rq %d (%s) contains multiple bios bvec:
> nsegs %d size %d offset %ld\n",
> +                               rq->tag, dir == WRITE ? "WRITE" :
> "READ", nsegs, size, offset);
>          }
>
>          iov_iter_bvec(&req->iter, dir, vec, nsegs, size);
> --
>
> I'll try to look further to understand if we have an issue there.