Error when running fio against nvme-of rdma target (mlx5 driver)
Max Gurtovoy
mgurtovoy at nvidia.com
Tue May 17 04:16:35 PDT 2022
Hi,
Can you please send the original scenario, setup details and dumps ?
I can't find it in my mailbox.
you can send it directly to me to avoid spam.
-Max.
On 5/17/2022 11:26 AM, Mark Ruijter wrote:
> Hi Robin,
>
> I ran into the exact same problem while testing with 4 connect-x6 cards, kernel 5.18-rc6.
>
> [ 4878.273016] nvme nvme0: Successfully reconnected (3 attempts)
> [ 4879.122015] nvme nvme0: starting error recovery
> [ 4879.122028] infiniband mlx5_4: mlx5_handle_error_cqe:332:(pid 0): WC error: 4, Message: local protection error
> [ 4879.122035] infiniband mlx5_4: dump_cqe:272:(pid 0): dump error cqe
> [ 4879.122037] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [ 4879.122039] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [ 4879.122040] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [ 4879.122040] 00000030: 00 00 00 00 a9 00 56 04 00 00 00 ed 0d da ff e2
> [ 4881.085547] nvme nvme3: Reconnecting in 10 seconds...
>
> I assume this means that the problem has still not been resolved?
> If so, I'll try to diagnose the problem.
>
> Thanks,
>
> --Mark
>
> On 11/02/2022, 12:35, "Linux-nvme on behalf of Robin Murphy" <linux-nvme-bounces at lists.infradead.org on behalf of robin.murphy at arm.com> wrote:
>
> On 2022-02-10 23:58, Martin Oliveira wrote:
> > On 2/9/22 1:41 AM, Chaitanya Kulkarni wrote:
> >> On 2/8/22 6:50 PM, Martin Oliveira wrote:
> >>> Hello,
> >>>
> >>> We have been hitting an error when running IO over our nvme-of setup, using the mlx5 driver and we are wondering if anyone has seen anything similar/has any suggestions.
> >>>
> >>> Both initiator and target are AMD EPYC 7502 machines connected over RDMA using a Mellanox MT28908. Target has 12 NVMe SSDs which are exposed as a single NVMe fabrics device, one physical SSD per namespace.
> >>>
> >>
> >> Thanks for reporting this, if you can bisect the problem on your setup
> >> it will help others to help you better.
> >>
> >> -ck
> >
> > Hi Chaitanya,
> >
> > I went back to a kernel as old as 4.15 and the problem was still there, so I don't know of a good commit to start from.
> >
> > I also learned that I can reproduce this with as little as 3 cards and I updated the firmware on the Mellanox cards to the latest version.
> >
> > I'd be happy to try any tests if someone has any suggestions.
>
> The IOMMU is probably your friend here - one thing that might be worth
> trying is capturing the iommu:map and iommu:unmap tracepoints to see if
> the address reported in subsequent IOMMU faults was previously mapped as
> a valid DMA address (be warned that there will likely be a *lot* of
> trace generated). With 5.13 or newer, booting with "iommu.forcedac=1"
> should also make it easier to tell real DMA IOVAs from rogue physical
> addresses or other nonsense, as real DMA addresses should then look more
> like 0xffff24d08000.
>
> That could at least help narrow down whether it's some kind of
> use-after-free race or a completely bogus address creeping in somehow.
>
> Robin.
>
>
More information about the Linux-nvme
mailing list