stack smashing detected with 'nvme sanitize-log /dev/nvme0'

Mon Aug 21 06:37:55 PDT 2023

On Wed, Jul 26, 2023 at 03:16:43PM +0200, Christoph Hellwig wrote:
> On Wed, Jul 26, 2023 at 01:52:04PM +0200, Daniel Wagner wrote:
> > FYI, I got a a bug report [1] with a 'stack smashing detected' when running
> > 'nvme sanitize-log /dev/nvme0' on Debian. Originally, it was reported against
> > udisk. udisk recently added libnvme which does now a sanitize-log call, so this
> > problem might exists for a while.
> > 
> > We figured out that an older kernel such as 4.19.289 work but newer not (it's a
> > bit hard for the reporter to test all combinations on his setup due to compiler
> > changes etc.).
> > 
> > There was a bit of refactoring in v5.2 which could be the cause of the stack
> > smash, because saw this recent fix:
> > 
> >  b8f6446b6853 ("nvme-pci: fix DMA direction of unmapping integrity data")
> > 
> > [1] https://github.com/storaged-project/udisks/issues/1152
> 
> If you think it is related to DMA, there are good ways to check for:
> 
>   1) force that an IOMMU is used for this device
>   2) hack nvme or the blk-map code that we never do the direct mapping
>      to user space but do the copy based version, and then enable
>      all kernel memory debugging helpers, most importantly KASAN

Collected some info:

 - this happens only with devices from MAXIO Technology

   vid       : 0x1e4b
   ssvid     : 0x1e4b

 - Oleg bissect the kernel and he landed on 3b2a1ebceba3 ("nvme: set
   dma alignment to qword"). He hacked the kernel and this made it
   work again:

   --- a/drivers/nvme/host/core.c
   +++ b/drivers/nvme/host/core.c
   @@ -1871,7 +1871,6 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
                   blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX));
           }
           blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1);
   -       blk_queue_dma_alignment(q, 3);
           blk_queue_write_cache(q, vwc, vwc);
   }

 - modified the reproducer so that it allocates a 4k buffer with a well
   known pattern and checked the buffer after fetching the sanitize log
   [1]. The first invocation wrote 0x940 bytes and the second 0x5c0
   bytes. Note we just asking for 0x200 bytes.

 - modified blk_rq_map_user_iov so that it always uses the copy path.
   The problem went away. Though forgot to ask to turn on KASAN but
   given that we know the device writes too much data it is likely also
   overwriting some kernel memory.

So what's the best way forward from here? Introduce a quirk and always
use bounce buffer?

[1] https://github.com/storaged-project/libblockdev/pull/951#issuecomment-1676276491