[PATCH v5 14/40] netfs: Add iov_iters to (sub)requests to describe various buffers

Christian Ebner c.ebner at proxmox.com
Thu Aug 8 01:07:24 PDT 2024


Hi,

recently some of our (Proxmox VE) users report issues with file 
corruptions when accessing contents located on CephFS via the in-kernel 
ceph client [0,1]. According to these reports, our kernels based on 
(Ubuntu) v6.8 do show these corruptions, using the FUSE client or the 
in-kernel ceph client with kernels based on v6.5 does not show these.
Unfortunately the corruption is hard to reproduce.

After a further report of file corruption [2] with a completely 
unrelated code path, we managed to reproduce the corruption for one file 
by sheer chance on one of our ceph test clusters. We were able to narrow 
it down to be possibly an issue with reading of the contents via the 
in-kernel ceph client. Note that we can exclude the file contents itself 
being corrupt, as any not affected kernel version or the FUSE client 
gives the correct contents.

The issue is present also in the current mainline kernel 6.11-rc2.

Bisection with the reproducer points to this commit:

"92b6cc5d: netfs: Add iov_iters to (sub)requests to describe various 
buffers"

Description of the issue:

* file copied from local filesystem to cephfs via:
`cp /tmp/proxmox-backup-server_3.2-1.iso 
/mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso`
* sha256sum on local filesystem:
`1d19698e8f7e769cf0a0dcc7ba0018ef5416c5ec495d5e61313f9c84a4237607 
/tmp/proxmox-backup-server_3.2-1.iso`
* sha256sum on cephfs with kernel up to above commit:
`1d19698e8f7e769cf0a0dcc7ba0018ef5416c5ec495d5e61313f9c84a4237607 
/mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso`
* sha256sum on cephfs with kernel after above commit:
`89ad3620bf7b1e0913b534516cfbe48580efbaec944b79951e2c14e5e551f736 
/mnt/pve/cephfs/proxmox-backup-server_3.2-1.iso`
* removing and/or recopying the file does not change the issue, the 
corrupt checksum remains the same.
* Only this one particular file has been observed to show the issue, for 
others the checksums match.
* Accessing the same file from different clients results in the same 
output: The one with above patch applied do show the incorrect checksum, 
ones without the patch show the correct checksum.
* The issue persists even across reboot of the ceph cluster and/or clients.
* The file is indeed corrupt after reading, as verified by a `cmp -b`.

Does anyone have an idea what could be the cause of this issue, or how 
to further debug this? Happy to provide more information or a dynamic 
debug output if needed.

Best regards,

Chris

[0] https://forum.proxmox.com/threads/78340/post-676129
[1] https://forum.proxmox.com/threads/149249/
[2] https://forum.proxmox.com/threads/151291/




More information about the linux-afs mailing list