[PATCH 1/2] nvme-tcp: use sendpage_ok() to check page for kernel_sendpage()
Sagi Grimberg
sagi at grimberg.me
Mon Jul 27 13:25:29 EDT 2020
On 7/26/20 6:52 AM, Coly Li wrote:
> Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to
> send slab pages. But for pages allocated by __get_free_pages() without
> __GFP_COMP, which also have refcount as 0, they are still sent by
> kernel_sendpage() to remote end, this is problematic.
>
> When bcache uses a remote NVMe SSD via nvme-over-tcp as its cache
> device, writing meta data e.g. cache_set->disk_buckets to remote SSD may
> trigger a kernel panic due to the above problem. Bcause the meta data
> pages for cache_set->disk_buckets are allocated by __get_free_pages()
> without __GFP_COMP.
>
> This problem should be fixed both in upper layer driver (bcache) and
> nvme-over-tcp code. This patch fixes the nvme-over-tcp code by checking
> whether the page refcount is 0, if yes then don't use kernel_sendpage()
> and call sock_no_sendpage() to send the page into network stack.
>
> Such check is done by macro sendpage_ok() in this patch, which is defined
> in include/linux/net.h as,
> (!PageSlab(page) && page_count(page) >= 1)
> If sendpage_ok() returns false, sock_no_sendpage() will handle the page
> other than kernel_sendpage().
>
> The code comments in this patch is copied and modified from drbd where
> the similar problem already gets solved by Philipp Reisner. This is the
> best code comment including my own version.
>
> Signed-off-by: Coly Li <colyli at suse.de>
> Cc: Chaitanya Kulkarni <chaitanya.kulkarni at wdc.com>
> Cc: Christoph Hellwig <hch at lst.de>
> Cc: Hannes Reinecke <hare at suse.de>
> Cc: Jan Kara <jack at suse.com>
> Cc: Jens Axboe <axboe at kernel.dk>
> Cc: Mikhail Skorzhinskii <mskorzhinskiy at solarflare.com>
> Cc: Philipp Reisner <philipp.reisner at linbit.com>
> Cc: Sagi Grimberg <sagi at grimberg.me>
> Cc: Vlastimil Babka <vbabka at suse.com>
> Cc: stable at vger.kernel.org
> ---
> Changelog:
> v3: introduce a more common name sendpage_ok() for the open coded check
> v2: fix typo in patch subject.
> v1: the initial version.
>
> drivers/nvme/host/tcp.c | 13 +++++++++++--
> include/linux/net.h | 2 ++
> 2 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 79ef2b8e2b3c..f9952f6d94b9 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -887,8 +887,17 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
> else
> flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
>
> - /* can't zcopy slab pages */
> - if (unlikely(PageSlab(page))) {
> + /*
> + * e.g. XFS meta- & log-data is in slab pages, or bcache meta
> + * data pages, or other high order pages allocated by
> + * __get_free_pages() without __GFP_COMP, which have a page_count
> + * of 0 and/or have PageSlab() set. We cannot use send_page for
> + * those, as that does get_page(); put_page(); and would cause
> + * either a VM_BUG directly, or __page_cache_release a page that
> + * would actually still be referenced by someone, leading to some
> + * obscure delayed Oops somewhere else.
> + */
I was hoping that this comment would move to the helper as well.
Agree with Christoph comment as well.
More information about the Linux-nvme
mailing list