[PATCH v2 1/1] block: nvme-core: Scatter gather list support in the NVMe Block Driver

Matthew Wilcox willy at linux.intel.com
Tue Dec 17 10:08:23 EST 2013


On Tue, Dec 17, 2013 at 05:55:14PM +0530, Rajiv Shanmugam Madeswaran wrote:
> Hi matthew,  Thanks for your suggestion. I have made the following changes in the block driver.
> 
> 1. As far as SGL is concerned I could see slender increase in the iops when compared to PRP. For Ex: consider a scatter
>    gather list point to a segment which is contiguous and it is multiples of 4k.
>    In case of PRP it may take more than one entry for an individual IO transfer. As each of the PRP entry point to 4k size.
>    In case of SGL it could be formed as a single data block descriptor. In that case the processing logic of SGL becomes
>    simpler than PRP. But this kind of entry from the block layer is minimal. so,SGL can perform well at such scenarios.

You're speculating, or you've measured?  Processing an SGL is more
complex for the device than processing a PRP list.  I would expect a
device to be able to process more IOPS using PRPs than SGLs.

I can see SGLs being faster for:

 - The case where we would currently send more than one command (ie we hit
   the nvme_split_and_submit() case in nvme_map_bio(), which is the case
   when a buffer is not virtually contiguous)
 - The case where we can describe the user buffer using a single SGL but would
   require more than 2 PRPs (so physically contiguous, but crosses more than
   one page boundary).  In this case, we can describe everything with a single
   command and we don't need a PRP list.
 - There is probably a case when the SGL is significantly shorter than
   the equivalent PRP list.  That's going to vary depending on the drive.

Really though, it needs someone with some hardware to measure before I'm
going to take a patch to enable SGLs.  And I'd dearly love to see macro
benchmarks (eg a kernel compile or a MySQL run) more than "I did this dd
operation and ..."

> 2. I have handled the driver to work with multiple controllers by adding following parameters in nvme_dev structure.
> 
> /* following function pointer sets the type of data transfer
> +  * based on the controller support.it may be either SGL or PRP.
> +  * This function pointers are assigned during initialization.
> +  */
> +
> +  struct nvme_iod *
> +  (*alloc_iod)(unsigned nseg, unsigned nbytes, gfp_t gfp);
> +
> +  int (*setup_xfer)(struct nvme_dev *dev, struct nvme_common_command *cmd,
> +                    struct nvme_iod *iod, int total_len, gfp_t gfp);
> +
> +  void (*release_iod)(struct nvme_dev *dev, struct nvme_iod *iod);
> 
>    So, depending on the driver instance the corresponding mode of transfer will be called during IO.

As I've said, I think this decision needs to be taken on a per-command
basis, not a per-device basis.

> +	dev->sgls = le32_to_cpup(&ctrl->sgls);
> +
> +	if (use_sgl) {
> +		if (NVME_CTRL_SUPP_SGL(dev->sgls)) {
> +			dev->alloc_iod = nvme_alloc_iod_sgl;
> +			dev->setup_xfer = nvme_setup_sgls;
> +			dev->release_iod = nvme_free_iod_sgl;
> +		} else
> +			goto enable_prp;
> +	} else {
> + enable_prp:
> +		dev->alloc_iod = nvme_alloc_iod;
> +		dev->setup_xfer = nvme_setup_prps;
> +		dev->release_iod = nvme_free_iod;
> +	}

Why not simply:

+	if (use_sgl && NVME_CTRL_SUPP_SGL(dev->sgls)) {
+		dev->alloc_iod = nvme_alloc_iod_sgl;
...
+	} else {
...

> @@ -68,7 +68,9 @@ struct nvme_id_ctrl {
>  	__u8			vwc;
>  	__le16			awun;
>  	__le16			awupf;
> -	__u8			rsvd530[1518];
> +	__u8			rsvd530[6];
> +	__le32			sgls;
> +	__u8			rsvd540[1518];

That can't be right.  Maybe it's 1508 bytes long?

>  	struct nvme_id_power_state	psd[32];
>  	__u8			vs[1024];
>  };



More information about the Linux-nvme mailing list