[PATCH 05/17] nvme: wire-up support for async-passthru on char-device.

Kanchan Joshi joshi.k at samsung.com
Wed Mar 16 00:27:27 PDT 2022


On Tue, Mar 15, 2022 at 09:54:10AM +0100, Christoph Hellwig wrote:
>On Mon, Mar 14, 2022 at 09:53:56PM +0530, Kanchan Joshi wrote:
>>>> +struct nvme_uring_cmd_pdu {
>>>> +	u32 meta_len;
>>>> +	union {
>>>> +		struct bio *bio;
>>>> +		struct request *req;
>>>> +	};
>>>> +	void *meta; /* kernel-resident buffer */
>>>> +	void __user *meta_buffer;
>>>> +} __packed;
>>>
>>> Why is this marked __packed?
>> Did not like doing it, but had to.
>> If not packed, this takes 32 bytes of space. While driver-pdu in struct
>> io_uring_cmd can take max 30 bytes. Packing nvme-pdu brought it down to
>> 28 bytes, which fits and gives 2 bytes back.
>
>What if you move meta_len to the end?  Even if we need the __packed
>that will avoid all the unaligned access to pointers, which on some
>architectures will crash the kernel.
ah, right. Will move that to the end.

>> And on moving meta elements outside the driver, my worry is that it
>> reduces scope of uring-cmd infra and makes it nvme passthru specific.
>> At this point uring-cmd is still generic async ioctl/fsctl facility
>> which may find other users (than nvme-passthru) down the line. Organization
>> of fields within "struct io_uring_cmd" is around the rule
>> that a field is kept out (of 28 bytes pdu) only if is accessed by both
>> io_uring and driver.
>
>We have plenty of other interfaces of that kind.  Sockets are one case
>already, and regular read/write with protection information will be
>another one.  So having some core infrastrucure for "secondary data"
>seems very useful.
So what is the picture that you have in mind for struct io_uring_cmd?
Moving meta fields out makes it look like this - 

@@ -28,7 +28,10 @@ struct io_uring_cmd {
        u32             cmd_op;
        u16             cmd_len;
        u16             unused;
-       u8              pdu[28]; /* available inline for free use */
+       void __user     *meta_buffer; /* nvme pt specific */
+       u32             meta_len; /* nvme pt specific */
+       u8              pdu[16]; /* available inline for free use */
+
 };
And corresponding nvme 16 byte pdu - 
 struct nvme_uring_cmd_pdu {
-       u32 meta_len;
        union {
                struct bio *bio;
                struct request *req;
        };
        void *meta; /* kernel-resident buffer */
-       void __user *meta_buffer;
 } __packed;

I do not understand how this helps. Only the generic space (28 bytes)
got reduced to 16 bytes.

>> I see, so there is room for adding some efficiency.
>> Hope it will be ok if I carry this out as a separate effort.
>> Since this is about touching blk_mq_complete_request at its heart, and
>> improving sync-direct-io, this does not seem best-fit and slow this
>> series down.
>
>I really rather to this properly.  Especially as the current effort
>adds new exported interfaces.

Seems you are referring to io_uring_cmd_complete_in_task().

We would still need to use/export that even if we somehow manage to move
task-work trigger from nvme-function to blk_mq_complete_request.
io_uring's task-work infra is more baked than raw task-work infra.
It would not be good to repeat all that code elsewhere.
I tried raw one in the first attempt, and Jens suggested to move to baked
one. Here is the link that gave birth to this interface -
https://lore.kernel.org/linux-nvme/6d847f4a-65a5-bc62-1d36-52e222e3d142@kernel.dk/
 

>> Deferring by ipi or softirq never occured. Neither for block nor for
>> char. Softirq is obvious since I was not running against scsi (or nvme with
>> single queue). I could not spot whether this is really a overhead, at
>> least for nvme.
>
>This tends to kick in if you have less queues than cpu cores.  Quite
>command with either a high core cound or a not very high end nvme
>controller.
I will check that.
But swtiching (irq to task-work) is more generic and not about this series.
Triggering task-work anyway happens for regular read/write
completion too (in io_uring)...in the same return path involving
blk_mq_complete_request. For passthru, we are just triggering this
somewhat earlier in the completion path. 


More information about the Linux-nvme mailing list