[PATCH 05/17] nvme: wire-up support for async-passthru on char-device.
Kanchan Joshi
joshi.k at samsung.com
Wed Mar 16 00:27:27 PDT 2022
On Tue, Mar 15, 2022 at 09:54:10AM +0100, Christoph Hellwig wrote:
>On Mon, Mar 14, 2022 at 09:53:56PM +0530, Kanchan Joshi wrote:
>>>> +struct nvme_uring_cmd_pdu {
>>>> + u32 meta_len;
>>>> + union {
>>>> + struct bio *bio;
>>>> + struct request *req;
>>>> + };
>>>> + void *meta; /* kernel-resident buffer */
>>>> + void __user *meta_buffer;
>>>> +} __packed;
>>>
>>> Why is this marked __packed?
>> Did not like doing it, but had to.
>> If not packed, this takes 32 bytes of space. While driver-pdu in struct
>> io_uring_cmd can take max 30 bytes. Packing nvme-pdu brought it down to
>> 28 bytes, which fits and gives 2 bytes back.
>
>What if you move meta_len to the end? Even if we need the __packed
>that will avoid all the unaligned access to pointers, which on some
>architectures will crash the kernel.
ah, right. Will move that to the end.
>> And on moving meta elements outside the driver, my worry is that it
>> reduces scope of uring-cmd infra and makes it nvme passthru specific.
>> At this point uring-cmd is still generic async ioctl/fsctl facility
>> which may find other users (than nvme-passthru) down the line. Organization
>> of fields within "struct io_uring_cmd" is around the rule
>> that a field is kept out (of 28 bytes pdu) only if is accessed by both
>> io_uring and driver.
>
>We have plenty of other interfaces of that kind. Sockets are one case
>already, and regular read/write with protection information will be
>another one. So having some core infrastrucure for "secondary data"
>seems very useful.
So what is the picture that you have in mind for struct io_uring_cmd?
Moving meta fields out makes it look like this -
@@ -28,7 +28,10 @@ struct io_uring_cmd {
u32 cmd_op;
u16 cmd_len;
u16 unused;
- u8 pdu[28]; /* available inline for free use */
+ void __user *meta_buffer; /* nvme pt specific */
+ u32 meta_len; /* nvme pt specific */
+ u8 pdu[16]; /* available inline for free use */
+
};
And corresponding nvme 16 byte pdu -
struct nvme_uring_cmd_pdu {
- u32 meta_len;
union {
struct bio *bio;
struct request *req;
};
void *meta; /* kernel-resident buffer */
- void __user *meta_buffer;
} __packed;
I do not understand how this helps. Only the generic space (28 bytes)
got reduced to 16 bytes.
>> I see, so there is room for adding some efficiency.
>> Hope it will be ok if I carry this out as a separate effort.
>> Since this is about touching blk_mq_complete_request at its heart, and
>> improving sync-direct-io, this does not seem best-fit and slow this
>> series down.
>
>I really rather to this properly. Especially as the current effort
>adds new exported interfaces.
Seems you are referring to io_uring_cmd_complete_in_task().
We would still need to use/export that even if we somehow manage to move
task-work trigger from nvme-function to blk_mq_complete_request.
io_uring's task-work infra is more baked than raw task-work infra.
It would not be good to repeat all that code elsewhere.
I tried raw one in the first attempt, and Jens suggested to move to baked
one. Here is the link that gave birth to this interface -
https://lore.kernel.org/linux-nvme/6d847f4a-65a5-bc62-1d36-52e222e3d142@kernel.dk/
>> Deferring by ipi or softirq never occured. Neither for block nor for
>> char. Softirq is obvious since I was not running against scsi (or nvme with
>> single queue). I could not spot whether this is really a overhead, at
>> least for nvme.
>
>This tends to kick in if you have less queues than cpu cores. Quite
>command with either a high core cound or a not very high end nvme
>controller.
I will check that.
But swtiching (irq to task-work) is more generic and not about this series.
Triggering task-work anyway happens for regular read/write
completion too (in io_uring)...in the same return path involving
blk_mq_complete_request. For passthru, we are just triggering this
somewhat earlier in the completion path.
More information about the Linux-nvme
mailing list