[PATCH 2/2] nvme: add emulation for zone-append
Javier Gonzalez
javier at javigon.com
Thu Aug 20 04:14:43 EDT 2020
On 20.08.2020 13:07, Kanchan Joshi wrote:
>On Thu, Aug 20, 2020 at 3:22 AM Keith Busch <kbusch at kernel.org> wrote:
>>
>> On Wed, Aug 19, 2020 at 01:11:58PM -0600, David Fugate wrote:
>> > Intel does not support making *optional* NVMe spec features *required*
>> > by the NVMe driver.
>>
>> This is inaccurate. As another example, the spec optionally allows a
>> zone size to be a power of 2, but linux requires a power of 2 if you
>> want to use it with this driver.
>>
>> > Provided there's no glaring technical issues
>>
>> There are many. Some may be improved through a serious review process,
>> but the mess it makes out of the fast path is measurably harmful to
>> devices that don't subscribe to this. That issue is not so easily
>> remedied.
>>
>> Since this patch is a copy of the scsi implementation, the reasonable
>> thing is take this fight to the generic block layer for a common
>> solution. We're not taking this in the nvme driver.
>
>I sincerely want to minimize any adverse impact to the fast-path of
>non-zoned devices.
>My understanding of that aspect is (I feel it is good to confirm,
>irrespective of the future of this patch):
>
>1. Submission path:
>There is no extra code for non-zoned device IO. For append, there is
>this "ns->append_emulate = 1" check.
>Snippet -
> case REQ_OP_ZONE_APPEND:
>- ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
>+ if (!nvme_is_append_emulated(ns))
>+ ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
>+ else {
>+ /* prepare append like write, and adjust lba
>afterwards */
>
>2. Completion:
> Not as clean as submission for sure.
>The extra check is "ns && ns->append_emulate == 1" for completions if
>CONFIG_ZONED is enabled.
>A slightly better completion-handling version (than the submitted patch) is -
>
>- } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
>- req_op(req) == REQ_OP_ZONE_APPEND) {
>- req->__sector = nvme_lba_to_sect(req->q->queuedata,
>- le64_to_cpu(nvme_req(req)->result.u64));
>+ } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) {
>+ struct nvme_ns *ns = req->q->queuedata;
>+ /* append-emulation requires wp update for some cmds*/
>+ if (ns && nvme_is_append_emulated(ns)) {
>+ if (nvme_need_zone_wp_update(req))
>+ nvme_zone_wp_update(ns, req, status);
>+ }
>+ else if (req_op(req) == REQ_OP_ZONE_APPEND)
>+ req->__sector = nvme_lba_to_sect(ns,
>+ le64_to_cpu(nvme_req(req)->result.u64));
>
>Am I missing any other fast-path pain-points?
>
>A quick 1 minute 4K randwrite run (QD 64, 4 jobs,libaio) shows :
>before: IOPS=270k, BW=1056MiB/s (1107MB/s)(61.9GiB/60002msec)
>after: IOPS=270k, BW=1055MiB/s (1106MB/s)(61.8GiB/60005msec)
It is good to use the QEMU "simulation" path that we implemented to test
performance with different delays, etc., but for these numbers to make
sense we need to put them in contrast to the simulated NAND speed, etc.
>
>This maynot be the best test to see the cost, and I am happy to
>conduct more and improvise.
>
>As for the volume of the code - it is same as SCSI emulation. And I
>can make efforts to reduce that by moving common code to blk-zone, and
>reuse in SCSI/NVMe emulation.
>In the patch I tried to isolate append-emulation by keeping everything
>into "nvme_za_emul". It contains nothing nvme'ish except nvme_ns,
>which can be turned into "driver_data".
>
>+#ifdef CONFIG_BLK_DEV_ZONED
>+struct nvme_za_emul {
>+ unsigned int nr_zones;
>+ spinlock_t zones_wp_offset_lock;
>+ u32 *zones_wp_offset;
>+ u32 *rev_wp_offset;
>+ struct work_struct zone_wp_offset_work;
>+ char *zone_wp_update_buf;
>+ struct mutex rev_mutex;
>+ struct nvme_ns *ns;
>+};
>+#endif
>
>Will that be an acceptable line of further work/discussions?
I know we spent time enabling this path, but I don't think that moving
the discussion to the block layer will have much more benefit.
Let's keep the support for these non-append devices in xNVMe and focus
on the support for the append-enabled ones in Linux. We have a lot of
good stuff in the backlog that we can start pushing.
More information about the Linux-nvme
mailing list