[PATCH 6/6] nvme: Add consistency check for zone count
Damien Le Moal
Damien.LeMoal at wdc.com
Fri Jun 26 03:42:04 EDT 2020
On 2020/06/26 16:29, Javier González wrote:
> On 26.06.2020 07:09, Damien Le Moal wrote:
>> On 2020/06/26 15:55, Javier González wrote:
>>> On 26.06.2020 06:49, Damien Le Moal wrote:
>>>> On 2020/06/26 15:13, Javier González wrote:
>>>>> On 26.06.2020 00:04, Damien Le Moal wrote:
>>>>>> On 2020/06/26 6:49, Keith Busch wrote:
>>>>>>> On Thu, Jun 25, 2020 at 02:21:52PM +0200, Javier González wrote:
>>>>>>>> drivers/nvme/host/zns.c | 7 +++++++
>>>>>>>> 1 file changed, 7 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
>>>>>>>> index 7d8381fe7665..de806788a184 100644
>>>>>>>> --- a/drivers/nvme/host/zns.c
>>>>>>>> +++ b/drivers/nvme/host/zns.c
>>>>>>>> @@ -234,6 +234,13 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
>>>>>>>> sector += ns->zsze * nz;
>>>>>>>> }
>>>>>>>>
>>>>>>>> + if (nr_zones < 0 && zone_idx != ns->nr_zones) {
>>>>>>>> + dev_err(ns->ctrl->device, "inconsistent zone count %u/%u\n",
>>>>>>>> + zone_idx, ns->nr_zones);
>>>>>>>> + ret = -EINVAL;
>>>>>>>> + goto out_free;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> ret = zone_idx;
>>>>>>>
>>>>>>> nr_zones is unsigned, so it's never < 0.
>>>>>>>
>>>>>>> The API we're providing doesn't require zone_idx equal the namespace's
>>>>>>> nr_zones at the end, though. A subset of the total number of zones can
>>>>>>> be requested here.
>>>>>>>
>>>>>
>>>>> I did see nr_zones coming with -1; guess it is my compiler.
>>>>
>>>> See include/linux/blkdev.h. -1 is:
>>>>
>>>> #define BLK_ALL_ZONES ((unsigned int)-1)
>>>>
>>>> Which is documented in block/blk-zoned.c:
>>>>
>>>> /**
>>>> * blkdev_report_zones - Get zones information
>>>> * @bdev: Target block device
>>>> * @sector: Sector from which to report zones
>>>> * @nr_zones: Maximum number of zones to report
>>>> * @cb: Callback function called for each reported zone
>>>> * @data: Private data for the callback
>>>> *
>>>> * Description:
>>>> * Get zone information starting from the zone containing @sector for at most
>>>> * @nr_zones, and call @cb for each zone reported by the device.
>>>> * To report all zones in a device starting from @sector, the BLK_ALL_ZONES
>>>> * constant can be passed to @nr_zones.
>>>> * Returns the number of zones reported by the device, or a negative errno
>>>> * value in case of failure.
>>>> *
>>>> * Note: The caller must use memalloc_noXX_save/restore() calls to control
>>>> * memory allocations done within this function.
>>>> */
>>>> int blkdev_report_zones(struct block_device *bdev, sector_t sector,
>>>> unsigned int nr_zones, report_zones_cb cb, void *data)
>>>>
>>>>>
>>>>>>
>>>>>> Yes, absolutely. zone_idx is not an absolute zone number. It is the index of the
>>>>>> reported zone descriptor in the current report range requested by the user,
>>>>>> which is not necessarily for the entire drive (i.e., provided nr zones is less
>>>>>> than the total number of zones of the disk and/or start sector is > 0). So
>>>>>> zone_idx indicates the actual number of zones reported, it is not the total
>>>>>
>>>>> I see. As I can see, when nr_zones comes undefined I believed we could
>>>>> assume that zone_idx is absolute, but I can be wrong.
>>>>
>>>> No. zone_idx is *always* the index of the zone in the current report. Whatever
>>>> that report is, regardless of the report starting point and number of zones
>>>> requested. E.g. For a single zone report (nr_zones = 1), you will always see
>>>> zone_idx = 0. For a full report, zone_idx will correspond to the zone number.
>>>> This is used for example in blk_revalidate_disk_zones() to initialize the zone
>>>> bitmaps.
>>>>
>>>>> Does it make sense to support this check with an additional counter and
>>>>> a explicit nr_zones initialization when undefined or you
>>>>> prefer to just remove it as Matias suggested?
>>>>
>>>> The check is not needed at all.
>>>>
>>>> If the device is buggy and reports more zones than the device capacity or any
>>>> other bugs, the driver can catch that when it processes the report.
>>>> blk_revalidate_disk_zones() also has many checks.
>>>
>>> I have managed to create a QEMU ZNS device that gave me a headache with
>>> a little bit of extra capacity that triggered an additional zone report.
>>> This was the motivation for the patch.
>>
>> The device emulation sound buggy... If the capacity is wrong, then the report
>> will be too since zones are all supposed to be sequential (no holes between
>> zones) and up to the disk capacity only (last zone start + len = capacity + 1)
>>
>> If one or the other is wrong, this should be easy to detect. Normally,
>> blk_revalidate_disk_zones() should be able to catch that.
>
> We have the capability to select the reported device capacity manually
> for a number of reasons. One of the different test configurations in our
> CI did go through.
If you change the drive capacity on the fly (e.g. with a low level format ?),
you must revalidate the disk/drive to get the changed capacity. A lot of things
will break otherwise This is not just report zones that will be incorrect.
>
> But it is OK, I will remove the check on V2.
>
> Javier
>
--
Damien Le Moal
Western Digital Research
More information about the Linux-nvme
mailing list