fine-grained PI control

Martin K. Petersen martin.petersen at oracle.com
Tue Jul 9 20:47:58 PDT 2024


Christoph,

> So what are useful APIs we can/should expose?.
>
> If we want full portability we can't support all the individual
> checks, because the disk will check it for SCSI even if we don't do
> the extra checks in the controller. We could still expose the invidual
> flags, but reuse the combinations SCSI doesn't support on SCSI,
> although that would lead to surprises if people write their software
> and test on NVMe and then move to SCSI. Could we just expose the valid
> SCSI combinations if people are find with that for now?

I didn't have any actual use for check-this-but-not-that. The rationale
behind having explicit checking flags was my dislike for the fact that
the policy decision about what to check was residing inside the disk
drive and depended on how it was formatted, which flags were wired up in
the EI VPD, etc. I preferred an approach where the OS tells the hardware
exactly what to do.

There are a couple of free bits in *PROTECT so we could conceivably work
with T10 to add the missing pieces. But it would have a pretty long
turnaround, of course, and wouldn't address existing devices.

Also, things are not entirely symmetric wrt. *PROTECT for reads and
writes either. I'll try to wrap my head around it tomorrow.

For the user API I think it would be most sensible to have CHECK_GUARD,
CHECK_APP, CHECK_REF to cover the common DIX/NVMe case.

And then we could have NO_CHECK_DISK and IP_CHECKSUM_CONVERSION to
handle the peculiar SCSI corner cases and document that these are
experimental flags to be used for test purposes only. Not particularly
elegant but I don't have a better idea. Especially since things are
inherently asymmetric with controller-to-target communication being
protected even if you don't attach PI to the bio.

I.e. I think the CHECK_{GUARD,APP,REF} flags should describe how a
DIX or NVMe controller should check the attached bip payload. And
nothing else.

The controller-to-target PI handling is orthogonal and refers to what
happens in the second protection envelope, i.e. the communication
between a DIX controller and a target. This may or may not be the same
PI as in the bip payload. Therefore I think these flags should be
separate.

I'll mull over it a bit more and revisit all the SCSI wrinkles.

> I'm not currently seeing warnings on SCSI, but that's because my only
> PI testing is scsi_debug which starts out with deallocated blocks.

SCSI says that deallocated blocks have 0xFFFF in the app tag and thus
checking should be disabled on read. And if you subsequently write a
block without providing PI, the drive generates a valid guard and ref
tag (for Type 1). So there should never be a situation where reading a
block returns a PI error unless the block is corrupted. Either the app
tag escape is present or the PI is valid.

SCSI subsequently added some blurriness to permit deviations from this
principle. But the original PI design explicitly ensured that PI was
never accidentally invalid and reads would never fail. Even if you wrote
the drive on a system that didn't know about PI things would be OK. This
was deliberately done so reading partition tables, etc. wouldn't fail.
In Linux we currently treat Type 2 as Type 1 for pretty much the same
reason: To ensure that the ref tag is always well-defined. I.e. it
contains the lower 32 bits of the LBA.

The intent when we defined E2EDP in NVMe was to match this never-fail
SCSI behavior. So I'm puzzled as to why you see errors.

I'll try to connect my NVMe test box tomorrow. It's been offline after a
rack move. Would like to understand what's going on. Are we not setting
ILBRT/EILBRT appropriately?

-- 
Martin K. Petersen	Oracle Linux Engineering



More information about the Linux-nvme mailing list