[PATCH v5 4/5] sd: limit to use write life hints

Kanchan Joshi joshi.k at samsung.com
Mon Sep 16 06:49:21 PDT 2024


On 9/13/2024 1:36 PM, Christoph Hellwig wrote:
> On Thu, Sep 12, 2024 at 10:01:00PM +0530, Kanchan Joshi wrote:
>> Please see the response in patch #1. My worries were:
>> (a) adding a new field and propagating it across the stack will cause
>> code duplication.
>> (b) to add a new field we need to carve space within inode, bio and
>> request.
>> We had a hole in request, but it is set to vanish after ongoing
>> integrity refactoring patch of Keith [1]. For inode also, there is no
>> liberty at this point [2].
>>
>> I think current multiplexing approach is similar to ioprio where
>> multiple io priority classes/values are expressed within an int type.
>> And few kernel components choose to interpret certain ioprio values at will.
>>
>> And all this is still in-kernel details. Which can be changed if/when
>> other factors start helping.
> 
> Maybe part of the problem is that the API is very confusing.  A smal
> part of that is of course that the existing temperature hints already
> have some issues, but this seems to be taking them make it significantly
> worse.

Can you explain what part is confusing. This is a simple API that takes 
type/value pair. Two types (and respective values) are clearly defined 
currently, and more can be added in future.

> Note: this tries to include highlevel comments from the discussion of
> the previous patches instead of splitting them over multiple threads.
> 
> F_{S,G}ET_RW_HINT works on arbitrary file descriptors with absolutely no
> check for support by the device or file system and not check for the
> file type.  That's not exactly good API design, but not really a major
> because they are clearly designed as hints with a fixed number of
> values, allowing the implementation to map them if not enough are
> supported.
> 
> But if we increase this to a variable number of hints that don't have
> any meaning (and even if that is just the rough order of the temperature
> hints assigned to them), that doesn't really work.  We'll need an API
> to check if these stream hints are supported and how many of them,
> otherwise the applications can't make any sensible use of them.

- Since writes are backward compatible, nothing bad happens if the 
passed placement-hint value is not supported. Maybe desired outcome (in 
terms of WAF reduction) may not come but that's not a kernel problem 
anyway. It's rather about how well application is segregating and how 
well device is doing its job.

- Device is perfectly happy to work with numbers (0 to 256 in current 
spec) to produce some value (i.e., WAF reduction). Any extra 
semantics/abstraction on these numbers only adds to the work without 
increasing that value. If any application needs that, it's free to 
attach any meaning/semantics to these numbers.

Extra abstraction has already been done with temperature-hint (over 
multi-stream numbers). If that's useful somehow, we should consider 
going back to using those (v3)? But if we are doing a new placement 
hint, it's better to use plain numbers without any semantics. That will 
be (a) more scalable, (b) be closer to what device can readily accept, 
(c) justify why placement should be a different hint-type, and (d) help 
Kernel because it has to do less (no intermediate mapping/transformation 
etc).

IMHO sticking to the existing hint model and doing less (in terms of 
abstraction, reporting and stuff) in kernel maybe a better path.

> If these aren't just stream hints of the file system but you actually
> want them as an abstract API for FDP you'll also need to actually
> expose even more information like the reclaim unit size, but let's
> ignore that for this part of the discssion.
> 
> Back the the API: the existing lifetime hints have basically three
> layers:
> 
>   1) syscall ABI
>   2) the hint stored in the inode
>   3) the hint passed in the bio
> 
> 1) is very much fixed for the temperature API, we just need to think if
>     we want to support it at the same time as a more general hints API.
>     Or if we can map one into another.  Or if we can't support them at
>     the same time how that is communicated.
> 
> For 2) and 3) we can use an actual union if we decide to not support
> both at the same time, keyed off a flag outside the field, but if not
> we simply need space for both.
> 

Right, if there were space, we probably would have kept both.
But particularly for these two types (temperature and placement) it's 
probably fine if one overwrites the another. This is not automatic and 
will happen only at the behest of user. And that's something we can 
clearly document in the man page of the new fcntl. Hope that sounds fine?



More information about the Linux-nvme mailing list