MMC quirks relating to performance/lifetime.

Fri Feb 11 17:33:42 EST 2011

On Wed, Feb 9, 2011 at 3:13 AM, Arnd Bergmann <arnd at arndb.de> wrote:
> On Wednesday 09 February 2011 09:37:40 Linus Walleij wrote:
>> [Quoting in verbatin so the orginal mail hits linux-mmc, this is very
>> interesting!]
>>
>> 2011/2/8 Andrei Warkentin <andreiw at motorola.com>:
>> > Hi,
>> >
>> > I'm not sure if this is the best place to bring this up, but Russel's
>> > name is on a fair share of drivers/mmc code, and there does seem to be
>> > quite a bit of MMC-related discussions. Excuse me in advance if this
>> > isn't the right forum :-).
>> >
>> > Certain MMC vendors (maybe even quite a bit of them) use a pretty
>> > rigid buffering scheme when it comes to handling writes. There is
>> > usually a buffer A for random accesses, and a buffer B for sequential
>> > accesses. For certain Toshiba parts, it looks like buffer A is 8KB
>> > wide, with buffer B being 4MB wide, and all accesses larger than 8KB
>> > effectively equating to 4MB accesses. Worse, consecutive small (8k)
>> > writes are treated as one large sequential access, once again ending
>> > up in buffer B, thus necessitating out-of-order writing to work around
>> > this.
>
> It's more complex, but I now have a pretty good understanding of
> what the flash media actually do, after doing a lot of benchmarking.
> Most of my results so far are documented on
>
> https://wiki.linaro.org/WorkingGroups/KernelConsolidation/Projects/FlashCardSurvey
>
> but I still need to write about the more recent discoveries.
>
> What you describe as buffer A is the "page size" of the underlying
> flash. It depends on the size and brand of the NAND flash chip and
> can be anywhere between 2 KB and 16 KB for modern cards, depending
> on how they combine multiple chips and planes within the chips.
>
> What you describe as buffer B is sometime called an "erase block
> group" or an "allocation unit". This is the smallest unit that
> gets kept in a global lookup table in the medium and can be anywhere
> between 1 MB and 8 MB for cards larger than 4 GB, or as small as
> 128 KB (a single erase block) for smaller media, as far as I have
> seen. When you don't write full aligned allocation units, the
> card will have to eventually do garbage collection on the allocation
> unit, which can take a long time (many milliseconds).
>
> Most cards have a third size, typically somewhere between 32 and 128 KB,
> which is the optimimum size for writes. While you can do linear
> writes to the card in page size units (writing an allocation unit
> from start to finish), doing random access within the allocation unit
> will be much faster doing larger writes.
>
>> > What this means is decreased life span for the parts, and it also
>> > means a performance impact on small writes, but the first item is much
>> > more crucial, especially for smaller parts.
>> >
>> > As I've mentioned, probably more vendors are affected. How about a
>> > generic MMC_BLOCK quirk that splits the requests (and optionally
>> > reorders) them? The thresholds would then be adjustable as
>> > module/kernel parameters based on manfid. I'm asking because I have a
>> > patch now, but its ugly and hardcoded against a specific manufacturer.
>
> It's not just MMC specific: USB flash drives, CF cards and even cheap
> PATA or SATA SSDs have the same patterns. I think this will need
> to be solved on a higher level, in the block device elevator code
> and in the file systems.
>
>> There is a quirk API so that specific quirks can be flagged for certain
>> vendors and cards, e.g. some Toshibas in this case. e.g. grep the
>> kernel source for MMC_QUIRK_BLKSZ_FOR_BYTE_MODE.
>>
>> But as Russell says this probably needs to be signalled up to the
>> block layer to be handled properly.
>>
>> Why don't you post the code you have today as an RFC: patch,
>> I think many will be interested?
>
> Yes, I agree, that would be good. Also, I'd be interested to see the
> output of 'head /sys/block/mmcblk0/device/*' on that card. I'm guessing
> that the manufacturer ID of 0x0002 is Toshiba, and these are indeed
> the worst cards that I have seen so far, because they can not do
> random access within an allocation unit, and they can not write to
> multiple allocation units alternating (# open AUs linear is "1" in
> my wiki table), while most cards can do at least two.
>
> Andrei, I'm certainly interested in working with you on this.
> The point you brought up about the toshiba cards being especially
> bad is certainly vald, even if we do something better in the block
> layer, we need to have a way to detect the worst-case scenario,
> so we can work around that.
>
>        Arnd
>

Arnd,

Yes, this is a Toshiba card. I've sent the patch as a reply to Linus' email.

cid - 02010053454d3332479070cc51451d00
csd - d00f00320f5903ffffffffff92404000
erase_size - 524288
fwrev - 0x0
hwrev - 0x0
manfid - 0x000002
name - SEM32G
oemid - 0x0100
preferred_erase_size - 2097152