[PATCH] mmc: failure of block read wait for long time

Mon Sep 20 09:09:15 EDT 2010

On 20/09/10 15:37, ext Ghorai, Sukumar wrote:
>
>
>> -----Original Message-----
>> From: Adrian Hunter [mailto:adrian.hunter at nokia.com]
>> Sent: Monday, September 20, 2010 5:20 PM
>> To: Ghorai, Sukumar
>> Cc: linux-mmc at vger.kernel.org; linux-arm-kernel at lists.infradead.org;
>> Adrian Hunter
>> Subject: Re: [PATCH] mmc: failure of block read wait for long time
>>
>> On 20/09/10 11:57, Ghorai, Sukumar wrote:
>>> Adrian,
>>>
>>>> -----Original Message-----
>>>> From: Adrian Hunter [mailto:adrian.hunter at nokia.com]
>>>> Sent: Monday, September 20, 2010 1:24 PM
>>>> To: Ghorai, Sukumar
>>>> Cc: linux-mmc at vger.kernel.org; linux-arm-kernel at lists.infradead.org;
>>>> Adrian Hunter
>>>> Subject: Re: [PATCH] mmc: failure of block read wait for long time
>>>>
>>>> On 14/09/10 08:15, ext Ghorai, Sukumar wrote:
>>>>> Adrian,
>>>>>
>>>>> [..snip..]
>>>>>>>>> [Ghorai] Adrian,
>>>>>>>>> Yes this works and reduced the retry by 1/4 (2048 to 512 times for
>>>> 1MB
>>>>>>>> data read) form the original code;
>>>>>>>>> Initially it was retrying for each page(512 bytes) after multi-
>> block
>>>>>>>> read fail; but this solution is retying for each segment(2048
>> bytes);
>>>>>>>>> 1. Now say block layrer reading 1MB and failed for the 1st segment.
>>>> So
>>>>>>>> it will still retry for 1MB/2048-bytes, i.e. 512 times.
>>>>>>>>> 2. So do you think any good reason to retry again and again?
>>>>>>>> If you have 1MB that is not readable, it sounds like the card is
>>>> broken.
>>>>>>>> Why are so many reads failing?  Has the card been removed?
>>>>>>>>
>>>>>>>> You might very rarely see ECC errors in a small number of sectors,
>>>>>>>> but more than that sounds like something else is broken.
>>>>>>>
>>>>>>> [Ghorai] yes, one example is we remove the card when reading data,
>>>>>>
>>>>>> Well, that is a different case.  Once the card has gone, the block
>>>> driver
>>>>>> can (and will once the remove method is called) error out all I/O
>>>>>> requests without sending them to MMC.  That doesn't happen until
>> there
>>>>>> is a card detect interrupt and a resulting rescan.
>>>>>
>>>>> [Ghorai] here we are discussing two problem,
>>>>> 1. If IO failed how to stop retry; because of -
>>>>> 	a. internal card error
>>>>> 	b. issue in Filesystem, driver, or host controller issue
>>>>> 	c. or cards removed.
>>>>>
>>>>> 2. And 2nd how to sync block-layer IO, if card removed,
>>>>> 	a. case 1: when card removed interrupt support by the platform
>>>>> 	b. case 2: when card removed interrupt does not support by the
>>>> platform?
>>>>>
>>>>>>
>>>>>> A possible solution is to put a flag on mmc_card to indicate
>> card_gone
>>>>>> that gets set as soon as the drivers card detect interrupt shows
>> there
>>>>>> is no card (hoping that we are not getting any bouncing on card
>> detect)
>>>>>> and then have mmc_wait_for_req() simple return -ENODEV immediately if
>>>>>> the card_gone flag is set.  Finally, if the mmc block driver sees
>>>>>> a -ENODEV error, it should also check the card_gone flag (via a new
>>>>>> core function) and if the card is gone, do not retry - and perhaps
>>>>>> even error out the rest of the I/O request queue as well.
>>>>>
>>>>> [Ghorai] your idea address the 2.a case, but not 2.b, 1.a, 1.b
>>>>
>>>> The card removal case can be extended to use the bus ops detect method
>>>> when there is no card detect irq.  I will send a RFC patch.
>>>>
>>>> With respect to 1.a:
>>>>     - If the card has an internal error, then it is broken.  The user
>>>>     should remove the card and use a better one.  I do not see how
>> reducing
>>>>     retry delays really helps the user very much.  Arguably if the card
>>>>     becomes unresponsive, the MMC core could provide a facility to
>>>>     reinitialise the card, but that is yet another issue.
>>>>
>>>> With respect to 1.b:
>>>>     - The file system cannot cause the block driver to have I/O errors.
>>>>     - If there are errors in the driver they should be fixed.
>>>>     - If there are hardware problems with the host controller, then
>>>>     it is up to the host controller driver to deal with them e.g.
>>>>     by resetting the controller.  I don't see what this has to do with
>>>>     the block driver.
>>>>
>>>> You leave out the important case of ECC errors.  I am concerned about
>>>> this because of the possibility that it happens inside a file system
>>>> journal e.g. EXT4 journal.  Perhaps the journal may be recovered if the
>>>> error only affects the last transaction, but perhaps not if it destroys
>>>> other transactions - which could happen if the approach you suggest
>>>> is taken.
>>>>
>>> [Ghorai] Thanks lot for your descriptive answer.
>>> 1. Can you answer this? 2.b. case 2: when card removed interrupt does
>> not support by the platform?
>>
>> As I wrote above: The card removal case can be extended to use the bus ops
>> detect method when there is no card detect irq.  I will send a RFC patch.
>>
>>>
>>> 2. Why block layer handling for inter-leave data? Can you give example
>> diver who is returning interleave data? And how to tell application that
>> buffer having interleave data?
>>
>> I am not sure what you mean by interleave data, but file systems  for
>> example
>> are free to map any block to any file, directory or file system object,
>> so a consecutive series of sectors may contain unrelated data.  Up to a
>> maximum
>> size, the block layer merges I/O requests when the sectors are consecutive,
>> so an I/O request can also contain unrelated data.
>
> [Ghorai]
> 1. I don't think so, FS know where data exists and where is the free space. Except oth cluster.

I was not talking about free space.  I was giving an example
of why it is not possible to assume anything about what is in
an I/O request.

>
> 2. Where its mentioned in block media that for segment-x[i],x[j] data read fail out of all all requested segments form [1..n].
> And I never gone through any driver/protocol, that retry the next i+1th segment where ith-segment is failed. And for that my suggestion is preferred.

The SD/MMC protocol does not indicate which sector has the error.
There is no possibility of trying the ith+1-segment because i is
unknown.

>
>>
>>>
>>>>>
>>>>> And the solution I was proposing to return the status of IO failure as
>>>> soon as possible to above layer; and handle the card removed interrupt
>>>> separately or any other issue in h/w or s/w or combination of both. Or
>>>> just think again when platform don't have the card remove interrupt.
>>>>>
>>>>> So my patch addresses the 1st part
>>>>
>>>> It is absolutely unacceptable to return I/O errors to the upper layers
>>>> for segments that do not have errors.
>>>>
>>>>> And for the 2nd part we can submit the patch anytime.
>>>>>
>>>>>>
>>>>>> I can suggest a patch if you want but I am on vacation next week so
>>>>>> it will have to wait a couple of weeks.
>>>>>>
>>>>>>> And moreover we should not give the interleave data to apps, as we
>>>> don't
>>>>>> have option to tell application for the valid data.
>>>>>>>
>>>>> [..snip..]
>>>>> http://comments.gmane.org/gmane.linux.kernel.mmc/2714
>>>>>
>>>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>>> the body of a message to majordomo at vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>
>