[PATCH] mtd: nand: raw: qcom_nandc: Don't clear_bam_transaction on READID

Fri Mar 11 13:22:51 PST 2022

On 24.02.2022 08:33, Sricharan Ramabadhran wrote:
> Hi Konrad,
> 
> On 2/8/2022 10:15 PM, Konrad Dybcio wrote:
>>
>> On 4.02.2022 18:17, Sricharan Ramabadhran wrote:
>>> On 2/2/2022 12:54 PM, Sricharan Ramabadhran wrote:
>>>> Hi Konrad/Miquel,
>>>>
>>>> On 2/1/2022 9:21 PM, Konrad Dybcio wrote:
>>>>> On 01/02/2022 14:52, Miquel Raynal wrote:
>>>>>> Hi Konrad,
>>>>>>
>>>>>> konrad.dybcio at somainline.org wrote on Mon, 31 Jan 2022 20:54:12 +0100:
>>>>>>
>>>>>>> On 31/01/2022 15:13, Sricharan Ramabadhran wrote:
>>>>>>>> Hi Konrad,
>>>>>>>>
>>>>>>>> On 1/31/2022 3:39 PM, Konrad Dybcio wrote:
>>>>>>>>> On 28/01/2022 18:50, Sricharan Ramabadhran wrote:
>>>>>>>>>> Hi Konrad,
>>>>>>>>>>
>>>>>>>>>> On 1/28/2022 9:55 AM, Sricharan Ramabadhran wrote:
>>>>>>>>>>> Hi Miquel,
>>>>>>>>>>>
>>>>>>>>>>> On 1/26/2022 4:12 PM, Miquel Raynal wrote:
>>>>>>>>>>>> Hi Mani,
>>>>>>>>>>>>
>>>>>>>>>>>> mani at kernel.org wrote on Wed, 26 Jan 2022 16:03:16 +0530:
>>>>>>>>>>>>> On Wed, Jan 26, 2022 at 11:16:13AM +0100, Miquel Raynal wrote:
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> miquel.raynal at bootlin.com wrote on Fri, 14 Jan 2022 08:27:18 +0100:
>>>>>>>>>>>>>>> Hi Konrad,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> konrad.dybcio at somainline.org wrote on Thu, 13 Jan 2022 19:44:26 >>>>>>>> +0100:
>>>>>>>>>>>>>>>> While I have absolutely 0 idea why and how, running >>>>>>>>> clear_bam_transaction
>>>>>>>>>>>>>>>> when READID is issued makes the DMA totally clog up and refuse >>>>>>>>> to function
>>>>>>>>>>>>>>>> at all on mdm9607. In fact, it is so bad that all the data >>>>>>>>> gets garbled
>>>>>>>>>>>>>>>> and after a short while in the nand probe flow, the CPU >>>>>>>>> decides that
>>>>>>>>>>>>>>>> sepuku is the only option.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Removing _READID from the if condition makes it work like a >>>>>>>>> charm, I can
>>>>>>>>>>>>>>>> read data and mount partitions without a problem.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Konrad Dybcio <konrad.dybcio at somainline.org>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>> This is totally just an observation which took me an inhumane >>>>>>>>> amount of
>>>>>>>>>>>>>>>> debug prints to find.. perhaps there's a better reason behind >>>>>>>>> this, but
>>>>>>>>>>>>>>>> I can't seem to find any answers.. Therefore, this is a BIG RFC!
>>>>>>>>>>>>>>> I'm adding two people from codeaurora who worked a lot on this >>>>>>>> driver.
>>>>>>>>>>>>>>> Hopefully they will have an idea :)
>>>>>>>>>>>>>> Sadre, I've spent a significant amount of time reviewing your >>>>>>> patches,
>>>>>>>>>>>>>> now it's your turn to not take a month to answer to your peers
>>>>>>>>>>>>>> proposals.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please help reviewing this patch.
>>>>>>>>>>>>> Sorry. I was hoping that Qcom folks would chime in as I don't >>>>>> have any idea
>>>>>>>>>>>>> about the mdm9607 platform. It could be that the mail server >>>>>> migration from
>>>>>>>>>>>>> codeaurora to quicinc put a barrier here.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let me ping them internally.
>>>>>>>>>>>> Oh, ok, I didn't know. Thanks!
>>>>>>>>>>>      Sorry Miquel, somehow we did not get this email in our inbox.
>>>>>>>>>>>      Thanks to Mani for pinging us, we will test this up today and >>>> get back.
>>>>>>>>>>         While we could not reproduce this issue on our ipq boards (do >>> not have a mdm9607 right now) and
>>>>>>>>>>          issue does not look any obvious.
>>>>>>>>>>         can you please give the debug logs that you did for the above >>> stage by stage ?
>>>>>>>>> I won't have access to the board for about two weeks, sorry.
>>>>>>>>>
>>>>>>>>> When I get to it, I'll surely try to send you the logs, though there
>>>>>>>>>
>>>>>>>>> wasn't much more than just something jumping to who-knows-where
>>>>>>>>>
>>>>>>>>> after clear_bam_transaction was called, resulting in values >> associated with
>>>>>>>>>
>>>>>>>>> the NAND being all zeroed out in pr_err/_debug/etc.
>>>>>>>>>
>>>>>>>>       Ok sure. So was the READID command itself failing (or) the > subsequent one ?
>>>>>>>>      We can check which parameter reset by the clear_bam_transaction is > causing the
>>>>>>>>      failure.  Meanwhile, looping in Pradeep who has access to the > board, so in a better
>>>>>>>>      position to debug.
>>>>>>> I'm sorry I have so few details on hand, and no kernel tree (no access to that machine either, for now).
>>>>>>>
>>>>>>>
>>>>>>> I will try to describe to the best of my abilities what I recall.
>>>>>>>
>>>>>>>
>>>>>>> My methodology of making sure things don't go haywire was to print the oob size
>>>>>>>
>>>>>>> of our NAND basically every two lines of code (yes, i was very desperate at one point),
>>>>>>>
>>>>>>> as that was zeroed out when *the bug* happened,
>>>>>> This does look like a pointer error at some point and some kernel data
>>>>>> has been corrupted very badly by the driver.
>>>>>>
>>>>>>> leading to a kernel bug/panic/stall
>>>>>>>
>>>>>>> (can't recall what exactly it was, but it said something along the lines of "no support for
>>>>>>>
>>>>>>> oob size 0" and then it didn't fail graceully, leading to some bad jumps and ultimately
>>>>>>>
>>>>>>> a dead platform..)
>>>>>>>
>>>>>>>
>>>>>>> after hours of digging, I found out that everything goes fine until clear_bam_transaction is called,
>>>>>> Do you remember if this function was called for the first time when
>>>>>> this happened?
>>>>> I think so, if I recall correctly there are no more callers in this path, as readid is the first nand command executed in flash probe flow.
>>>>>
>>>>>
>>>>>
>>>>>>> after that gets executed every nand op starts reading all zeroes (for example in JEDEC ID check)
>>>>>>>
>>>>>>> so I added the changes from this patch, and things magically started working... My suspicion is
>>>>>>>
>>>>>>> that the underlying FIFO isn't fully drained (is it a FIFO on 9607? bah, i work on too many socs at once)
>>>>>> I don't see it in the list of supported devices, what's the exact
>>>>>> compatible used?
>>>>> qcom,ipq4019-nand
>>>>>
>>>>>
>>>>>
>>>>>>> and this function only makes Linux think it is, without actually draining it, and the leftover
>>>>>>>
>>>>>>> commands get executed with some parts of them getting overwritten, resulting in the
>>>>>>>
>>>>>>> famous garbage in - garbage out situation, but that's only a guesstimate..
>>>>>> I would bet for a non allocated bam-ish pointer that is reset to zero
>>>>>> in the clear_bam_transaction() helper.
>>>>>>
>>>>>> Can you get your hands on the board again?
>>>>> Sure, but as I mentioned previously, only in about 2 weeks, I can't really do any dev before then.. :(
>>>>>
>>>>>
>>>>>
>>>>>> It would be nice to check if the allocation always occurs before use,
>>>>>> and if yes on how much bytes.
>>>>>>
>>>>>> If the pointer is not dangling, then perhaps something else smashes
>>>>>> that pointer.
>>>>>
>>>>> Konrad
>>>>>
>>>>>>> Do note this somehow worked fine on 5.11 and then broke on 5.12/13. I went as far as replacing most
>>>>>>>
>>>>>>> of the kernel with the updated/downgraded parts via git checkout (i tried many combinations),
>>>>>>>
>>>>>>> to no avail.. I even tried different compilers and optimization levels, thinking it could have been
>>>>>>>
>>>>>>> a codegen issue, but no luck either.
>>>>>>>
>>>>>>>
>>>>>>> I.. do understand this email is a total mess to read, as much as it was to write, but
>>>>>>>
>>>>>>> without access to my code and the machine itself I can't give you solid details, and
>>>>>>>
>>>>>>> the fact this situation is far from ordinary doesn't help either..
>>>>>>>
>>>>>>>
>>>>>>> The latest (ancient, not quite pretty, but probably working if my memory is correct) version of my patches
>>>>>>>
>>>>>>> for the mdm9607 is available at [1], I will push the new revision after I get access to the workstation.
>>>>>>>
>>>>    + few more who have access to the board.
>>>>
>>>>     Going by the description, for kernel corruption, we can try out a KASAN build.
>>>>     Since you have mentioned it worked till 5.11, you bisected the driver till 5.11 head and it worked ?
>>>>
>>>     Tried running a KASAN enabled image on IPQ board, but no luck. Nothing came out.
>>>     Only if someone with the board can help here, we can proceed
>>>
>>>
>>> Regards,
>>>    Sricharan
>>>
>> I have the board with me again. Please tell me where do we start :)
> 
>  Sorry for the delayed response.
[Looks at the calendar] What can I say... lots of things happened :)

> 
>      As a first step, Can you enable KASAN and check if you get any warnings ?
> 
>      Then, can you check inside clear_bam_transaction, which parameter resetting specifically is causing the issue ?
> 
I have 3 logs for you:

[1] is KASAN=y, with this patch
[2] is KASAN=y, WITHOUT this patch (should die, but doesn't - does KASAN prevent it from doing something stupid?)
[3] is KASAN=n, WITHOUT this patch (dies as expected)

Looks like there's a lot happening..

Konrad
> 
> Regards,
>   Sricharan
> 
> 

[1] https://paste.debian.net/1233873/
[2] https://paste.debian.net/1233874/
[3] https://paste.debian.net/1233878/