UBIFS errors when file-system is full

Mon Aug 31 19:20:08 PDT 2015

Hi Brian,

On 2015-08-31 18:43, Brian Norris wrote:
> Hi Stefan,
> 
> On Mon, Aug 31, 2015 at 05:45:47PM -0700, Stefan Agner wrote:
>> On 2015-08-12 00:27, Richard Weinberger wrote:
>> > Am 12.08.2015 um 09:01 schrieb Stefan Agner:
>> >> [also added Brian to the discussion, since he had a look into that
>> >> driver before]
>> >
>> > Good idea, maybe Brian has an idea.
> 
> Sorry, didn't look until now.
> 
>> >> I worked on the driver since quite some time, currently v10 is in
>> >> review. With this issue in mind, I went through the driver however I
>> >> currently can't see an issue.
>> >>
>> >> The error position is always page aligned, but at different pages. We
>> >> printed the reread buffers once: It seems that one page lands on flash
>> >> twice. My guess is that the second page doesn't get transmitted
>> >> properly, while the new column/row gets transmitted and
>> >> NAND_CMD_PAGEPROG executed... Hence the same buffer would be written to
>> >> the device again.
>> >>
>> >> The NFC IP in Vybrid (vf610) has a higher level programming model which
>> >> takes care of the command sequencing. Therefore some callbacks are not
>> >> actually sending a command to the device (e.g. NAND_CMD_SEQIN) since
>> >> this will be done one command later, on in NAND_CMD_PAGEPROG. Now, of
>> >> course, the driver relies heavily on not being interrupted by other
>> >> requests in between, (also not read!) but I thought that this is taken
>> >> care of by the MTD subsystem? So for me it is a bit hard to spot the
>> >> error since I'm always unsure whether the assumptions regarding
>> >> locking/exclusiveness between the calls is really guaranteed...
>> >
>> > NAND access is serialized using nand_get_device() and nand_release_device()
>> > in nand_base.c, so serialization should be fine.
> 
> Right, on a macro level. But I suspect you may have serialization
> problems within your driver routines themselves. See below.
> 
>> I did some more testing with v11 of this driver, and it seems when I
>> continuously reset the system, I get read errors from time to time. The
>> read errors end up in various errors, I have seen ubifs_read_node: bad
>> node type or ubifs_tnc_read_node: bad key in node. However, particularly
>> often and interesting are ubifs_decompress: cannot decompress errors,
>> such as this one. It seems that same page is read twice (or has been
>> written twice, see data pattern at offset 0x00000000 and 0x00000800).
>>
> [...]
>>
>> I checked the code regarding glitches on read/write side which could
>> lead to page number not getting updated or something along this line, so
>> far I could not nail the issue.
>>
>> Brian, with the pattern above in mind, do you see possible issues in the
>> driver?
> 
> I don't have any particular knowledge about that pattern. Regarding
> patterns, I've mostly been worried about incomplete "check for 0xff"
> algorithms. Doesn't seem to be relevant here.

By  pattern I meant mainly the fact that the same data appear twice.

> 
> I don't know too much of this IP other than what I glean from general
> NAND code review. With that in mind, this snippet does look a bit
> suspect:
> 
> +static void vf610_nfc_done(struct vf610_nfc *nfc)
> +{
> +	unsigned long timeout = msecs_to_jiffies(100);
> +
> +	/*
> +	 * Barrier is needed after this write. This write need
> +	 * to be done before reading the next register the first
> +	 * time.
> +	 * vf610_nfc_set implicates such a barrier by using writel
> +	 * to write to the register.
> +	 */
> +	vf610_nfc_set(nfc, NFC_IRQ_STATUS, IDLE_EN_BIT);
> +	vf610_nfc_set(nfc, NFC_FLASH_CMD2, START_BIT);
> +
> +	if (!(vf610_nfc_read(nfc, NFC_IRQ_STATUS) & IDLE_IRQ_BIT)) {
> 
> ^^^ this line
> 
> +		if (!wait_for_completion_timeout(&nfc->cmd_done, timeout))
> +			dev_warn(nfc->dev, "Timeout while waiting for BUSY.\n");
> +	}
> +	vf610_nfc_clear_status(nfc);
> +}
> 
> Why do you actually need to check the idle bit? If you need to read it
> to clear out some FIFO, then that's fine -- just read it, but don't use
> it as a second condition. The complete()/wait_for_completion()
> synchronization should be sufficient on its own. (If not, then I think
> you have other problems.)
> 
> The NFC_IRQ_STATUS-based condition looks like it could lead to races
> because your interrupt may fire between setting and checking
> the idle bit. So the IRQ handler will increment the completion struct
> (cmd_done), but you *won't* be doing the corresponding decrement via
> wait_for_completion(). If you don't do that... then the subsequent
> wait_for_completion() will immediately succeed.

I totally see your point. And it would even explain the pattern above:
Since the subsequent wait_for_completion() succeeds immediately, the
data read using vf610_nfc_read_buf would just return the same data
(given the controller is not fast enough to transfer the data...)

> Anyway, that's a quick and possibly disorganized explanation of my
> initial thoughts. Let me know if you
> (1) don't understand me;
> (2) disagree; or
> (3) have tested my suggested fix and still see the behavior.
> 
> I can take another look if we're on to option (2) or (3) :)

Cool, thanks, will test the above and let you know...

--
Stefan