Help needed with corruption detection/ubifs_wbuf_sync_nolock

Fri Jun 29 09:40:08 EDT 2012

Artem,

Our analysis shows that LEB's are corrupted due to a software bug we are trying to find.  It seems like data is shifted in the buffer (like memcpy(buf,buf+2,node_size)), which we think is related to our power management.

My goal was to try and trap on the bad write, and debug whatever is corrupting things before it happens.

You've convinced me that using wbuf probably isn't the right tactic.  I put code into all the other ubl_leb_write() calls, and it doesn't seem to be in those.  

Still trying to find the best way to trap on it before it happens.

Thanks

----- Original Message -----
> From: Artem Bityutskiy <dedekind1 at gmail.com>
> To: Reginald Perrin <reggyperrin at yahoo.com>
> Cc: MTD Mailing List <linux-mtd at lists.infradead.org>
> Sent: Wednesday, June 27, 2012 10:22 AM
> Subject: Re: Help needed with corruption detection/ubifs_wbuf_sync_nolock
> 
> Hi,
> 
> On Mon, 2012-06-25 at 06:58 -0700, Reginald Perrin wrote:
>>  I'm tracking down a corruption issue, and trying to trace back where
>>  LEB's are getting randomly corrupted in our system (a very rare event,
>>  but it can happen).  I'm focusing on ubifs/io.c, and trying to
>>  validate data before we send to ubi_leb_write().
> 
> You are not using MLC NAND, right? Did you validate your flash using MTD
> tests?
> 
>>  Can somebody please clarify something for me
>>  on ubifs_wbuf_sync_nolock()?  I'm trying to validate that the data
>>  we're writing hasn't been corrupted.  I thought I could just check
>>  that the node-type was valid, such as:
>> 
>>      if ( ((struct ubifs_ch *)wbuf->buf)->node_type > 
> UBIFS_ORPH_NODE )
>>  {
>> 
>>          // ABORT WRITE
>>      }
>> 
>>      err = ubi_leb_write(c->ubi, wbuf->lnum, wbuf->buf, 
> wbuf->offs,
>> 
> The above code assumes the contents of the write-buffer always starts
> with an UBIFS node, which is not true. 'wbuf->buf[0]' may be the 
> middle
> or the end of a node. If you want to add a check, you need to write a
> helper function which _scans_ the write-buffer and searches for
> UBIFS_NODE_MAGIC, and _then_ may be the start of a node. Then you go
> check the common header CRC. And the write-buffer may contain more than
> one node, so you need to iterate. And you need to take into account the
> case when this is the end of the write-buffer and the common header does
> not fit.
>> 
>>  Can anybody help me understand how to check to see if the LEB is
>>  corrupted before we write?  I'm trying to get close enough to the
>>  corruption to get a backtrace.
> 
> Corrupted how - the CRC is corrupted? You can try to scan the LEB in the
> previoius LEB using 'ubifs_scan()' in before switching to the new one in
> the 'ubifs_wbuf_seek_nolock()' function, I guess.
> 
> -- 
> Best Regards,
> Artem Bityutskiy
> 
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
>