UBI leb_write_unlock NULL pointer Oops (continuation)

Mon Feb 24 10:09:56 EST 2014

On 22 Feb 2014, richard at nod.at wrote:

> Am 22.02.2014 01:49, schrieb Bill Pringlemeir:

>> I think the 'ubi_eba_copy_leb()' may have an issue.
>>
>> int ubi_eba_copy_leb(struct ubi_device *ubi, int from, int to,
>> 		     struct ubi_vid_hdr *vid_hdr)
>> {
>> ...
>>
>> 	err = leb_write_trylock(ubi, vol_id, lnum);
>>
>> static int leb_write_trylock(struct ubi_device *ubi, int vol_id, int lnum)
>> {
>> ..
>> 	le = ltree_add_entry(ubi, vol_id, lnum);  /* users + 1 */
>> 	if (IS_ERR(le))
>> 		return PTR_ERR(le);
>> 	if (down_write_trylock(&le->mutex))
>> 		return 0;
>>
>> 	/* Contention, cancel */
>> 	spin_lock(&ubi->ltree_lock);
>> 	le->users -= 1;                           /* user - 1 */
>> ...
>> 	spin_unlock(&ubi->ltree_lock);           
>>
>> 	return 1;
>> }
>>
>> 	if (err)...
>>
	if (vol->eba_tbl[lnum] != from) {
		dbg_wl("LEB %d:%d is no longer mapped to PEB %d, mapped to PEB %d, cancel",
		       vol_id, lnum, from, vol->eba_tbl[lnum]);
		err = MOVE_CANCEL_RACE;
		goto out_unlock_leb;
	}
...
out_unlock_leb:
	leb_write_unlock(ubi, vol_id, lnum);     /* user - 1 */

>> Didn't the count have to bump up in this case?  This isn't in Thorsten's
>> traces, but neither are any 'down_read' or 'up_read' traces; maybe
>> everything is in cache?

> Hmm, I'm not sure whether I was able to follow your thought.  But
> leb_write_unlock() is balanced with leb_write_trylock() in
> ubi_eba_copy_leb() which makes perfectly sense to me.  What exactly is
> the problem?

There are two things that must be balanced.  The 'reference count'
ubi_ltree_entry -> users and the rw_semaphore down/up.  You are right,
the trylock needs to be balanced by the 'leb_write_unlock'.  However,
the 'leb_write_trylock()' has already decremented 'users' in preperation
to move the 'lnum'.  However, in the case of contention,
'ubi_eba_copy_leb' bails and does the 'leb_write_unlock()', which
balances the 'trylock', but unbalances the 'users' reference count (I
added some comments on the lines).

I was postulating that the down/up paths are not on the same semaphore.
There is nothing in the code to check this.  It is possible for someone
to come in and recycle the 'lnum' and create a new rwsemaphore.  Here,
the down/up are done on different 'rwsemaphores' and this will cause
issues.

I was looking at using a 'kref' instead of 'users' and then I found this
path.

Fwiw,
Bill Pringlemeir.