list_add corruption from cmd.c (wasn't Re: bug: sleeping function called ...)

pgf at laptop.org pgf at laptop.org
Mon Jun 21 11:06:42 EDT 2010


[ my apologies for forgetting to change the subject line when
  i edited an old message to start this new thread.  text below
  is unchanged.  resending may cause as much confusion as it
  avoids, but nonetheless....  -pgf ]

hi --

we've been seeing list corruption related to libertas commands on
both the OLPC XO-1 and XO-1.5 laptops.  a couple of examples of
this can be seen at http://dev.laptop.org/ticket/9962.

i've been able to reproduce, and track down, one variant of this
problem.  i'm not convinced it's the same as in the bug linked
above, but it's certainly an issue.  i'm just not sure how it should
be fixed.

(the subject of this mail is also written up at: 
http://dev.laptop.org/ticket/10177 -- if you're reading this, you
don't really need to read that.)

the scenario is this:  when lbs_thread() times out a command
("Excessive timeouts submitting command 0x%04x\n"), it calls
lbs_complete_command().  lbs_complete_command(), in turn, wakes
up cmdwait_q, and then, depending on the value of cmd->callback,
scrubs the command structure and requeues it on the free queue (by
calling __lbs_cleanup_and_insert_cmd():

    void lbs_complete_command(struct lbs_private *priv,
		    struct cmd_ctrl_node *cmd, int result)
    {
	    if (cmd == priv->cur_cmd)
		    priv->cur_cmd_retcode = result;

	    cmd->result = result;
	    cmd->cmdwaitqwoken = 1;
	    wake_up_interruptible(&cmd->cmdwait_q);

	    if (!cmd->callback || cmd->callback == lbs_cmd_async_callback)
		    __lbs_cleanup_and_insert_cmd(priv, cmd);
	    priv->cur_cmd = NULL;
    }


however, it's possible that the process that's being woken up
will run first.  in my case, that process is waiting in
__lbs_cmd().  if __lbs_cmd() runs immediately after its wakeup,
it will unconditionally call __lbs_cleanup_and_insert_cmd() to
scrub and free the command buffer.  the scrubbing includes
nulling the cmd->callback element.  when control eventually comes
back to lbs_complete_command(), "callback" is null, and the
command gets scrubbed and freed again.  this corrupts the cmdfreeq.

i have a debug trace that shows this happening: 
  http://dev.laptop.org/~pgf/junk/duplicate_list_add.log
(search for "Excessive timeouts".)

as i said above, i'm not sure how this should be fixed.  it seems
like the call to __lbs_cleanup_and_insert_cmd() should be made
in just one place, but i don't know the driver well enough to
say which place that should be.

paul

p.s. as an aside, note that __lbs_cleanup_and_insert_cmd() is called
without taking the spinlock on driver_lock.  actually, i think
lbs_thread() should be taking the lock before calling
lbs_complete_command().  this is a bug, but it's not today's bug.

p.p.s. all of this begs the question of why we seem to see a lot
of timed-out commands...

=---------------------
 paul fox, pgf at laptop.org



More information about the libertas-dev mailing list