[BUG] New Kernel Bugs

Tue Nov 13 14:02:59 EST 2007

On Tue, 13 Nov 2007 04:32:07 -0800 (PST) David Miller <davem at davemloft.net> wrote:

> From: Andrew Morton <akpm at linux-foundation.org>
> Date: Tue, 13 Nov 2007 04:12:59 -0800
> 
> > On Tue, 13 Nov 2007 03:58:24 -0800 (PST) David Miller <davem at davemloft.net> wrote:
> > 
> > > From: Andrew Morton <akpm at linux-foundation.org>
> > > Date: Tue, 13 Nov 2007 03:49:16 -0800
> > > 
> > > > Do you believe that our response to bug reports is adequate?
> > > 
> > > Do you feel that making us feel and look like shit helps?
> > 
> > That doesn't answer my question.
> > 
> > See, first we need to work out whether we have a problem.  If we do this,
> > then we can then have a think about what to do about it.
> > 
> > I tried to convince the 2006 KS attendees that we have a problem and I
> > resoundingly failed.  People seemed to think that we're doing OK.
> > 
> > But it appears that data such as this contradicts that belief.
> > 
> > This is not a minor matter.  If the kernel _is_ slowly deteriorating then
> > this won't become readily apparent until it has been happening for a number
> > of years.  By that stage there will be so much work to do to get us back to
> > an acceptable level that it will take a huge effort.  And it will take a
> > long time after that for the kerel to get its reputation back.
> > 
> > So it is important that we catch deterioration *early* if it is happening.
> 
> You tell me what I should spend my time working on, and I promise to
> do it OK? :-)

My suggestion: regressions.

If we're really active in chasing down the regressions then I think we can
be confident that the kernel isn't deteriorating.  Probably it will be
improving as we also fix some always-been-there bugs.

I think that we're fairly good about working the regressions in
Adrian/Michal/Rafael's lists but once Linus releases 2.6.x we tend to let
the unsolved ones slide, and we don't pay as much attention to the
regressions which 2.6.x testers report.

> For example, if I have a choice between a TCP crash just about anyone
> can hit and some obscure issue only reported with some device nearly
> nobody has, which one should I analyze and work on?
> 
> That's the problem.  All of us prioritize and it means the chaff
> collects at the bottom.  You cannot fix that except by getting more
> bug fixers so that the chaff pile has a chance to get smaller.
> 
> Luckily if the report being ignored isn't chaff, it will show up again
> (and again and again) and this triggers a reprioritization because not
> only is the bug no longer chaff, it also now got a lot of information
> tagged to it so it's a double worthwhile investment to work on the
> problem.
> 
> I think a lot of bugs that "aren't getting looked at" are simply
> sitting in some early stage of this process.

Yes, that's a useful technique.  If multiple people are being hurt a lot by
a bug then that's a more important one to fix than the single-person
minor-irritant bug.

otoh that doesn't work very well with driver/platform bugs.  Often these
are regressions which only a single person can reproduce within the time
window which we have in which we can fix it.  If we don't fix it in that
window it'll go out to distros and presumably some more people will hit it.

So I don't see much alternative here to the traditional
work-with-the-originator way of resolving it.

git bisection should really help us with these regressions but it doesn't
appear that people are using as much as one would like.  I'm hoping that
the very good http://www.kernel.org/doc/local/git-quick.html will help us
out here.  Thanks to the mystery person who prepared that.