[PATCH v11] drm: Add initial ci/ subdirectory

Mon Sep 18 14:35:13 PDT 2023

On 15/09/2023 12:08, Daniel Stone wrote:
> Hey,
> 
> On Thu, 14 Sept 2023 at 10:54, Maxime Ripard <mripard at kernel.org> wrote:
>> On Tue, Sep 12, 2023 at 02:16:41PM +0100, Daniel Stone wrote:
>>> Hopefully less mangled formatting this time: turns out Thunderbird +
>>> plain text is utterly unreadable, so that's one less MUA that is
>>> actually usable to send email to kernel lists without getting shouted
>>> at.
>>
>> Sorry if it felt that way, it definitely wasn't my intention to shout at
>> you. Email is indeed kind of a pain to deal with, and I wanted to keep
>> the discussion going.
> 
> My bad - I didn't mean you _at all_. I was thinking of other, much
> less pleasant, kernel maintainers, and the ongoing struggle of
> attempting to actually send well-formatted email to kernel lists in
> 2023.
> 
>>> I don't quite see the same picture from your side though. For example,
>>> my reading of what you've said is that flaky tests are utterly
>>> unacceptable, as are partial runs, and we shouldn't pretend otherwise.
>>> With your concrete example (which is really helpful, so thanks), what
>>> happens to the MT8173 hdmi-inject test? Do we skip all MT8173 testing
>>> until it's perfect, or does MT8173 testing always fail because that
>>> test does?
>>
>> It's not clear to me why that test is even running in the first place?
>> There's been some confusion on my side here about what we're going to
>> test with this. You've mentioned Mesa and GPUs before, but that's a KMS
>> test so there must be more to it.
>>
>> Either way, it's a relevant test so I guess why not. It turns out that
>> the test is indeed flaky, I guess we could add it to the flaky tests
>> list.
>>
>> BUT
>>
>> I want to have every opportunity to fix whatever that failure is.
> 
> Agreed so far!
> 
>> So:
>>
>>    - Is the test broken? If so, we should report it to IGT dev and remove
>>      it from the test suite.
>>    - If not, is that test failure have been reported to the driver author?
>>    - If no answer/fix, we can add it to the flaky tests list, but do we
>>      have some way to reproduce the test failure?
>>
>> The last part is especially critical. Looking at the list itself, I have
>> no idea what board, kernel version, configuration, or what the failure
>> rate was. Assuming I spend some time looking at the infra to find the
>> board and configuration, how many times do I have to run the tests to
>> expect to reproduce the failure (and thus consider it fixed if it
>> doesn't occur anymore).
>>
>> Like, with that board and test, if my first 100 runs of the test work
>> fine, is it reasonable for me to consider it fixed, or is it only
>> supposed to happen once every 1000 runs?

I wonder if this should be an overall policy or just let the maintainer 
to decide.

In any case these stress tests must be run from time to time to verify 
if flakes are still flakes. We could do it automatically but we need to 
evaluate how to do it properly so it doesn't overload the infra.

>>
>> So, ideally, having some (mandatory) metadata in the test lists with a
>> link to the bug report, the board (DT name?) it happened with, the
>> version and configuration it was first seen with, and an approximation
>> of the failure rate for every flaky test list.
>>
>> I understand that it's probably difficult to get that after the fact on
>> the tests that were already merged, but I'd really like to get that
>> enforced for every new test going forward.
>>
>> That should hopefully get us in a much better position to fix some of
>> those tests issues. And failing that, I can't see how that's
>> sustainable.
> 
> OK yeah, and we're still agreed here. That is definitely the standard
> we should be aiming for.  It is there for some - see
> drivers/gpu/drm/ci/xfails/rockchip-rk3288-skips.txt, but should be
> there for the rest, it's true. (The specific board/DT it was observed
> on can be easily retconned because we only run on one specific board
> type per driver, again to make things more predictable; we could go
> back and retrospectively add those in a header comment?)
> 
> For flakes, it can be hard to pin them down, because, well, they're
> flaky. Usually when we add things in Mesa (sorry to keep coming back
> to Mesa - it's not to say that it's the objective best thing that
> everything should follow, only that it's the thing we have the most
> experience with that we know works well), we do a manual bisect and
> try to pin the blame on a specific merge request which looks like the
> most likely culprit. If nothing obvious jumps out, we just note when
> it was first observed and provide some sample job logs. But yeah, it
> should be more verbose.
> 
> FWIW, the reason it wasn't done here - not to say that it shouldn't
> have been done better, but here we are - is that we just hammered a
> load of test runs, vacuumed up the results with a script, and that's
> what generated those files. Given the number of tests and devices, it
> was hard to narrow each down individually, but yeah, it is something
> which really wants further analysis and drilling into. It's a good
> to-do, and I agree it should be the standard going forward.

Yes, during development I was just worried about to get a pipeline that 
would succeed most of the time (otherwise people would start ignoring 
when it fails), so they just got run a couple of times and a script 
filled the flakes list.
For me the idea was "let's get a starting point" first, but yeah, we 
need to improve how we deal with it from now on.

> 
>> And Mesa does show what I'm talking about:
>>
>> $ find -name *-flakes.txt | xargs git diff --stat  e58a10af640ba58b6001f5c5ad750b782547da76
>> [...]
>>
>> In the history of Mesa, there's never been a single test removed from a
>> flaky test list.
> 
> As Rob says, that's definitely wrong. But there is a good point in
> there: how do you know a test isn't flaky anymore? 100 runs is a
> reasonable benchmark, but 1000 is ideal. At a 1% failure rate, with 20
> devices, that's just too many spurious false-fails to have a usable
> workflow.
> 
> We do have some tools to make stress testing easier, but those need to
> be better documented. We'll fix that. The tools we have which also
> pull out the metadata etc also need documenting - right now they
> aren't because they're under _extremely_ heavy development, but they
> can be further enhanced to e.g. pull out the igt results automatically
> and point very clearly to the cause. Also on the to-do.
> 
>>> Only maintainers can actually fix the drivers (or the tests tbf). But
>>> doing the testing does let us be really clear to everyone what the
>>> actual state is, and that way people can make informed decisions too.
>>> And the only way we're going to drive the test rate down is by the
>>> subsystem maintainers enforcing it.
>>
>> Just FYI, I'm not on the other side of the fence there, I'd really like
>> to have some kind of validation. I talked about it at XDC some years
>> ago, and discussed it several people at length over the years. So I'm
>> definitely not in the CI-is-bad camp.
>>
>>> Does that make sense on where I'm (and I think a lot of others are)
>>> coming from?
>>
>> That makes sense from your perspective, but it's not clear to me how you
>> can expect maintainers to own the tests if they were never involved in
>> the process.
>>
>> They are not in Cc of the flaky tests patches, they are not reported
>> that the bug is failing, how can they own that process if we never
>> reached out and involved them?
>>
>> We're all overworked, you can't expect them to just look at the flaky
>> test list every now and then and figure it out.
> 
> Absolutely. We got acks (or at least not-nacks) from the driver
> developers, but yeah, they should absolutely be part of the loop for
> those updates. I don't think we can necessarily block on them though.
> Say we add vc4 KMS tests, then after a backmerge we start to see a
> bunch of flakes on it, but you're sitting on a beach for a couple of
> weeks. If we wait for you to get back, see it, and merge it, then
> that's two weeks of people submitting Rockchip driver changes and
> getting told that their changes failed CI. That's exactly what we want
> to avoid, because it erodes confidence and usefulness of CI when
> people expect failures and ignore them by default.
> 
> So I would say that it's reasonable for expectations to be updated
> according to what actually happens in practice, but also to make sure
> that the maintainers are explicitly informed and kept in the loop, and
> not just surprised when they look at the lists and see a bunch of
> stuff happened without their knowledge.

I was thinking in adding entries in MAINTAINERS file pointing to each 
flake/skip/fails list file to their maintainers, so get_maintainers.pl 
can return the right thing.

> 
> Again there's much more to be done on the tooling here. Part of it is
> CLI tools and automation, part of it is dashboards and
> easily-digestible reporting, and then there's integration with things
> like KernelCI. KCI(DB) is actually quite high up on the list, but
> we're mostly waiting until a lot of the KCI rework happens so we can
> actually properly integrate with the new system.
> 
> Right now a lot of the tooling we have is pretty involved - for
> example, we do have ci-collate as a Python library which can inspect a
> number of pipelines, pull out detailed status and logs, etc, but it
> mostly needs to be used as a library with bespoke scripts, rather than
> a ready-made tool. Work on that is ongoing to make it way more clear
> and accessible though.
> 
> So I think it sounds like we're on the same page and going exactly in
> the same direction, just that this is a starting point rather than the
> desired end state. And the main point is that having a set of
> xfails/flakes parachuted in with little to no context is trying to get
> an MVP bootstrapped, rather than how we expect things to go in future.
> Does that sound about right?
> 
> Cheers,
> Daniel

Regards,
Helen