[LTP] [REGRESSION] lkft ltp for 6763a36
Joerg Vehlow
lkml@jv-coder.de
Tue Jun 21 14:51:11 CEST 2022
Hi,
Am 6/21/2022 um 1:38 PM schrieb Richard Palethorpe:
> Hello Li,
>
> Li Wang <liwang@redhat.com> writes:
>
>> On Tue, Jun 21, 2022 at 4:56 PM Richard Palethorpe <rpalethorpe@suse.de> wrote:
>>
>> Hello,
>>
>> Joerg Vehlow <lkml@jv-coder.de> writes:
>>
>> > Hi Jan,
>> >
>> > Am 6/21/2022 um 9:22 AM schrieb Jan Stancek:
>> >> On Tue, Jun 21, 2022 at 9:15 AM Joerg Vehlow <lkml@jv-coder.de> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> Am 6/17/2022 um 3:17 AM schrieb lkft@linaro.org:
>> >>>> * qemu_i386, ltp-fs-tests
>> >>>> - read_all_proc
>> >>> I've seen this test fail a lot, has anyone ever tried to analyze it? I
>> >>> was unable to reproduce the problem when running the test in isolation.
>> >>
>> >> I see it hit timeouts too (read_all_sys as well). I think it needs
>> >> runtime restored to 5minutes as well, atm. it has 30s.
>> > Didn't think about that, but at least for the failures I've seen, this
>> > is not the reason. The message printed by the test is "Test timeout 5
>> > minutes exceeded."
>> >
>> > Joerg
>>
>> The main issue with read_all is that it also acts as a stress
>> test. Reading some files in proc and sys is very resource intensive
>> (e.g. due to lock contention) and varies depending on what state the
>> system is in. On some systems this test will take a long time. Also
>> there are some files which have to be filtered from the test. This
>> varies by system as well.
>>
>> Does it make sense to have a lite version of read_all_sys?
>> which may only go through files sequentially or under slight stress.
>
> IIRC the reason I started doing it in parallel is because sequential
> opens and reads are even slower and unreliable. Some level of parallism
> is required, but too much and it causes issues.
>
> Thinking about it now, on a single or two core system only one worker
> process will be spawned. Which could get blocked for a long time on some
> reads because of the way some sys/proc files are implemented.
>
> The worker count can be overridden with -w if someone wants to try
> increasing it to see if that actually helps on systems with <3
> cpus. Also the number of reads is set to 3 in the runtest file, that can
> be reduced to 1 with -r.
>
>>
>> With regard to this stressful read_all, I guess we can put into a dedicated
>> set and run separately in stress testing.
>
> I don't think I'd want to run that. IMO just doing enough to test
> parallel accesses is whats required. More than that we will run into
> diminishing returns . However I'm not against creating another runtest
> file/entry for that.
>
> On bigger systems I think the test is already quite limited even though
> it does 3 reads. It only spwans a max of 15 workers which should prevent
> it from causing huge lock contention on machines with >16 CPUs. At least
> I've not seen problems with that.
>
> It looks like the log from lkft is for a smaller machine?
I just used this regression report as an anchor point, because I am
seeing the same intermittent error on a 4 and an 8 core aarch64 system.
The system state at the time of the test execution is very reproducible
and sometimes the 5 minutes are exceeded, while it only takes ~3s, when
it is successful. Maybe there is a very time sensitive kernel bug here?
I am still not sure how to debug this, because I was never able to
reproduce it without executing all ltp tests, that run before in out setup.
Joerg
More information about the ltp
mailing list