[LTP] [REGRESSION] lkft ltp for 6763a36

Tue Jun 21 14:51:11 CEST 2022

Hi,

Am 6/21/2022 um 1:38 PM schrieb Richard Palethorpe:
> Hello Li,
> 
> Li Wang <liwang@redhat.com> writes:
> 
>> On Tue, Jun 21, 2022 at 4:56 PM Richard Palethorpe <rpalethorpe@suse.de> wrote:
>>
>>  Hello,
>>
>>  Joerg Vehlow <lkml@jv-coder.de> writes:
>>
>>  > Hi Jan,
>>  >
>>  > Am 6/21/2022 um 9:22 AM schrieb Jan Stancek:
>>  >> On Tue, Jun 21, 2022 at 9:15 AM Joerg Vehlow <lkml@jv-coder.de> wrote:
>>  >>>
>>  >>> Hi,
>>  >>>
>>  >>> Am 6/17/2022 um 3:17 AM schrieb lkft@linaro.org:
>>  >>>> * qemu_i386, ltp-fs-tests
>>  >>>>   - read_all_proc
>>  >>> I've seen this test fail a lot, has anyone ever tried to analyze it? I
>>  >>> was unable to reproduce the problem when running the test in isolation.
>>  >> 
>>  >> I see it hit timeouts too (read_all_sys as well). I think it needs
>>  >> runtime restored to 5minutes as well, atm. it has 30s.
>>  > Didn't think about that, but at least for the failures I've seen, this
>>  > is not the reason. The message printed by the test is "Test timeout 5
>>  > minutes exceeded."
>>  >
>>  > Joerg
>>
>>  The main issue with read_all is that it also acts as a stress
>>  test. Reading some files in proc and sys is very resource intensive
>>  (e.g. due to lock contention) and varies depending on what state the
>>  system is in. On some systems this test will take a long time. Also
>>  there are some files which have to be filtered from the test. This
>>  varies by system as well.
>>
>> Does it make sense to have a lite version of read_all_sys?
>> which may only go through files sequentially or under slight stress.
> 
> IIRC the reason I started doing it in parallel is because sequential
> opens and reads are even slower and unreliable. Some level of parallism
> is required, but too much and it causes issues.
> 
> Thinking about it now, on a single or two core system only one worker
> process will be spawned. Which could get blocked for a long time on some
> reads because of the way some sys/proc files are implemented.
> 
> The worker count can be overridden with -w if someone wants to try
> increasing it to see if that actually helps on systems with <3
> cpus. Also the number of reads is set to 3 in the runtest file, that can
> be reduced to 1 with -r.
> 
>>
>> With regard to this stressful read_all, I guess we can put into a dedicated
>> set and run separately in stress testing.
> 
> I don't think I'd want to run that. IMO just doing enough to test
> parallel accesses is whats required. More than that we will run into
> diminishing returns . However I'm not against creating another runtest
> file/entry for that.
> 
> On bigger systems I think the test is already quite limited even though
> it does 3 reads. It only spwans a max of 15 workers which should prevent
> it from causing huge lock contention on machines with >16 CPUs. At least
> I've not seen problems with that.
> 
> It looks like the log from lkft is for a smaller machine?
I just used this regression report as an anchor point, because I am
seeing the same intermittent error on a 4 and an 8 core aarch64 system.
The system state at the time of the test execution is very reproducible
and sometimes the 5 minutes are exceeded, while it only takes ~3s, when
it is successful. Maybe there is a very time sensitive kernel bug here?
I am still not sure how to debug this, because I was never able to
reproduce it without executing all ltp tests, that run before in out setup.

Joerg