[LTP] [REGRESSION] lkft ltp for 6763a36

Thu Jun 23 12:51:26 CEST 2022

Hello Joerg,

Joerg Vehlow <lkml@jv-coder.de> writes:

> Hi,
>
> Am 6/21/2022 um 1:38 PM schrieb Richard Palethorpe:
>> Hello Li,
>> 
>> Li Wang <liwang@redhat.com> writes:
>> 
>>> On Tue, Jun 21, 2022 at 4:56 PM Richard Palethorpe <rpalethorpe@suse.de> wrote:
>>>
>>>  Hello,
>>>
>>>  Joerg Vehlow <lkml@jv-coder.de> writes:
>>>
>>>  > Hi Jan,
>>>  >
>>>  > Am 6/21/2022 um 9:22 AM schrieb Jan Stancek:
>>>  >> On Tue, Jun 21, 2022 at 9:15 AM Joerg Vehlow <lkml@jv-coder.de> wrote:
>>>  >>>
>>>  >>> Hi,
>>>  >>>
>>>  >>> Am 6/17/2022 um 3:17 AM schrieb lkft@linaro.org:
>>>  >>>> * qemu_i386, ltp-fs-tests
>>>  >>>>   - read_all_proc
>>>  >>> I've seen this test fail a lot, has anyone ever tried to analyze it? I
>>>  >>> was unable to reproduce the problem when running the test in isolation.
>>>  >> 
>>>  >> I see it hit timeouts too (read_all_sys as well). I think it needs
>>>  >> runtime restored to 5minutes as well, atm. it has 30s.
>>>  > Didn't think about that, but at least for the failures I've seen, this
>>>  > is not the reason. The message printed by the test is "Test timeout 5
>>>  > minutes exceeded."
>>>  >
>>>  > Joerg
>>>
>>>  The main issue with read_all is that it also acts as a stress
>>>  test. Reading some files in proc and sys is very resource intensive
>>>  (e.g. due to lock contention) and varies depending on what state the
>>>  system is in. On some systems this test will take a long time. Also
>>>  there are some files which have to be filtered from the test. This
>>>  varies by system as well.
>>>
>>> Does it make sense to have a lite version of read_all_sys?
>>> which may only go through files sequentially or under slight stress.
>> 
>> IIRC the reason I started doing it in parallel is because sequential
>> opens and reads are even slower and unreliable. Some level of parallism
>> is required, but too much and it causes issues.
>> 
>> Thinking about it now, on a single or two core system only one worker
>> process will be spawned. Which could get blocked for a long time on some
>> reads because of the way some sys/proc files are implemented.
>> 
>> The worker count can be overridden with -w if someone wants to try
>> increasing it to see if that actually helps on systems with <3
>> cpus. Also the number of reads is set to 3 in the runtest file, that can
>> be reduced to 1 with -r.
>> 
>>>
>>> With regard to this stressful read_all, I guess we can put into a dedicated
>>> set and run separately in stress testing.
>> 
>> I don't think I'd want to run that. IMO just doing enough to test
>> parallel accesses is whats required. More than that we will run into
>> diminishing returns . However I'm not against creating another runtest
>> file/entry for that.
>> 
>> On bigger systems I think the test is already quite limited even though
>> it does 3 reads. It only spwans a max of 15 workers which should prevent
>> it from causing huge lock contention on machines with >16 CPUs. At least
>> I've not seen problems with that.
>> 
>> It looks like the log from lkft is for a smaller machine?
> I just used this regression report as an anchor point, because I am
> seeing the same intermittent error on a 4 and an 8 core aarch64 system.
> The system state at the time of the test execution is very reproducible
> and sometimes the 5 minutes are exceeded, while it only takes ~3s, when
> it is successful. Maybe there is a very time sensitive kernel bug here?
> I am still not sure how to debug this, because I was never able to
> reproduce it without executing all ltp tests, that run before in out
> setup.

Very interesting. Well, running tests can cause files to appear in proc
and sys. Including ones which remain after testing has finished. The
most obvious example being when a module is loaded and it creates some
sys files.

Also it could be some reasources are added which are probed by existing
files. Which could be time sensitive if they are cleaned up
asynchronously.

Anyway it should be possible to profile the open and read syscalls with
ftrace or similar. Or you can just set '-v' and inspect the log. We
should also have a per read timeout. I just haven't got around to
implementing it. Probably it requires monitoring, killing and restarting
stuck workers due to how read is implemented on some files.

>
> Joerg

-- 
Thank you,
Richard.