[LTP] [PATCH 1/2] read_all: Add worker timeout

Tue Jul 19 10:58:03 CEST 2022

On Mon, Jul 18, 2022 at 6:37 PM Richard Palethorpe <rpalethorpe@suse.de>
wrote:

> Hello Li,
>
> Li Wang <liwang@redhat.com> writes:
>
> > Hi Richard,
> >
> > On Tue, Jul 12, 2022 at 8:46 PM Richard Palethorpe <rpalethorpe@suse.com>
> wrote:
> >
> >  Kill and restart workers that take too long to read a file. The
> >  default being one second. A custom time can be set with the new -t
> >  option.
> >
> >  This is to prevent a worker from blocking forever in a read. Currently
> >  when this happens the whole test times out and any remaining files in
> >  the worker's queue are not tested.
> >
> >  As a side effect we can now also set the timeout very low to cause
> >  partial reads.
> >
> > To restart workers which are blocked from reading files is a practical
> way.
> > My only concern is that restarted-worker is also slower reading on some
> > specific files so it will still be timed out again. Then the rest will
> fall into
> > restart cycles. Here I'd suggest loosening the worker_timeout to 3000 ms
> > to accommodate systems of different IO speeds.
>
> I didn't observe any issues setting the timeout very low
> (<100ms). Worker's remove an item from the queue before trying to read
> it, so they shouldn't get stuck in a restart cycle on the same file (if
> thats what you meant).
>

Ah yes, exactly.

> I'm not sure if this is better or worse. In my sys folder there are
> approx 35K files. Obviously a timeout of even 1s is far too long to read
> all of them if they all timeout. In fact if we set the timeout as 100ms
> then they would still require about 58 mins.
>

At the moment we just give a rough value on that as nobody has lots of
sample data.

>
> Of course most of them take a very short time on most systems. I guess
> that ideally the timeout needs to be adapted as the test runs so that
> all remaining files can be read. The issue with that though is that
> kill+wait+fork
> takes more time than most reads. Although that can be measured as
> well....
>
> This issue is quite a lot like the issues we have with fuzzy sync. I
> think it's maybe time to start dynamically adapting to system
> performance. That's probably best left to another patch series though.
>

Yes, collecting sample data can help evaluate the performance.
That's a practical way to make things more reliable.

-- 
Regards,
Li Wang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linux.it/pipermail/ltp/attachments/20220719/726f56b6/attachment.htm>