[LTP] [PATCH 1/2] read_all: Add worker timeout

Mon Jul 18 12:09:21 CEST 2022

Hello Li,

Li Wang <liwang@redhat.com> writes:

> Hi Richard,
>
> On Tue, Jul 12, 2022 at 8:46 PM Richard Palethorpe <rpalethorpe@suse.com> wrote:
>
>  Kill and restart workers that take too long to read a file. The
>  default being one second. A custom time can be set with the new -t
>  option.
>
>  This is to prevent a worker from blocking forever in a read. Currently
>  when this happens the whole test times out and any remaining files in
>  the worker's queue are not tested.
>
>  As a side effect we can now also set the timeout very low to cause
>  partial reads.
>
> To restart workers which are blocked from reading files is a practical way.
> My only concern is that restarted-worker is also slower reading on some
> specific files so it will still be timed out again. Then the rest will fall into 
> restart cycles. Here I'd suggest loosening the worker_timeout to 3000 ms
> to accommodate systems of different IO speeds.

I didn't observe any issues setting the timeout very low
(<100ms). Worker's remove an item from the queue before trying to read
it, so they shouldn't get stuck in a restart cycle on the same file (if
thats what you meant).

>
> Though each worker at most takes 3secs, the max time would not
> be overdue 45s (3 sec * max_works), unless tested sequentially
> in the worst case. So bumping up max_runtime maybe also helpful.
>
> Anyway, I'd be happy to apply your patch first to see how things going.
> Reviewed-by: Li Wang <liwang@redhat.com>
>
> --- a/testcases/kernel/fs/read_all/read_all.c
> +++ b/testcases/kernel/fs/read_all/read_all.c
> @@ -64,7 +64,7 @@ struct queue {
>  struct worker {
>         pid_t pid;
>         struct queue *q;
> -       int last_seen;
> +       unsigned long long last_seen;
>  };
>  
>  enum dent_action {
> @@ -86,7 +86,7 @@ static long max_workers = 15;
>  static struct worker *workers;
>  static char *drop_privs;
>  static char *str_worker_timeout;
> -static int worker_timeout = 1000;
> +static int worker_timeout = 3000;
>  
>  static char *blacklist[] = {
>         NULL, /* reserved for -e parameter */
> @@ -552,5 +552,6 @@ static struct tst_test test = {
>         .cleanup = cleanup,
>         .test_all = run,
>         .forks_child = 1,
> +       .max_runtime = 60,
>  };

I'm not sure if this is better or worse. In my sys folder there are
approx 35K files. Obviously a timeout of even 1s is far too long to read
all of them if they all timeout. In fact if we set the timeout as 100ms
then they would still require about 58 mins.

Of course most of them take a very short time on most systems. I guess
that ideally the timeout needs to be adapted as the test runs so that
all remaining files can be read. The issue with that though is that kill+wait+fork
takes more time than most reads. Although that can be measured as
well....

This issue is quite a lot like the issues we have with fuzzy sync. I
think it's maybe time to start dynamically adapting to system
performance. That's probably best left to another patch series though.

-- 
Thank you,
Richard.