[LTP] [PATCH 1/2] read_all: Add worker timeout

Thu Jul 14 08:58:30 CEST 2022

On Tue, Jul 12, 2022 at 2:46 PM Richard Palethorpe via ltp
<ltp@lists.linux.it> wrote:
>
> Kill and restart workers that take too long to read a file. The
> default being one second. A custom time can be set with the new -t
> option.
>
> This is to prevent a worker from blocking forever in a read. Currently
> when this happens the whole test times out and any remaining files in
> the worker's queue are not tested.
>
> As a side effect we can now also set the timeout very low to cause
> partial reads.
>
> Signed-off-by: Richard Palethorpe <rpalethorpe@suse.com>
> Cc: Joerg Vehlow <lkml@jv-coder.de>
> Cc: Li Wang <liwang@redhat.com>
> ---
>  testcases/kernel/fs/read_all/read_all.c | 83 ++++++++++++++++++++++++-
>  1 file changed, 82 insertions(+), 1 deletion(-)

>
> +static void restart_worker(struct worker *const worker)
> +{
> +       int wstatus, ret, i, q_len;
> +       struct timespec now;
> +
> +       kill(worker->pid, SIGKILL);
> +       ret = waitpid(worker->pid, &wstatus, 0);

Is there a chance we could get stuck in uninterruptible read? I think I saw some
in past, but those may be blacklisted already, so this may only be something
to watch for if we still get test timeouts in future.

> +
> +       if (ret != worker->pid) {
> +               tst_brk(TBROK | TERRNO, "waitpid(%d, ...) = %d",
> +                       worker->pid, ret);
> +       }
> +
> +       /* Make sure the queue length and semaphore match. Threre is a
> +        * race in qeuue_pop where the semaphore can be decremented
^^ typo in queue_pop above

> +        * then the worker killed before updating q->front
> +        */
> +       q_len = 0;
> +       i = worker->q->front;
> +       while (i != worker->q->back) {
> +               if (!worker->q->data[i])
> +                       q_len++;
> +
> +               i = (i + 1) % QUEUE_SIZE;
> +       }
> +
> +       ret = sem_destroy(&worker->q->sem);
> +       if (ret == -1)
> +               tst_brk(TBROK | TERRNO, "sem_destroy");
> +       ret = sem_init(&worker->q->sem, 1, q_len);
> +       if (ret == -1)
> +               tst_brk(TBROK | TERRNO, "sem_init");
> +
> +       tst_clock_gettime(CLOCK_MONOTONIC_RAW, &now);
> +       worker->last_seen = tst_timespec_to_ms(now);
> +       worker->pid = SAFE_FORK();
> +
> +       if (!worker->pid)
> +               exit(worker_run(worker));
> +}
> +
>  static void work_push_retry(int worker, const char *buf)
>  {
>         int i, ret, worker_min, worker_max, usleep_time = 100;
> +       struct timespec now;
> +       int elapsed;
>
>         if (worker < 0) {
>                 /* pick any, try -worker first */
> @@ -299,10 +357,25 @@ static void work_push_retry(int worker, const char *buf)
>         i = worker_min;
>
>         for (;;) {
> -               ret = queue_push(workers[i].q, buf);
> +               struct worker *const w = workers + i;
> +
> +               ret = queue_push(w->q, buf);
>                 if (ret == 1)
>                         break;
>
> +               tst_clock_gettime(CLOCK_MONOTONIC_RAW, &now);
> +               elapsed =
> +                       tst_timespec_to_ms(now) - tst_atomic_load(&w->last_seen);
> +
> +               if (elapsed > worker_timeout) {
> +                       if (!quiet) {
> +                               tst_res(TINFO,
> +                                       "Worker %d (%d) stuck for %dms, restarting it",
> +                                       i, w->pid, elapsed);

Can we also print file worker is stuck on?
And as Li pointed out, I'd also extend max_runtime to 60 seconds.

Acked-by: Jan Stancek <jstancek@redhat.com>