[LTP] [PATCH 1/2] read_all: Add worker timeout

Mon Jul 18 15:01:49 CEST 2022

Hello,

Richard Palethorpe <rpalethorpe@suse.de> writes:

> Hello,
>
> Jan Stancek <jstancek@redhat.com> writes:
>
>> On Tue, Jul 12, 2022 at 2:46 PM Richard Palethorpe via ltp
>> <ltp@lists.linux.it> wrote:
>>>
>>> Kill and restart workers that take too long to read a file. The
>>> default being one second. A custom time can be set with the new -t
>>> option.
>>>
>>> This is to prevent a worker from blocking forever in a read. Currently
>>> when this happens the whole test times out and any remaining files in
>>> the worker's queue are not tested.
>>>
>>> As a side effect we can now also set the timeout very low to cause
>>> partial reads.
>>>
>>> Signed-off-by: Richard Palethorpe <rpalethorpe@suse.com>
>>> Cc: Joerg Vehlow <lkml@jv-coder.de>
>>> Cc: Li Wang <liwang@redhat.com>
>>> ---
>>>  testcases/kernel/fs/read_all/read_all.c | 83 ++++++++++++++++++++++++-
>>>  1 file changed, 82 insertions(+), 1 deletion(-)
>>
>>>
>>> +static void restart_worker(struct worker *const worker)
>>> +{
>>> +       int wstatus, ret, i, q_len;
>>> +       struct timespec now;
>>> +
>>> +       kill(worker->pid, SIGKILL);
>>> +       ret = waitpid(worker->pid, &wstatus, 0);
>>
>> Is there a chance we could get stuck in uninterruptible read? I think I saw some
>> in past, but those may be blacklisted already, so this may only be something
>> to watch for if we still get test timeouts in future.
>>
>
> I was hoping that kill is special somehow, but I suppose that I should
> check exactly what happens. If the process is stuck inside the kernel
> then we don't want to wait too long for it. We just need to know that
> the kill signal was delivered and that the process will not return to
> userland. If we have a large number of zombies then it could exhaust the
> PIDs or some other resource, but most reads are done very quickly and
> don't need interrupting.

AFAICT fatal signals are special which means a reschedule event is sent
immediately on kill. However I guess the IPI doesn't get delivered to
another CPU synchronously, not to mention that preemption might be
disabled, so we don't know what state everything is in by the time
kill() returns.

I guess also some long running kernel code will continue to run after it
has received kill and been rescheduled. So wait could block for a longer
period. In this case though we don't need to worry about the process we
are trying to kill returning to userland.

Regardless of whether all that is correct we can mark the worker as
failed, call kill then call waitpid with WNOHANG. Then either restart it
or return to the main loop. If we didn't restart it, the next time we
encounter that worker we can call waitpid again. After some number of
failures we can abandon waiting on it and create a new process.

-- 
Thank you,
Richard.