[LTP] [RFC] sched/starvation: Add disclaimer for virtualized/emulated environments

Wed Jan 22 09:52:39 CET 2025

Hello Petr,
Thanks a lot for having shared your position, please read mine inline.

On Mon, Jan 20, 2025 at 3:25 PM Petr Vorel <pvorel@suse.cz> wrote:
>
> Hi Alessandro,
>
> > This patch adds a disclaimer message to the starvation test case, warning
> > users against running the test in virtualized or emulated environments.
> > The test produces expected results only on bare-metal systems and is prone
> > to failure when executed in non-bare-metal setups.
>
> > While detecting virtualization or emulation is possible in some cases,
> > the methods are unreliable.
> > Rather than attempting to prevent the test from running in such
> > environments, this patch provides a warning to inform users of the
> > limitations.
>
> > Change:
> > - Added a TINFO message to notify users that the test should be run
> >   on bare-metal systems for meaningful results.
>
> > Signed-off-by: Alessandro Carminati <acarmina@redhat.com>
> > ---
> >  testcases/kernel/sched/cfs-scheduler/starvation.c | 3 +++
> >  1 file changed, 3 insertions(+)
>
> > diff --git a/testcases/kernel/sched/cfs-scheduler/starvation.c b/testcases/kernel/sched/cfs-scheduler/starvation.c
> > index c620c9c3e..b779a5f0a 100644
> > --- a/testcases/kernel/sched/cfs-scheduler/starvation.c
> > +++ b/testcases/kernel/sched/cfs-scheduler/starvation.c
> > @@ -115,6 +115,9 @@ static void setup(void)
> >       if (tst_has_slow_kconfig())
> >               tst_brk(TCONF, "Skip test due to slow kernel configuration");
>
> > +     tst_res(TINFO, "This test is designed to run on bare-metal systems. "
> > +             "Running it in a virtualized or emulated environment may produce unreliable results.");
>
> We should at least wrap it with tst_is_virt(), make it shorter and add WARNING:
>
>         if (tst_is_virt(VIRT_ANY))
>                 tst_res(TINFO, "WARNING: Running on a virtualized or emulated environment may produce unreliable results");
>
> But OTOH I haven't seen any problem with it on various SLES versions nor in
> openSUSE Tumbleweed (which uses latest mainline kernel). Therefore I would not
> put TCONF, but just TINFO as you suggested (other tests which use tst_is_virt()
> detection usually do TCONF).

I understand the suggestion to wrap the test with tst_is_virt() and include
a warning, but I have some reservations regarding the effectiveness of the
virtualization check in LTP. Here's my perspective:

1. Effectiveness of the Virtualization Check: While I am aware of the
   existing virtualization detection mechanism in LTP, I believe it is
   not sufficiently robust to reliably identify virtualized environments
   across all scenarios.

2. Systemd-based Detection: The systemd implementation used in LTP is
   impressively accurate and often succeeds in detecting virtualized or
   emulated environments. However, LTP cannot reasonably assume the
   presence of systemd on every system.
   For example, embedded systems such as OpenWrt typically not uses
   systemd from their distributions, making this approach less
   universally applicable.

3. LTP's Built-in Functions: The functions within LTP for detecting
   virtualization, such as is_kvm, are significantly less reliable.
   For instance, is_kvm relies on detecting "QEMU" in the cpuinfo
   string, which is often not present in many practical scenarios.
   This limitation becomes particularly apparent in my typical use
   case, where I test kernels in an emulated AArch64 QEMU environment.

Suggestions:
To minimize changes while still addressing this issue, I propose adding
a disclaimer in the test output, regardless of detection.
This way, users reviewing the results are at least informed about the
potential impact of virtualization on test reliability.

If more extensive changes are feasible, I suggest improving the LTP
detection functions.
Specifically:
* For non-x86 systems, introduce a mechanism based on device-tree
  detection, as this can provide more accurate results for architectures
   like AArch64.
* For x86 systems, utilize DMI-based detection, as the standardized
  firmware interfaces in x86 make DMI a reliable method.

>
> Any idea what can cause instability on virtualized environments? Which kernel
> options could be affected?

As mentioned earlier, I have encountered this issue in my typical test
environment. I primarily work with aarch64 platforms.
While my company provides access to large ARM64 servers, these are shared
resources.
Consequently, I need to prepare the servers each time I use them.

To make my workflow faster, my initial target for testing is an emulated
aarch64 environment running on x86 machines. This is followed by validation
on a small native AArch64 machine sitting in my lab, which is based on the
inexpensive RK3688.

Regarding the issue, I have observed consistent test failures on the
emulated machine, while on the native aarch64 virtualized KVM setup, the
failures occur approximately 1 in 10 times.
Importantly, I have never observed this issue on bare metal systems.

Although I suspect that this problem might be less pronounced on large
aarch64 servers with a higher core count, I believe this behavior is
significant enough.

> ATM test is disabled due slow kernel config detection
> on SLES/Tumbleweed (non-RT, tested on qemu) where it's working and this is not
> enough to detect unstable results on the kernels you test.
>
> I also send a patch to remove CONFIG_LATENCYTOP as option causing slow kernel.
>
> Kind regards,
> Petr
>
> > +
> >       tst_set_runtime(timeout);
> >  }
>

-- 
---
172