[LTP] [RFC] sched/starvation: Add disclaimer for virtualized/emulated environments

Mon Jan 27 16:35:42 CET 2025

Hi Alessandro,

> Hello Petr,
> Thanks a lot for having shared your position, please read mine inline.

> On Mon, Jan 20, 2025 at 3:25 PM Petr Vorel <pvorel@suse.cz> wrote:

> > Hi Alessandro,

> > > This patch adds a disclaimer message to the starvation test case, warning
> > > users against running the test in virtualized or emulated environments.
> > > The test produces expected results only on bare-metal systems and is prone
> > > to failure when executed in non-bare-metal setups.

> > > While detecting virtualization or emulation is possible in some cases,
> > > the methods are unreliable.
> > > Rather than attempting to prevent the test from running in such
> > > environments, this patch provides a warning to inform users of the
> > > limitations.

> > > Change:
> > > - Added a TINFO message to notify users that the test should be run
> > >   on bare-metal systems for meaningful results.

> > > Signed-off-by: Alessandro Carminati <acarmina@redhat.com>
> > > ---
> > >  testcases/kernel/sched/cfs-scheduler/starvation.c | 3 +++
> > >  1 file changed, 3 insertions(+)

> > > diff --git a/testcases/kernel/sched/cfs-scheduler/starvation.c b/testcases/kernel/sched/cfs-scheduler/starvation.c
> > > index c620c9c3e..b779a5f0a 100644
> > > --- a/testcases/kernel/sched/cfs-scheduler/starvation.c
> > > +++ b/testcases/kernel/sched/cfs-scheduler/starvation.c
> > > @@ -115,6 +115,9 @@ static void setup(void)
> > >       if (tst_has_slow_kconfig())
> > >               tst_brk(TCONF, "Skip test due to slow kernel configuration");

> > > +     tst_res(TINFO, "This test is designed to run on bare-metal systems. "
> > > +             "Running it in a virtualized or emulated environment may produce unreliable results.");

> > We should at least wrap it with tst_is_virt(), make it shorter and add WARNING:

> >         if (tst_is_virt(VIRT_ANY))
> >                 tst_res(TINFO, "WARNING: Running on a virtualized or emulated environment may produce unreliable results");

> > But OTOH I haven't seen any problem with it on various SLES versions nor in
> > openSUSE Tumbleweed (which uses latest mainline kernel). Therefore I would not
> > put TCONF, but just TINFO as you suggested (other tests which use tst_is_virt()
> > detection usually do TCONF).

> I understand the suggestion to wrap the test with tst_is_virt() and include
> a warning, but I have some reservations regarding the effectiveness of the
> virtualization check in LTP. Here's my perspective:

> 1. Effectiveness of the Virtualization Check: While I am aware of the
>    existing virtualization detection mechanism in LTP, I believe it is
>    not sufficiently robust to reliably identify virtualized environments
>    across all scenarios.

> 2. Systemd-based Detection: The systemd implementation used in LTP is
>    impressively accurate and often succeeds in detecting virtualized or
>    emulated environments. However, LTP cannot reasonably assume the
>    presence of systemd on every system.

I wonder whether Linaro or other embedded folks use distros without systemd.
So far nobody complained. FYI we have in shell API non-systemd code (in
testcases/lib/daemonlib.sh) but I doubt anybody is using it.
But if yes, there are 5 tests which depends on it working, it would be then
worth to fix it.

If really needs to be fixed there could be ways:

read /sys/devices/virtual/dmi/id/sys_vendor content (requires CONFIG_DMIID=y, I
guess ).
$ cat /sys/devices/virtual/dmi/id/sys_vendor

Or use some code from lscpu:
https://git.kernel.org/pub/scm/utils/util-linux/util-linux.git/tree/sys-utils/lscpu-virt.c
(cpuid or dmi)

or systemd code:
https://github.com/systemd/systemd/tree/main/src/basic/virt.c

or from virt-what: http://git.annexia.org/?p=virt-what.git;a=summary

or just looking at virtio disc
$ ls -1 /dev/disk/by-path/ |grep virtio
virtio-pci-0000:04:00.0
virtio-pci-0000:04:00.0-part1
virtio-pci-0000:04:00.0-part2

Also our machines loads qemu_fw_cfg kernel module.

>    For example, embedded systems such as OpenWrt typically not uses
>    systemd from their distributions, making this approach less
>    universally applicable.

I wonder if anybody from OpenWrt folks actually tests the code.

> 3. LTP's Built-in Functions: The functions within LTP for detecting
>    virtualization, such as is_kvm, are significantly less reliable.
>    For instance, is_kvm relies on detecting "QEMU" in the cpuinfo
>    string, which is often not present in many practical scenarios.
>    This limitation becomes particularly apparent in my typical use
>    case, where I test kernels in an emulated AArch64 QEMU environment.

Yes, we have in aarch64 openSUSE
BIOS Vendor ID:                       QEMU

but not in riscv64.

> Suggestions:
> To minimize changes while still addressing this issue, I propose adding
> a disclaimer in the test output, regardless of detection.
> This way, users reviewing the results are at least informed about the
> potential impact of virtualization on test reliability.

> If more extensive changes are feasible, I suggest improving the LTP
> detection functions.
> Specifically:
> * For non-x86 systems, introduce a mechanism based on device-tree
>   detection, as this can provide more accurate results for architectures
>    like AArch64.

Feel free to send a patch. Device tree is used in quite a lot of archs.
I would actually go with detect_vm_device_tree() code from systemd.

> * For x86 systems, utilize DMI-based detection, as the standardized
>   firmware interfaces in x86 make DMI a reliable method.

> > Any idea what can cause instability on virtualized environments? Which kernel
> > options could be affected?

> As mentioned earlier, I have encountered this issue in my typical test
> environment. I primarily work with aarch64 platforms.
> While my company provides access to large ARM64 servers, these are shared
> resources.
> Consequently, I need to prepare the servers each time I use them.

> To make my workflow faster, my initial target for testing is an emulated
> aarch64 environment running on x86 machines. This is followed by validation
> on a small native AArch64 machine sitting in my lab, which is based on the
> inexpensive RK3688.

> Regarding the issue, I have observed consistent test failures on the
> emulated machine, while on the native aarch64 virtualized KVM setup, the
> failures occur approximately 1 in 10 times.
> Importantly, I have never observed this issue on bare metal systems.

> Although I suspect that this problem might be less pronounced on large
> aarch64 servers with a higher core count, I believe this behavior is
> significant enough.

Thanks for a detailed explanation. I wonder what resource is missing on aarch64
or what kernel config causes it, because it works well on our virtualized
aarch64 (better say worked well, because now it's skipped by
CONFIG_FAULT_INJECTION - IMHO there should be more precise slow kernel detection).

Kind regards,
Petr

> > ATM test is disabled due slow kernel config detection
> > on SLES/Tumbleweed (non-RT, tested on qemu) where it's working and this is not
> > enough to detect unstable results on the kernels you test.

> > I also send a patch to remove CONFIG_LATENCYTOP as option causing slow kernel.

> > Kind regards,
> > Petr