[LTP] [PATCH] sched_football: fix false failures on many-CPU systems
John Stultz
jstultz@google.com
Tue Apr 21 00:49:06 CEST 2026
On Wed, Apr 15, 2026 at 8:23 PM Li Wang <wangli.ahau@gmail.com> wrote:
> John Stultz <jstultz@google.com> wrote:
> > > > > > 1. RT throttling freezes all SCHED_FIFO threads simultaneously. On
> > > > > > release, the kernel does not always reschedule the highest-priority
> > > > > > thread first on every CPU, so offense briefly runs and increments
> > > > > > the_ball before defense is rescheduled. Fix by saving and disabling
> > > > > > sched_rt_runtime_us in setup and restoring it in a new cleanup
> > > > > > callback.
> > > >
> > > > Make sense, and like the AI-reviewer points out LTP provides an option
> > > > to save_restore it automatically.
> >
> > Throttling shouldn't break the test. The fact that SCHED_NORMAL tasks
> > ran shouldn't change the ordering when we go back to running RT tasks.
> > This is likely a kernel bug.
>
> Theoretically speaking, your point is correct. While instant global
> priority ordering upon unthrottling is ideal, it ignores the physical realities
> of SMP architecture.
>
> If we look at do_sched_rt_period_timer(), CPUs are unthrottled sequentially
> via a for_each_cpu loop. When an early CPU in the loop is unlocked, it
> immediately schedules its local RT task 'offense'.
> See:
> https://elixir.bootlin.com/linux/v7.0/source/kernel/sched/rt.c#L797
>
> Meanwhile, subsequent CPUs in the loop (which may hold the higher-priority
> 'defense' task) are literally still throttled.
>
> Expecting atomic, zero-latency global unthrottling is physically unrealistic for
> multi-core systems.
>
> That's why I tend to believe disabling throttling for this specific test is the
> wise and practical approach.
Apologies, I'm still not sure I see it.
If before throttling happens there are NR_CPU high priority (same
priority) defenders, and they are distributed across cpus the
preventing NR_CPU lower priority offensive tasks running, throttling
should not change this distribution of the high priority tasks
(because all the other CPUs already have their own high priority
defender to run, so they wouldn't pull an equivalent priority task
over). So when unthrottling, I don't see how the lower-priority
offensive task would be chosen.
It may very well be the case that lower priority tasks do run, but I'd
contend that suggests there is a bug.
> > > > > > 2. Offense and defense threads were unpinned, allowing the scheduler
> > > > > > to migrate them freely. An offense thread could land on a CPU with
> > > > > > no defense thread present and run unchecked. Fix by passing a CPU
> > > > > > index as the thread arg and calling sched_setaffinity() at thread
> > > > > > start. Pairs are distributed round-robin (i % ncpus) so each
> > > > > > offense thread shares its CPU with a defense thread.
> > > >
> > > > This is a good thought, as for SCHED_FIFO it manages the corresponding
> > > > runqueue for each CPU and simply picks the higher priority task to run.
> > > > So pinning the threads to each CPU makes sense, but maybe we could
> > > > only pin the defense because:
> > > >
> > > > With N defense threads pinned one per CPU, every CPU has a defense
> > > > thread at priority 30 permanently runnable. The offense threads at priority
> > > > 15, regardless of which CPU the scheduler places them on, will always find
> > > > a higher-priority defense thread on the same CPU's runqueue. Since
> > > > SCHED_FIFO strictly favors the higher-priority runnable task, offense can
> > > > never be picked.
> > > >
> > > > Pinning offense as well would be redundant, it doesn't matter where offense
> > > > lands, because defense already covers every CPU. This also has the advantage
> > > > of letting the scheduler freely migrate offense threads without
> > > > affecting the test
> > > > outcome, which avoids interfering with the kernel's load balancing logic during
> > > > the test.
> > > >
> > > > And, I'd suggest using tst_ncpus_available() instead of get_numcpus()
> > > > when distributing defense threads across CPUs, in case some CPUs are
> > > > offline. Pinning a defense thread to an offline CPU would leave that
> > > > CPU uncovered and allow offense to run unchecked. See:
> >
> > I didn't see the orignal patch here, but the whole point of
> > sched_football is to ensure the top <num cpu> (unaffined) priority
> > tasks are always run and no lower priority rt tasks are run instead.
> >
> > So none of the tasks should be pinned to any cpus. The scheduler is
> > supposed to ensure the RT invariant holds.
> > There are some known bugs at the moment that will cause sched_football
> > to fail (the RT_PUSH_IPI feature, for instance). That's a problem with
> > the kernel, not the test.
>
> Apart from the known bug of RT_PUSH_IPI feature, it still does not
> guarantee 100% success in real scenarios.
>
> After a deep look into the rt scheduler principals, I found that, because
> the RT_PUSH_IPI mechanism is designed as a "best-effort"
> optimization rather than a guaranteed operation. As the kernel's
> scheduling state is highly dynamic and asynchronous, a push attempt
> will deliberately abort if the environment changes between the time the
> IPI is sent and when it is actually processed.
So again, RT_PUSH_IPI is known to break the RT invarient. So I'm not
sure pointing to it to justify the behavior is very compelling.
> It fails by design to prevent instability, primarily due to state expiration,
> CPU affinity restrictions, sudden priority inversions, or the lack of an
> eligible target CPU.
>
> See push_rt_task() in kernel/sched/rt.c:
> https://elixir.bootlin.com/linux/v7.0/source/kernel/sched/rt.c#L1939
>
> Hence, if we explicitly pin the defense thread to each CPU, it will join in
> the corresponding runqueues, which completely match the reasonable
> situation: the kernel's RT scheduler guarantees per-CPU priority ordering,
> not global placement. The RT load balancer is asynchronous and doesn't
> guarantee that all high-priority threads are placed before any low-priority
> thread runs.
>
> On the other side, if the test expects a global guarantee, it's testing
> something the kernel doesn't claim to provide.
Prior to RT_PUSH_IPI, I believe it did provide this functionality for
RT scheduling and I recall a fair amount of effort went into the RT
scheduler to try to enforce the RT priority invariant.
Last I tried, disabling RT_PUSH_IPI resolved the failures I saw on SMP
systems, but that was a bit back so new issues may have cropped up.
Mostly I just want to make sure we're not papering over the test to
have it stop reporting a real correctness bug.
I'll grant that this real bug doesn't seem like something anyone is
prioritizing to resolve, so maybe it's not all that important. And I
agree there is a cost to having the sched_football test be strict and
always failing, since we potentially miss other variations of bugs and
problems that could be introduced. If you want to add a --per-cpu
argument or something to enable the pinning and re-scope the test, I'd
probably not object to that.
But I really would like to make sure that the *specific thing* the
test was written to check isn't lost just so the results can stop
seeing failures.
> To the rest changes, the current test has several non-kernel issues
> (RT throttling interference, uninitialized game_over on reruns, silent
> sched_setscheduler failures) that produce false failures. These mask
> the real kernel bugs you want to detect.
Again, I wasn't sent the patch that started this thread, so I don't
have any strong objections.
thanks
-john
More information about the ltp
mailing list