[LTP] [PATCH v3] thermal: add new test group

Fri Jan 30 00:24:54 CET 2026

> On 1/23/2026 7:28 PM, Petr Vorel wrote:
> > Hi Piotr,

> > > > > Then it
> > > > > + * decreases the threshold for sending a thermal interrupt to just
> > > > > above
> > > > > + * the current temperature and runs a workload on the CPU.
> > > > First, why test needs to run for 30 sec and then sleep for 10 sec?
> > Maybe the most important of my questions / points.

> > > Here the point is to use a decreasing timeout. The test starts with 10s
> > > cooldown to make sure that even pre-production CPU's, which might have
> > > their thermal protections disabled, cool down properly. Once sleep time
> > > reaches 0, the conclusion is that either there was not enough workload
> > > or somehow interrupts are not triggered after all.
> > Why 30 sec and then sleep for 10 sec? Is it really needed to do it this way?

> Of course not.


> > Aren't these times depending on the tested machine? Some of them will fail due
> > time not running enough,

> That's unexpected with the numbers that are used, so something is amiss if
> it fails (and so it should fail).

I tested on very old (~15 years) Thinkpad, quite powerful 3 years old Thinkpad
and some random machine kind of between these two. All detect the threshold with
less than 10% of time (heating CPU runtime 1s and sleep time 1s).

You go with 30s to smaller values. Wouldn't be faster to go the opposite
(start with small values and increase)? The diff below runs successfully on all
3 machines. What am I missing?

> > other will waste time (if they get interrupt e.g. in 10 sec).

> That very well may happen, but is it a big deal?

We try to cut down the test runtime, because LTP test collection is huge
and runtime for it is many hours [1]. For example, we have many CVE tests which
detect race condition. Instead of running each test for "safe long time enough"
which could be e.g. several minutes for many of them we have way to shorten the
time (see include/tst_fuzzy_sync.h).

[1] https://linux-test-project.readthedocs.io/en/latest/developers/ground_rules.html#why-is-sleep-in-tests-bad-then

> > The usual approach would be to have the timeout safe enough for any type
> > of hardware but proactively check the temperature and stop testing once it's
> > done.

> We want to create conditions in which the temperature should rise and if it
> doesn't, then there is a problem.

Sure.

> That said, the temperature can of course be checked more proactively, at
> least in principle, like say run cpu_workload() for 1s, check the
> temperature, repeat that several times, then cool down etc.

Yeah, that's kind of my proposal above.

Also, all of my 3 machines have only 1x x86_pkg_temp type, but I suppose there
are devices with more (I was not able to figure that out from
drivers/thermal/intel/x86_pkg_temp_thermal.c but otherwise the test would not
try to test them all). But why it is important to test them all? Isn't it enough
just to test a single one?

Kind regards,
Petr

> BR, Rafael


+++ testcases/kernel/thermal/thermal_interrupt_events.c
@@ -117,8 +117,8 @@ static void *cpu_workload(double run_time)
 static void test_zone(int i)
 {
 			char path[NAME_MAX], temp_path[NAME_MAX];
-			int sleep_time = SLEEP, temp_high, temp;
-			double run_time = RUNTIME;
+			int sleep_time = 1, temp_high, temp;
+			double run_time = 1;
 
 			snprintf(path, NAME_MAX, "/sys/class/thermal/thermal_zone%d/", i);
 			strncpy(temp_path, path, NAME_MAX);
@@ -138,7 +138,7 @@ static void test_zone(int i)
 			SAFE_FILE_SCANF(trip_path, "%d", &trip);
 			SAFE_FILE_PRINTF(trip_path, "%d", temp_high);
 
-			while (sleep_time > 0) {
+			while (sleep_time < SLEEP) {
 				tst_res(TDEBUG, "Running for %f seconds, then sleeping for %d seconds", run_time, sleep_time);
 
 				for (int j = 0; j < nproc; j++) {
@@ -155,8 +155,8 @@ static void test_zone(int i)
 
 				if (temp > temp_high)
 					break;
-				sleep(sleep_time--);
-				run_time -= 3;
+				sleep(sleep_time++);
+				run_time += 3;
 			}
 
 			if (temp <= temp_high)