[LTP] [PATCH v3] thermal: add new test group

Wysocki, Rafael J rafael.j.wysocki@intel.com
Tue Feb 3 15:38:38 CET 2026


On 1/30/2026 12:24 AM, Petr Vorel wrote:
>> On 1/23/2026 7:28 PM, Petr Vorel wrote:
>>> Hi Piotr,
>>>>>> Then it
>>>>>> + * decreases the threshold for sending a thermal interrupt to just
>>>>>> above
>>>>>> + * the current temperature and runs a workload on the CPU.
>>>>> First, why test needs to run for 30 sec and then sleep for 10 sec?
>>> Maybe the most important of my questions / points.
>>>> Here the point is to use a decreasing timeout. The test starts with 10s
>>>> cooldown to make sure that even pre-production CPU's, which might have
>>>> their thermal protections disabled, cool down properly. Once sleep time
>>>> reaches 0, the conclusion is that either there was not enough workload
>>>> or somehow interrupts are not triggered after all.
>>> Why 30 sec and then sleep for 10 sec? Is it really needed to do it this way?
>> Of course not.
>
>>> Aren't these times depending on the tested machine? Some of them will fail due
>>> time not running enough,
>> That's unexpected with the numbers that are used, so something is amiss if
>> it fails (and so it should fail).
> I tested on very old (~15 years) Thinkpad, quite powerful 3 years old Thinkpad
> and some random machine kind of between these two. All detect the threshold with
> less than 10% of time (heating CPU runtime 1s and sleep time 1s).
>
> You go with 30s to smaller values. Wouldn't be faster to go the opposite
> (start with small values and increase)? The diff below runs successfully on all
> 3 machines. What am I missing?

That can be done too, the question is when to decide that the thermal 
sensor does not work in the case of no response.

The 30 s of running the workload continuously is kind of a "worst case" 
choice, but a smaller value can be used to start with.  If a system that 
needs a longer time is found, the test can be updated I suppose.

>>> other will waste time (if they get interrupt e.g. in 10 sec).
>> That very well may happen, but is it a big deal?
> We try to cut down the test runtime, because LTP test collection is huge
> and runtime for it is many hours [1]. For example, we have many CVE tests which
> detect race condition. Instead of running each test for "safe long time enough"
> which could be e.g. several minutes for many of them we have way to shorten the
> time (see include/tst_fuzzy_sync.h).

Fair enough.


> [1] https://linux-test-project.readthedocs.io/en/latest/developers/ground_rules.html#why-is-sleep-in-tests-bad-then
>
>>> The usual approach would be to have the timeout safe enough for any type
>>> of hardware but proactively check the temperature and stop testing once it's
>>> done.
>> We want to create conditions in which the temperature should rise and if it
>> doesn't, then there is a problem.
> Sure.
>
>> That said, the temperature can of course be checked more proactively, at
>> least in principle, like say run cpu_workload() for 1s, check the
>> temperature, repeat that several times, then cool down etc.
> Yeah, that's kind of my proposal above.
>
> Also, all of my 3 machines have only 1x x86_pkg_temp type, but I suppose there
> are devices with more (I was not able to figure that out from
> drivers/thermal/intel/x86_pkg_temp_thermal.c but otherwise the test would not
> try to test them all). But why it is important to test them all? Isn't it enough
> just to test a single one?

The thermal sensors in question are per processor package.  With 
multiple packages in a system, if any of them does not work as expected, 
we want to know.

BR, Rafael


>
>
> +++ testcases/kernel/thermal/thermal_interrupt_events.c
> @@ -117,8 +117,8 @@ static void *cpu_workload(double run_time)
>   static void test_zone(int i)
>   {
>   			char path[NAME_MAX], temp_path[NAME_MAX];
> -			int sleep_time = SLEEP, temp_high, temp;
> -			double run_time = RUNTIME;
> +			int sleep_time = 1, temp_high, temp;
> +			double run_time = 1;
>   
>   			snprintf(path, NAME_MAX, "/sys/class/thermal/thermal_zone%d/", i);
>   			strncpy(temp_path, path, NAME_MAX);
> @@ -138,7 +138,7 @@ static void test_zone(int i)
>   			SAFE_FILE_SCANF(trip_path, "%d", &trip);
>   			SAFE_FILE_PRINTF(trip_path, "%d", temp_high);
>   
> -			while (sleep_time > 0) {
> +			while (sleep_time < SLEEP) {
>   				tst_res(TDEBUG, "Running for %f seconds, then sleeping for %d seconds", run_time, sleep_time);
>   
>   				for (int j = 0; j < nproc; j++) {
> @@ -155,8 +155,8 @@ static void test_zone(int i)
>   
>   				if (temp > temp_high)
>   					break;
> -				sleep(sleep_time--);
> -				run_time -= 3;
> +				sleep(sleep_time++);
> +				run_time += 3;
>   			}
>   
>   			if (temp <= temp_high)


More information about the ltp mailing list