[LTP] [PATCH v3] thermal: add new test group
Wysocki, Rafael J
rafael.j.wysocki@intel.com
Tue Feb 3 15:38:38 CET 2026
On 1/30/2026 12:24 AM, Petr Vorel wrote:
>> On 1/23/2026 7:28 PM, Petr Vorel wrote:
>>> Hi Piotr,
>>>>>> Then it
>>>>>> + * decreases the threshold for sending a thermal interrupt to just
>>>>>> above
>>>>>> + * the current temperature and runs a workload on the CPU.
>>>>> First, why test needs to run for 30 sec and then sleep for 10 sec?
>>> Maybe the most important of my questions / points.
>>>> Here the point is to use a decreasing timeout. The test starts with 10s
>>>> cooldown to make sure that even pre-production CPU's, which might have
>>>> their thermal protections disabled, cool down properly. Once sleep time
>>>> reaches 0, the conclusion is that either there was not enough workload
>>>> or somehow interrupts are not triggered after all.
>>> Why 30 sec and then sleep for 10 sec? Is it really needed to do it this way?
>> Of course not.
>
>>> Aren't these times depending on the tested machine? Some of them will fail due
>>> time not running enough,
>> That's unexpected with the numbers that are used, so something is amiss if
>> it fails (and so it should fail).
> I tested on very old (~15 years) Thinkpad, quite powerful 3 years old Thinkpad
> and some random machine kind of between these two. All detect the threshold with
> less than 10% of time (heating CPU runtime 1s and sleep time 1s).
>
> You go with 30s to smaller values. Wouldn't be faster to go the opposite
> (start with small values and increase)? The diff below runs successfully on all
> 3 machines. What am I missing?
That can be done too, the question is when to decide that the thermal
sensor does not work in the case of no response.
The 30 s of running the workload continuously is kind of a "worst case"
choice, but a smaller value can be used to start with. If a system that
needs a longer time is found, the test can be updated I suppose.
>>> other will waste time (if they get interrupt e.g. in 10 sec).
>> That very well may happen, but is it a big deal?
> We try to cut down the test runtime, because LTP test collection is huge
> and runtime for it is many hours [1]. For example, we have many CVE tests which
> detect race condition. Instead of running each test for "safe long time enough"
> which could be e.g. several minutes for many of them we have way to shorten the
> time (see include/tst_fuzzy_sync.h).
Fair enough.
> [1] https://linux-test-project.readthedocs.io/en/latest/developers/ground_rules.html#why-is-sleep-in-tests-bad-then
>
>>> The usual approach would be to have the timeout safe enough for any type
>>> of hardware but proactively check the temperature and stop testing once it's
>>> done.
>> We want to create conditions in which the temperature should rise and if it
>> doesn't, then there is a problem.
> Sure.
>
>> That said, the temperature can of course be checked more proactively, at
>> least in principle, like say run cpu_workload() for 1s, check the
>> temperature, repeat that several times, then cool down etc.
> Yeah, that's kind of my proposal above.
>
> Also, all of my 3 machines have only 1x x86_pkg_temp type, but I suppose there
> are devices with more (I was not able to figure that out from
> drivers/thermal/intel/x86_pkg_temp_thermal.c but otherwise the test would not
> try to test them all). But why it is important to test them all? Isn't it enough
> just to test a single one?
The thermal sensors in question are per processor package. With
multiple packages in a system, if any of them does not work as expected,
we want to know.
BR, Rafael
>
>
> +++ testcases/kernel/thermal/thermal_interrupt_events.c
> @@ -117,8 +117,8 @@ static void *cpu_workload(double run_time)
> static void test_zone(int i)
> {
> char path[NAME_MAX], temp_path[NAME_MAX];
> - int sleep_time = SLEEP, temp_high, temp;
> - double run_time = RUNTIME;
> + int sleep_time = 1, temp_high, temp;
> + double run_time = 1;
>
> snprintf(path, NAME_MAX, "/sys/class/thermal/thermal_zone%d/", i);
> strncpy(temp_path, path, NAME_MAX);
> @@ -138,7 +138,7 @@ static void test_zone(int i)
> SAFE_FILE_SCANF(trip_path, "%d", &trip);
> SAFE_FILE_PRINTF(trip_path, "%d", temp_high);
>
> - while (sleep_time > 0) {
> + while (sleep_time < SLEEP) {
> tst_res(TDEBUG, "Running for %f seconds, then sleeping for %d seconds", run_time, sleep_time);
>
> for (int j = 0; j < nproc; j++) {
> @@ -155,8 +155,8 @@ static void test_zone(int i)
>
> if (temp > temp_high)
> break;
> - sleep(sleep_time--);
> - run_time -= 3;
> + sleep(sleep_time++);
> + run_time += 3;
> }
>
> if (temp <= temp_high)
More information about the ltp
mailing list