[LTP] min_free_kbytes: Handle transient memory drops in check_monitor
linuxtestproject.agent@gmail.com
linuxtestproject.agent@gmail.com
Tue Jun 2 06:02:55 CEST 2026
Hi Wei,
Thanks for the patch. A few comments below.
> min_free_kbytes: Handle transient memory drops in check_monitor
>
> High memory pressure can cause MemFree to temporarily drop below the
> min_free_kbytes threshold before the kernel reclaimer can catch up.
> This results in intermittent test failures, observed on openQA aarch64
> virtual machines.
>
> Implement a 2-second grace period with high-accuracy 10ms fixed polling
> in check_monitor() to allow the kernel time to reclaim memory.
>
> Introduce a 10% tolerance (90% threshold) for the MemFree check. My
> measurements showed that under extreme pressure, MemFree can dip as low
> as ~50% to ~70% of the target. While it typically recovers above 90%
> within one second, hitting the exact 100% watermark sometimes can take
> significantly longer. This tolerance prevents false positives during the
> slow recovery tail while still ensuring memory is maintained near the
> required level.
Please rewrite in impersonal form, e.g.:
"Measurements under extreme pressure show MemFree can dip as low as
~50% to ~70% of the target."
Commit messages become a permanent part of the project history and
should avoid first-person language.
> -unsigned long tune;
> -unsigned long memfree;
> +unsigned long tune, threshold;
> +unsigned long memfree, min_memfree;
> +int i;
>
> while (!end) {
> memfree = SAFE_READ_MEMINFO("MemFree:");
> tune = TST_SYS_CONF_LONG_GET(MIN_FREE_KBYTES);
> +/*
> + * Allow 10% tolerance to account for transient states.
> + */
> +threshold = tune * 9 / 10;
>
> if (memfree < tune) {
> -tst_res(TINFO, "MemFree is %lu kB, "
> - "min_free_kbytes is %lu kB", memfree, tune);
> -tst_res(TFAIL, "MemFree < min_free_kbytes");
> +min_memfree = memfree;
> +/*
> + * Give it some time to reclaim. The kernel should keep
> + * MemFree above min_free_kbytes, but transient drops
> + * are possible under high pressure.
> + * Check every 10ms for up to 2 seconds for high accuracy.
> + */
> +for (i = 10; i <= 2000; i += 10) {
> +usleep(10000);
> +memfree = SAFE_READ_MEMINFO("MemFree:");
> +if (memfree < min_memfree)
> +min_memfree = memfree;
> +
> +if (memfree >= tune)
> +break;
> +}
The inner polling loop does not check the 'end' flag, which is set to 1
by the SIGUSR1 handler when the parent finishes. In the original code,
the single sleep(2) at the bottom of the outer loop was interrupted by
the signal (usleep/sleep return EINTR), so the outer while (!end) check
fired promptly. With the new grace period loop, the signal can fire
while we are in any of the 200 inner usleep(10000) calls; the sleep
returns early but the loop body continues, and the outer while (!end)
cannot be reached until the full ~2 seconds elapse.
Please add an early exit:
for (i = 10; i <= 2000; i += 10) {
if (end)
return;
usleep(10000);
memfree = SAFE_READ_MEMINFO("MemFree:");
if (memfree < min_memfree)
min_memfree = memfree;
if (memfree >= tune)
break;
}
Alternatively a `break` is sufficient if the outer loop check is
reached quickly, but `return` is cleaner given the function only has
one exit point anyway.
> +if (memfree < threshold) {
> +tst_res(TFAIL, "MemFree %lu < 90%% of min_free_kbytes %lu (MinSeen: %lu%%) after 2s",
> +memfree, tune, (min_memfree * 100 / tune));
> +} else if (memfree < tune) {
> +tst_res(TINFO, "MemFree (%lu) stayed within 10%% tolerance (min %lu%%) after ~2s",
> +memfree, (min_memfree * 100 / tune));
> +} else {
> +tst_res(TINFO, "MemFree recovered to %lu (min %lu%%) after %d ms",
> +memfree, (min_memfree * 100 / tune), i);
> +}
Minor: the TINFO messages mix kB values and percentage values using the
same format specifier %lu, which can be confusing in the output. The
TFAIL line prints MemFree and tune as raw kB values but labels the third
argument "(MinSeen: N%)" — the intent is clear but adding "kB" units to
the first two values would be more consistent:
"MemFree %lu kB < 90%% of min_free_kbytes %lu kB (MinSeen: %lu%%) after 2s"
Similarly for the TINFO lines.
---
Verdict: Needs revision
The missing 'end' check in the inner polling loop is a regression: it
removes the prompt termination behaviour of the original sleep(2) and
can delay the test exit by up to 2 seconds after SIGUSR1 is received.
Please fix before merging. The commit message first-person wording and
the missing kB unit labels are minor and can be addressed in the same
respin.
LTP AI Reviewer
More information about the ltp
mailing list