[LTP] min_free_kbytes: Handle transient memory drops in check_monitor
linuxtestproject.agent@gmail.com
linuxtestproject.agent@gmail.com
Mon Jun 1 08:42:20 CEST 2026
Hi Wei,
Thank you for the patch. A few issues to address before this can be merged.
On Sun, May 31, 2026 at 13:40:47 +0000, Wei Gao <wegao@suse.com> wrote:
> min_free_kbytes: Handle transient memory drops in check_monitor
>
> High memory pressure can cause MemFree to temporarily drop below the
> min_free_kbytes threshold before the kernel reclaimer can catch up.
> This results in intermittent test failures, particularly observed on
> openQA aarch64 machines where swap is exhausted.
>
> Implement a 2-second grace period with high-accuracy 10ms fixed polling
> in check_monitor() to allow the kernel time to reclaim memory.
>
> Introduce a 10% tolerance (90% threshold) for the MemFree check. Our
> measurements showed that under extreme pressure, MemFree can take a
> long time to recover to the exact 100% MIN_FREE_KBYTES, or may stay slightly
> below it. This tolerance prevents false positives and avoids excessive
> wait times while still ensuring memory is maintained near the required level.
The 10% tolerance weakens what the test actually verifies. The commit says
"Our measurements showed..." but no data or bounds are provided. What is the
maximum observed shortfall? If the worst case seen was, say, 3%, a 10%
tolerance is far too loose and may mask real regressions. Please quantify the
measurement and tighten the tolerance to match observed reality, or at least
document the worst-case value that motivated the 10% figure.
> Enhanced diagnostics are added to report MemAvailable and the minimum
> memory level seen during the pressure period to aid in future
> calibration.
"During the pressure period" is inaccurate. `min_memfree` is correctly
tracked inside the retry loop, but `memavail` is read *after* the loop
completes:
>+ memavail = SAFE_READ_MEMINFO("MemAvailable:");
>+
>+ if (memfree < threshold) {
>+ tst_res(TINFO, "tune=%lu, threshold=%lu", tune, threshold);
>+ tst_res(TINFO, "MemFree=%lu, MemAvailable=%lu, MinSeen=%lu (%lu%%)",
>+ memfree, memavail, min_memfree, (min_memfree * 100 / tune));
`memavail` is a snapshot taken at the *end* of the grace period, not during
it. By the time the retry loop finishes the kernel reclaimer may have already
recovered substantially. Either move SAFE_READ_MEMINFO("MemAvailable:") inside
the loop to track it properly (as is done for MemFree), or adjust the commit
message to say "after the grace period" rather than "during the pressure
period".
> @@ -177,20 +177,55 @@ static int eatup_mem(unsigned long overcommit_policy)
> static void check_monitor(void)
> {
> - unsigned long tune;
> - unsigned long memfree;
> + unsigned long tune, threshold;
> + unsigned long memfree, memavail, min_memfree;
> + int i, retry_count;
`retry_count` is redundant. Within the loop, `retry_count` is always equal
to `i / 10` (e.g. when i=10 retry_count=1, when i=50 retry_count=5). The
else-branch TINFO message reports both, conveying identical information in
two different units. Drop `retry_count` and derive the count from `i / 10`
directly, or just remove it from the diagnostic entirely since the elapsed
time in ms (`i`) is the more useful figure.
> - sleep(2);
> + usleep(100000);
> }
This changes the normal (non-pressure) monitoring interval from 2 seconds to
100 ms, making the monitor sample `/proc/meminfo` and
`/proc/sys/vm/min_free_kbytes` 20x more often during the entire test run.
This is a significant behavioral change that is completely absent from the
commit message. Please document it — including the rationale for choosing
100 ms — so reviewers and bisectors can understand the intent.
---
Verdict: Needs revision
Issues:
1. [NEEDS REVISION] Undocumented behavioral change: normal poll interval
2s → 100 ms is not mentioned in the commit message.
2. [NEEDS REVISION] Commit message inaccuracy: MemAvailable is read after
the retry loop, not "during the pressure period".
3. [NEEDS DISCUSSION] 10% tolerance lacks quantitative justification; the
"Our measurements showed..." claim needs backing data or tighter bounds.
4. [NEEDS REVISION] `retry_count` is always equal to `i / 10`; the
variable is redundant and should be removed.
LTP AI Reviewer
More information about the ltp
mailing list