[LTP] min_free_kbytes: Handle transient memory drops in check_monitor

Wed Jun 3 05:07:33 CEST 2026

On Tue, Jun 02, 2026 at 04:07:02PM +0000, Andrea Cervesato via ltp wrote:
> Hi Wei,
> 
> > > > Introduce a 10% tolerance (90% threshold) for the MemFree check. My
> > > > measurements showed that under extreme pressure, MemFree can dip as low
> > > > as ~50% to ~70% of the target. While it typically recovers above 90%
> > > > within one second, hitting the exact 100% watermark sometimes can take
> > > > significantly longer. This tolerance prevents false positives during the
> > > > slow recovery tail while still ensuring memory is maintained near the
> > > > required level.
> > > 
> > > Please rewrite in impersonal form, e.g.:
> > >   "Measurements under extreme pressure show MemFree can dip as low as
> > >    ~50% to ~70% of the target."
> > > 
> > > Commit messages become a permanent part of the project history and
> > > should avoid first-person language.
> > The measurement result is quite related with my local env, also is there any ltp
> > rule forbids the use of "My"?
> 
> Yeah, this is quite common in the commit messages, but it's not a strict
> rule don't worry about it.
> 
> > > 
> > > > -unsigned long tune;
> > > > -unsigned long memfree;
> > > > +unsigned long tune, threshold;
> > > > +unsigned long memfree, min_memfree;
> > > > +int i;
> > > >
> > > >  while (!end) {
> > > >  memfree = SAFE_READ_MEMINFO("MemFree:");
> > > >  tune = TST_SYS_CONF_LONG_GET(MIN_FREE_KBYTES);
> > > > +/*
> > > > + * Allow 10% tolerance to account for transient states.
> > > > + */
> > > > +threshold = tune * 9 / 10;
> > > >
> > > >  if (memfree < tune) {
> > > > -tst_res(TINFO, "MemFree is %lu kB, "
> > > > - "min_free_kbytes is %lu kB", memfree, tune);
> > > > -tst_res(TFAIL, "MemFree < min_free_kbytes");
> > > > +min_memfree = memfree;
> > > > +/*
> > > > + * Give it some time to reclaim. The kernel should keep
> > > > + * MemFree above min_free_kbytes, but transient drops
> > > > + * are possible under high pressure.
> > > > + * Check every 10ms for up to 2 seconds for high accuracy.
> > > > + */
> > > > +for (i = 10; i <= 2000; i += 10) {
> > > > +usleep(10000);
> > > > +memfree = SAFE_READ_MEMINFO("MemFree:");
> > > > +if (memfree < min_memfree)
> > > > +min_memfree = memfree;
> > > > +
> > > > +if (memfree >= tune)
> > > > +break;
> > > > +}
> > > 
> > > The inner polling loop does not check the 'end' flag, which is set to 1
> > > by the SIGUSR1 handler when the parent finishes. In the original code,
> > > the single sleep(2) at the bottom of the outer loop was interrupted by
> > > the signal (usleep/sleep return EINTR), so the outer while (!end) check
> > > fired promptly. With the new grace period loop, the signal can fire
> > > while we are in any of the 200 inner usleep(10000) calls; the sleep
> > > returns early but the loop body continues, and the outer while (!end)
> > > cannot be reached until the full ~2 seconds elapse.
> > > 
> > > Please add an early exit:
> > > 
> > >   for (i = 10; i <= 2000; i += 10) {
> > >       if (end)
> > >           return;
> > >       usleep(10000);
> > >       memfree = SAFE_READ_MEMINFO("MemFree:");
> > >       if (memfree < min_memfree)
> > >           min_memfree = memfree;
> > >       if (memfree >= tune)
> > >           break;
> > >   }
> > > 
> > > Alternatively a `break` is sufficient if the outer loop check is
> > > reached quickly, but `return` is cleaner given the function only has
> > > one exit point anyway.
> > Completeness is more important than saving 2 seconds of test time, especially 
> > for the final sub-test. If we jump out early, we might miss the final result of the test.
> 
> I'm not sure about this one. In general, once we have a result we
> should return as fast as possible, so we can save time while running
> other tests. 2 seconds might not seem enough, but when we sum up
> various tests with the same logic, we might have a big delay.

Since end is only set to 1 by SIGUSR1 at the very end of the entire program, 
adding end check to the inner loop would only save time during the final teardown, 
the maximum possible time saved is just 2 seconds, also we start take a real risk: 
If a memory drop is active at the very end of the test, and we exit early because 
end == 1, we miss the final result (completeness).

> 
> > > 
> > > > +if (memfree < threshold) {
> > > > +tst_res(TFAIL, "MemFree %lu < 90%% of min_free_kbytes %lu (MinSeen: %lu%%) after 2s",
> > > > +memfree, tune, (min_memfree * 100 / tune));
> > > > +} else if (memfree < tune) {
> > > > +tst_res(TINFO, "MemFree (%lu) stayed within 10%% tolerance (min %lu%%) after ~2s",
> > > > +memfree, (min_memfree * 100 / tune));
> > > > +} else {
> > > > +tst_res(TINFO, "MemFree recovered to %lu (min %lu%%) after %d ms",
> > > > +memfree, (min_memfree * 100 / tune), i);
> > > > +}
> > > 
> > > Minor: the TINFO messages mix kB values and percentage values using the
> > > same format specifier %lu, which can be confusing in the output. The
> > > TFAIL line prints MemFree and tune as raw kB values but labels the third
> > > argument "(MinSeen: N%)" — the intent is clear but adding "kB" units to
> > > the first two values would be more consistent:
> > > 
> > >   "MemFree %lu kB < 90%% of min_free_kbytes %lu kB (MinSeen: %lu%%) after 2s"
> > > 
> > > Similarly for the TINFO lines.
> > Name is min_free_kbytes and the test context is very clear.
> > 
> > Unless there's a logical error or an LTP rule violation, I won't be sending another patch.
> 
> I think the agent complains that the output message is not easy
> to debug: we are printing values without the units. I would just
> add MemFree unit `kB` in this case. The rest of the message looks
> ok.
> 
> Regards,
> --
> Andrea Cervesato
> SUSE QE Automation Engineer Linux
> andrea.cervesato@suse.com
> 
> -- 
> Mailing list info: https://lists.linux.it/listinfo/ltp