[LTP] min_free_kbytes: Handle transient memory drops in check_monitor

Andrea Cervesato andrea.cervesato@suse.com
Tue Jun 2 18:07:02 CEST 2026


Hi Wei,

> > > Introduce a 10% tolerance (90% threshold) for the MemFree check. My
> > > measurements showed that under extreme pressure, MemFree can dip as low
> > > as ~50% to ~70% of the target. While it typically recovers above 90%
> > > within one second, hitting the exact 100% watermark sometimes can take
> > > significantly longer. This tolerance prevents false positives during the
> > > slow recovery tail while still ensuring memory is maintained near the
> > > required level.
> > 
> > Please rewrite in impersonal form, e.g.:
> >   "Measurements under extreme pressure show MemFree can dip as low as
> >    ~50% to ~70% of the target."
> > 
> > Commit messages become a permanent part of the project history and
> > should avoid first-person language.
> The measurement result is quite related with my local env, also is there any ltp
> rule forbids the use of "My"?

Yeah, this is quite common in the commit messages, but it's not a strict
rule don't worry about it.

> > 
> > > -unsigned long tune;
> > > -unsigned long memfree;
> > > +unsigned long tune, threshold;
> > > +unsigned long memfree, min_memfree;
> > > +int i;
> > >
> > >  while (!end) {
> > >  memfree = SAFE_READ_MEMINFO("MemFree:");
> > >  tune = TST_SYS_CONF_LONG_GET(MIN_FREE_KBYTES);
> > > +/*
> > > + * Allow 10% tolerance to account for transient states.
> > > + */
> > > +threshold = tune * 9 / 10;
> > >
> > >  if (memfree < tune) {
> > > -tst_res(TINFO, "MemFree is %lu kB, "
> > > - "min_free_kbytes is %lu kB", memfree, tune);
> > > -tst_res(TFAIL, "MemFree < min_free_kbytes");
> > > +min_memfree = memfree;
> > > +/*
> > > + * Give it some time to reclaim. The kernel should keep
> > > + * MemFree above min_free_kbytes, but transient drops
> > > + * are possible under high pressure.
> > > + * Check every 10ms for up to 2 seconds for high accuracy.
> > > + */
> > > +for (i = 10; i <= 2000; i += 10) {
> > > +usleep(10000);
> > > +memfree = SAFE_READ_MEMINFO("MemFree:");
> > > +if (memfree < min_memfree)
> > > +min_memfree = memfree;
> > > +
> > > +if (memfree >= tune)
> > > +break;
> > > +}
> > 
> > The inner polling loop does not check the 'end' flag, which is set to 1
> > by the SIGUSR1 handler when the parent finishes. In the original code,
> > the single sleep(2) at the bottom of the outer loop was interrupted by
> > the signal (usleep/sleep return EINTR), so the outer while (!end) check
> > fired promptly. With the new grace period loop, the signal can fire
> > while we are in any of the 200 inner usleep(10000) calls; the sleep
> > returns early but the loop body continues, and the outer while (!end)
> > cannot be reached until the full ~2 seconds elapse.
> > 
> > Please add an early exit:
> > 
> >   for (i = 10; i <= 2000; i += 10) {
> >       if (end)
> >           return;
> >       usleep(10000);
> >       memfree = SAFE_READ_MEMINFO("MemFree:");
> >       if (memfree < min_memfree)
> >           min_memfree = memfree;
> >       if (memfree >= tune)
> >           break;
> >   }
> > 
> > Alternatively a `break` is sufficient if the outer loop check is
> > reached quickly, but `return` is cleaner given the function only has
> > one exit point anyway.
> Completeness is more important than saving 2 seconds of test time, especially 
> for the final sub-test. If we jump out early, we might miss the final result of the test.

I'm not sure about this one. In general, once we have a result we
should return as fast as possible, so we can save time while running
other tests. 2 seconds might not seem enough, but when we sum up
various tests with the same logic, we might have a big delay.

> > 
> > > +if (memfree < threshold) {
> > > +tst_res(TFAIL, "MemFree %lu < 90%% of min_free_kbytes %lu (MinSeen: %lu%%) after 2s",
> > > +memfree, tune, (min_memfree * 100 / tune));
> > > +} else if (memfree < tune) {
> > > +tst_res(TINFO, "MemFree (%lu) stayed within 10%% tolerance (min %lu%%) after ~2s",
> > > +memfree, (min_memfree * 100 / tune));
> > > +} else {
> > > +tst_res(TINFO, "MemFree recovered to %lu (min %lu%%) after %d ms",
> > > +memfree, (min_memfree * 100 / tune), i);
> > > +}
> > 
> > Minor: the TINFO messages mix kB values and percentage values using the
> > same format specifier %lu, which can be confusing in the output. The
> > TFAIL line prints MemFree and tune as raw kB values but labels the third
> > argument "(MinSeen: N%)" — the intent is clear but adding "kB" units to
> > the first two values would be more consistent:
> > 
> >   "MemFree %lu kB < 90%% of min_free_kbytes %lu kB (MinSeen: %lu%%) after 2s"
> > 
> > Similarly for the TINFO lines.
> Name is min_free_kbytes and the test context is very clear.
> 
> Unless there's a logical error or an LTP rule violation, I won't be sending another patch.

I think the agent complains that the output message is not easy
to debug: we are printing values without the units. I would just
add MemFree unit `kB` in this case. The rest of the message looks
ok.

Regards,
--
Andrea Cervesato
SUSE QE Automation Engineer Linux
andrea.cervesato@suse.com


More information about the ltp mailing list