[LTP] Issue faced in memcg_stat_rss while running mainline kernels between 6.7 and 6.8

Thu Jan 16 11:35:54 CET 2025

On Thu 16-01-25 15:34:38, Harshvardhan Jha wrote:
> Hi Michal,
> 
> On 16/01/25 2:36 PM, Michal Hocko wrote:
> > On Thu 16-01-25 13:37:14, Harshvardhan Jha wrote:
> >> Hello Michal
> >> On 16/01/25 1:23 PM, Michal Hocko wrote:
> >>> Hi,
> >>>
> >>> On Wed 15-01-25 23:59:20, Petr Vorel wrote:
> >>>> Hi Harshvardhan,
> >>>>
> >>>> [ Cc cgroups@vger.kernel.org: FYI problem in recent kernel using cgroup v1 ]
> >>> It is hard to decypher the output and nail down actual failure. Could
> >>> somebody do a TL;DR summary of the failure, since when it happens, is it
> >>> really v1 specific?
> >> The test ltp_memcg_stat_rss is indeed cgroup v1 specific.
> > What does this test case aims to test?
> >
> This test specifically tests the memory cgroup(memcg) subsystem,
> focusing on the RSS accounting functionality.
> 
> The test verifies how the kernel tracks and reports memory usage within
> cgroups, specifically:
> 
> - The accuracy of RSS accounting in memory cgroups
> - How the kernel updates and maintains the RSS statistics for processes
> within memory cgroups
> - The proper reporting of memory usage through the cgroup interface
> 
> The test typically:
> 
>  1. Creates a memory cgroup
>  2. Allocates various types of memory within it
>  3. Verifies that the reported RSS statistics match the expected values
>  4. Test edge cases like shared pages and memory pressure situations
> 
> I hope I explained it right @Petr?

Thanks. Yes this does clarify the test case. Unfortunatelly this could
be quite tricky to get right, especially on short lived processes. Due
to stats accounting optimizations all the changes to counters might not be
visible right a way. So there is some tuning required and to make it
worse that tuning might just not work with future optimizations.

All that being said, it is a question whether the specific testcases
brings a sufficient value to justify likely false negatives and constant
tuning to existing kenrnel implementation.

If this local imprecision is a problem for real workloads we might need
to provide means to sync up stats (similar to what we have for
/proc/vmstat) and test cases could rely on that rather than trying to
estimate in flight cached stats.
-- 
Michal Hocko
SUSE Labs