[LTP] [PATCH v2] syscalls/readahead02: limit max readahead to backing device max_readahead_kb

Thu Mar 7 10:15:09 CET 2019

----- Original Message -----
> On Thu, Mar 7, 2019 at 10:18 AM Jan Stancek <jstancek@redhat.com> wrote:
> >
> >
> >
> > ----- Original Message -----
> > > On Wed, Mar 6, 2019 at 6:43 PM Jan Stancek <jstancek@redhat.com> wrote:
> > > >
> > > > On Tue, Mar 05, 2019 at 10:44:57PM +0200, Amir Goldstein wrote:
> > > > >> > > > This is certainly better than 4K, but still feels like we are
> > > > >> > > > not
> > > > >> > > > really
> > > > >> > > > testing
> > > > >> > > > the API properly, but I'm fine with this fix.
> > > > >> > > >
> > > > >> > > > However... as follow up, how about extending the new
> > > > >> > > > tst_dev_bytes_written() utils from Sumit to cover also
> > > > >> > > > bytes_read
> > > > >> > > > and replace validation of readahead() from get_cached_size()
> > > > >> > > > diff
> > > > >> > > > to tst_dev_bytes_read()?
> > > > >> > >
> > > > >> > > There is something similar based on /proc/self/io. We could try
> > > > >> > > using
> > > > >> > > that to estimate max readahead size.
> > > > >> > >
> > > > >> > > Or /sys/class/block/$dev/stat as you suggested, not sure which
> > > > >> > > one
> > > > >> > > is
> > > > >> > > more accurate/up to date.
> > > > >> > >
> > > > >> >
> > > > >> > I believe /proc/self/io doesn't count IO performed by kernel async
> > > > >> > readahead against the process that issued the readahead, but
> > > > >> > didn't
> > > > >> > check. The test uses /proc/self/io to check how many IO where
> > > > >> > avoided
> > > > >> > by readahead...
> > > > >>
> > > > >> We could do one readahead() on entire file, then read
> > > > >> the file and see how many IO we didn't manage to avoid.
> > > > >> The difference between filesize and IO we couldn't avoid,
> > > > >> would be our max readahead size.
> > > >
> > > > This also doesn't seem 100% accurate.
> > > >
> > > > Any method to inspect side-effect of readahead, appears to lead to more
> > > > readahead done by kernel. E.g. sequential reads leading to more async
> > > > readahead started by kernel (which tries to stay ahead by async_size).
> > > > mmap() approach appears to fault-in with do_fault_around().
> > > >
> > > > MAP_NONBLOCK is gone, mincore and pagemap doesn't help here.
> > > >
> > > > I'm attaching v3, where I do reads with sycalls() in reverse order.
> > > > But occasionally, it still somehow leads to couple extra pages being
> > > > read to cache. So, it still over-estimates. On ppc64le, it's quite
> > > > significant, 4 extra pages in cache, each 64k, causes readahead
> > > > loop to miss ~10MB of data.
> > > >
> > > > /sys/class/block/$dev/ stats appear to be increased for fs metadata
> > > > as well, which can also inflate value and we over-estimate.
> > > >
> > > > I'm running out of ideas for something more accurate/stable than v2.
> > > >
> > >
> > > I'm trying to understand if maybe you are running off the rails with
> > > the estimation issue.
> > > What the test aims to verify is that readahead prevents waiting
> > > on IO in the future.
> > > The max readahead size estimation is a very small part
> > > of the test and not that significant IMO.
> > > Why is the test not trying to readahead more than 64MB?
> > > It's arbitrary. So my first proposal to over-estimation was
> > > to cap upper estimation with 1MB, i.e.:
> >
> > Problem is that max allowed readahead can be smaller than 1MB,
> > so then we still over-estimate.
> >
> > >
> > > offset += MIN(max_ra_estimate, MAX_SANE_READAHEAD_ESTIMATION);
> > >
> > > With this change we won't be testing if readahead of 64MB
> > > in one go works anymore - it looks like getting that to work reliably
> > > is more challenging and not sure its worth the trouble.
> > >
> > > What you ended up implementing in v2 is not what I proposed
> > > (you just disabled estimation altogether)
> >
> > Yes, v2/bdi limit is highest number I'm aware of, that should
> > work across all kernels without guessing.
> >
> > > I am fine with the dynamic setting on min_sane_readahead
> > > in v2, but it is independent of capping the upper limit for
> > > estimation.
> > >
> > > So what do you say about v2 + estimation (with or without reading
> > > file backwards) and upper estimation limit?
> >
> > How could we tell if 1MB upper limit is smaller than max readahead limit?
> >
> 
> Try to set bdi limit of test device on setup() to testfile_size before
> reading
> back the value?
> If that fails try testfile_size / 2 etc.

Maybe, but we would need to start lower (2M). Kernels prior to
commit 600e19afc5f8a6c18ea49cee9511c5797db02391 will just ignore it.

I can give it a try.

> Use the value that bdi limit ended up with.
> 
> > Other option could be v3, but in steps equal to "max_ra / 2":
> >
> >  *                        |<----- async_size ---------|
> >  *     |------------------- size -------------------->|
> >  *     |==================#===========================|
> >  *     ^start             ^page marked with PG_readahead
> >  *
> >  * To overlap application thinking time and disk I/O time, we do
> >  * `readahead pipelining': Do not wait until the application consumed all
> >  * readahead pages and stalled on the missing page at readahead_index;
> >  * Instead, submit an asynchronous readahead I/O as soon as there are
> >  * only async_size pages left in the readahead window. Normally async_size
> >  * will be equal to size, for maximum pipelining.
> >
> >
> 
> OK, if it works, but v2 + raising bdi limit sounds simpler,
> so if that works
> I think that it is better to minimize speculative code
> in the test,

readahead has been tweaked a lot throughout the years,
so it all feels like speculation :-/

> 
> Thanks,
> Amir.
>