[LTP] LTP: futex_wake04 hangs forever on i386

Tue Oct 9 04:05:16 CEST 2018

On 10/8/18 10:30 AM, Cyril Hrubis wrote:
> Hi!
>> LTP syscall test case futex_wake04 hangs forever on i386 kernel 
>> running on x86_64 server machine.
>> 
>> Test PASS on qemu_i386 and other devices. Test HANGs on i386 (i386
>>  kernel booting on x86_64 machine).
>> 
>> output, incrementing stop futex_wake04    0  TINFO  : Hugepagesize
>> 4194304
> 
> Looks like we are trying to fault 4GB hugepage, that may be quite 
> slow, but then, the test is a kernel reproducer as well.

This is i386 without PAE so the hugepage here is 4 MB (or else it would
be a 2 MB with PAE). I have configured the same environment on my side:

(k)inaddy@bug3984:~$ sudo hugeadm --hard --pool-pages-max 4M:128
(k)inaddy@bug3984:~$ sudo hugeadm --hard --pool-pages-min 4M:128
(k)inaddy@bug3984:~$ sudo hugeadm --pool-list
       Size  Minimum  Current  Maximum  Default
    4194304      128      128      128        *

And ran the test with 500 iterations, multiple times, without a single
issue.

(k)root@bug3984:~$ while true; do cat /proc/meminfo | grep -i
hugepages_free; done

Shoes me the hugepages are being used:

HugePages_Free:       93
HugePages_Free:        4
HugePages_Free:      128
HugePages_Free:        0
HugePages_Free:      128
HugePages_Free:        1
HugePages_Free:        8
HugePages_Free:      128
HugePages_Free:       27
HugePages_Free:        0
HugePages_Free:      102

> Can you connect to the process with gdb and check if we happened to 
> timeout before or after the two threads that are are supposed to 
> trigger kernel bug are started?
> 

Reading the test and commit 13d60f4b6a it looks like this test is
testing futexes indexes stored in hugepages (one in head and the other
outside the hugepage head) and a indexing problem on which futex() was
signalized as non-contended.

Kernel code was using page->index (of page hosting the futex) for futex
offset and, after this commit, it started taking in consideration if the
page was a hugepage (head or tail) or a compound page, when then it
would use the compound_idx for the futex offset, fixing the issue.

In order for the issue to happen we would have to face:

test thread #1                    test thread #2
--------------                    --------------
[hugepage #1 head] *futex1 offset [hugepage #1 head]
[page]                            [page] *futex2 offset
[page]                            [page]
[page]                            [page]
...                               ...

where futex2 key points to futex1.

And test case does NOT look like something intermittent (since it is a
lock ordering issue - depending on pthreads sharing futexes - with wrong
futexes indexes, like described in kernel commit).

* Test case problem would either happen or not *

I tried to extend a bit number of iterations and something caught my 
attention... when running the same test, multiple times, after some time 
I would face something like:

futex_wake04    0  TINFO  :  Thread 7495 not sleeping yet
futex_wake04    0  TINFO  :  Thread 7495 not sleeping yet
futex_wake04    0  TINFO  :  Thread 7495 not sleeping yet
futex_wake04    0  TINFO  :  Thread 7495 not sleeping yet
futex_wake04    0  TINFO  :  Thread 7495 not sleeping yet
futex_wake04    0  TINFO  :  Thread 7495 not sleeping yet
futex_wake04    0  TINFO  :  Thread 7495 not sleeping yet
futex_wake04    0  TINFO  :  Thread 7495 not sleeping yet
futex_wake04    0  TINFO  :  Thread 7495 not sleeping yet

And there wasn't a second thread task id inside:

/proc/<pid>/task/{<pid>, <tid1>, <tid2>}

like it should.

Likely the task creation (because of error ? check last part) logic here 
was faster than wait_for_threads() logic could expect... causing a race 
for the test's logic.

* This would explain the timeout error we had, right ? *

Since stdout/stderr are being directed to a file: @Naresh, is it 
possible to get output from stdout/err files from the same LKFT URL you 
showed ?

* Wait, there is more ... *

While running the test, and monitoring meminfo, I could see the 
HugePages_Total oscillating:

HugePages_Total:       1
HugePages_Free:        1
HugePages_Rsvd:        0
HugePages_Surp:        0

HugePages_Total:       1
HugePages_Free:        1
HugePages_Rsvd:        0
HugePages_Surp:        0

HugePages_Total:      54
HugePages_Free:       53
HugePages_Rsvd:        0
HugePages_Surp:        0

HugePages_Total:       1
HugePages_Free:        1
HugePages_Rsvd:        0
HugePages_Surp:        0

HugePages_Total:       1
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0

HugePages_Total:     101
HugePages_Free:      101
HugePages_Rsvd:        0
HugePages_Surp:        0

HugePages_Total:     128
HugePages_Free:      128
HugePages_Rsvd:        0
HugePages_Surp:        0

HugePages_Total:       1
HugePages_Free:        1
HugePages_Rsvd:        0
HugePages_Surp:        0

HugePages_Total:       1
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0

despite the fact that I had configured:

(k)inaddy@bug3984:~$ sudo hugeadm --hard --pool-pages-min 4M:128
(k)inaddy@bug3984:~$ sudo hugeadm --hard --pool-pages-max 4M:128

And had "128" for HugePages_Total in the beginning of the test. Number 
of hugepages total oscillates during a successful test and, at the very 
end, the number for HugePages_Total goes back to 128. I have disabled 
THP completely after this test and there was the same result.

Reading hugetlb_report_meminfo() in hugetlb.c, I am not sure (struct 
hstat *h)->nr_huge_pages should change that much (through new hugepages 
being asked for, sysctl, or numa balance, among other options). I'll try 
to trace this tomorrow to see who is touching the HugePages_Total during 
the test.

* Theory *

HugePages_Total flacks and causes 1 pthread to have errors before the 
test readdir() logic on /proc/<pid>/task/X catches if thread went to 
sleep, causing a race in the readdir() logic.

* Obs *

I still need to finish this last part investigation. I haven't tried 
with other architectures yet, only x86 on non-pae kernel. I have tested 
a 4.18 kernel (debian) and latest mainline and both can reproduce this.

Will get back to this soon...