[LTP] [PATCH] ksm: fix segfault on s390

Wed May 21 19:10:26 CEST 2025

On 2025-05-21 10:08, Li Wang wrote:
> Hi Luiz,
> 
> This is a good catch, thank you, comment inline below.

Hi Li, thanks for taking a look (answers below).

> 
> On Wed, May 21, 2025 at 4:29 PM Luiz Capitulino via ltp
> <ltp@lists.linux.it> wrote:
>>
>> Recently, we started seeing the following segfault when running ksm01
>> and ksm02 tests on an s390 KSM guest:
>>
>> """
>> [  119.302817] User process fault: interruption code 0011 ilc:3 in libc.so.6[b14ae,3ff91500000+1c9000]
>> [  119.302824] Failing address: 000003ff91400000 TEID: 000003ff91400800
>> [  119.302826] Fault in primary space mode while using user ASCE.
>> [  119.302828] AS:0000000084bec1c7 R3:00000000824cc007 S:0000000081a28001 P:0000000000000400
>> [  119.302833] CPU: 0 UID: 0 PID: 5578 Comm: ksm01 Kdump: loaded Not tainted 6.15.0-rc6+ #8 NONE
>> [  119.302837] Hardware name: IBM 3931 LA1 400 (KVM/Linux)
>> [  119.302839] User PSW : 0705200180000000 000003ff915b14ae
>> [  119.302841]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:2 PM:0 RI:0 EA:3
>> [  119.302843] User GPRS: cccccccccccccccd 000000000007efff 000003ff91400000 000003ff814ff010
>> [  119.302845]            0000000007ffffff 0000000000000000 0000000000000000 000003ff00000000
>> [  119.302847]            0000000000000063 0000000000100000 00000000023db500 0000000008000000
>> [  119.302848]            0000000000000063 0000000000000080 00000000010066da 000003ffd7777e20
>> [  119.302855] User Code: 000003ff915b149e: a784ffee            brc     8,000003ff915b147a
>>                            000003ff915b14a2: e31032000036        pfd     1,512(%r3)
>>                           #000003ff915b14a8: e31022000036        pfd     1,512(%r2)
>>                           >000003ff915b14ae: d5ff30002000        clc     0(256,%r3),0(%r2)
>>                            000003ff915b14b4: a784ffef            brc     8,000003ff915b1492
>>                            000003ff915b14b8: b2220020            ipm     %r2
>>                            000003ff915b14bc: eb220022000d        sllg    %r2,%r2,34
>>                            000003ff915b14c2: eb22003e000a        srag    %r2,%r2,62
>> [  119.302867] Last Breaking-Event-Address:
>> [  119.302868]  [<000003ff915b14b4>] libc.so.6[b14b4,3ff91500000+1c9000]
>> """
>>
>> This segfault is triggered by the memcmp() call in verify():
>>
>> """
>> memcmp(memory[start], s, (end - start) * (end2 - start2)
>> """
>>
>> In the default case, this call checks if the memory area starting in
>> memory[0] (since start=0 by default) matches 's' for 128MB. IOW, this
>> assumes that the memory areas in memory[] are contiguous. This is wrong,
>> since create_ksm_child() allocates 128 individual areas of 1MB each. As,
>> in this particular case, memory[0] happens to be the last 1MB area in
>> the VMA created by the kernel, we hit a segault at the first byte beyond
>> memory[0].
>>
>> Now, the question is how this has worked for so long and why it may still
>> work on arm64 and x86 (even on s390 it ocassionaly works).
>>
>> For the s390 case, the reason is upstream kernel commit efa7df3e3bb5
>> ("mm: align larger anonymous mappings on THP boundaries"). Before this
>> commit, the kernel would always map a library right after the memory[0]
>> area in the process address space. This causes memcmp() to return
>> non-zero when reading the first byte beyond memory[0], which in turn
>> causes the nested loop in verify() to execute. The nested loop is correct
>> (ie. it doesn't assume the memory areas in memory[] are contiguous) so
>> the test doesn't fail. The mentioned upstream commit causes the first byte
>> beyond memory[0] not to be mapped most of the time on s390, which may
>> result in a segfault.
>>
>> Now, as it turns out on arm64 and x86 the kernel still maps a library right
>> after memory[0] which causes the test to suceed as explained above (this
>> can be easily verified by printing the return value for memcmp()).
>>
>> This commit fixes verify() to do a byte-by-byte check on each individual
>> memory area. This also simplifies verify() a lot, which is what we want
>> to avoid this kind of issue in the future.
>>
>> Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
>> ---
>>   testcases/kernel/mem/ksm/ksm_test.h | 21 +++++++--------------
>>   1 file changed, 7 insertions(+), 14 deletions(-)
>>
>> diff --git a/testcases/kernel/mem/ksm/ksm_test.h b/testcases/kernel/mem/ksm/ksm_test.h
>> index 0db759d5a..cbad147d4 100644
>> --- a/testcases/kernel/mem/ksm/ksm_test.h
>> +++ b/testcases/kernel/mem/ksm/ksm_test.h
>> @@ -74,22 +74,15 @@ static inline void verify(char **memory, char value, int proc,
>>                      int start, int end, int start2, int end2)
>>   {
>>          int i, j;
>> -       void *s = NULL;
>> -
>> -       s = SAFE_MALLOC((end - start) * (end2 - start2));
>>
>>          tst_res(TINFO, "child %d verifies memory content.", proc);
>> -       memset(s, value, (end - start) * (end2 - start2));
>> -       if (memcmp(memory[start], s, (end - start) * (end2 - start2))
>> -           != 0)
>> -               for (j = start; j < end; j++)
>> -                       for (i = start2; i < end2; i++)
>> -                               if (memory[j][i] != value)
>> -                                       tst_res(TFAIL, "child %d has %c at "
>> -                                                "%d,%d,%d.",
>> -                                                proc, memory[j][i], proc,
>> -                                                j, i);
>> -       free(s);
>> +
>> +       for (j = start; j < end; j++)
>> +               for (i = start2; i < end2; i++)
>> +                       if (memory[j][i] != value)
>> +                               tst_res(TFAIL, "child %d has %c at "
>> +                                       "%d,%d,%d.",
>> +                                       proc, memory[j][i], proc, j, i);
>>   }
> 
> Or, can we optimize the verify() function by using memcmp() per memory
> block, rather than falling back to the slow nested loop that checks each
> byte individually?

As I understand it, the nested loop is there to tell us which byte failed
the verification. And that's good information, IMHO.

Now, as for having both as the code is written today, why should we do it?
Meaning, what does the optimization intend to improve? Is it test run-time?
If yes, do we have measurements to justify it?

Also, for the kernels that I tested (with my particular .config) that code
was always falling through the nested loop anyways after failing memcmp()
for 1MB. So, if there's a perceived difference, this patch should make it
faster not slower.

IMO, test code should be simple and direct.

- Luiz

> 
> Something like:
> ------------------
> 
> ...
> char *expected = SAFE_MALLOC(end2 - start2);
> memset(expected, value, block_size);
> 
> for (j = start; j < end; j++) {
>      if (memcmp(&memory[j][start2], expected, end2 - start2) != 0)
>      ...
> 
>