[LTP] [PATCH] ksm: fix segfault on s390

Thu May 22 16:08:00 CEST 2025

On 2025-05-22 01:29, Li Wang wrote:
> On Thu, May 22, 2025 at 7:27 AM Li Wang <liwang@redhat.com> wrote:
>>
>> On Thu, May 22, 2025 at 1:10 AM Luiz Capitulino <luizcap@redhat.com> wrote:
>>>
>>> On 2025-05-21 10:08, Li Wang wrote:
>>>> Hi Luiz,
>>>>
>>>> This is a good catch, thank you, comment inline below.
>>>
>>> Hi Li, thanks for taking a look (answers below).
>>>
>>>>
>>>> On Wed, May 21, 2025 at 4:29 PM Luiz Capitulino via ltp
>>>> <ltp@lists.linux.it> wrote:
>>>>>
>>>>> Recently, we started seeing the following segfault when running ksm01
>>>>> and ksm02 tests on an s390 KSM guest:
>>>>>
>>>>> """
>>>>> [  119.302817] User process fault: interruption code 0011 ilc:3 in libc.so.6[b14ae,3ff91500000+1c9000]
>>>>> [  119.302824] Failing address: 000003ff91400000 TEID: 000003ff91400800
>>>>> [  119.302826] Fault in primary space mode while using user ASCE.
>>>>> [  119.302828] AS:0000000084bec1c7 R3:00000000824cc007 S:0000000081a28001 P:0000000000000400
>>>>> [  119.302833] CPU: 0 UID: 0 PID: 5578 Comm: ksm01 Kdump: loaded Not tainted 6.15.0-rc6+ #8 NONE
>>>>> [  119.302837] Hardware name: IBM 3931 LA1 400 (KVM/Linux)
>>>>> [  119.302839] User PSW : 0705200180000000 000003ff915b14ae
>>>>> [  119.302841]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:2 PM:0 RI:0 EA:3
>>>>> [  119.302843] User GPRS: cccccccccccccccd 000000000007efff 000003ff91400000 000003ff814ff010
>>>>> [  119.302845]            0000000007ffffff 0000000000000000 0000000000000000 000003ff00000000
>>>>> [  119.302847]            0000000000000063 0000000000100000 00000000023db500 0000000008000000
>>>>> [  119.302848]            0000000000000063 0000000000000080 00000000010066da 000003ffd7777e20
>>>>> [  119.302855] User Code: 000003ff915b149e: a784ffee            brc     8,000003ff915b147a
>>>>>                             000003ff915b14a2: e31032000036        pfd     1,512(%r3)
>>>>>                            #000003ff915b14a8: e31022000036        pfd     1,512(%r2)
>>>>>                            >000003ff915b14ae: d5ff30002000        clc     0(256,%r3),0(%r2)
>>>>>                             000003ff915b14b4: a784ffef            brc     8,000003ff915b1492
>>>>>                             000003ff915b14b8: b2220020            ipm     %r2
>>>>>                             000003ff915b14bc: eb220022000d        sllg    %r2,%r2,34
>>>>>                             000003ff915b14c2: eb22003e000a        srag    %r2,%r2,62
>>>>> [  119.302867] Last Breaking-Event-Address:
>>>>> [  119.302868]  [<000003ff915b14b4>] libc.so.6[b14b4,3ff91500000+1c9000]
>>>>> """
>>>>>
>>>>> This segfault is triggered by the memcmp() call in verify():
>>>>>
>>>>> """
>>>>> memcmp(memory[start], s, (end - start) * (end2 - start2)
>>>>> """
>>>>>
>>>>> In the default case, this call checks if the memory area starting in
>>>>> memory[0] (since start=0 by default) matches 's' for 128MB. IOW, this
>>>>> assumes that the memory areas in memory[] are contiguous. This is wrong,
>>>>> since create_ksm_child() allocates 128 individual areas of 1MB each. As,
>>>>> in this particular case, memory[0] happens to be the last 1MB area in
>>>>> the VMA created by the kernel, we hit a segault at the first byte beyond
>>>>> memory[0].
>>>>>
>>>>> Now, the question is how this has worked for so long and why it may still
>>>>> work on arm64 and x86 (even on s390 it ocassionaly works).
>>>>>
>>>>> For the s390 case, the reason is upstream kernel commit efa7df3e3bb5
>>>>> ("mm: align larger anonymous mappings on THP boundaries"). Before this
>>>>> commit, the kernel would always map a library right after the memory[0]
>>>>> area in the process address space. This causes memcmp() to return
>>>>> non-zero when reading the first byte beyond memory[0], which in turn
>>>>> causes the nested loop in verify() to execute. The nested loop is correct
>>>>> (ie. it doesn't assume the memory areas in memory[] are contiguous) so
>>>>> the test doesn't fail. The mentioned upstream commit causes the first byte
>>>>> beyond memory[0] not to be mapped most of the time on s390, which may
>>>>> result in a segfault.
>>>>>
>>>>> Now, as it turns out on arm64 and x86 the kernel still maps a library right
>>>>> after memory[0] which causes the test to suceed as explained above (this
>>>>> can be easily verified by printing the return value for memcmp()).
>>>>>
>>>>> This commit fixes verify() to do a byte-by-byte check on each individual
>>>>> memory area. This also simplifies verify() a lot, which is what we want
>>>>> to avoid this kind of issue in the future.
>>>>>
>>>>> Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
>>>>> ---
>>>>>    testcases/kernel/mem/ksm/ksm_test.h | 21 +++++++--------------
>>>>>    1 file changed, 7 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/testcases/kernel/mem/ksm/ksm_test.h b/testcases/kernel/mem/ksm/ksm_test.h
>>>>> index 0db759d5a..cbad147d4 100644
>>>>> --- a/testcases/kernel/mem/ksm/ksm_test.h
>>>>> +++ b/testcases/kernel/mem/ksm/ksm_test.h
>>>>> @@ -74,22 +74,15 @@ static inline void verify(char **memory, char value, int proc,
>>>>>                       int start, int end, int start2, int end2)
>>>>>    {
>>>>>           int i, j;
>>>>> -       void *s = NULL;
>>>>> -
>>>>> -       s = SAFE_MALLOC((end - start) * (end2 - start2));
>>>>>
>>>>>           tst_res(TINFO, "child %d verifies memory content.", proc);
>>>>> -       memset(s, value, (end - start) * (end2 - start2));
>>>>> -       if (memcmp(memory[start], s, (end - start) * (end2 - start2))
>>>>> -           != 0)
>>>>> -               for (j = start; j < end; j++)
>>>>> -                       for (i = start2; i < end2; i++)
>>>>> -                               if (memory[j][i] != value)
>>>>> -                                       tst_res(TFAIL, "child %d has %c at "
>>>>> -                                                "%d,%d,%d.",
>>>>> -                                                proc, memory[j][i], proc,
>>>>> -                                                j, i);
>>>>> -       free(s);
>>>>> +
>>>>> +       for (j = start; j < end; j++)
>>>>> +               for (i = start2; i < end2; i++)
>>>>> +                       if (memory[j][i] != value)
>>>>> +                               tst_res(TFAIL, "child %d has %c at "
>>>>> +                                       "%d,%d,%d.",
>>>>> +                                       proc, memory[j][i], proc, j, i);
>>>>>    }
>>>>
>>>> Or, can we optimize the verify() function by using memcmp() per memory
>>>> block, rather than falling back to the slow nested loop that checks each
>>>> byte individually?
>>>
>>> As I understand it, the nested loop is there to tell us which byte failed
>>> the verification. And that's good information, IMHO.
>>>
>>> Now, as for having both as the code is written today, why should we do it?
>>> Meaning, what does the optimization intend to improve? Is it test run-time?
>>> If yes, do we have measurements to justify it?
>>
>> The original design uses memcmp() for bulk memory comparisons,
>> which is fast and optimized. When memcmp() fails, it falls back to a
>> nested loop that performs byte-by-byte comparisons to provide detailed
>> failure information.
>>
>> However, as you've pointed out, this approach is flawed because it
>> assumes that the memory starting at memory[start] is a contiguous
>> region, which it is not. Each memory[start] points to a separately
>> allocated memory block (via mmap()), so treating them as a single
>> contiguous block leads to undefined behavior, and on some architectures
>> like s390, it even causes a segmentation fault.
>>
>> That said, we can still retain the performance benefits of bulk comparison
>> by using memcmp() in a block-by-block manner: checking each memory[][start2]
>> individually. Since each memory[][start2] points to a contiguous
>> region (e.g., 1MB),
>> using memcmp() within each block is both safe and efficient.
>>
>> This would allow us to:
>>   - Preserve correctness across all platforms and memory layouts
>>   - Avoid unnecessary per-byte comparisons when memory is correct
>>   - Provide detailed diagnostics only when a mismatch is detected
>>
>> So yes, while the original design had a performance goal in mind,
>> a refined per-block check using memcmp() can achieve similar speed
>> benefits without sacrificing correctness or portability.
>>
>> WDYT?
>>
>> --- a/testcases/kernel/mem/ksm/ksm_test.h
>> +++ b/testcases/kernel/mem/ksm/ksm_test.h
>> @@ -74,15 +74,28 @@ static inline void verify(char **memory, char
>> value, int proc,
>>                      int start, int end, int start2, int end2)
>>   {
>>          int i, j;
>> +       size_t block_size = end2 - start2;
>> +       char *expected = SAFE_MALLOC(block_size);
>> +
>> +       memset(expected, value, block_size);
>>
>>          tst_res(TINFO, "child %d verifies memory content.", proc);
>>
>> -       for (j = start; j < end; j++)
>> -               for (i = start2; i < end2; i++)
>> -                       if (memory[j][i] != value)
>> -                               tst_res(TFAIL, "child %d has %c at "
>> -                                       "%d,%d,%d.",
>> -                                       proc, memory[j][i], proc, j, i);
>> +       for (j = start; j < end; j++) {
>> +               if (memcmp(&memory[j][start2], expected, block_size) != 0) {
>> +
>> +                       tst_res(TINFO, "====> DEBUG: usually not reach
>> here <===");
>> +
>> +                       for (i = start2; i < end2; i++) {
>> +                               if (memory[j][i] != value) {
>> +                                       tst_res(TFAIL, "child %d has
>> %c at %d,%d,%d.",
>> +                                               proc, memory[j][i], proc, j, i);
>> +                               }
>> +                       }
>> +               }
>> +       }
>> +
>> +       free(expected);
>>   }
>>
>>   struct ksm_merge_data {
> 
> I might be a bit too picky:). So I compared the two approaches on a
> 2 CPUs, KVM, x86_64 system:
> 
> Per-block checking cost time:
>     real 0m5.862s
>     user 0m1.098s
>     sys 0m1.505s
> 
> Per-byte checking cost time:
>    real    0m6.819s
>    user    0m2.498s
>    sys     0m1.495s
> 
>  From the data, block-by-block checking can reduce the total execution
> time by about 14% and reduce CPU usage by more than 35%, especially
> in user-space calculations. This number may not be large, but considering
> that tests are frequently run in CI, I think it would be a good thing if we can
> reduce 1 second each time :).

Just to make sure I understand: you measured total test run-time, correct?
How many times did you run it?

In any case, I'm not sure a 1 second run-time (or even CPU utilization) matters
that much. You're running test code, you shouldn't expect otherwise unless you
hit a very bad case (say something taking several hours to complete).

The trade off is more complex code with bugs that can hide for 10+ years and
take developer time to debug. Also, higher memory utilization: 's' doubles
memory utilization per child only to do that check.

So, I suggest we stick to the simpler code. Or, get it merged now (since it's
fixing a bug and possibly making the code _faster_) and then you can optimize
on top later if you like.