[LTP] [bug] problems with migration of huge pages with v4.20-10214-ge1ef035d272e

Mike Kravetz mike.kravetz@oracle.com
Thu Jan 3 02:44:33 CET 2019


On 1/2/19 1:24 PM, Mike Kravetz wrote:
> On 1/2/19 12:30 PM, Jan Stancek wrote:
>> Hi,
>>
>> LTP move_pages12 [1] started failing recently.
>>
>> The test maps/unmaps some anonymous private huge pages
>> and migrates them between 2 nodes. This now reliably
>> hits NULL ptr deref:
>>
>> [  194.819357] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
>> [  194.864410] #PF error: [WRITE]
>> [  194.881502] PGD 22c758067 P4D 22c758067 PUD 235177067 PMD 0
>> [  194.913833] Oops: 0002 [#1] SMP NOPTI
>> [  194.935062] CPU: 0 PID: 865 Comm: move_pages12 Not tainted 4.20.0+ #1
>> [  194.972993] Hardware name: HP ProLiant SL335s G7/, BIOS A24 12/08/2012
>> [  195.005359] RIP: 0010:down_write+0x1b/0x40
>> [  195.028257] Code: 00 5c 01 00 48 83 c8 03 48 89 43 20 5b c3 90 0f 1f 44 00 00 53 48 89 fb e8 d2 d7 ff ff 48 89 d8 48 ba 01 00 00 00 ff ff
>> ff ff <f0> 48 0f c1 10 85 d2 74 05 e8 07 26 ff ff 65 48 8b 04 25 00 5c 01
>> [  195.121836] RSP: 0018:ffffb87e4224fd00 EFLAGS: 00010246
>> [  195.147097] RAX: 0000000000000030 RBX: 0000000000000030 RCX: 0000000000000000
>> [  195.185096] RDX: ffffffff00000001 RSI: ffffffffa69d30f0 RDI: 0000000000000030
>> [  195.219251] RBP: 0000000000000030 R08: ffffe7d4889d8008 R09: 0000000000000003
>> [  195.258291] R10: 000000000000000f R11: ffffe7d4889d8008 R12: ffffe7d4889d0008
>> [  195.294547] R13: ffffe7d490b78000 R14: ffffe7d4889d0000 R15: ffff8be9b2ba4580
>> [  195.332532] FS:  00007f1670112b80(0000) GS:ffff8be9b7a00000(0000) knlGS:0000000000000000
>> [  195.373888] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  195.405938] CR2: 0000000000000030 CR3: 000000023477e000 CR4: 00000000000006f0
>> [  195.443579] Call Trace:
>> [  195.456876]  migrate_pages+0x833/0xcb0
>> [  195.478070]  ? __ia32_compat_sys_migrate_pages+0x20/0x20
>> [  195.506027]  do_move_pages_to_node.isra.63.part.64+0x2a/0x50
>> [  195.536963]  kernel_move_pages+0x667/0x8c0
>> [  195.559616]  ? __handle_mm_fault+0xb95/0x1370
>> [  195.588765]  __x64_sys_move_pages+0x24/0x30
>> [  195.611439]  do_syscall_64+0x5b/0x160
>> [  195.631901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [  195.657790] RIP: 0033:0x7f166f5ff959
>> [  195.676365] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08
>> 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 17 45 2c 00 f7 d8 64 89 01 48
>> [  195.772938] RSP: 002b:00007ffd8d77bb48 EFLAGS: 00000246 ORIG_RAX: 0000000000000117
>> [  195.810207] RAX: ffffffffffffffda RBX: 0000000000000400 RCX: 00007f166f5ff959
>> [  195.847522] RDX: 0000000002303400 RSI: 0000000000000400 RDI: 0000000000000360
>> [  195.882327] RBP: 0000000000000400 R08: 0000000002306420 R09: 0000000000000004
>> [  195.920017] R10: 0000000002305410 R11: 0000000000000246 R12: 0000000002303400
>> [  195.958053] R13: 0000000002305410 R14: 0000000002306420 R15: 0000000000000003
>> [  195.997028] Modules linked in: sunrpc amd64_edac_mod ipmi_ssif edac_mce_amd kvm_amd ipmi_si igb ipmi_devintf k10temp kvm pcspkr ipmi_msgha
>> ndler joydev irqbypass sp5100_tco dca hpwdt hpilo i2c_piix4 xfs libcrc32c radeon i2c_algo_bit drm_kms_helper ttm ata_generic pata_acpi drm se
>> rio_raw pata_atiixp
>> [  196.134162] CR2: 0000000000000030
>> [  196.152788] ---[ end trace 4420ea5061342d3e ]---
>>
>> Suspected commit is:
>>   b43a99900559 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
>> which adds to unmap_and_move_huge_page():
>> +               struct address_space *mapping = page_mapping(hpage);
>> +
>> +               /*
>> +                * try_to_unmap could potentially call huge_pmd_unshare.
>> +                * Because of this, take semaphore in write mode here and
>> +                * set TTU_RMAP_LOCKED to let lower levels know we have
>> +                * taken the lock.
>> +                */
>> +               i_mmap_lock_write(mapping);
>>
>> If I'm reading this right, 'mapping' will be NULL for anon mappings.
> 
> Not exactly.
> 

Well, yes exactly.  Sorry, I was confusing this with something else.

That commit does cause BUGs for migration and page poisoning of anon huge
pages.  The patch was trying to take care of i_mmap_rwsem locking outside
try_to_unmap infrastructure.  This is because try_to_unmap will take the
semaphore in read mode (for file mappings) and we really need it to be
taken in write mode.

The patch below continues to take the semaphore outside try_to_unmap for
the file mapping case.  For anon mappings, the locking is done as a special
case in try_to_unmap_one.  This is something I was trying to avoid as it
it harder to follow/understand.  Any suggestions on how to restructure this
or make it more clear are welcome.

Adding Andrew on Cc as he already sent the commit causing the BUGs upstream.

From: Mike Kravetz <mike.kravetz@oracle.com>

hugetlbfs: fix migration and poisoning of anon huge pages

Expanded use of i_mmap_rwsem for pmd sharing synchronization incorrectly
used page_mapping() of anon huge pages to get to address_space
i_mmap_rwsem.  Since page_mapping() is NULL for pages of anon mappings,
an "unable to handle kernel NULL pointer" BUG would occur with stack
similar to:

RIP: 0010:down_write+0x1b/0x40
Call Trace:
 migrate_pages+0x81f/0xb90
 __ia32_compat_sys_migrate_pages+0x190/0x190
 do_move_pages_to_node.isra.53.part.54+0x2a/0x50
 kernel_move_pages+0x566/0x7b0
 __x64_sys_move_pages+0x24/0x30
 do_syscall_64+0x5b/0x180
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

To fix, only use page_mapping() for non-anon or file pages.  For anon
pages wait until we find a vma in which the page is mapped and get the
address_space from vm_file.

Fixes: b43a99900559 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/memory-failure.c | 15 +++++++++++----
 mm/migrate.c        | 34 +++++++++++++++++++++++-----------
 mm/rmap.c           | 15 ++++++++++++---
 3 files changed, 46 insertions(+), 18 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 93558fb981fb..f229cbd0b347 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1032,14 +1032,21 @@ static bool hwpoison_user_mappings(struct page *p,
unsigned long pfn,
 		unmap_success = try_to_unmap(hpage, ttu);
 	} else if (mapping) {
 		/*
-		 * For hugetlb pages, try_to_unmap could potentially call
-		 * huge_pmd_unshare.  Because of this, take semaphore in
-		 * write mode here and set TTU_RMAP_LOCKED to indicate we
-		 * have taken the lock at this higer level.
+		 * For file mappings, take semaphore in write mode here and
+		 * set TTU_RMAP_LOCKED to let lower levels know we have taken
+		 * the lock.  This is in case lower levels call
+		 * huge_pmd_unshare.  Without this, try_to_unmap would only
+		 * take the semaphore in read mode.
 		 */
 		i_mmap_lock_write(mapping);
 		unmap_success = try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED);
 		i_mmap_unlock_write(mapping);
+	} else {
+		/*
+		 * For huge page anon mappings, try_to_unmap_one will take the
+		 * i_mmap_rwsem before calling huge_pmd_unshare if necessary.
+		 */
+		unmap_success = try_to_unmap(hpage, ttu);
 	}
 	if (!unmap_success)
 		pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
diff --git a/mm/migrate.c b/mm/migrate.c
index 725edaef238a..45d7dd0c9479 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1309,17 +1309,29 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	if (page_mapped(hpage)) {
 		struct address_space *mapping = page_mapping(hpage);

-		/*
-		 * try_to_unmap could potentially call huge_pmd_unshare.
-		 * Because of this, take semaphore in write mode here and
-		 * set TTU_RMAP_LOCKED to let lower levels know we have
-		 * taken the lock.
-		 */
-		i_mmap_lock_write(mapping);
-		try_to_unmap(hpage,
-			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|
-			TTU_RMAP_LOCKED);
-		i_mmap_unlock_write(mapping);
+		if (mapping) {
+			/*
+			 * For file mappings, take semaphore in write mode here
+			 * and set TTU_RMAP_LOCKED to let lower levels know we
+			 * have taken the lock.  This is in case lower levels
+			 * call huge_pmd_unshare.  Without this, try_to_unmap
+			 * would only take the semaphore in read mode.
+			 */
+			i_mmap_lock_write(mapping);
+			try_to_unmap(hpage,
+				TTU_MIGRATION|TTU_IGNORE_MLOCK|
+				TTU_IGNORE_ACCESS|TTU_RMAP_LOCKED);
+			i_mmap_unlock_write(mapping);
+		} else {
+			/*
+			 * For anon mappings, try_to_unmap_one will take the
+			 * i_mmap_rwsem before calling huge_pmd_unshare if
+			 * necessary.
+			 */
+			try_to_unmap(hpage,
+				TTU_MIGRATION|TTU_IGNORE_MLOCK|
+				TTU_IGNORE_ACCESS);
+		}
 		page_was_mapped = 1;
 	}

diff --git a/mm/rmap.c b/mm/rmap.c
index c566bd552535..b267cc084f92 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1375,11 +1375,17 @@ static bool try_to_unmap_one(struct page *page, struct
vm_area_struct *vma,
 		/*
 		 * If sharing is possible, start and end will be adjusted
 		 * accordingly.
-		 *
-		 * If called for a huge page, caller must hold i_mmap_rwsem
-		 * in write mode as it is possible to call huge_pmd_unshare.
 		 */
 		adjust_range_if_pmd_sharing_possible(vma, &start, &end);
+
+		/*
+		 * If called for a huge page file mapping, caller will hold
+		 * i_mmap_rwsem in write mode.  For anon mappings, we must
+		 * take the semaphore here.  All this is necessary because
+		 * it is possible huge_pmd_unshare will 'unshare' a pmd.
+		 */
+		if (PageAnon(page))
+			i_mmap_lock_write(vma->vm_file->f_mapping);
 	}
 	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);

@@ -1655,6 +1661,9 @@ static bool try_to_unmap_one(struct page *page, struct
vm_area_struct *vma,
 	}

 	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
+	/* For anon huge pages, we must unlock. */
+	if (PageHuge(page) && PageAnon(page))
+		i_mmap_unlock_write(vma->vm_file->f_mapping);

 	return ret;
 }
-- 
2.17.2



More information about the ltp mailing list