[LTP] [PATCH 1/2] SUNRPC: Don't allow waiting for exiting tasks

Mon Jul 28 11:34:19 CEST 2025

On Mon, 28 Jul 2025, Harshvardhan Jha wrote:
> On 27/07/25 10:20 AM, NeilBrown wrote:
> > On Fri, 25 Jul 2025, Harshvardhan Jha wrote:
> >> On 23/07/25 1:37 PM, NeilBrown wrote:
> >>> On Wed, 23 Jul 2025, Harshvardhan Jha wrote:
> >>>> On 08/04/25 4:01 PM, Mark Brown wrote:
> >>>>> On Fri, Mar 28, 2025 at 01:40:44PM -0400, trondmy@kernel.org wrote:
> >>>>>> From: Trond Myklebust <trond.myklebust@hammerspace.com>
> >>>>>>
> >>>>>> Once a task calls exit_signals() it can no longer be signalled. So do
> >>>>>> not allow it to do killable waits.
> >>>>> We're seeing the LTP acct02 test failing in kernels with this patch
> >>>>> applied, testing on systems with NFS root filesystems:
> >>>>>
> >>>>> 10271 05:03:09.064993  tst_test.c:1900: TINFO: LTP version: 20250130-1-g60fe84aaf
> >>>>> 10272 05:03:09.076425  tst_test.c:1904: TINFO: Tested kernel: 6.15.0-rc1 #1 SMP PREEMPT Sun Apr  6 21:18:14 UTC 2025 aarch64
> >>>>> 10273 05:03:09.076733  tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz'
> >>>>> 10274 05:03:09.087803  tst_test.c:1722: TINFO: Overall timeout per run is 0h 01m 30s
> >>>>> 10275 05:03:09.088107  tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz'
> >>>>> 10276 05:03:09.093097  acct02.c:63: TINFO: CONFIG_BSD_PROCESS_ACCT_V3=y
> >>>>> 10277 05:03:09.093400  acct02.c:240: TINFO: Verifying using 'struct acct_v3'
> >>>>> 10278 05:03:10.053504  <6>[   98.043143] Process accounting resumed
> >>>>> 10279 05:03:10.053935  <6>[   98.043143] Process accounting resumed
> >>>>> 10280 05:03:10.064653  acct02.c:193: TINFO: == entry 1 ==
> >>>>> 10281 05:03:10.064953  acct02.c:84: TINFO: ac_comm != 'acct02_helper' ('acct02')
> >>>>> 10282 05:03:10.076029  acct02.c:133: TINFO: ac_exitcode != 32768 (0)
> >>>>> 10283 05:03:10.076331  acct02.c:141: TINFO: ac_ppid != 2466 (2461)
> >>> It seems that the acct02 process got logged..
> >>> Maybe the vfork attempt (trying to run acct02_helper) got half way an
> >>> aborted.
> >>> It got far enough that accounting got interested.
> >>> It didn't get far enough to update the ppid.
> >>> I'd be surprised if that were even possible....
> >>>
> >>> If you would like to help debug this, changing the
> >>>
> >>> +       if (unlikely(current->flags & PF_EXITING))
> >>>
> >>> to
> >>>
> >>> +       if (unlikely(WARN_ON(current->flags & PF_EXITING)))
> >>>
> >>> would provide stack traces so we can wee where -EINTR is actually being
> >>> returned.  That should provide some hints.
> >>>
> >>> NeilBrown
> >> Hi Neil,
> >>
> >> Upon this addition I got this in the logs
> > Thanks for testing.  Was there anything new in the kernel logs?  I was
> > expecting a WARNING message followed by a "Call Trace".
> >
> > If there wasn't, then this patch cannot have caused the problem.
> > If there was, then I need to see it.
> >
> > Thanks,
> > NeilBrown
> 
> This is what the dmesg contains:
> 
> [  678.814887] LTP: starting acct02
> [  679.831232] ------------[ cut here ]------------
> [  679.833500] WARNING: CPU: 6 PID: 88930 at net/sunrpc/sched.c:279
> rpc_wait_bit_killable+0x76/0x90 [sunrpc]
> [  679.837308] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs
> netfs rpcrdma rdma_cm iw_cm ib_cm ib_core nfsd auth_rpcgss nfs_acl lockd
> grace loop nft_redir ipt_REJECT xt_comment xt_owner nft_compat
> nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib rfkill nft_reject_inet
> nf_reject_
> ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack
> nf_defrag_ipv6 nf_defrag_ipv4 ip_set cuse vfat fat intel_rapl_msr
> intel_rapl_common kvm_amd ccp kvm drm_shmem_helper irqbypass i2c_piix4
> drm_kms_helper pcspkr pvpanic_mmio i2c_smbus pvpanic drm fuse xfs
> crc32c_generic
>  nvme_tcp nvme_fabrics nvme_core nvme_keyring nvme_auth sd_mod
> virtio_net sg net_failover virtio_scsi failover ata_generic pata_acpi
> ata_piix ghash_clmulni_intel libata sha512_ssse3 virtio_pci sha256_ssse3
> virtio_pci_legacy_dev sha1_ssse3 virtio_pci_modern_dev serio_raw
> dm_multipath btrfs
>  blake2b_generic xor zstd_compress raid6_pq sunrpc dm_mirror
> dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls
> cxgb3i cxgb3 mdio libcxgbi libcxgb
> [  679.837524]  qla4xxx iscsi_tcp libiscsi_tcp libiscsi
> scsi_transport_iscsi iscsi_ibft iscsi_boot_sysfs qemu_fw_cfg aesni_intel
> crypto_simd cryptd [last unloaded: kheaders]
> [  679.873316] CPU: 6 UID: 0 PID: 88930 Comm: acct02_helper Kdump:
> loaded Not tainted 6.15.8-1.el9.rc2.x86_64 #1 PREEMPT(voluntary)
> [  679.877769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.6.4 02/27/2023
> [  679.880782] RIP: 0010:rpc_wait_bit_killable+0x76/0x90 [sunrpc]
> [  679.883189] Code: 01 b8 00 fe ff ff 75 d5 48 8b 85 48 0d 00 00 5b 5d
> 48 c1 e8 08 83 e0 01 f7 d8 19 c0 25 00 fe ff ff 31 d2 31 f6 e9 8a e6 c4
> d4 <0f> 0b b8 fc ff ff ff 5b 5d 31 d2 31 f6 e9 78 e6 c4 d4 0f 1f 84 00
> [  679.889976] RSP: 0018:ffffaf47811a7770 EFLAGS: 00010202
> [  679.892196] RAX: ffff97be48e00330 RBX: ffffaf47811a77c0 RCX:
> 0000000000000000
> [  679.894978] RDX: 0000000000000001 RSI: 0000000000002102 RDI:
> ffffaf47811a77c0
> [  679.897786] RBP: ffff97be61588000 R08: 0000000000000000 R09:
> 0000000000000000
> [  679.900600] R10: 0000000000000000 R11: 0000000000000000 R12:
> 0000000000002102
> [  679.903432] R13: ffffffff96408ea0 R14: ffffaf47811a77d8 R15:
> ffffffffc07568e0
> [  679.906233] FS:  00007fc2563f8600(0000) GS:ffff97c5c890f000(0000)
> knlGS:0000000000000000
> [  679.909289] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  679.911736] CR2: 00007fc2561fba70 CR3: 00000003bce3a000 CR4:
> 00000000003506f0
> [  679.914555] Call Trace:
> [  679.915918]  <TASK>
> [  679.917215]  __wait_on_bit+0x31/0xa0
> [  679.918932]  out_of_line_wait_on_bit+0x93/0xc0
> [  679.920914]  ? __pfx_wake_bit_function+0x10/0x10
> [  679.922944]  __rpc_execute+0x109/0x310 [sunrpc]
> [  679.925024]  rpc_execute+0x137/0x160 [sunrpc]
> [  679.927020]  rpc_run_task+0x107/0x170 [sunrpc]
> [  679.929032]  nfs4_call_sync_sequence+0x74/0xc0 [nfsv4]
> [  679.931319]  _nfs4_proc_statfs+0xc7/0x100 [nfsv4]
> [  679.933520]  ? srso_return_thunk+0x5/0x5f
> [  679.935391]  nfs4_proc_statfs+0x6b/0xb0 [nfsv4]
> [  679.937367]  nfs_statfs+0x7e/0x1e0 [nfs]
> [  679.939138]  statfs_by_dentry+0x67/0xa0
> [  679.940887]  vfs_statfs+0x1c/0x40
> [  679.942596]  check_free_space+0x71/0x110

Thanks.  I'm not sure why this causes a problem as if vfs_statfs() fail,
check_free_space() assumes there is still free space.
However it does strongly suggest that we still need to NFS to work in
processes where signals have been shutdown.

Could you change rpc_wait_bit_killable() to be the following and retest?
I intention is that when the process is exiting, we wait up to 5 seconds
for each request and then fail.  It's a bit ugly, but it is a rather
strange situation.  It blocking forever that we really want to avoid
here, not blocking at all.

Thanks,
NeilBrown

static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode)
{
	if (unlikely(current->flags & PF_EXITING)) {
		if (schedule_timeout(5*HZ) > 0)
			/* timed out */
			return 0;
		return -EINTR;
	}
	schedule();
	if (signal_pending_state(mode, current))
		return -ERESTARTSYS;
	return 0;
}

> [  679.944433]  acct_write_process+0x45/0x180
> [  679.946313]  acct_process+0xff/0x180
> [  679.948003]  do_exit+0x216/0x480
> [  679.949799]  ? srso_return_thunk+0x5/0x5f
> [  679.951621]  do_group_exit+0x30/0x80
> [  679.953329]  __x64_sys_exit_group+0x18/0x20
> [  679.955217]  x64_sys_call+0xfdb/0x14f0
> [  679.956971]  do_syscall_64+0x82/0x7a0
> [  679.958717]  ? srso_return_thunk+0x5/0x5f
> [  679.960550]  ? ___pte_offset_map+0x1b/0x1a0
> [  679.962434]  ? srso_return_thunk+0x5/0x5f
> [  679.964261]  ? __alloc_frozen_pages_noprof+0x18d/0x340
> [  679.966389]  ? srso_return_thunk+0x5/0x5f
> [  679.968183]  ? srso_return_thunk+0x5/0x5f
> [  679.969945]  ? __mod_memcg_lruvec_state+0xb6/0x1b0
> [  679.971977]  ? srso_return_thunk+0x5/0x5f
> [  679.973690]  ? __lruvec_stat_mod_folio+0x83/0xd0
> [  679.975671]  ? srso_return_thunk+0x5/0x5f
> [  679.977392]  ? srso_return_thunk+0x5/0x5f
> [  679.979079]  ? set_ptes.isra.0+0x36/0x90
> [  679.980771]  ? srso_return_thunk+0x5/0x5f
> [  679.982375]  ? srso_return_thunk+0x5/0x5f
> [  679.984052]  ? wp_page_copy+0x333/0x730
> [  679.985648]  ? srso_return_thunk+0x5/0x5f
> [  679.987220]  ? __handle_mm_fault+0x397/0x6f0
> [  679.988818]  ? srso_return_thunk+0x5/0x5f
> [  679.990411]  ? __count_memcg_events+0xbb/0x150
> [  679.992111]  ? srso_return_thunk+0x5/0x5f
> [  679.993689]  ? count_memcg_events.constprop.0+0x26/0x50
> [  679.995590]  ? srso_return_thunk+0x5/0x5f
> [  679.997177]  ? handle_mm_fault+0x245/0x350
> [  679.998807]  ? srso_return_thunk+0x5/0x5f
> [  680.000339]  ? do_user_addr_fault+0x221/0x690
> [  680.002042]  ? srso_return_thunk+0x5/0x5f
> [  680.003553]  ? arch_exit_to_user_mode_prepare.isra.0+0x1e/0xd0
> [  680.005643]  ? srso_return_thunk+0x5/0x5f
> [  680.007202]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  680.009025] RIP: 0033:0x7fc2560d985d
> [  680.010510] Code: Unable to access opcode bytes at 0x7fc2560d9833.
> [  680.012660] RSP: 002b:00007ffde591df68 EFLAGS: 00000246 ORIG_RAX:
> 00000000000000e7
> [  680.015355] RAX: ffffffffffffffda RBX: 00007fc2561f59e0 RCX:
> 00007fc2560d985d
> [  680.017749] RDX: 00000000000000e7 RSI: ffffffffffffff88 RDI:
> 0000000000000080
> [  680.020292] RBP: 0000000000000080 R08: 0000000000000000 R09:
> 0000000000000020
> [  680.022729] R10: 00007ffde591de10 R11: 0000000000000246 R12:
> 00007fc2561f59e0
> [  680.025174] R13: 00007fc2561faf20 R14: 0000000000000001 R15:
> 00007fc2561faf08
> [  680.027593]  </TASK>
> [  680.028661] ---[ end trace 0000000000000000 ]---
> 
> 
> Thanks & Regards,
> Harshvardhan
> 
> >
> >> <<<test_start>>>
> >> tag=acct02 stime=1753444172
> >> cmdline="acct02"
> >> contacts=""
> >> analysis=exit
> >> <<<test_output>>>
> >> tst_kconfig.c:88: TINFO: Parsing kernel config
> >> '/lib/modules/6.15.8-1.bug38227970.el9.rc2.x86_64/config'
> >> tst_tmpdir.c:316: TINFO: Using /tmpdir/ltp-w1ozKKlJ6n/LTP_acc4RRfLh as
> >> tmpdir (nfs filesystem)
> >> tst_test.c:2004: TINFO: LTP version: 20250530-105-gda73e1527
> >> tst_test.c:2007: TINFO: Tested kernel:
> >> 6.15.8-1.bug38227970.el9.rc2.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Jul 25
> >> 02:03:04 PDT 2025 x86_64
> >> tst_kconfig.c:88: TINFO: Parsing kernel config
> >> '/lib/modules/6.15.8-1.bug38227970.el9.rc2.x86_64/config'
> >> tst_test.c:1825: TINFO: Overall timeout per run is 0h 00m 30s
> >> tst_kconfig.c:88: TINFO: Parsing kernel config
> >> '/lib/modules/6.15.8-1.bug38227970.el9.rc2.x86_64/config'
> >> acct02.c:61: TINFO: CONFIG_BSD_PROCESS_ACCT_V3=y
> >> acct02.c:238: TINFO: Verifying using 'struct acct_v3'
> >> acct02.c:191: TINFO: == entry 1 ==
> >> acct02.c:82: TINFO: ac_comm != 'acct02_helper' ('acct02')
> >> acct02.c:131: TINFO: ac_exitcode != 32768 (0)
> >> acct02.c:139: TINFO: ac_ppid != 88929 (88928)
> >> acct02.c:181: TFAIL: end of file reached
> >>
> >> HINT: You _MAY_ be missing kernel fixes:
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4d9570158b626
> >>
> >> Summary:
> >> passed   0
> >> failed   1
> >> broken   0
> >> skipped  0
> >> warnings 0
> >> incrementing stop
> >> <<<execution_status>>>
> >> initiation_status="ok"
> >> duration=1 termination_type=exited termination_id=1 corefile=no
> >> cutime=0 cstime=20
> >>
> >> <<<test_end>>>
> >>
> >>
> >> Thanks & Regards,
> >>
> >> Harshvardhan
>