[SERVER-39728] Unittests on Windows debug can crash after returning from main() Created: 21/Feb/19  Updated: 29/Oct/23  Resolved: 04/Apr/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.1.10

Type: Task Priority: Major - P3
Reporter: Henrik Edin Assignee: Henrik Edin
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
Backwards Compatibility: Fully Compatible
Sprint: Dev Tools 2019-03-11, Dev Tools 2019-03-25, Dev Tools 2019-04-08
Participants:
Linked BF Score: 12

 Description   

We currently experience spurious crashes of unittests on Windows debug builds with exit code C0000005 / -1073741819 Windows Access Violation even though we returned 0 from main().

The reason is that we don't shutdown various subsystems cleanly before returning from main causing races between new threads starting up and instantiating thread_locals that allocates memory while we're tearing down the debug CRT.

Even "single threaded" unittests have this problem because certain Windows API's like CertCreateCertificateChainEngine uses the Windows thread pool which may spawn new threads at any time.

There's two possible solutions:
1. quickExit instead of returning from main, at least on Windows debug
2. Attempt to use the same cleanup path as embedded for unittests and find&fix the problematic systems.

Below is an example of a shutdown crash like this:

Main thread:

 	ntdll.dll!NtWaitForSingleObject()	Unknown
 	ntdll.dll!LdrpDrainWorkQueue()	Unknown
 	ntdll.dll!LdrpLoadDllInternal()	Unknown
 	ntdll.dll!LdrpLoadDll()	Unknown
 	ntdll.dll!LdrLoadDll()	Unknown
 	KERNELBASE.dll!LoadLibraryExW()	Unknown
 	shard_registry_test.exe!try_load_library_from_system_directory(const wchar_t * const name) Line 199	C++
 	shard_registry_test.exe!try_get_module(const `anonymous-namespace'::module_id id) Line 238	C++
>	shard_registry_test.exe!try_get_first_available_module(const `anonymous-namespace'::module_id * const first, const `anonymous-namespace'::module_id * const last) Line 271	C++
 	shard_registry_test.exe!try_get_proc_address_from_first_available_module(const char * const name, const `anonymous-namespace'::module_id * const first_module_id, const `anonymous-namespace'::module_id * const last_module_id) Line 289	C++
 	shard_registry_test.exe!try_get_function(const `anonymous-namespace'::function_id id, const char * const name, const `anonymous-namespace'::module_id * const first_module_id, const `anonymous-namespace'::module_id * const last_module_id) Line 326	C++
 	shard_registry_test.exe!try_get_AppPolicyGetProcessTerminationMethod() Line 377	C++
 	shard_registry_test.exe!__acrt_AppPolicyGetProcessTerminationMethodInternal(AppPolicyProcessTerminationMethod * policy) Line 737	C++
 	shard_registry_test.exe!`__acrt_get_process_end_policy'::`2'::process_end_policy_properties::appmodel_get_policy(AppPolicyProcessTerminationMethod * appmodelPolicy) Line 81	C++
 	shard_registry_test.exe!get_win_policy<`__acrt_get_process_end_policy'::`2'::process_end_policy_properties>(AppPolicyProcessTerminationMethod defaultValue) Line 26	C++
 	shard_registry_test.exe!__acrt_get_process_end_policy() Line 85	C++
 	shard_registry_test.exe!should_call_terminate_process() Line 112	C++
 	shard_registry_test.exe!exit_or_terminate_process(const unsigned int return_code) Line 134	C++
 	shard_registry_test.exe!common_exit(const int return_code, const _crt_exit_cleanup_mode cleanup_mode, const _crt_exit_return_mode return_mode) Line 282	C++
 	shard_registry_test.exe!exit(int return_code) Line 294	C++
 	shard_registry_test.exe!__scrt_common_main_seh() Line 297	C++
 	shard_registry_test.exe!__scrt_common_main() Line 331	C++
 	shard_registry_test.exe!mainCRTStartup() Line 17	C++
 	kernel32.dll!BaseThreadInitThunk()	Unknown
 	ntdll.dll!RtlUserThreadStart()	Unknown

Windows thread pool thread:

 	ntdll.dll!RtlpWaitOnCriticalSection()	Unknown
 	ntdll.dll!RtlpEnterCriticalSectionContended()	Unknown
 	ntdll.dll!RtlEnterCriticalSection()	Unknown
 	shard_registry_test.exe!__acrt_lock(__acrt_lock_id _Lock) Line 55	C++
 	shard_registry_test.exe!heap_alloc_dbg_internal(const unsigned __int64 size, const int block_use, const char * const file_name, const int line_number) Line 309	C++
 	shard_registry_test.exe!heap_alloc_dbg(const unsigned __int64 size, const int block_use, const char * const file_name, const int line_number) Line 450	C++
 	shard_registry_test.exe!_malloc_dbg(unsigned __int64 size, int block_use, const char * file_name, int line_number) Line 496	C++
 	shard_registry_test.exe!malloc(unsigned __int64 size) Line 27	C++
>	shard_registry_test.exe!operator new(unsigned __int64 size) Line 35	C++
 	[Inline Frame] shard_registry_test.exe!std::allocator<std::_Container_proxy>::allocate(const unsigned __int64) Line 997	C++
 	[Inline Frame] shard_registry_test.exe!std::_Deque_alloc<std::_Deque_base_types<std::function<void __cdecl(void)>,std::allocator<std::function<void __cdecl(void)> > > >::_Alloc_proxy() Line 787	C++
 	[Inline Frame] shard_registry_test.exe!std::_Deque_alloc<std::_Deque_base_types<std::function<void __cdecl(void)>,std::allocator<std::function<void __cdecl(void)> > > >::{ctor}() Line 731	C++
 	[Inline Frame] shard_registry_test.exe!std::deque<std::function<void __cdecl(void)>,std::allocator<std::function<void __cdecl(void)> > >::{ctor}() Line 929	C++
 	shard_registry_test.exe!`dynamic initializer for 'mongo::transport::ServiceExecutorSynchronous::_localWorkQueue''() Line 56	C++
 	shard_registry_test.exe!__dyn_tls_init(void * __formal, unsigned long dwReason, void * __formal) Line 96	C++
 	ntdll.dll!LdrpCallInitRoutine()	Unknown
 	ntdll.dll!LdrpCallTlsInitializers()	Unknown
 	ntdll.dll!LdrpInitializeThread()	Unknown
 	ntdll.dll!_LdrpInitialize()	Unknown
 	ntdll.dll!LdrInitializeThunk()	Unknown



 Comments   
Comment by Githook User [ 04/Apr/19 ]

Author:

{'email': 'henrik.edin@mongodb.com', 'name': 'Henrik Edin', 'username': 'henrikedin'}

Message: SERVER-39728 Deinit SSL Manager on Windows in unittests.

This prevents the background threads being started in the Windows thread
pool after we return from main() while we're tearing down the CRT causing
access violations even though we returned 0 from main.
Branch: master
https://github.com/mongodb/mongo/commit/79fa77c5bdac8b240fc616fef825e0cd5ff3f31d

Comment by Jonathan Reams [ 25/Mar/19 ]

henrik.edin, is there a repro you have for CertCreateCertificateChainEngine? I don't doubt that CertCreateCertificateChainEngine may start a worker thread, but the unit tests that are failing have nothing to do with TLS. TLS may be initialized anyway, but I'd expect that to happen at the beginning of the test rather than the end. When this ticket was handed off to me I didn't see anything in any of the call stacks that indicated TLS was involved, and instead saw that the CRT implementation of exit() was always involved (at least from all the BFGs I looked at) and that also may use a thread pool to load DLLs. That would happen right as the test was ending, so it seems like there's much more of a chance for that to race with shutdown than something that's being initialized (probably as a side-effect) right at the beginning of the test. It's also possible that calling CertFreeCertificateChainEngine loads the correct DLLs before the test calls exit() which would let us side-step the bug in the CRT? If you have a repro that shows CertCreateCertificateChainEngine involved somehow, I'd be happy to take another look from the security team angle, but I think there are more things in modern windows that unexpectedly use a thread pool than just the TLS implementation.

Comment by Henrik Edin [ 25/Mar/19 ]

For what it's worth, when I prototyped this I saw that CertCreateCertificateChainEngine caused these thread pool jobs to run and by adding calls to CertFreeCertificateChainEngine before retuning from main fixed the problem. I don't think there's a bug in the CRT as cleaning up seem to fix it. This is why I singled out Windows SSL,

That being said, terminating with TerminateProcess may still be the correct thing to do here, and it would certainly be robust as it would also cover any other system that we don't cleanup before returning from main.

Comment by Jonathan Reams [ 28/Feb/19 ]

After talking to mark.benvenuto, I'm pretty convinced this doesn't actually relate to SSL. It's definitely possible that the Windows SSL implementation is also starting worker threads that we didn't know were being launched, but for these unit tests, I'd guess they're being initialized at the beginning before the test starts. However, exit() in the latest version of the windows CRT actually loads DLLs by way of calling AppPolicyGetProcessTerminationMethod to determine whether to call TerminateProcess or ExitProcess to exit the program - somewhat ironically, calling TerminateProcess is the fix here because it guarantees new threads won't be created while the process is being torn down. Since in Windows 10 loading a DLL is done in parallel on the default process thread pool, calling exit() may actually create threads after the debug heap has been destroyed. I think this is likely a bug in the windows CRT, and we may want to just call quickExit() at the end of main() on Windows.

Comment by Jonathan Reams [ 27/Feb/19 ]

I can believe there are other threads being created out of the windows thread pool, but I don't see anything SSL related in the example stack trace provided. The main thread in the example stack trace looks like it's trying to load a DLL which in windows 10 involves a thread pool, and there is another thread trying to run the TLS initializers. Since minidumps and core dumps are broken for windows unit tests, it's hard to get a sense of exactly where the failure is here or which thread pool needs to be cleaned up.

Comment by Henrik Edin [ 21/Feb/19 ]

Essentially all unittests have this problem now, due to SSL being compiled in that uses the windows thread pool as stated above. If shard_registry_test.exe have additional problems I don't know, I just picked a random unittest that failed.

Comment by Kaloian Manassiev [ 21/Feb/19 ]

I vote that it should be on each of the teams to ensure their tests are shutting down the respective subsystems properly. Is shard_registry_test.exe one of these examples?

Generated at Thu Feb 08 04:52:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.