[SERVER-39728] Unittests on Windows debug can crash after returning from main() Created: 21/Feb/19 Updated: 29/Oct/23 Resolved: 04/Apr/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.10 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Henrik Edin | Assignee: | Henrik Edin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Sprint: | Dev Tools 2019-03-11, Dev Tools 2019-03-25, Dev Tools 2019-04-08 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 12 | ||||||||
| Description |
|
We currently experience spurious crashes of unittests on Windows debug builds with exit code C0000005 / -1073741819 Windows Access Violation even though we returned 0 from main(). The reason is that we don't shutdown various subsystems cleanly before returning from main causing races between new threads starting up and instantiating thread_locals that allocates memory while we're tearing down the debug CRT. Even "single threaded" unittests have this problem because certain Windows API's like CertCreateCertificateChainEngine uses the Windows thread pool which may spawn new threads at any time. There's two possible solutions: Below is an example of a shutdown crash like this: Main thread:
Windows thread pool thread:
|
| Comments |
| Comment by Githook User [ 04/Apr/19 ] |
|
Author: {'email': 'henrik.edin@mongodb.com', 'name': 'Henrik Edin', 'username': 'henrikedin'}Message: This prevents the background threads being started in the Windows thread |
| Comment by Jonathan Reams [ 25/Mar/19 ] |
|
henrik.edin, is there a repro you have for CertCreateCertificateChainEngine? I don't doubt that CertCreateCertificateChainEngine may start a worker thread, but the unit tests that are failing have nothing to do with TLS. TLS may be initialized anyway, but I'd expect that to happen at the beginning of the test rather than the end. When this ticket was handed off to me I didn't see anything in any of the call stacks that indicated TLS was involved, and instead saw that the CRT implementation of exit() was always involved (at least from all the BFGs I looked at) and that also may use a thread pool to load DLLs. That would happen right as the test was ending, so it seems like there's much more of a chance for that to race with shutdown than something that's being initialized (probably as a side-effect) right at the beginning of the test. It's also possible that calling CertFreeCertificateChainEngine loads the correct DLLs before the test calls exit() which would let us side-step the bug in the CRT? If you have a repro that shows CertCreateCertificateChainEngine involved somehow, I'd be happy to take another look from the security team angle, but I think there are more things in modern windows that unexpectedly use a thread pool than just the TLS implementation. |
| Comment by Henrik Edin [ 25/Mar/19 ] |
|
For what it's worth, when I prototyped this I saw that CertCreateCertificateChainEngine caused these thread pool jobs to run and by adding calls to CertFreeCertificateChainEngine before retuning from main fixed the problem. I don't think there's a bug in the CRT as cleaning up seem to fix it. This is why I singled out Windows SSL, That being said, terminating with TerminateProcess may still be the correct thing to do here, and it would certainly be robust as it would also cover any other system that we don't cleanup before returning from main. |
| Comment by Jonathan Reams [ 28/Feb/19 ] |
|
After talking to mark.benvenuto, I'm pretty convinced this doesn't actually relate to SSL. It's definitely possible that the Windows SSL implementation is also starting worker threads that we didn't know were being launched, but for these unit tests, I'd guess they're being initialized at the beginning before the test starts. However, exit() in the latest version of the windows CRT actually loads DLLs by way of calling AppPolicyGetProcessTerminationMethod to determine whether to call TerminateProcess or ExitProcess to exit the program - somewhat ironically, calling TerminateProcess is the fix here because it guarantees new threads won't be created while the process is being torn down. Since in Windows 10 loading a DLL is done in parallel on the default process thread pool, calling exit() may actually create threads after the debug heap has been destroyed. I think this is likely a bug in the windows CRT, and we may want to just call quickExit() at the end of main() on Windows. |
| Comment by Jonathan Reams [ 27/Feb/19 ] |
|
I can believe there are other threads being created out of the windows thread pool, but I don't see anything SSL related in the example stack trace provided. The main thread in the example stack trace looks like it's trying to load a DLL which in windows 10 involves a thread pool, and there is another thread trying to run the TLS initializers. Since minidumps and core dumps are broken for windows unit tests, it's hard to get a sense of exactly where the failure is here or which thread pool needs to be cleaned up. |
| Comment by Henrik Edin [ 21/Feb/19 ] |
|
Essentially all unittests have this problem now, due to SSL being compiled in that uses the windows thread pool as stated above. If shard_registry_test.exe have additional problems I don't know, I just picked a random unittest that failed. |
| Comment by Kaloian Manassiev [ 21/Feb/19 ] |
|
I vote that it should be on each of the teams to ensure their tests are shutting down the respective subsystems properly. Is shard_registry_test.exe one of these examples? |