[SERVER-57427] Avoid special-case handling in ServiceEntryPointImpl::shutdown Created: 03/Jun/21 Updated: 08/Jan/24 Resolved: 01/Mar/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Internal Code |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Amirsaman Memaripour | Assignee: | Backlog - Service Architecture |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | servicearch-wfbf-day | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Assigned Teams: |
Service Arch
|
||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Sprint: | Service Arch 2022-1-10 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||||||
| Description |
|
Shutting down the service entry point outside TSAN and ASAN builds immediately returns true without running any shutdown code. A separate interface, shutdownAndWait, is introduced to shutdown the service entry point.
We should remove the special-case handling in shutdown and have it run the body of shutdownAndWait. This would also obviate the need for shutdownAndWait. Acceptance criteria: Change the code to have #if at the higher level of the stack. Make sure we link these to the relevant shutdown project. |
| Comments |
| Comment by Daniel Morilha (Inactive) [ 01/Mar/22 ] | |||||||||||||||||
|
After being investigated, this ticket was addressed by | |||||||||||||||||
| Comment by Daniel Morilha (Inactive) [ 14/Jan/22 ] | |||||||||||||||||
|
Per discussion with blake.oler this ticket isn't important enough to spend more time on, the fix does introduce some hangs specifically for the windows flavor. Here are the related error messages, before something crashed and burned
My brief assessment tells me there is a lot of opportunity for improving the logs here, by checking the code it looks like the thread calling shutdown calls every underlying session to "end" and then sits on a conditional variable loop until either the connection counter reaches 0 or it times out. The issue seems to only occur on Windows and more have to be understood about ASIO on Windows and how the code is written to end a session. | |||||||||||||||||
| Comment by Daniel Morilha (Inactive) [ 13/Jan/22 ] | |||||||||||||||||
|
All failing tests related with repairShardedCollectionChunksHistory ** are documented in a known issue in 5.1 [1] matthew.saltz mentioned:
**A log line was added to measure the total cost of a clean shutdown. In tests[2] we could see the shutdownAndWait taking up to 2 seconds, amirsaman.memaripour has a nice suggestion on how to improve the code[3]. [1] [BF-24011] failed: sharding_jscore_multiversion_passthrough on enterprise-rhel-80-64-bit-multiversion [mongodb-mongo-v5.1 @ f92817a4] (views_all_commands.js) - MongoDB Jira [2] Evergreen - Version Patch 30 by daniel.morilha (mongodb.com) | |||||||||||||||||
| Comment by Daniel Morilha (Inactive) [ 13/Jan/22 ] | |||||||||||||||||
|
amirsaman.memaripour helped me understand repairShardedCollectionChunksHistory found in jstests/sharding/database_versioning_all_commands.js fails after applying the suggestion found in here. Next step is to make sure this is really related or not. If so, Sam suggested to reach out to the sharding team, and they might help understanding the context. So be it! | |||||||||||||||||
| Comment by Daniel Morilha (Inactive) [ 11/Jan/22 ] | |||||||||||||||||
|
After time spent properly setting my ubuntu box which came all funky (thanks ryan.egesdahl and max.hirschhorn for the various ideas), I could compile, run and analyze the reports and they are valid! Apparently the shutdownAndWait does guarantee a proper destruction. Now the question is what is the priority for that, given tests synthetically create and destroy ServiceStateMachine instances. I would imagine this isn't the case on Atlas for instance. | |||||||||||||||||
| Comment by Daniel Morilha (Inactive) [ 07/Jan/22 ] | |||||||||||||||||
|
as a side note from someone who is learning, by adopting the suggestion found in the description I was able to see only a small number of failed tests, and one most likely related: Next step is to rerun the test and investigate if I can locally reproduce it.
| |||||||||||||||||
| Comment by Daniel Morilha (Inactive) [ 23/Dec/21 ] | |||||||||||||||||
|
Hmm, as far as my memory serves file descriptors may leak if you don't "properly close" the ASIO sockets. While shutting down every associated session completely might be expensive, we can certainly investigate a way to close the local socket / file descriptor while the remote peer will perceive the connection as being aborted. I will have to get back and deep into the ins & outs of ASIO to confirm. | |||||||||||||||||
| Comment by Billy Donahue [ 04/Jun/21 ] | |||||||||||||||||
|
There's another thing we discussed about this function that I just wanted to capture.
This is pretty strange. If it's significant that the terminated sessions remove themselves, then I'd think we would abort if we can't achieve that on time. If we don't care about the terminated sessions removing themselves, then why wait for any of them? |