[SERVER-24600] Mongos stalls during shutdown on Windows Created: 15/Jun/16 Updated: 25/Jan/17 Resolved: 07/Sep/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.2.5 |
| Fix Version/s: | 3.3.14 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | David Golub | Assignee: | Andy Schwerin |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Sharding 17 (07/15/16), Platforms 16 (06/24/16), Sharding 2016-09-19 | ||||||||
| Participants: | |||||||||
| Description |
|
Configure a config server and a mongos that connects to it, both running as Windows services. Start both, connect to the mongos from the Mongo shell, and run the following:
The Mongo shell will stall for a few seconds while the server is unresponsive and then eventually time out. If you observe what's happening in the Windows Services window, you'll notice that the mongos remains running but unresponsive while the Mongo shell is stalled. Once the Mongo shell stops trying to communicate with it, it immediately stops. (Please note that the Windows Services window does not automatically refresh, so it is necessary to repeatedly press F5 in order to see this.) This behavior is a regression that was introduced in MongoDB 3.2.5. If you do the same thing with earlier versions of MongoDB, the Windows service stops immediately and the Mongo shell just displays some error messages without stalling. The issue was detected because it is causing an Automation Agent tests, FickleUtilSuite.TestIsLowestMongosUpInCluster, to fail on Windows only. |
| Comments |
| Comment by Andy Schwerin [ 07/Sep/16 ] | |||||
|
The root cause of this behavior is that shutdown will not complete while certain subsystems on mongos are targeting an operation to a config server primary. Those targeting operations pretty much always timeout after 20 seconds, which is why a 30-second wait in the automation agent improves its ability to wait for mongos shutdown. I have just committed a patch on master that improves the interruptibility of replica set targeting, and ought to eliminate this shutdown symptom. | |||||
| Comment by Githook User [ 07/Sep/16 ] | |||||
|
Author: {u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'}Message: By making more calls of RemoteCommandTargeter::findHost interruptible, this | |||||
| Comment by Spencer Brody (Inactive) [ 25/Aug/16 ] | |||||
|
I believe that this has gotten worse in 3.4 as it now affects shards as well as mongos. | |||||
| Comment by Andy Schwerin [ 14/Jul/16 ] | |||||
|
The mongos shutdown behavior is strange. If you start a mongos with a bad --configdb argument, killing with SIGINT takes an awfully long time on Unix, too, now. Feel free to bounce this to the sharding backlog, and mark the ticket "needs triage". | |||||
| Comment by David Golub [ 06/Jul/16 ] | |||||
|
OK, so there was a timeout of 20 seconds within which the Automation Agent expected the mongod to terminate. I temporarily raised that to 30 seconds, and it got the test to pass. I personally feel that such a huge increase in the amount of time that it takes to stop constitutes a bug, but I'll defer to your team to make the final judgement on that. If it's deemed not to be a bug, I can raise the timeout in the Automation Agent. | |||||
| Comment by Mark Benvenuto [ 06/Jul/16 ] | |||||
|
I have been successfully able to shutdown mongos services via the shutdown command. Do you only see a problem in your test framework? On my local Windows 10 machine, using 3.2.5. I did the following steps:
The commands I listed about took ~23 seconds to complete. It took sometime for the shell to detect the network disconnect, but I could confirm the service was cleanly stopped by querying the service control manager (via sc query mongos). | |||||
| Comment by David Golub [ 06/Jul/16 ] | |||||
|
I'm not clear on what the expected behavior following this change is. I tried modifying the automation test to ignore timeout errors and treat them as success, but it then determines that the mongos process is still up after the timeout. On the other hand, if I let it retry the shutdown command, it continues to repeatedly time out. How is one supposed to send the shutdown command and know when the process has actually stopped? | |||||
| Comment by Mark Benvenuto [ 27/Jun/16 ] | |||||
|
david.golub I am planning to close this as works as designed. Let me know if you have any concerns. | |||||
| Comment by Mark Benvenuto [ 23/Jun/16 ] | |||||
|
A change in behavior for shutdown on mongos was made as part of | |||||
| Comment by David Golub [ 16/Jun/16 ] | |||||
|
As I said in the ticket description, it only happens with 3.2.5 and later, not with earlier versions. | |||||
| Comment by Mark Benvenuto [ 16/Jun/16 ] | |||||
|
david.golub Does this repro with 3.2.4 or did it just start failing in 3.2.5? |