[SERVER-24600] Mongos stalls during shutdown on Windows Created: 15/Jun/16  Updated: 25/Jan/17  Resolved: 07/Sep/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.5
Fix Version/s: 3.3.14

Type: Bug Priority: Major - P3
Reporter: David Golub Assignee: Andy Schwerin
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-22950 mongos shutdown is non-deterministic ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 17 (07/15/16), Platforms 16 (06/24/16), Sharding 2016-09-19
Participants:

 Description   

Configure a config server and a mongos that connects to it, both running as Windows services. Start both, connect to the mongos from the Mongo shell, and run the following:

use admin
db.runCommand({ shutdown: 1 })

The Mongo shell will stall for a few seconds while the server is unresponsive and then eventually time out. If you observe what's happening in the Windows Services window, you'll notice that the mongos remains running but unresponsive while the Mongo shell is stalled. Once the Mongo shell stops trying to communicate with it, it immediately stops. (Please note that the Windows Services window does not automatically refresh, so it is necessary to repeatedly press F5 in order to see this.) This behavior is a regression that was introduced in MongoDB 3.2.5. If you do the same thing with earlier versions of MongoDB, the Windows service stops immediately and the Mongo shell just displays some error messages without stalling. The issue was detected because it is causing an Automation Agent tests, FickleUtilSuite.TestIsLowestMongosUpInCluster, to fail on Windows only.

CC mark.benvenuto



 Comments   
Comment by Andy Schwerin [ 07/Sep/16 ]

The root cause of this behavior is that shutdown will not complete while certain subsystems on mongos are targeting an operation to a config server primary. Those targeting operations pretty much always timeout after 20 seconds, which is why a 30-second wait in the automation agent improves its ability to wait for mongos shutdown. I have just committed a patch on master that improves the interruptibility of replica set targeting, and ought to eliminate this shutdown symptom.

Comment by Githook User [ 07/Sep/16 ]

Author:

{u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'}

Message: SERVER-24600 Increase interruptibility of RemoteCommandTargeter::findHost.

By making more calls of RemoteCommandTargeter::findHost interruptible, this
change speeds up the shutdown of mongos when no config servers are discoverable.
Branch: master
https://github.com/mongodb/mongo/commit/645a77b3fa5b28d29d245e30cc195fd5a8eda049

Comment by Spencer Brody (Inactive) [ 25/Aug/16 ]

I believe that this has gotten worse in 3.4 as it now affects shards as well as mongos.

Comment by Andy Schwerin [ 14/Jul/16 ]

The mongos shutdown behavior is strange. If you start a mongos with a bad --configdb argument, killing with SIGINT takes an awfully long time on Unix, too, now. Feel free to bounce this to the sharding backlog, and mark the ticket "needs triage".

Comment by David Golub [ 06/Jul/16 ]

OK, so there was a timeout of 20 seconds within which the Automation Agent expected the mongod to terminate. I temporarily raised that to 30 seconds, and it got the test to pass. I personally feel that such a huge increase in the amount of time that it takes to stop constitutes a bug, but I'll defer to your team to make the final judgement on that. If it's deemed not to be a bug, I can raise the timeout in the Automation Agent.

Comment by Mark Benvenuto [ 06/Jul/16 ]

I have been successfully able to shutdown mongos services via the shutdown command. Do you only see a problem in your test framework?

On my local Windows 10 machine, using 3.2.5. I did the following steps:

  1. I set up a simple sharded cluster by using s = new ShardingTest() in the mongo shell.
  2. Ran: mongos.exe --install --configdb test-configRS/%COMPUTERNAME%:20002,%COMPUTERNAME%:20003,%COMPUTERNAME%:20004 -v --chunkSize 50 --port 20007 –
    setParameter enableTestCommands=1 --logpath=d:\tmp\325\a.log}
  3. started mongos with: sc start mongos
  4. Waited 20 seconds
  5. Shutdown mongos via the mongo shell by pasting the following commands as a batch in the shell:

    use admin
    start = Date.now()
    db.runCommand({shutdown:1})
    end = Date.now()
    end - start
    

The commands I listed about took ~23 seconds to complete. It took sometime for the shell to detect the network disconnect, but I could confirm the service was cleanly stopped by querying the service control manager (via sc query mongos).

Comment by David Golub [ 06/Jul/16 ]

I'm not clear on what the expected behavior following this change is. I tried modifying the automation test to ignore timeout errors and treat them as success, but it then determines that the mongos process is still up after the timeout. On the other hand, if I let it retry the shutdown command, it continues to repeatedly time out. How is one supposed to send the shutdown command and know when the process has actually stopped?

Comment by Mark Benvenuto [ 27/Jun/16 ]

david.golub I am planning to close this as works as designed. Let me know if you have any concerns.

Comment by Mark Benvenuto [ 23/Jun/16 ]

A change in behavior for shutdown on mongos was made as part of SERVER-22950 to make shutdown of mongos cleaner.

Comment by David Golub [ 16/Jun/16 ]

As I said in the ticket description, it only happens with 3.2.5 and later, not with earlier versions.

Comment by Mark Benvenuto [ 16/Jun/16 ]

david.golub Does this repro with 3.2.4 or did it just start failing in 3.2.5?

Generated at Thu Feb 08 04:06:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.