[SERVER-4661] Mongos doesn't detect primary change if old primary lost network connectivity Created: 11/Jan/12  Updated: 10/Dec/14  Resolved: 15/Mar/13

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.0.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 6
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-5175 Need "failpoints" system to facilitat... Closed
depends on SERVER-5642 Make mongobridge work with ShardingTe... Closed
depends on SERVER-7573 Add tests for network connectivity lo... Closed
Related
related to SERVER-4094 better mongos handling of state where... Closed
related to SERVER-4505 Don't assume old primary is still pri... Closed
Operating System: ALL
Participants:

 Description   

If the primary of a shard loses all network connectivity, the secondary will take over as primary, but the mongos may keep trying to reconnect to the former primary.



 Comments   
Comment by auto [ 18/Mar/13 ]

Author:

{u'date': u'2013-03-18T18:18:16Z', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}

Message: Revert "SERVER-7573 SERVER-4661 Add test for mongos detecting RS failover when primary loses network connectivity"

This reverts commit 8ed4f87153afe99609898c6af7e1b58327e6335f.
Branch: master
https://github.com/mongodb/mongo/commit/a6046309af1b8e7751976e68f46152574e4f078a

Comment by Spencer Brody (Inactive) [ 15/Mar/13 ]

Have tested with 2.4.0-RC3 and cannot reproduce the problem. A lot of the connection management code has changed in mongos in the last 2 versions, so it's quite possible that this was a problem in 2.0 and no longer is. Can you please re-test this on a 2.4.0 RC (or the official 2.4.0 once it is release in the very near future)? I'm closing this ticket for now, but if this is still a problem for you in 2.4.0, please re-open.

Comment by auto [ 15/Mar/13 ]

Author:

{u'date': u'2013-03-15T16:10:09Z', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}

Message: SERVER-7573 SERVER-4661 Add test for mongos detecting RS failover when primary loses network connectivity
Branch: master
https://github.com/mongodb/mongo/commit/8ed4f87153afe99609898c6af7e1b58327e6335f

Comment by Spencer Brody (Inactive) [ 07/Nov/12 ]

SERVER-5642 or SERVER-7573 could be used to test this, but I think SERVER-7573 is a better solution that will be a closer representation of what happens in the wild.

Comment by Sergey Zubkovsky [ 19/Sep/12 ]

Does this bug affect 2.2.x versions?

Comment by Spencer Brody (Inactive) [ 10/Apr/12 ]

Need failpoints from SERVER-5175 (or something similar) to be able to test network connectivity loss.

Comment by Andy Gayton [ 13/Jan/12 ]

We've just hit an issue running a fire drill that sounds very similar to this.

We're still running 2.0.1, but I couldn't see anything in the 2.0.2 release notes that looks like it addresses this issue.

Our setup is a single replica set of 3 nodes, mo01, mo02, mo03. Each of these nodes is also running a config server and have dns mc01, mc02, mc03 respectively so we can move the config servers later on.

Our application nodes have a mongos running with:

bin/mongos --configdb mc01:27019,mc02:27019,mc03:27019

And the shard single shared configured with:

db.adminCommand(

{addShard:'shard1/mo01:27018,mo02:27018,mo03:27018'}

)

The fire drill is to terminate the ec2 node running the current master mo01, expect the app to cleanly retry while mo02 or mo03 is elected as a new master, then to fire up a fresh ec2 node and attach a recent ebs snapshot to replace mo01.

However once mo01 is terminated, the local mongos on the app nodes becomes completely unresponsive. bin/mongo just hangs and won't bring up the console, for a long time, then eventually you can get a console. Then a db.printShardingStatus() will hang, pretty much forever. Connecting directly to mo02/mo03 works OK.

The mongos logs look like this:

2012-01-13_19:12:32.71920 Fri Jan 13 19:12:32 [ReplicaSetMonitorWatcher] reconnect mo01:27018 failed couldn't connect to server mo01:27018
2012-01-13_19:12:33.94754 Fri Jan 13 19:12:33 [conn4] SyncClusterConnection connecting to [mc01:27019]
2012-01-13_19:12:42.76191 Fri Jan 13 19:12:42 [ReplicaSetMonitorWatcher] trying reconnect to mo01:27018
2012-01-13_19:12:47.75921 Fri Jan 13 19:12:47 [ReplicaSetMonitorWatcher] reconnect mo01:27018 failed couldn't connect to server mo01:27018
2012-01-13_19:12:57.80280 Fri Jan 13 19:12:57 [ReplicaSetMonitorWatcher] trying reconnect to mo01:27018
2012-01-13_19:13:02.79923 Fri Jan 13 19:13:02 [ReplicaSetMonitorWatcher] reconnect mo01:27018 failed couldn't connect to server mo01:27018
2012-01-13_19:13:12.81288 Fri Jan 13 19:13:12 [ReplicaSetMonitorWatcher] trying reconnect to mo01:27018
2012-01-13_19:13:17.80929 Fri Jan 13 19:13:17 [ReplicaSetMonitorWatcher] reconnect mo01:27018 failed couldn't connect to server mo01:27018
2012-01-13_19:13:27.82288 Fri Jan 13 19:13:27 [ReplicaSetMonitorWatcher] trying reconnect to mo01:27018
2012-01-13_19:13:32.81919 Fri Jan 13 19:13:32 [ReplicaSetMonitorWatcher] reconnect mo01:27018 failed couldn't connect to server mo01:27018
2012-01-13_19:13:42.65965 Fri Jan 13 19:13:42 [mongosMain] connection accepted from 127.0.0.1:38212 #5
2012-01-13_19:13:42.66249 Fri Jan 13 19:13:42 [conn5] end connection 127.0.0.1:38212
2012-01-13_19:13:42.83266 Fri Jan 13 19:13:42 [ReplicaSetMonitorWatcher] trying reconnect to mo01:27018

Generated at Thu Feb 08 03:06:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.