[SERVER-4661] Mongos doesn't detect primary change if old primary lost network connectivity Created: 11/Jan/12 Updated: 10/Dec/14 Resolved: 15/Mar/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.0.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Spencer Brody (Inactive) | Assignee: | Spencer Brody (Inactive) |
| Resolution: | Done | Votes: | 6 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Description |
|
If the primary of a shard loses all network connectivity, the secondary will take over as primary, but the mongos may keep trying to reconnect to the former primary. |
| Comments |
| Comment by auto [ 18/Mar/13 ] |
|
Author: {u'date': u'2013-03-18T18:18:16Z', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}Message: Revert " This reverts commit 8ed4f87153afe99609898c6af7e1b58327e6335f. |
| Comment by Spencer Brody (Inactive) [ 15/Mar/13 ] |
|
Have tested with 2.4.0-RC3 and cannot reproduce the problem. A lot of the connection management code has changed in mongos in the last 2 versions, so it's quite possible that this was a problem in 2.0 and no longer is. Can you please re-test this on a 2.4.0 RC (or the official 2.4.0 once it is release in the very near future)? I'm closing this ticket for now, but if this is still a problem for you in 2.4.0, please re-open. |
| Comment by auto [ 15/Mar/13 ] |
|
Author: {u'date': u'2013-03-15T16:10:09Z', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}Message: |
| Comment by Spencer Brody (Inactive) [ 07/Nov/12 ] |
|
|
| Comment by Sergey Zubkovsky [ 19/Sep/12 ] |
|
Does this bug affect 2.2.x versions? |
| Comment by Spencer Brody (Inactive) [ 10/Apr/12 ] |
|
Need failpoints from |
| Comment by Andy Gayton [ 13/Jan/12 ] |
|
We've just hit an issue running a fire drill that sounds very similar to this. We're still running 2.0.1, but I couldn't see anything in the 2.0.2 release notes that looks like it addresses this issue. Our setup is a single replica set of 3 nodes, mo01, mo02, mo03. Each of these nodes is also running a config server and have dns mc01, mc02, mc03 respectively so we can move the config servers later on. Our application nodes have a mongos running with: bin/mongos --configdb mc01:27019,mc02:27019,mc03:27019 And the shard single shared configured with: db.adminCommand( {addShard:'shard1/mo01:27018,mo02:27018,mo03:27018'}) The fire drill is to terminate the ec2 node running the current master mo01, expect the app to cleanly retry while mo02 or mo03 is elected as a new master, then to fire up a fresh ec2 node and attach a recent ebs snapshot to replace mo01. However once mo01 is terminated, the local mongos on the app nodes becomes completely unresponsive. bin/mongo just hangs and won't bring up the console, for a long time, then eventually you can get a console. Then a db.printShardingStatus() will hang, pretty much forever. Connecting directly to mo02/mo03 works OK. The mongos logs look like this: 2012-01-13_19:12:32.71920 Fri Jan 13 19:12:32 [ReplicaSetMonitorWatcher] reconnect mo01:27018 failed couldn't connect to server mo01:27018 |