[SERVER-8273] Mongos view of Replica Set can become stale forever if members change completely between refresh Created: 22/Jan/13 Updated: 19/Apr/18 Resolved: 19/Apr/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.2.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | James Blackburn | Assignee: | Randolph Tan |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux RHEL 5.5 |
||
| Attachments: |
|
| Operating System: | Linux |
| Participants: |
| Description |
|
The ReplicaSetMonitor refreshes the replica set view every 10 sec or when a new operation is requested on a replica set connection that had errored out. The problem comes in when the members of the set changed such that none of the members are part of the members of what the ReplicaSetMonitor have. And the direct consequence is that there will be no way for the monitor to contact any of the new members. The only work around for this issue is to manually edit the config.shards collection and restart all mongos. Attached a script, test.patch, that demonstrates this problem. Original bug report:
|
| Comments |
| Comment by Gregory McKeon (Inactive) [ 19/Apr/18 ] | ||||||||||||||||||||||||||||||||||||||||||
|
This is also an issue in drivers - we don't see a good solution beyond using SRV in 3.6, so we're closing WaD for now. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 27/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||
|
My apologies, I got sidetracked when I found the issue. You are correct that the ERROR log message was relevant. This only appears when mongos tries to update the config server with a new seed list. The newer mongos will have a more detailed log explaining why it failed. We don't encourage changing the config.shards collection manually as it easy to make mistakes. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by James Blackburn [ 16/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||
Note this isn't the case in this bug report. The new interfaces cn54-ib is an interface on the same host cn54. The mongods are contactable through both the old and the new dns name. Looking at the logs there appears to be some internal confusion - clearly the mongos' are seeing the hosts at the new address, they're just unable to update the config db. Also we did a lot of googling and couldn't find any docs for recovering from a situation like this. It would be great if there was a doc that said it was safe to change config.shards manually. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 16/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Not sure there is any real solution to this. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by James Blackburn [ 30/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||
|
It's happened again. We've removed a host from one of the shards, and the config servers haven't noticed. Meanwhile the mongos' have loads of this:
What's the best way to fix this? | ||||||||||||||||||||||||||||||||||||||||||
| Comment by James Blackburn [ 23/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||
|
I've attached the output of:
on that log file. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by James Blackburn [ 23/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Looking in one of the mongos logs from last Thursday afternoon:
Perhaps:
is relevant? | ||||||||||||||||||||||||||||||||||||||||||
| Comment by James Blackburn [ 22/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Thanks! The cluster is attached to MMS under the HAL group name. Some of the earlier pings no longer appear, but if you have history of older pings from this morning or before, you can see the problem. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 22/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Great, thanks for the info and report. We have made some changes which I believe will be better in 2.4, but might not be able to be, or have been, back-ported. I am having one of the active devs on those changes see if this case was covered in those changes. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by James Blackburn [ 22/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Yes, the replica sets were reconfigured middle of last week. Moreover the entire cluster was restarted over the weekend for bios firmware upgrades. Today too we killed and restarted mongos's a few time hoping it would pick up the change. Oddly it did pick up the change to one of the replicasets last week. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 22/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Did you restart any mongos instances since the replica set reconfig? |