[SERVER-17668] Mongos Fail over did not work as expected Created: 19/Mar/15  Updated: 19/Mar/15  Resolved: 19/Mar/15

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.6.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: joe fang Assignee: Andy Schwerin
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-17617 One config server being down can bloc... Closed
Operating System: ALL
Participants:

 Description   

We ran a mongo config server failover test several weeks ago. What we did is to ramp up some load in the system and then we started to stop services on mongo config server in the sequence of master, slave1 and slave2.

Based on the following comments from mongo knowledge-base, the system should still function without performance degradation.
If all three config servers are unavailable, you can still use the cluster if you do not restart the mongos instances until after the config servers are accessible again. If you restart the mongos instances before the config servers are available, the mongos will be unable to route reads and writes.

However, the following findings from our load test is not what we expect according to mongo knowledge-base.

  1. When one of the mongo-config server service was stopped the overall throughput drop from 1008.52 to 995.9
  2. When two of the mongo-config server services were stopped the overall throughput drop from 995.9 to 652.
  3. When all three mongo-config server services were shut down, all requests are failed.

Is this a known bug?



 Comments   
Comment by Andy Schwerin [ 19/Mar/15 ]

Duplicate of SERVER-17617.

Comment by Andy Schwerin [ 19/Mar/15 ]

It is as-designed that routing of requests eventually fails when all config servers are unavailable. Furthermore, when any config server is unavailable, it is expected that metadata changes (such as chunk migrations, sharding collections and creating databases) will fail.

The slow-down you experience when one config server is stopped will vary based on which of the config servers you stop, similar for the second one. This is a known issue in the config server protocols. Our long term plan is to implement SERVER-1448 and replace config servers with replica sets, and to improve the replica set secondary-read behavior. In the interim, we attempt to mitigate specific performance problems, but the config server protocols offer limited recourse.

Generated at Thu Feb 08 03:45:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.