[SERVER-7271] On transient config server failure, mongod should not abort Created: 05/Oct/12 Updated: 11/Jul/16 Resolved: 02/Apr/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 2.4.2, 2.5.0 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Greg Studer | Assignee: | Alberto Lerner |
| Resolution: | Done | Votes: | 4 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Participants: | |||||
| Description |
|
Certain failure modes of the SyncClusterConnection to the config server are safe failures (we are sure nothing has been written) - it is incorrect to fail by aborting mongod here since our state is fully known (the migration failed). Rollback and continue instead. Original description: If the config server is offline for a short amount of time during the critical section causing a write error but we're able to reconnect later, we should try to A) verify the write failure on all servers and B) locally rollback the new version, aborting the migration. Currently we fail hard, and rely on the mongod restart for the config reload. Note this is different from the case when the config servers go down and stay down, since in that case we are unable to read the state of the metadata and so have to terminate. |
| Comments |
| Comment by auto [ 09/Apr/13 ] | |||||||||||||||||||||||||
|
Author: {u'date': u'2013-04-09T19:59:22Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'}Message: | |||||||||||||||||||||||||
| Comment by auto [ 09/Apr/13 ] | |||||||||||||||||||||||||
|
Author: {u'date': u'2013-04-09T19:59:22Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'}Message: | |||||||||||||||||||||||||
| Comment by auto [ 02/Apr/13 ] | |||||||||||||||||||||||||
|
Author: {u'date': u'2013-04-02T16:06:20Z', u'name': u'Dan Pasette', u'email': u'dan@10gen.com'}Message: | |||||||||||||||||||||||||
| Comment by auto [ 02/Apr/13 ] | |||||||||||||||||||||||||
|
Author: {u'date': u'2013-03-31T20:08:37Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'}Message: Conflicts: src/mongo/client/distlock.cpp | |||||||||||||||||||||||||
| Comment by auto [ 01/Apr/13 ] | |||||||||||||||||||||||||
|
Author: {u'date': u'2013-03-31T20:08:37Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'}Message: | |||||||||||||||||||||||||
| Comment by Shaun Verch [ 05/Oct/12 ] | |||||||||||||||||||||||||
|
setup: 9 servers, 3 shards with 3 rs members each. MongoDB 2.2 One member (member1) in a set crashed, because of a server crash. This server also runs 1 of 3 config servers. After that another member (member2) crashed because it couldn't reach the crashed server? This is the message on member2:
In this case, a config server went down in the middle of a moveChunk command, and the remaining config servers went read only. The moveChunk command attempted to write the new config with applyOps, read the old config, got the old version, and then aborted. |