[SERVER-7472] Replication lag can cause cluster to hang in migration critical section Created: 25/Oct/12 Updated: 11/Jul/16 Resolved: 16/Nov/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.2.0 |
| Fix Version/s: | 2.2.2, 2.3.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Spencer Brody (Inactive) | Assignee: | Spencer Brody (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
In the critical section of a migration we check to make sure a majority of secondaries have received the migration writes. If the secondaries fall behind at that point, it's possible for the migration to be stuck in the critical section for a long time (we timeout after 5 minutes). During that time, however, the whole cluster can become unusable as setShardVersion commands will block until the shard is out of the critical section. |
| Comments |
| Comment by auto [ 16/Nov/12 ] |
|
Author: {u'date': u'2012-10-25T02:56:06Z', u'email': u'spencer@10gen.com', u'name': u'Spencer T Brody'}Message: Decrease timeout on replication catching up in migration |
| Comment by auto [ 16/Nov/12 ] |
|
Author: {u'date': u'2012-10-25T02:56:06Z', u'email': u'spencer@10gen.com', u'name': u'Spencer T Brody'}Message: Decrease timeout on replication catching up in migration |
| Comment by Spencer Brody (Inactive) [ 25/Oct/12 ] |
|
We make sure all the main writes for the migration have happened before entering the critical section, so this can only happen if writes have been coming into that chunk during the migration AND those writes take a while to be replicated during the _transferMods phase of the migration. |