[SERVER-7472] Replication lag can cause cluster to hang in migration critical section Created: 25/Oct/12  Updated: 11/Jul/16  Resolved: 16/Nov/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.2.0
Fix Version/s: 2.2.2, 2.3.1

Type: Bug Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-7500 Set socket timeout on connection used... Closed
Related
related to SERVER-7298 thousands of "waiting till out of cri... Closed
is related to SERVER-7034 timeouts for all connections in migra... Closed
is related to SERVER-7493 Possible for read starvation to cause... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

In the critical section of a migration we check to make sure a majority of secondaries have received the migration writes. If the secondaries fall behind at that point, it's possible for the migration to be stuck in the critical section for a long time (we timeout after 5 minutes). During that time, however, the whole cluster can become unusable as setShardVersion commands will block until the shard is out of the critical section.



 Comments   
Comment by auto [ 16/Nov/12 ]

Author:

{u'date': u'2012-10-25T02:56:06Z', u'email': u'spencer@10gen.com', u'name': u'Spencer T Brody'}

Message: Decrease timeout on replication catching up in migration SERVER-7472
Branch: v2.2
https://github.com/mongodb/mongo/commit/cb2e7e34d5a2dddeba4eaffece4af7fadcf615a2

Comment by auto [ 16/Nov/12 ]

Author:

{u'date': u'2012-10-25T02:56:06Z', u'email': u'spencer@10gen.com', u'name': u'Spencer T Brody'}

Message: Decrease timeout on replication catching up in migration SERVER-7472
Branch: master
https://github.com/mongodb/mongo/commit/35ca4d0f94dd5a6cbc091d18060b842ac2dff2ce

Comment by Spencer Brody (Inactive) [ 25/Oct/12 ]

We make sure all the main writes for the migration have happened before entering the critical section, so this can only happen if writes have been coming into that chunk during the migration AND those writes take a while to be replicated during the _transferMods phase of the migration.

Generated at Thu Feb 08 03:14:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.