[SERVER-84135] Chunk Migration Failure in Shard “error”:”OperationFailed: Data transfer error: migrate failed: WriteConcernFailed: waiting for replication timed out” Created: 13/Dec/23  Updated: 13/Dec/23  Resolved: 13/Dec/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Madhu Sai Vavilala Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Duplicate
Operating System: ALL
Participants:

 Description   

HI Team,

I would like to bring to your attention an issue we have been encountering in one of our shared environments during the chunk migration process. This issue has manifested itself after upgrading the MongoDB version from v4.4.25 to v5.0.21.

 

Here is a summary of the error logs we’ve observed:

 

{{{"t":

{"$date":"2023-10-30T19:12:11.717+05:30"}

,"s":"I", "c":"SHARDING", "id":21872, "ctx":"Balancer","msg":"Migration failed","attr":

{"migrateInfo":"DB.Coll: [\{ ID: MinKey }

, { ID: -92188298389644630XX }), from Shard3, to Shard6","error":"CommandFailed: commit clone failed :: caused by :: startCommit timed out waiting for the catch up completion. Sender's session is Shaed3_Shard6_653fb0640cb752fb246bd6b6. Current session is Shaed3_Shard6_653fb0640cb752fb246bd6b6"}}

{"t":

{"$date":"2023-10-30T19:22:58.782+05:30"}

,"s":"I", "c": "SHARDING", "id":21872, "ctx":"Balancer","msg":"Migration failed","attr":

{"migrateInfo":"DB.Coll: [\{ ID: MinKey }

, { ID: -92188298389644630XX }), from Shard3, to Shard4","error":"OperationFailed: Data transfer error: migrate failed: WriteConcernFailed: waiting for replication timed out"}}}}

FYI:

  • We have stopped the balancer as a temporary solution.
  • The write concern value has been set to {w:1}, _secondaryThrottle Value also {w:1}.
  • These errors are persistently occurring during chunk migrations. Interestingly, when we manually migrate the same chunk, it is carried out without any errors.

 

{{{}mongos> db.settings.find()
{ "_id" : "balancer", "mode" : "off", "stopped" : true, "_secondaryThrottle" :

{ "w" : 1 }

}

{ "_id" : "autosplit", "enabled" : false }

{ "_id" : "ReadWriteConcernDefaults", "defaultWriteConcern" :

{ "w" : 1, "wtimeout" : 0 }

, "updateOpTime" : Timestamp(1698308057, 1884), "updateWallClockTime" : ISODate("2023-10-26T08:14:17.727Z") }{}}}{}

 

If anyone has encountered a similar error or has suggestions on how to mitigate this issue, please share your insights. We are actively seeking a resolution to this matter.

Thank you for your attention and support.


Generated at Thu Feb 08 06:54:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.