-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Cluster Scalability
-
ALL
-
None
-
3
-
TBD
-
None
-
None
-
None
-
None
-
None
-
None
The following is a moveCollection failure completion log from a patch run:
[js_test:resharding_skip_cloning_and_applying] c20049| 2025-05-06T19:33:25.516+00:00 I RESHARD 7763800 [S] [ReshardingCoordinatorService-3] "Resharding complete","attr":{"info":{"uuid":{"uuid":{"$uuid":"b2d301bc-c59c-4759-b085-5e449c9c14d1"}},"status":"failed","statistics":{"ns":"testDb.testColl","provenance":"moveCollection","sourceUUID":{"uuid":{"$uuid":"91ff900a-6809-4e13-aff6-93852aef412f"}},"oldShardKey":"{ _id: 1 }","newShardKey":"{ _id: 1 }","startTime":{"$date":"2025-05-06T19:19:22.407Z"},"endTime":{"$date":"2025-05-06T19:33:25.515Z"},"operationDurationMs":843108,"numberOfSourceShards":1,"numberOfDestinationShards":1,"donors":{"0":{"shardName":"resharding_skip_cloning_and_applying-rs0","bytesToClone":2900,"documentsToClone":100,"indexCount":2,"phaseDurations":{},"writesDuringCriticalSection":0}},"recipients":{"0":{"shardName":"resharding_skip_cloning_and_applying-rs0","bytesCloned":0,"documentsCloned":0,"oplogsFetched":0,"oplogsApplied":0,"indexCount":0},"1":{"shardName":"resharding_skip_cloning_and_applying-rs1","bytesCloned":0,"documentsCloned":100,"oplogsFetched":0,"oplogsApplied":0,"indexCount":0}},"totals":{"copyDurationMs":417947,"applyDurationMs":103,"criticalSectionDurationMs":424553,"totalBytesToClone":2900,"totalDocumentsToClone":100,"averageDocSize":29,"totalBytesCloned":0,"totalDocumentsCloned":100,"totalOplogsFetched":0,"totalOplogsApplied":0,"maxDonorIndexes":2,"maxRecipientIndexes":0,"numberOfIndexesDelta":-2},"criticalSection":{"interval":{"start":{"$date":"2025-05-06T19:26:20.963Z"}},"expiration":{"$date":"2025-05-07T19:26:20.892Z"},"totalWritesDuringCriticalSection":0}}}}
Questions:
- If the operation failed, why did the coordinator not persist an abortReason?
- Ideally, there should be an abort reason attached any time a reshard/move/unshard collection fails.
- How are the following cloning metrics possible: {"bytesCloned":0,"documentsCloned":100}?
- Both recipient fetched and applied zero oplog entries, why did the coordinator spend time applying {"applyDurationMs":103}?
- Could it be the time to transition the coordinator state document(writing to disk, waiting for majority, etc.)?
- Note this is a moveCollection operation so there is only one real recipient. The other is the dbprimary added as an recipient by default, but it will skip the cloning and applying phases.
- Are these metrics abnormalities specific to moveCollection?