Investigate correctness of moveCollection completion log metrics

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Cluster Scalability
    • ALL
    • None
    • 3
    • TBD
    • None
    • None
    • None
    • None
    • None
    • None

      The following is a moveCollection failure completion log from a patch run

      [js_test:resharding_skip_cloning_and_applying] c20049| 2025-05-06T19:33:25.516+00:00 I  RESHARD  7763800 [S] [ReshardingCoordinatorService-3] "Resharding complete","attr":{"info":{"uuid":{"uuid":{"$uuid":"b2d301bc-c59c-4759-b085-5e449c9c14d1"}},"status":"failed","statistics":{"ns":"testDb.testColl","provenance":"moveCollection","sourceUUID":{"uuid":{"$uuid":"91ff900a-6809-4e13-aff6-93852aef412f"}},"oldShardKey":"{ _id: 1 }","newShardKey":"{ _id: 1 }","startTime":{"$date":"2025-05-06T19:19:22.407Z"},"endTime":{"$date":"2025-05-06T19:33:25.515Z"},"operationDurationMs":843108,"numberOfSourceShards":1,"numberOfDestinationShards":1,"donors":{"0":{"shardName":"resharding_skip_cloning_and_applying-rs0","bytesToClone":2900,"documentsToClone":100,"indexCount":2,"phaseDurations":{},"writesDuringCriticalSection":0}},"recipients":{"0":{"shardName":"resharding_skip_cloning_and_applying-rs0","bytesCloned":0,"documentsCloned":0,"oplogsFetched":0,"oplogsApplied":0,"indexCount":0},"1":{"shardName":"resharding_skip_cloning_and_applying-rs1","bytesCloned":0,"documentsCloned":100,"oplogsFetched":0,"oplogsApplied":0,"indexCount":0}},"totals":{"copyDurationMs":417947,"applyDurationMs":103,"criticalSectionDurationMs":424553,"totalBytesToClone":2900,"totalDocumentsToClone":100,"averageDocSize":29,"totalBytesCloned":0,"totalDocumentsCloned":100,"totalOplogsFetched":0,"totalOplogsApplied":0,"maxDonorIndexes":2,"maxRecipientIndexes":0,"numberOfIndexesDelta":-2},"criticalSection":{"interval":{"start":{"$date":"2025-05-06T19:26:20.963Z"}},"expiration":{"$date":"2025-05-07T19:26:20.892Z"},"totalWritesDuringCriticalSection":0}}}} 

       

      Questions:

      • If the operation failed, why did the coordinator not persist an abortReason? 
        • Ideally, there should be an abort reason attached any time a reshard/move/unshard collection fails.
      • How are the following cloning metrics possible: {"bytesCloned":0,"documentsCloned":100}?
      • Both recipient fetched and applied zero oplog entries, why did the coordinator spend time applying {"applyDurationMs":103}?
        • Could it be the time to transition the coordinator state document(writing to disk, waiting for majority, etc.)?
        • Note this is a moveCollection operation so there is only one real recipient. The other is the dbprimary added as an recipient by default, but it will skip the cloning and applying phases.
      • Are these metrics abnormalities specific to moveCollection? 

            Assignee:
            Unassigned
            Reporter:
            Kruti Shah
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: