Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- resharding-improvements
- resharding-logs-improvement

Assigned Teams:

Cluster Scalability
Operating System:
ALL
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The following is a moveCollection failure completion log from a patch run:

[js_test:resharding_skip_cloning_and_applying] c20049| 2025-05-06T19:33:25.516+00:00 I  RESHARD  7763800 [S] [ReshardingCoordinatorService-3] "Resharding complete","attr":{"info":{"uuid":{"uuid":{"$uuid":"b2d301bc-c59c-4759-b085-5e449c9c14d1"}},"status":"failed","statistics":{"ns":"testDb.testColl","provenance":"moveCollection","sourceUUID":{"uuid":{"$uuid":"91ff900a-6809-4e13-aff6-93852aef412f"}},"oldShardKey":"{ _id: 1 }","newShardKey":"{ _id: 1 }","startTime":{"$date":"2025-05-06T19:19:22.407Z"},"endTime":{"$date":"2025-05-06T19:33:25.515Z"},"operationDurationMs":843108,"numberOfSourceShards":1,"numberOfDestinationShards":1,"donors":{"0":{"shardName":"resharding_skip_cloning_and_applying-rs0","bytesToClone":2900,"documentsToClone":100,"indexCount":2,"phaseDurations":{},"writesDuringCriticalSection":0}},"recipients":{"0":{"shardName":"resharding_skip_cloning_and_applying-rs0","bytesCloned":0,"documentsCloned":0,"oplogsFetched":0,"oplogsApplied":0,"indexCount":0},"1":{"shardName":"resharding_skip_cloning_and_applying-rs1","bytesCloned":0,"documentsCloned":100,"oplogsFetched":0,"oplogsApplied":0,"indexCount":0}},"totals":{"copyDurationMs":417947,"applyDurationMs":103,"criticalSectionDurationMs":424553,"totalBytesToClone":2900,"totalDocumentsToClone":100,"averageDocSize":29,"totalBytesCloned":0,"totalDocumentsCloned":100,"totalOplogsFetched":0,"totalOplogsApplied":0,"maxDonorIndexes":2,"maxRecipientIndexes":0,"numberOfIndexesDelta":-2},"criticalSection":{"interval":{"start":{"$date":"2025-05-06T19:26:20.963Z"}},"expiration":{"$date":"2025-05-07T19:26:20.892Z"},"totalWritesDuringCriticalSection":0}}}}

Questions:

If the operation failed, why did the coordinator not persist an abortReason?
- Ideally, there should be an abort reason attached any time a reshard/move/unshard collection fails.
How are the following cloning metrics possible: {"bytesCloned":0,"documentsCloned":100}?
Both recipient fetched and applied zero oplog entries, why did the coordinator spend time applying {"applyDurationMs":103}?
- Could it be the time to transition the coordinator state document(writing to disk, waiting for majority, etc.)?
- Note this is a moveCollection operation so there is only one real recipient. The other is the dbprimary added as an recipient by default, but it will skip the cloning and applying phases.
Are these metrics abnormalities specific to moveCollection?

is related to

SERVER-107645 Resharding Completion Log should not be logged After Every Failover

Backlog

Assignee:: Unassigned
Reporter:: Kruti Shah
Participants:: Kruti Shah
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: May 06 2025 08:45:47 PM UTC
Updated:: Oct 10 2025 05:16:30 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates