[SERVER-71626] Failed to Presplit and create chunks in Sharded Cluster Created: 25/Nov/22  Updated: 12/Apr/23  Resolved: 12/Apr/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Rajesh Vinayagam Assignee: Jordi Serra Torrens
Resolution: Duplicate Votes: 0
Labels: chunks, sharded-cluster, sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

STG


Attachments: Text File repro-server-71626.patch    
Issue Links:
Duplicate
duplicates SERVER-54979 Calling move/split/mergeChunk after o... Backlog
duplicates SERVER-68485 Merge and Split commands should not u... Closed
Assigned Teams:
Sharding EMEA
Operating System: ALL
Sprint: Sharding EMEA 2023-04-17
Participants:

 Description   

Summary

PreSplit and allocate the required chunks in a sharded cluster

Motivation

Who is the affected end user?

Developers who are trying to work in the sharded cluster environment

How does this affect the end user?

The developers are currently blocked and are using the sharded cluster without pre splitting.

How likely is it that this problem or use case will occur?

It occurs frequently.

If the problem does occur, what are the consequences and how severe are they?

Not able to presplit and create chunks in sharded cluster

Is this issue urgent?

Yeah 

Database name : chasetest

Collection name: documents

 

Steps:

 

sh.disableBalancing("chasetest.documents")

sh.startBalancer();

sh.enableSharding("chasetest");

sh.shardCollection('chasetest.documents', {'SHARD_KEY':1})

 

for(x=0;x<840;x++) {

    print(`sh.splitAt("chasetest.documents", { "SHARD_KEY": ${x} })`)

    printjson(sh.splitAt("chasetest.documents", {"SHARD_KEY": x}))

}  

 

 

 

MongoError: split failed :: caused by :: chunk operation commit failed: version 4|3||637fee532a5d9628d64e06eb||Timestamp(1669328467, 4) doesn't exist in namespace: chasetest.documents. Unable to save chunk ops. Command: { applyOps: [ { op: "u", b: true, ns: "config.chunks", o: { _id: ObjectId('637fee87f998b1216f78a0ef'), uuid: UUID("61c91778-d277-44fe-bb42-f1ca936b7a14"), min:

{ SHARD_KEY: 18 }

, max: { SHARD_KEY: 20 }, shard: "dev-65148-csi-gms-ingestion-api-dev-rs1", lastmod: Timestamp(4, 2), history: [ { validAfter: Timestamp(1669328467, 4), shard: "dev-65148-csi-gms-ingestion-api-dev-rs1" } ] }, o2: { _id: ObjectId('637fee87f998b1216f78a0ef') } }, { op: "u", b: true, ns: "config.chunks", o: { _id: ObjectId('637fee8df998b1216f78a180'), uuid: UUID("61c91778-d277-44fe-bb42-f1ca936b7a14"), min:

{ SHARD_KEY: 20 }

, max: { SHARD_KEY: MaxKey }, shard: "dev-65148-csi-gms-ingestion-api-dev-rs1", lastmod: Timestamp(4, 3), history: [ { validAfter: Timestamp(1669328467, 4), shard: "dev-65148-csi-gms-ingestion-api-dev-rs1" } ] }, o2: { _id: ObjectId('637fee8df998b1216f78a180') } } ], preCondition: [ { ns: "config.chunks", q: { query: { min:

{ SHARD_KEY: 18 }

, max: { SHARD_KEY: MaxKey }, uuid: UUID("61c91778-d277-44fe-bb42-f1ca936b7a14") }, orderby: { lastmod: -1 } }, res: { uuid: UUID("61c91778-d277-44fe-bb42-f1ca936b7a14"), shard: "dev-65148-csi-gms-ingestion-api-dev-rs1" } } ], writeConcern: { w: 1, wtimeout: 0 } }. Result: { got: {}, whatFailed: { ns: "config.chunks", q: { query: { min:

{ SHARD_KEY: 18 }

, max: { SHARD_KEY: MaxKey }, uuid: UUID("61c91778-d277-44fe-bb42-f1ca936b7a14") }, orderby: { lastmod: -1 } }, res: { uuid: UUID("61c91778-d277-44fe-bb42-f1ca936b7a14"), shard: "dev-65148-csi-gms-ingestion-api-dev-rs1" } }, ok: 0.0, errmsg: "preCondition failed", code: 2, codeName: "BadValue", $gleStats: { lastOpTime:

{ ts: Timestamp(1669328525, 6), t: 4 }

, electionId: ObjectId('7fffffff0000000000000004') }, lastCommittedOpTime: Timestamp(1669328525, 6), $clusterTime: { clusterTime: Timestamp(1669328525, 6), signature:

{ hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 }

}, $configTime: Timestamp(1669328525, 6), $topologyTime: Timestamp(1658235601, 4), operationTime: Timestamp(1669328525, 6) } :: caused by :: preCondition failed

 



 Comments   
Comment by Jordi Serra Torrens [ 12/Apr/23 ]

It's difficult to conclusively tell exactly what interleaving lead to this split failure, given that the report does not mention the server version nor include logs or a config dump. However, from the error message above I can infer that this must be 5.0 or greater (because the config.chunks document have the 'uuid' field).

I have not been able to reproduce the issue after following the steps in the report. I tried 5.0, 6.0 and master (~7.0).

However, after code inspection I found one possible scenario where this can happen in 5.0, which I have been able to reproduce by using failpoints (attaching repro for the record repro-server-71626.patch ). This scenario involves splitting chunks while the balancer is moving chunks to other shards. Certain interleaving can lead to the split being send to a shard that no longer owns that chunk, which results in the exact same symptom as the report. This was fixed in 5.1 by SERVER-54979 (later reworked by SERVER-68485).

For this reason, I'm going to mark this as duplicate of SERVER-54979/SERVER-68485. rajesh.vinayagam@mongodb.com please reopen if this did not happen in 5.0.

Additionally, I'd like to point out that `sh.disableBalancing` must be run after `sh.shardCollection`. Otherwise `disableBalancing` has no effect because the sharded collection does not yet exists, and this could lead to the hypothesized migrations that ran concurrently with the split. If you do that, I expect the split failures to stop happening.

Comment by Jordi Serra Torrens [ 31/Mar/23 ]

rajesh.vinayagam@mongodb.com can you share what server version is this on?

Generated at Thu Feb 08 06:19:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.