[SERVER-69142] A retry of _shardsvrReshardCollection can cause DuplicateKey error (resharding_nonblocking_coordinator_rebuild.js) Created: 25/Aug/22  Updated: 29/Oct/23  Resolved: 01/Sep/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 5.0.13

Type: Bug Priority: Major - P3
Reporter: Abdul Qadeer Assignee: Abdul Qadeer
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2022-09-05
Participants:
Linked BF Score: 138
Story Points: 3

 Description   

Per the following logs in the failing test:

[js_test:resharding_nonblocking_coordinator_rebuild] c20779| 2022-08-15T22:34:31.656+00:00 W  CONTROL  23829   [conn1] "Set failpoint","attr":{"failPointName":"reshardingPauseCoordinatorBeforeCompletion","failPoint":{"mode":0,"data":{},"timesEntered":0}}
 
[js_test:resharding_nonblocking_coordinator_rebuild] c20779| 2022-08-15T22:34:33.138+00:00 I  RESHARD  4956902 [ReshardingCoordinatorService-0] "Resharding failed","attr":{"namespace":"reshardingDb.coll","newShardKeyPattern":{"newKey":1},"error":{"code":5356800,"codeName":"Location5356800","errmsg":"Recipient shard shard2-recipient0 reached an unrecoverable error :: caused by :: Resharding completed with non-empty stash collections"}}
..
..
[js_test:resharding_nonblocking_coordinator_rebuild] c20779| 2022-08-15T22:34:34.792+00:00 I  RESHARD  5343001 [ReshardingCoordinatorService-1] "Transitioned resharding coordinator state","attr":{"newState":"initializing","oldState":"unused","namespace":"reshardingDb.coll","collectionUUID":{"uuid":{"$uuid":"98d3baa8-562a-4c71-8156-8cb71202bc58"}},"reshardingUUID":{"uuid":{"$uuid":"afc0b895-e087-436d-a9d8-b141ea128266"}}}
..
..
[js_test:resharding_nonblocking_coordinator_rebuild] d20770| 2022-08-15T22:34:37.057+00:00 I  COMMAND  51803   [conn64] "Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"MongoDB Shell","command":{"_shardsvrReshardCollection":"reshardingDb.coll","key":{"newKey":1},"_presetReshardedChunks":[{"recipientShardId":"shard2-recipient0","min":{"newKey":{"$minKey":1}},"max":{"newKey":{"$maxKey":1}}}],"writeConcern":{"w":"majority","wtimeout":60000},"databaseVersion":{"uuid":{"$uuid":"76d42834-8a2e-4942-bc32-d520dae1731c"},"timestamp":{"$timestamp":{"t":1660602860,"i":3}},"lastMod":1},"lsid":{"id":{"$uuid":"62f0a515-4f10-4caa-844c-a07714eadb7c"},"uid":{"$binary":{"base64":"u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A=","subType":"0"}}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1660602870,"i":25}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$configTime":{"$timestamp":{"t":1660602870,"i":24}},"$topologyTime":{"$timestamp":{"t":1660602860,"i":1}},"$audit":{"$impersonatedUsers":[{"user":"__system","db":"local"}],"$impersonatedRoles":[]},"$client":{"application":{"name":"MongoDB Shell","pid":"32822"},"driver":{"name":"MongoDB Internal Client","version":"5.0.11-6-ga2642ba"},"os":{"type":"Linux","name":"Ubuntu","architecture":"x86_64","version":"18.04"},"mongos":{"host":"ip-10-128-0-163:20782","client":"10.128.0.163:49632","version":"5.0.11-6-ga2642ba"}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1660602870,"i":24}},"t":-1}},"$db":"admin"},"numYields":0,"ok":0,"errMsg":"Recipient shard shard2-recipient0 reached an unrecoverable error :: caused by :: E11000 duplicate key error collection: reshardingDb.system.resharding.98d3baa8-562a-4c71-8156-8cb71202bc58 index: _id_ dup key: { _id: 0.0 }","errName":"DuplicateKey","errCode":11000,"reslen":682,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":1}},"FeatureCompatibilityVersion":{"acquireCount":{"r":1}},"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"writeConcern":{"w":"majority","wtimeout":60000,"provenance":"clientSupplied"},"remote":"10.128.0.163:44006","protocol":"op_msg","durationMillis":2359}

It can be seen that the first _configsvrReshardCollection command fails as it is supposed to with code 5356800 during oplog application. However a subsequent retry of the same command after the first one finishes causes DuplicateKey (11000) due to the global id constraint violation found in cloning phase for second reshard collection command rather than in oplog application phase. From the following logs, it can be seen that it is the  _shardsvrReshardCollection command that is being retried, unlike BF-26040:

[js_test:resharding_nonblocking_coordinator_rebuild] d20770| 2022-08-15T22:34:30.663+00:00 I  COMMAND  51803   [conn64] "Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"MongoDB Shell","command":{"_shardsvrReshardCollection":"reshardingDb.coll","key":{"newKey":1},"_presetReshardedChunks":[{"recipientShardId":"shard2-recipient0","min":{"newKey":{"$minKey":1}},"max":{"newKey":{"$maxKey":1}}}],"writeConcern":{"w":"majority","wtimeout":60000},"databaseVersion":{"uuid":{"$uuid":"76d42834-8a2e-4942-bc32-d520dae1731c"},"timestamp":{"$timestamp":{"t":1660602860,"i":3}},"lastMod":1},"lsid":{"id":{"$uuid":"62f0a515-4f10-4caa-844c-a07714eadb7c"},"uid":{"$binary":{"base64":"u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A=","subType":"0"}}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1660602862,"i":29}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$configTime":{"$timestamp":{"t":1660602862,"i":29}},"$topologyTime":{"$timestamp":{"t":1660602860,"i":1}},"$audit":{"$impersonatedUsers":[{"user":"__system","db":"local"}],"$impersonatedRoles":[]},"$client":{"application":{"name":"MongoDB Shell","pid":"32822"},"driver":{"name":"MongoDB Internal Client","version":"5.0.11-6-ga2642ba"},"os":{"type":"Linux","name":"Ubuntu","architecture":"x86_64","version":"18.04"},"mongos":{"host":"ip-10-128-0-163:20782","client":"10.128.0.163:49632","version":"5.0.11-6-ga2642ba"}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1660602862,"i":29}},"t":-1}},"$db":"admin"},"numYields":0,"ok":0,"errMsg":"Not Primary when trying to create a new instance of PrimaryOnlyService ReshardingCoordinatorService","errName":"NotWritablePrimary","errCode":10107,"reslen":574,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":1}},"FeatureCompatibilityVersion":{"acquireCount":{"r":1}},"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"writeConcern":{"w":"majority","wtimeout":60000,"provenance":"clientSupplied"},"remote":"10.128.0.163:44006","protocol":"op_msg","durationMillis":6497}
...
[js_test:resharding_nonblocking_coordinator_rebuild] d20770| 2022-08-15T22:34:37.057+00:00 I  COMMAND  51803   [conn64] "Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"MongoDB Shell","command":{"_shardsvrReshardCollection":"reshardingDb.coll","key":{"newKey":1},"_presetReshardedChunks":[{"recipientShardId":"shard2-recipient0","min":{"newKey":{"$minKey":1}},"max":{"newKey":{"$maxKey":1}}}],"writeConcern":{"w":"majority","wtimeout":60000},"databaseVersion":{"uuid":{"$uuid":"76d42834-8a2e-4942-bc32-d520dae1731c"},"timestamp":{"$timestamp":{"t":1660602860,"i":3}},"lastMod":1},"lsid":{"id":{"$uuid":"62f0a515-4f10-4caa-844c-a07714eadb7c"},"uid":{"$binary":{"base64":"u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A=","subType":"0"}}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1660602870,"i":25}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$configTime":{"$timestamp":{"t":1660602870,"i":24}},"$topologyTime":{"$timestamp":{"t":1660602860,"i":1}},"$audit":{"$impersonatedUsers":[{"user":"__system","db":"local"}],"$impersonatedRoles":[]},"$client":{"application":{"name":"MongoDB Shell","pid":"32822"},"driver":{"name":"MongoDB Internal Client","version":"5.0.11-6-ga2642ba"},"os":{"type":"Linux","name":"Ubuntu","architecture":"x86_64","version":"18.04"},"mongos":{"host":"ip-10-128-0-163:20782","client":"10.128.0.163:49632","version":"5.0.11-6-ga2642ba"}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1660602870,"i":24}},"t":-1}},"$db":"admin"},"numYields":0,"ok":0,"errMsg":"Recipient shard shard2-recipient0 reached an unrecoverable error :: caused by :: E11000 duplicate key error collection: reshardingDb.system.resharding.98d3baa8-562a-4c71-8156-8cb71202bc58 index: _id_ dup key: { _id: 0.0 }","errName":"DuplicateKey","errCode":11000,"reslen":682,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":1}},"FeatureCompatibilityVersion":{"acquireCount":{"r":1}},"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"writeConcern":{"w":"majority","wtimeout":60000,"provenance":"clientSupplied"},"remote":"10.128.0.163:44006","protocol":"op_msg","durationMillis":2359}

max.hirschhorn@mongodb.com pointed out about changes in how _shardsvrReshardCollection is handled between v5.0 and later branches with SERVER-62064.

The NotWritablePrimary error is being propagated to mongos via ShardingDDLCoordinator_NORESILIENT after exhausting _configsvReshardCollection command retries which in turn leads to a retry of _shardsvrReshardCollection from mongos. Essentially, if the command is retried or sent again to primary shard from mongos then it waits to acquire ScopedDistLock before creating an instance of ReshardCollectionCoordinator. This happens after the previous operation completes fully, not allowing a retry to possibly join the existing operation.

A fix here would be to recommit SERVER-61607 and accept DuplicateKey as a possible error code.

 

 



 Comments   
Comment by Githook User [ 01/Sep/22 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-69142 Add DuplicateKey as acceptable error in resharding test.
Branch: v5.0
https://github.com/mongodb/mongo/commit/82f8c7d7999603f7d8e5e03a49e25b3c4eba244e

Generated at Thu Feb 08 06:12:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.