[SERVER-30615] _configsvrShardCollection can retry upsert after getting writeConcern timeout and result in duplicate key error Created: 11/Aug/17  Updated: 30/Oct/23  Resolved: 14/Sep/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.5.11
Fix Version/s: 3.6.0-rc0

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: Dianna Hohensee (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-30559 Sharding tests which run under contin... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2017-09-11, Sharding 2017-10-02
Participants:
Linked BF Score: 0

 Comments   
Comment by Ramon Fernandez Marina [ 14/Sep/17 ]

Author:

{'username': u'DiannaHohensee', 'name': u'Dianna Hohensee', 'email': u'dianna.hohensee@10gen.com'}

Message:SERVER-30615 Fix duplicate key error in shardCollection by using local read concern instead of majority
Branch:master
https://github.com/mongodb/mongo/commit/e1cbf344efcaa72453f96f27a770919ad93b0b4f

Comment by Randolph Tan [ 08/Sep/17 ]

Per offline discussion, the direction we currently want to go is make reads use readConcern local.

Comment by Randolph Tan [ 01/Sep/17 ]

I think I have found the problem. Sequence of events:

1. configShardCollection begins
2. command performs query on config.database with readConcern majority (ShardingCatalogClientImpl::getDatabase -> ShardingCatalogClientImpl::_exhaustiveFindOnConfig)
3. For ShardLocal::_exhaustiveFindOnConfig, this effectively calls opCtx->recoveryUnit()->setReadFromMajorityCommittedSnapshot();
4. command tries to upsert the new config.collections document and succeeds.
5. command times out waiting for write concern. Shard::runBatchWriteCommand has built-in retry, so it retries the command
6. Triggers the duplicate key error since the fetcher didn't find the document and the update stage attempted to insert the document.

The problem here is that the update in #4 and #6 were using the snapshot view, which was set by an earlier operation. Since the setting is the operationContext and the same one is being used through the entire command, the update will end up using the same setting unintentionally.

Generated at Thu Feb 08 04:24:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.