[SERVER-53550] Resharding coordinator doesn't handle possibility of QueryPlanKilled error when triggering refresh Created: 02/Jan/21  Updated: 29/Oct/23  Resolved: 07/Jan/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: PM-234-M2, PM-234-T-lifecycle
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-53267 Add additional resharding fuzzer task... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2021-01-11
Participants:
Linked BF Score: 19
Story Points: 1

 Description   

Hit this issue regularly on Windows DEBUG when trying to enable the resharding_fuzzer_shuffle in SERVER-53267. The config.cache.chunks.test_reshard.reshard_coll collection is being dropped by the donor and recipient shards as a result of the collection epoch for test_reshard.reshard_coll being changed. The _flushRoutingTableCacheUpdatesWithWriteConcern command sent by the resharding coordinator can end up failing with a QueryPlanKilled error response if the shard was already in the process of dropping the config.cache.chunks.test_reshard.reshard_coll collection due to a refresh having been triggered already by something else.

[js_test:resharding_fuzzer-505f-1609523439870-1] 2021-01-01T17:52:12.712+0000 d20026| 2021-01-01T17:52:12.713+00:00 I  COMMAND  518070  [ShardServerCatalogCacheLoader::runCollAndChunksTasks] "CMD: drop","attr":{"namespace":"config.cache.chunks.test_reshard.reshard_coll"}
...
[js_test:resharding_fuzzer-505f-1609523439870-1] 2021-01-01T17:52:12.730+0000 d20026| 2021-01-01T17:52:12.730+00:00 I  STORAGE  20314   [ShardServerCatalogCacheLoader::runCollAndChunksTasks] "dropCollection: storage engine will take ownership of drop-pending collection","attr":{"namespace":"config.cache.chunks.test_reshard.reshard_coll","uuid":{"uuid":{"$uuid":"aee9e556-c159-472d-88ac-be15f1f66266"}},"dropOpTime":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1},"commitTimestamp":{"$timestamp":{"t":0,"i":0}}}
...
[js_test:resharding_fuzzer-505f-1609523439870-1] 2021-01-01T17:52:12.731+0000 d20026| 2021-01-01T17:52:12.732+00:00 W  QUERY    23798   [ShardServerCatalogCacheLoader::getChunksSince] "Plan executor error during find command","attr":{"error":{"code":175,"codeName":"QueryPlanKilled","errmsg":"collection dropped. UUID aee9e556-c159-472d-88ac-be15f1f66266"},"stats":{"stage":"FETCH","nReturned":1,"works":1,"advanced":1,"needTime":0,"needYield":0,"saveState":1,"restoreState":1,"isEOF":0,"docsExamined":1,"alreadyHasObj":0,"inputStage":{"stage":"IXSCAN","nReturned":1,"works":1,"advanced":1,"needTime":0,"needYield":0,"saveState":1,"restoreState":1,"isEOF":0,"keyPattern":{"lastmod":1},"indexName":"lastmod_1","isMultiKey":false,"multiKeyPaths":{"lastmod":[]},"isUnique":false,"isSparse":false,"isPartial":false,"indexVersion":2,"direction":"forward","indexBounds":{"lastmod":["[Timestamp(0, 0), Timestamp(4294967295, 4294967295)]"]},"keysExamined":1,"seeks":1,"dupsTested":0,"dupsDropped":0}},"cmd":{"find":"cache.chunks.test_reshard.reshard_coll","filter":{"lastmod":{"$gte":{"$timestamp":{"t":0,"i":0}}}},"sort":{"lastmod":1},"readConcern":{},"$db":"config"}}
[js_test:resharding_fuzzer-505f-1609523439870-1] 2021-01-01T17:52:12.732+0000 d20026| 2021-01-01T17:52:12.732+00:00 I  STORAGE  20320   [ShardServerCatalogCacheLoader::runCollAndChunksTasks] "createCollection","attr":{"namespace":"config.cache.chunks.test_reshard.reshard_coll","uuidDisposition":"generated","uuid":{"uuid":{"$uuid":"55eb32c1-524e-4e3c-ab90-d17111d05350"}},"options":{}}
[js_test:resharding_fuzzer-505f-1609523439870-1] 2021-01-01T17:52:12.733+0000 d20026| 2021-01-01T17:52:12.734+00:00 I  SH_REFR  4619903 [CatalogCache-0] "Error refreshing cached collection","attr":{"namespace":"test_reshard.reshard_coll","durationMillis":51,"error":"QueryPlanKilled: Failed to load local metadata. :: caused by :: Executor error during find command :: caused by :: collection dropped. UUID aee9e556-c159-472d-88ac-be15f1f66266"}
...
[js_test:resharding_fuzzer-505f-1609523439870-1] 2021-01-01T17:52:12.752+0000 c20032| 2021-01-01T17:52:12.753+00:00 I  COMMAND  4956902 [ReshardingCoordinatorService-1] "Resharding failed","attr":{"namespace":"test_reshard.reshard_coll","newShardKeyPattern":{"recipient":1,"slot":1},"error":{"code":175,"codeName":"QueryPlanKilled","errmsg":"Unable to _flushRoutingTableCacheUpdatesWithWriteConcern for namespace test_reshard.reshard_coll on shard3-recipient0 :: caused by :: Failed to load local metadata. :: caused by :: Executor error during find command :: caused by :: collection dropped. UUID aee9e556-c159-472d-88ac-be15f1f66266"}}
...
[js_test:resharding_fuzzer-505f-1609523439870-1] 2021-01-01T17:52:12.835+0000 	"errmsg" : "Expected to match 1 docs, but only matched 0 for write request { update: \"collections\", bypassDocumentValidation: false, ordered: true, updates: [ { q: { _id: \"test_reshard.system.resharding.e02ba0ab-8e3d-4763-9850-af410777e804\" }, u: { $set: { reshardingFields.state: \"error\", lastmod: new Date(1609523532796) } }, multi: false, upsert: false } ] }",
[js_test:resharding_fuzzer-505f-1609523439870-1] 2021-01-01T17:52:12.835+0000 	"code" : 5030401,

https://logkeeper.mongodb.org/lobster/build/32e6a99e0e300f46dfc9d2b2cc0076c2/test/5fef612cbe07c43a2f26aae7#bookmarks=0%2C20258%2C25353%2C25366%2C25371%2C25372%2C25373%2C25384%2C25434%2C25435%2C30754&l=1



 Comments   
Comment by Githook User [ 07/Jan/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-53550 Retry on QueryPlanKilled errors when triggering refresh.

Introduces a new kIdempotentOrCursorInvalidated retry policy.
Branch: master
https://github.com/mongodb/mongo/commit/4ecb403e84f4d9a746d838df60f6aeabef914ed0

Generated at Thu Feb 08 05:31:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.