[SERVER-74195] Transaction failed after version upgrade Created: 20/Feb/23  Updated: 27/Oct/23  Resolved: 23/Feb/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Asel Magzhanova Assignee: Tommaso Tocci
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-46679 Transaction test fails with Stale Sha... Closed
Assigned Teams:
Sharding EMEA
Operating System: ALL
Participants:

 Description   

We have a sharded cluster with one shard. After upgrading from version 4.4 to 5, an error was received at the application level:

Command failed with error 13388 (StaleConfig): 'Transaction d5d7e6d6-3a64-48a1-bb4b-89742bc77ff8:5 was aborted on statement 7 due to: an error from cluster data placement change :: caused by :: findAndModify :: caused by :: sharding status of collection list-ks.orgnumbers is not currently available for description and needs to be recovered from the config server' on server sel-mongo01.imp.dks.lanit.ru:27017. The full response is {"ok": 0.0, "errmsg": "Transaction d5d7e6d6-3a64-48a1-bb4b-89742bc77ff8:5 was aborted on statement 7 due to: an error from cluster data placement change :: caused by :: findAndModify :: caused by :: sharding status of collection list-ks.orgnumbers is not currently available for description and needs to be recovered from the config server", "code": 13388, "codeName": "StaleConfig", "ns": "list-ks.orgnumbers", "vReceived": {"$timestamp": {"t": 0, "i": 0}}, "vReceivedEpoch": {"$oid": "000000000000000000000000"}, "shardId": "rs0", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1676462531, "i": 2}}, "signature": {"hash": {"$binary": {"base64": "TdIK36bailsmfPg8EbZ8Jh6TEM8=", "subType": "00"}}, "keyId": 7155588327440121859}}, "operationTime": {"$timestamp": {"t": 1676462531, "i": 2}}, "errorLabels": ["TransientTransactionError"]} 

Only this was found in the logs:

{"t":{"$date":"2023-02-15T12:02:11.217+00:00"},"s":"I",  "c":"SH_REFR",  "id":4619902, "ctx":"CatalogCache-3713","msg":"Collection has found to be unsharded after refresh","attr":{"namespace":"list-ks.orgnumbers","durationMillis":2}}
{"t":{"$date":"2023-02-15T12:02:11.217+00:00"},"s":"I",  "c":"SHARDING", "id":21917,   "ctx":"RecoverRefreshThread","msg":"Marking collection as unsharded","attr":{"namespace":"list-ks.orgnumbers"}}
{"t":{"$date":"2023-02-15T12:02:11.217+00:00"},"s":"I",  "c":"COMMAND",  "id":518070,  "ctx":"ShardServerCatalogCacheLoader::runCollAndChunksTasks","msg":"CMD: drop","attr":{"namespace":"config.cache.chunks.list-ks.orgnumbers"}}

Can a version update be the cause of the error? If not an update, what could be the cause?



 Comments   
Comment by Marcos José Grillo Ramirez [ 28/Feb/23 ]

Hi asik_asek@list.ru,

As pointed out and written in our official documentation in order to handle TransientTransactionError errors, transaction statements should be called inside a retry loop. Drivers usually provide a withTransaction() helper function that automatically handle the retry attempts.

We added the possibility to automatically retry the first statement of a transaction in SERVER-46679, but this might not be enough if you have multi statements transactions, so, we strongly encourage to use your driver's helper function to prevent this from happening again.

Comment by Yuan Fang [ 22/Feb/23 ]

Hi asik_asek@list.ru,

Thank you for bringing this to our attention. I've passed this along to the team for investigation. Please stay tuned for further updates.

Regards,
Yuan

Generated at Thu Feb 08 06:26:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.