[SERVER-38628] Investigate failures in indexed_insert workloads in sharded transaction concurrency suites Created: 13/Dec/18  Updated: 29/Oct/23  Resolved: 19/Dec/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.1.7

Type: Task Priority: Major - P3
Reporter: Jack Mulrow Assignee: Jack Mulrow
Resolution: Fixed Votes: 0
Labels: ShardedTxn:Testing
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Sprint: Sharding 2018-12-17, Sharding 2018-12-31
Participants:

 Description   

The fsm workload indexed_insert_multikey.js and other indexed_insert.* workloads began to fail in the concurrency_sharded_replication_multi_stmt_txn(_with_balancer) suites added as part of SERVER-38026 at some point after commit e0932edfcb (successful evergreen patch on that commit). The purpose of this ticket is to determine the root cause of that failure and update the blacklists in concurrency_sharded_replication_multi_stmt_txn(_with_balancer) once it has been determined.



 Comments   
Comment by Githook User [ 19/Dec/18 ]

Author:

{'username': 'jsmulrow', 'email': 'jack.mulrow@mongodb.com', 'name': 'Jack Mulrow'}

Message: SERVER-38628 Concurrency txn passthroughs shouldn't modify thread local state until after commit
Branch: master
https://github.com/mongodb/mongo/commit/14e2a9697475191807f8ece0843876109ec25d5d

Comment by Jack Mulrow [ 19/Dec/18 ]

After some investigation, the problem seems to be in the concurrency framework, not the server. The concurrency multi statement transaction suites run existing fsm state functions inside transactions using withTxnAndAutoRetry to retry the entire function on a transient error. To avoid prematurely modifying thread local state, a copy of a thread's data object (bound to the state function as this) is used and swapped with the real data after the function completes. The bug is that the data variables are swapped after the state function finishes but before the transaction it ran inside commits. If the commit fails with a transient error, the entire state function will be retried, but the data will have already been modified, possibly leading to failures in workloads that assert on its values, like the indexed_insert*.js workloads.

In between the commit of the successful evergreen run in the description and master, SERVER-37884 was committed, which significantly slowed down two phase commit and led to more transient failures committing transactions, exposing this bug.

Generated at Thu Feb 08 04:49:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.