[SERVER-49019] refineShardKey can hit getLastError invariant during stepdown Created: 22/Jun/20  Updated: 29/Oct/23  Resolved: 24/Aug/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 4.4.0-rc10
Fix Version/s: 4.7.0, 4.4.2

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: Janna Golden
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: Sharding 2020-09-07
Participants:
Linked BF Score: 10

 Description   

This invariant

Sequence:

1. refineShardKey finished all updates and about to run commitTransaction.
2. stepDown occurs, stepDown thread enqueues RSTL MODE_X
3. commitTransaction tries to grab RSTL lock inside beginOrContinue and throws LockTimeout error.
4. commitTransaction thread tries to call waitForWriteConcern inside the catch block and encounters NotMaster error.
5. Server tries to create response object and hits invariant in getErrorLabels because LockTimeout is a transient error and the NotMaster writeConcern error is retryable.

Note: this issue doesn't hit normal transaction because in a normal commitTransaction, the opCtx is fresh and the lock timeout has not been set yet (until the TxnResource is unstashed, which happens after beginOrContinue). In the refineShardKey command, it uses the same AlternativeClientRegion opCtx for all the previous writes in the transaction and the commit command so the opCtx still had the 5ms timeout set from the previous writes.



 Comments   
Comment by Githook User [ 08/Sep/20 ]

Author:

{'name': 'jannaerin', 'email': 'golden.janna@gmail.com', 'username': 'jannaerin'}

Message: SERVER-49019 Run commitTransaction for refineShardKey on fresh opCtx

(cherry picked from commit 7c8935c12a4d6b6ae6af9a570870450475f8c3e9)
Branch: v4.4
https://github.com/mongodb/mongo/commit/24434044778d2d2ff858c50d0de81c51ed88f4c4

Comment by Githook User [ 21/Aug/20 ]

Author:

{'name': 'jannaerin', 'email': 'golden.janna@gmail.com', 'username': 'jannaerin'}

Message: SERVER-49019 Run commitTransaction for refineShardKey on fresh opCtx
Branch: master
https://github.com/mongodb/mongo/commit/7c8935c12a4d6b6ae6af9a570870450475f8c3e9

Comment by Randolph Tan [ 22/Jun/20 ]

Attached a simple test and diff to show where to put sleeps to get more consistent reproduction of this issue.

Generated at Thu Feb 08 05:18:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.