[SERVER-39139] Remove testing support for secondary transactions Created: 22/Jan/19  Updated: 29/Oct/23  Resolved: 06/Mar/19

Status: Closed
Project: Core Server
Component/s: Replication, Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.1.9

Type: Task Priority: Major - P3
Reporter: Judah Schvimer Assignee: Vesselina Ratcheva (Inactive)
Resolution: Fixed Votes: 0
Labels: prepare_durability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-38876 Ensure secondary user operations cann... Closed
Duplicate
is duplicated by SERVER-38660 fuzzer can cause tests to timeout whe... Closed
Related
related to SERVER-38660 fuzzer can cause tests to timeout whe... Closed
related to SERVER-39372 Make secondary lock acquisition for D... Closed
related to SERVER-39169 Add testing-only support for doing sn... Closed
related to SERVER-37944 Update causally consistent txn passth... Closed
related to SERVER-39321 Re-enable the CheckReplDBHashInBackgr... Closed
is related to SERVER-39096 Prepared transactions and DDL operati... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2019-02-11, Repl 2019-02-25, Repl 2019-03-11
Participants:
Linked BF Score: 18

 Description   

We currently support secondary transactions when test commands are enabled for background "dbhash". This is not going to work due to SERVER-39096 and is a constant tax to try to keep working in the face of prepare without a wholistic design.



 Comments   
Comment by Githook User [ 06/Mar/19 ]

Author:

{'name': 'Vesselina Ratcheva', 'username': 'vessy-mongodb', 'email': 'vesselina.ratcheva@10gen.com'}

Message: SERVER-39139 Disallow starting transactions on secondaries

This reverts commit a74b2f39025fee2c59aa5437deea6e06f05e18ca.
Branch: master
https://github.com/mongodb/mongo/commit/d989f45b6f9a802e3d4fe5c4cb34b5ddc28ed4eb

Comment by Githook User [ 06/Mar/19 ]

Author:

{'name': 'Vesselina Ratcheva', 'email': 'vesselina.ratcheva@10gen.com', 'username': 'vessy-mongodb'}

Message: SERVER-39139 Make TransactionParticipant acquire RSTL interruptibly
Branch: master
https://github.com/mongodb/mongo/commit/e50a4cb38a22e95c33bc1e803bb05be11e8643cb

Comment by Githook User [ 06/Mar/19 ]

Author:

{'name': 'Vesselina Ratcheva', 'email': 'vesselina.ratcheva@10gen.com', 'username': 'vessy-mongodb'}

Message: SERVER-39139 Make lock mode for ReplicationStateTransitionLockGuard configurable
Branch: master
https://github.com/mongodb/mongo/commit/1fe923eec6c301f554ec5a5cfb9a40ebf4318e13

Comment by Githook User [ 28/Feb/19 ]

Author:

{'name': 'Vesselina Ratcheva', 'username': 'vessy-mongodb', 'email': 'vesselina.ratcheva@10gen.com'}

Message: Revert "SERVER-39139 Disallow starting transactions on secondaries"

This reverts commit 58fad7d7efa275beb4a6a83f90d3dd222bbb534b.
Branch: master
https://github.com/mongodb/mongo/commit/a74b2f39025fee2c59aa5437deea6e06f05e18ca

Comment by Githook User [ 26/Feb/19 ]

Author:

{'name': 'Vesselina Ratcheva', 'username': 'vessy-mongodb', 'email': 'vesselina.ratcheva@10gen.com'}

Message: SERVER-39139 Disallow starting transactions on secondaries
Branch: master
https://github.com/mongodb/mongo/commit/58fad7d7efa275beb4a6a83f90d3dd222bbb534b

Comment by Githook User [ 11/Feb/19 ]

Author:

{'name': 'Vesselina Ratcheva', 'email': 'vesselina.ratcheva@10gen.com', 'username': 'vessy-mongodb'}

Message: SERVER-39139 Make canAcceptWritesFor, canAcceptWritesForDatabase and canServeReadsFor check RSTL
Branch: master
https://github.com/mongodb/mongo/commit/302a4f91a54a77221f7408b95fcbb988a9366d03

Comment by Githook User [ 06/Feb/19 ]

Author:

{'name': 'Vesselina Ratcheva', 'email': 'vesselina.ratcheva@10gen.com', 'username': 'vessy-mongodb'}

Message: SERVER-39139 Disallow commitTransaction and abortTransaction commands on secondaries
Branch: master
https://github.com/mongodb/mongo/commit/133fa7293bc729ff95cacd9384ad49c928e8e0a8

Comment by Githook User [ 04/Feb/19 ]

Author:

{'name': 'Vesselina Ratcheva', 'email': 'vesselina.ratcheva@10gen.com', 'username': 'vessy-mongodb'}

Message: SERVER-39139 Remove usages of the CheckReplDBHashInBackground hook
Branch: master
https://github.com/mongodb/mongo/commit/939ecfe7a93365145bb876bcab6584501e407fdb

Comment by Esha Maharishi (Inactive) [ 31/Jan/19 ]

Just to be clear, I would like to re-state that implementing the ban this way is just to future-proof the code:

I believe this wouldn't happen today, because secondary oplog application checks out the session and puts the transaction into prepare before checking the session back in. So, by the time the user request is able to call beginOrContinue on the transaction, the transaction should no longer be abortable.

Comment by Esha Maharishi (Inactive) [ 31/Jan/19 ]

judah.schvimer, ah, I did not know about that check for canAcceptWritesForDatabase in abortActiveTransaction. As far as I can tell, that should prevent the race for a direct abortTransaction command.

Perhaps Vessy can implement the ban under this ticket by doing a similar check in beginOrContinue, and I can write tests for both the higher transaction number and direct abortTransaction command under SERVER-38876?

Comment by Judah Schvimer [ 31/Jan/19 ]

I think beginOrContinue acquiring the RSTL in mode IX before aborting the transaction makes sense. The lock acquisition will be an interrupt point already. With the RSTL held we need to check canAcceptWritesForDatabase(opCtx, "admin") to see if we're still a primary.

For abortTransaction, we already check if we're primary here. esha.maharishi, does that not account for the race you're concerned about for the abortTransaction command?

Comment by Siyuan Zhou [ 31/Jan/19 ]

Talked with esha.maharishi offline. We agree that an abortTransaction on prepared transaction on secondary is also possible. Aborting an unprepared or prepared transaction by either a higher txnNumber or abortTransaction command are faced with the same race mentioned by Esha. One possible solution is to have beginOrContinue hold in RSTL IX lock and check for interrupt.

Esha also mentioned that it's possible for a participant primary to get an abortTransaction from the user even though the coordinator is about to commit the transaction. This will corrupt the data, but that's intentional since we want to give users a way to fix the transactions when the coordinator is down.

Comment by Siyuan Zhou [ 31/Jan/19 ]

esha.maharishi, does your argument apply to prepared transactions on secondary? As you said, the window where a transaction is in-progress is very short and protected by the session checkout, but aborting a prepared transaction by a user request on secondary also seems possible.

Comment by Esha Maharishi (Inactive) [ 31/Jan/19 ]

One thing to consider for #1 is that commandCanRunHere is checked here in ServiceEntryPointCommon::execCommandDatabase, but the node may transition to secondary in between that and calling beginOrContinue on the TransactionParticipant.

From my understanding of SERVER-38876, this would still leave open the possibility for a user request with a higher transaction number to abort a transaction that is in progress in secondary oplog application (because _beginOrContinueRetryableWrite and _beginMultiDocumentTransaction call _setNewTxnNumber). This is based on the assumption that aborting the older transaction would not fail even though the opCtx had been interrupted due to stepdown. The user request would, of course, fail to later do any work that required acquiring locks.

I believe this wouldn't happen today, because secondary oplog application checks out the session and puts the transaction into prepare before checking the session back in. So, by the time the user request is able to call beginOrContinue on the transaction, the transaction should no longer be abortable.

However, to future-proof the code, another option for implementing the ban could be to hold the global IX lock in TransactionParticipant::beginOrContinue across checking whether the node is primary and calling _setNewTxnNumber. (If the node is not primary, beginOrContinue should uassert.)

Since implementing the ban this way would address both issues, it may be worth doing just what I suggested instead of doing both.

Comment by Vesselina Ratcheva (Inactive) [ 30/Jan/19 ]

tess.avitabile that's a good point. Thanks!

max.hirschhorn that sounds very reasonable to me. I should be able to remove the usages in a separate commit, which I hope would make it easier to put them back in in the future.

Comment by Max Hirschhorn [ 30/Jan/19 ]

2) Disable background dbhash testing altogether, but without really touching the code for it so as to not collide with the work for SERVER-39169. I’m divided between removing the usages from the yml files (which I find clean), or having the hook just return early as a no-op (which could better highlight the idea that it is not currently being used).

I'd vote for removing the CheckReplDBHashInBackground hook from the YAML suite files for the time being. The background thread continuously spawns a mongo shell process to run run_check_repl_dbhash_background.js, i.e. it spawns a new one as soon as the earlier one exits. Putting an early return in will cause that happen more frequently than it already does, which feels wasteful and likely chattier for our logs.

Comment by Tess Avitabile (Inactive) [ 30/Jan/19 ]

Looks great! I think we may not need to do (3), since transactions against secondaries will fail after (1).

Comment by Vesselina Ratcheva (Inactive) [ 30/Jan/19 ]

I have so far identified several work items:

1) Disallow secondary transactions on the server side even when test commands are enabled. I intend to leverage the code here and remove the exception for test commands. I will also update the AllowedOnSecondary rules for commit and abort here.
2) Disable background dbhash testing altogether, but without really touching the code for it so as to not collide with the work for SERVER-39169. I’m divided between removing the usages from the yml files (which I find clean), or having the hook just return early as a no-op (which could better highlight the idea that it is not currently being used).
3) Ban running transactions against secondaries in the fuzzer. I’m happy to file a separate TIG ticket for this, as per Max’s request in the comments for SERVER-38660.
4) Also take care of any incidental tests that perform transactions on secondaries. I hope that the changes on #1 will help identify the suspects, but I can spend a bit of time surveying for those too.

Feel free to chime in if you feel there's anything missing here.

Comment by Judah Schvimer [ 23/Jan/19 ]

This ticket should first disable background dbhash testing. Background dbhash code shouldn't be completely removed, since it will be replaced with a separate snapshot read mechanism, and we'll leave it to the STM team to remove the code they want removed.

Generated at Thu Feb 08 04:51:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.