[SERVER-36498] Remove 'config.transactions' entries with 'state' field on shutdown in fCV 4.0 Created: 07/Aug/18  Updated: 29/Oct/23  Resolved: 20/Feb/19

Status: Closed
Project: Core Server
Component/s: Replication, Upgrade/Downgrade
Affects Version/s: None
Fix Version/s: 4.1.9

Type: Task Priority: Major - P3
Reporter: Judah Schvimer Assignee: Jack Mulrow
Resolution: Fixed Votes: 0
Labels: prepare_upgrade_downgrade
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
is documented by DOCS-12503 Docs for SERVER-36498: Remove 'config... Closed
Related
related to SERVER-79269 Invariant that we don't check FCV in ... Open
related to SERVER-35882 Wait for all prepared transactions to... Closed
related to SERVER-36643 Figure out mixed version replication ... Closed
related to SERVER-39691 Remove FCV checks around the state fi... Closed
related to SERVER-39950 Merge two oplog batching code paths Closed
related to SERVER-40013 upgrade downgrade support for config.... Closed
Backwards Compatibility: Fully Compatible
Sprint: Sharding 2019-02-11, Sharding 2019-02-25
Participants:

 Comments   
Comment by Githook User [ 20/Feb/19 ]

Author:

{'name': 'Jack Mulrow', 'username': 'jsmulrow', 'email': 'jack.mulrow@mongodb.com'}

Message: SERVER-36498 Gate "state" field in config.transactions on FCV and remove entries with state on downgrade
Branch: master
https://github.com/mongodb/mongo/commit/0d79175a88ee958722c0ffb276606949e695d028

Comment by Siyuan Zhou [ 11/Feb/19 ]

> I don't think there is a simple way to differentiate retryable writes config.transactions entries from committed transaction entries

Traversing the oplog chain and checking for applyOps will help us differentiate them. 4.0 version of transaction commit is an applyOps command, but applyOps is not allowed for retryable writes. Does that help us here?

For large transactions, we will have a new field for those oplog entries in transactions to differentiate them from retryable writes.

Comment by Jack Mulrow [ 05/Feb/19 ]

kaloian.manassiev, I realized a problem with using checkOutSessionForKill() is the method throws if the session isn't in the SessionCatalog. That means it can't be used to check out a session that has an entry in config.transactions but hasn't been loaded into memory on the current primary.

How do you think I should get around this? I can use an AlternativeClientRegion to check out the session with MongodOperationContextSession which creates the session if it doesn't exist in-memory, but I know you'd like to avoid the ACR. Instead, I could add a way to check out a session without installing it onto the opCtx that also creates the session if necessary (possibly without killing it too), but I'm worried that might make it easier to misuse the SessionCatalog.

Comment by Judah Schvimer [ 04/Feb/19 ]

pavithra.vetriselvan should be fixing the kAborted case in SERVER-35872.

Comment by Jack Mulrow [ 04/Feb/19 ]

Is the problem there that setFCV can be called on a session (which is different from the session you are operating on)? 

Yeah exactly. To get around that I either need to get a new opCtx or use checkOutSessionForKill, but that requires a KillToken, which as far as I know can only be gotten by killing the session. It seems possible we could add a third way to check out a session that doesn't require killing or installing it on the opCtx, but that feels unnecessary.

For 3.1.2.3 - it is because of this that the transaction participant will indicate that it is part of transaction or retryable write, right?

Yeah, that should guarantee the session's txnState is kCommitted if the session was most recently used to commit a transaction, with prepare or without. I'm realizing we don't check if the transaction record's state is "kAborted" though, so if the session aborted after prepare, refreshing won't transition the session to the kAborted state. I'll add a test case for this to know for sure, but I think I'll have to fix that as part of this patch or distinguish aborted sessions from retryable writes in a different way.

Comment by Judah Schvimer [ 04/Feb/19 ]

This all seems good to me. I'd like to avoid killing sessions on upgrade, if possible. I'd also prefer not to use checkOutForKill if we're not killing. That seems confusing and misleading.

3.2.5 invalidating makes sense.

Comment by Kaloian Manassiev [ 04/Feb/19 ]

Why invalidate the TransactionParticipant? Aren't we putting it into its correct state? ... Kaloian Manassiev, do you know if invalidation is necessary?

From your description jack.mulrow, you are just modifying the on-disk state, but not actually putting the participant in the correct state, expecting the next refresh to do that. And I think that's the better way to do it. Otherwise you need to write new code for "patching-up" the participant, which I think is wasteful given that we have the refresh logic.

Comment by Kaloian Manassiev [ 04/Feb/19 ]

For 2.2.1 - you are suggesting to use killSession because of problem (1) that you listed, right? Is the problem there that setFCV can be called on a session (which is different from the session you are operating on)? I don't love it that we unnecessarily kill sessions though, but I prefer killing the session over using AlternativeClientRegion - I think the latter is very error prone.

For 3.1.2.3 - it is because of this that the transaction participant will indicate that it is part of transaction or retryable write, right?

For question (3) - killing a session just interrupts the operation context, but doesn't have anything to do with invalidation.

Other than that, jack.mulrow's logic seems good to me.

judah.schvimer, for your question (2) - scanSessions only scans what's in memory, but there could be sessions on disk, which have not been accessed and need to be fetched in memory, which requires check out.

Comment by Jack Mulrow [ 04/Feb/19 ]

2.2.1) At what point are we actually calling killSession?

I was planning on calling it after getting a set of all sessions that potentially need to have their entries modified - for upgrade: all sessions, for downgrade: just sessions that have a "state" field in their config.transactions entry. This all happens after FCV is set to downgrading.

2.2.3) If the state is kAborted, it should also still be deleted, correct?

Yeah, good point. We'll have to delete for any txnState that means the session's config.transactions entry points to an oplog entry not supported by 4.0, so kAbort and kCommit. I don't think it's possible for the session to be in kPrepared, since they're checked out after the node is in the downgrading state, so I might add an invariant for that.

3.2.1) Why are we checking out the session for kill and not normal checkout?

The main reason is code simplicity. Checking out for kill doesn't install the session on the opCtx, which makes modifying the transaction table entry while the session is checked out a lot simpler. I was able to make things work using the normal checkout, but I had to create new opCtxs in AlternateClientRegions for both the checkout and then then the modification (to work around session id invariants), which was a little awkward. If you'd rather not kill the sessions though, I'm fine going with the ACR approach for now and we can come back to this in the CR if we don't like how it ends up.

3.2.5) Why invalidate the TransactionParticipant? Aren't we putting it into its correct state?

I'm actually not sure if we do need to invalidate the session after the refactor to the TransactionParticipant. I remember in 3.6 for retryable writes the session needed to be invalidated after on-disk changes to force the next thread that checks out the session to refresh its in-memory state, but that may not still be true/required. kaloian.manassiev, do you know if invalidation is necessary?

1) I think killing all sessions on upgrade is undesirable, but fine if needed. I'm not sure why it's needed though. Downgrade it seems totally fine.

Got it, I'll try not to kill on upgrade for the first review patch, since like I put above it's just for code simplicity.

2) Does scanSessions not accomplish what you want?

Not really, since there might be entries in config.transactions that aren't in the session catalog when we're upgrading/downgrading, so I think we have to scan the whole collection to be sure we don't miss a session.

3) I think invalidating it is safer, since killing the session will just abort the transaction, which isn't really accurate to the on-disk state.

Yeah that was my thinking too. And if it turns out it's unnecessary, it will be easy to remove.

Comment by Judah Schvimer [ 02/Feb/19 ]

2.2.1) At what point are we actually calling kilSession?
2.2.3) If the state is kAborted, it should also still be deleted, correct?
3.2.1) Why are we checking out the session for kill and not normal checkout?
3.2.5) Why invalidate the TransactionParticipant? Aren't we putting it into its correct state?

1) I think killing all sessions on upgrade is undesirable, but fine if needed. I'm not sure why it's needed though. Downgrade it seems totally fine.
2) Does scanSessions not accomplish what you want?
3) I think invalidating it is safer, since killing the session will just abort the transaction, which isn't really accurate to the on-disk state.

Comment by Jack Mulrow [ 01/Feb/19 ]

Ok it sounds like we're fine with the tradeoffs, so I've been thinking about how to implement this, and here's what I have so far. The main idea is to guard the "state" field by FCV and instead of removing the field on downgrade, we delete the transaction table documents that contain it.

Proposed implementation:

  1. Only write the "state" field in config.transactions documents if FCV is 4.2 or upgrading to 4.2
    1. Add an FCV check in logApplyOpsForTransaction() so we only pass a txnState to onWriteOpCompleted() when FCV >= kUpgradingTo42
  2. Remove config.transactions entries that have a "state" field (and therefore can point to oplog entries in the 4.2 format) during downgrade.
    1. In the setFCV command, after FCV is set to kDowngradingTo40 and the global S lock barrier, do the following:
      1. Scan config.transactions for all entries with a state field ({state: {$exists: true}}) and place their session ids into a set. Because we've passed the barrier, there should be no in-progress prepared transactions and any concurrent transactions will not write a "state" field.
      2. Iterate the set of session ids and for each:
        1. Check out its corresponding session using SessionCatalog::checkOutSessionForKill() with the killToken from calling SessionCatalog::killSession() using the session's id.
        2. Get the session's TransactionParticipant with the checked out session and refresh it from storage manually since checking out for kill doesn't automatically refresh
        3. Verify the session's entry still needs to be deleted by checking its txnState == kCommitted, skipping the session if it does not. This prevents any retryable writes that ran between the scan and this step from having their entries deleted.
        4. Using a DBDirectClient with setFCV's opCtx, delete the session's entry from config.transactions.
        5. Invalidate the TransactionParticipant.
  3. Set state="committed" in each config.transactions table entry that corresponds to a committed transaction during upgrade, since only committed transactions have transaction table entries in 4.0.
    1. In setFCV, after FCV is set to kUpgradingTo42 and the global S lock barrier, do the following:
      1. I don't think there is a simple way to differentiate retryable writes config.transactions entries from committed transaction entries, so collection scan config.transactions and place every session id into a set. We're not fully upgraded to FCV 4.2 and we've passed the barrier, so there should be no in-progress prepared transactions and any concurrent transactions will write a "state" field.
      2. Iterate the set of session ids and for each:
        1. Check out its corresponding session using SessionCatalog::checkOutSessionForKill() with the killToken from calling SessionCatalog::killSession() using the session's id.
        2. Get the session's TransactionParticipant with the checked out session and refresh it from storage manually since checking out for kill doesn't automatically refresh
        3. Verify the session's entry is not a retryable write by checking its txnState != kNone, skipping the session if it is. This check won't catch transactions that commit after the global S lock barrier (which will have "state"), but updating them shouldn't be harmful.
        4. Using a DBDirectClient with setFCV's opCtx, update the session's entry in config.transactions using {$set: {state: "committed"}}.
        5. Invalidate the TransactionParticipant.

judah.schvimer, kaloian.manassiev, does this sound reasonable? If so, I have a couple implementation questions:

  1. I ran into problems if I checked out the session using MongoDOperationContextSession (I got around them with AlternativeClientRegions, but that felt like overkill), so I'd like to use the checkOutSessionForKill() interface, which requires interrupting any active operations on those sessions - is that OK? Upgrade is rare so it seems acceptable to me, but it's not great, especially in the upgrade case where I'm killing all sessions in the transactions table.
  2. For part 3.1, loading each session id from config.transactions into a set seems to work, but I'm worried about using too much memory - do you think that will be a problem? The footprint of each id is pretty small and upgrade is a rare operation, so I don't think it will be that bad.
  3. Do I need to invalidate the TransactionParticipant after I kill its session (parts 2.2.5 and 3.2.5) or is that redundant?
Comment by Judah Schvimer [ 30/Jan/19 ]

1) I think losing retryability on downgrade is fine. It's a rare operation known to have limitations.
2) Could it lead to committing a transaction on one shard but aborting on another? It sounds plausible but I would have to spend more time to feel confident or to find a dangerous case. I don't think that clearing "state" entries on shutdown solves this problem either. It also sounds like this would only occur if a driver were misbehaving, which is something I think I'm okay with us trusting won't happen.
3) Losing retryability on downgrade is fine.
4) I think giving them a "state" field makes sense. I had originally anticipated the server understanding both formats for commit w/o prepare (with and without a"state" field), however adding a "state" field seems better for 4.4 always seeing a "state" field for transactions. I would check out the session. It is probably necessary, or at least recommended, for modifying the transaction document anyways, which invalidates the session.
5) If we never add "state" for commit w/o prepare, then we still need code to distinguish retryable writes from transactions without a "state" field, right? That would leave us in the same position as we are with not changing existing entries, but writing committing w/o prepare with a "state" field in fCV 4.2. There's not really a difference between committed w/o prepare and committed w/ prepare once you've committed, so it seems that keeping those two more similar would make more sense and differentiating the retryable writes case.

Comment by Jack Mulrow [ 30/Jan/19 ]

judah.schvimer, I've started looking into removing entries with "state" on FCV downgrade and only adding "state" in FCV 4.2, and I have a few questions:

1) If we delete these records on downgrade, we'll lose the ability to retry commit for their corresponding transactions after we upgrade. Is that acceptable behavior? If not, can you think of a way around this?

2) If we delete the records, clients will be able to re-use txnNumbers for the sessions associated with them, e.g. a client could commit a transaction in 4.2, downgrade, then re-use that txnId (or use a lower txnNumber) for some other operation. Is that ok? I don't think this would necessarily be dangerous, but it is odd behavior.

3) It looks like we also give a "state" field to entries for transactions that committed w/o prepare, so when we downgrade, we'll delete them and lose the ability to retry commit for them too. Is that acceptable? Or should I find a way to avoid deleting them (like one of the options I give in 4))?

4) To make sure the entries for transactions that commit w/o prepare match the 4.2 format, when we upgrade FCV, we'll have to give them a "state" field (or delete them entirely). The only ways I can think of to distinguish them from retryable writes entries are to manually look up the oplog entry the transaction record points to or to check out its corresponding session. Do you have a preference between these options, or can you think of a better way?

5) Alternatively, we could address 3) and 4) by never adding "state" to the entries for transactions that commit w/o prepare, so they're not affected by the 4.2 upgrade/downgrade at all. From what I can tell, "state" only matters for prepared transactions, but I'm not familiar enough with how we use it to say for sure. What do you think?

Comment by Judah Schvimer [ 07/Jan/19 ]

An alternative to consider is removing the entries on fCV downgrade and only adding a 'state' field in fCV 4.2

Comment by Judah Schvimer [ 14/Nov/18 ]

renctan pointed out that any entries that reference problematic oplog entries will contain a 'state' field. I cannot recall why we decided to clear the entire table originally. It seems safe to just remove entries with a 'state' field on shutdown in fCV 4.0 or on fCV downgrade if we change the code to only add a 'state' field in fCV 4.2.

Comment by Judah Schvimer [ 14/Nov/18 ]

A problem with not clearing them all is that the 'config.transactions' table could reference 'commitTransaction' or 'abortTransaction' oplog entries that 4.0 does not recognize.

Comment by Randolph Tan [ 14/Nov/18 ]

judah.schvimer This sounds like a breaking change. What if we just remove config.transactions documents that have a state field?

Comment by Judah Schvimer [ 12/Nov/18 ]

Yes. The idea was that shutting down a 4.2 server in fCV 4.0 will mean you will no longer be able to retry those writes.

I think you're right that this could lead to a user/driver thinking it's safe to retry a write when it is not safe. That said, if an entry has "state=committed" its pointer could point to a 'commitTransaction' oplog entry that a 4.0 binary would crash trying to parse. If an entry has "state=aborted" it's unclear what we would change it to for 4.0 to properly understand it since we have no record of the previous version of the entry, which is what a 4.0 binary would have (since 4.0 doesn't update the transactions table on abort).

renctan, any ideas for how to make it clear to a 4.0 node that writes on a given session are unsafe to retry?

Comment by Randolph Tan [ 12/Nov/18 ]

Does empty mean removing all documents in config.transactions? Does this mean you will no longer be able to safely retry writes when a shutdown occurs?

Comment by Judah Schvimer [ 24/Sep/18 ]

When complete, please file a ticket to remove any upgrade/downgrade code unnecessary in MongoDB 4.4 with the "4.3 Required" fixVersion and the "replication-42-fcv-checks" label.

Generated at Thu Feb 08 04:43:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.