[SERVER-31769] Retrying findAndModify can error if oplog write for was rolled back Created: 30/Oct/17  Updated: 06/Dec/22  Resolved: 29/Jan/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.4
Fix Version/s: 4.1 Desired

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Duplicate Votes: 0
Labels: former-quick-wins
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Duplicate
duplicates SERVER-36004 SessionUpdateTracker should ignore no... Closed
Related
is related to SERVER-36004 SessionUpdateTracker should ignore no... Closed
Assigned Teams:
Sharding
Operating System: ALL
Backport Requested:
v4.0, v3.6
Sprint: Sharding 2018-08-13
Participants:

 Description   

Retrying a findAndModify can error if its oplog write was rolled back but not the post/pre image oplog entry.

For retryable findAndModify, the pre/post image is stored as a separate oplog entry with op type 'n'. This is written to the oplog before the actual update/remove oplog entry. If the secondaries were able to replicate up to the pre/post image but not the actual update/remove, it will be in an inconsistent state when it becomes the new primary at that point. Attempting to retry will fail because the server thinks that it has already executed the write, but cannot properly fetch the oplog entries because it expects both the pre/post image and update/remove oplog to exist.



 Comments   
Comment by Randolph Tan [ 10/May/18 ]

Note: Took a stab at this and it looks like we can't just simply remove the fields without breaking the previous major release. This is because of the checks like this and this. If we would proceed with this approach, we will need to relax checks on the next major release and completely take out the fields in the next next major release.

Comment by Randolph Tan [ 02/Nov/17 ]

Based on offline discussion, we believe that removing the txnNumber, lsid and statementId fields from the pre/post image oplog will make this problem go away. These fields are not really needed, but only there for diagnosibility and sanity checks.

Comment by Andy Schwerin [ 02/Nov/17 ]

I propose the correct solution is that pre/post image "n" entries shouldn't update the transaction table when processed on secondaries.

Comment by Randolph Tan [ 02/Nov/17 ]

schwerin The reason we have these problem is a combination of these things:

1. findAndModify write history comes in pairs of oplog. You need both to be complete.
2. Secondary oplog application updates config.transactions even for type 'n' entries. This is because migrated sessions come in as type 'n' oplogs, so the secodaries cannot just ignore them.
3. We added an optimization in the Session class to mark a statementId as completed as long as we see the statement id while traversing through the history. Because of #2, the config.transactions will point to the pre/post image oplog. In addition, the pre/post image oplog has the valid statement id, lsid and txnNum, that can 'fool' this optimization.

In more detail: The problem is that we have 2 oplog entries: 1 for the pre/post image and 1 for the actual write. We insert the pre/post image oplog first and the actual write second. If the oplog fetch batch boundary happen to get the pre/post image and not the actual write, the secondary's config.transactions lastWriteOpTime will end up pointing to the pre/post image. When this secondary becomes the new primary without seeing the other oplog, it will end up in a state where it thinks that it has completed the write (because of the optimization) but errors out because it will find out that something is wrong when it actually tries to fetch the write history.

Comment by Andy Schwerin [ 02/Nov/17 ]

I'm confused by this description, renctan. If the new primary only has the "n" entry and not the actual description of the write, won't its entry in the transaction table be as though the write never happened? That seems correct. Can you take another crack at explaining this?

Generated at Thu Feb 08 04:28:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.