[SERVER-31769] Retrying findAndModify can error if oplog write for was rolled back Created: 30/Oct/17 Updated: 06/Dec/22 Resolved: 29/Jan/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.6.4 |
| Fix Version/s: | 4.1 Desired |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Randolph Tan | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | former-quick-wins | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Assigned Teams: |
Sharding
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Backport Requested: |
v4.0, v3.6
|
||||||||||||||||||||
| Sprint: | Sharding 2018-08-13 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
Retrying a findAndModify can error if its oplog write was rolled back but not the post/pre image oplog entry. For retryable findAndModify, the pre/post image is stored as a separate oplog entry with op type 'n'. This is written to the oplog before the actual update/remove oplog entry. If the secondaries were able to replicate up to the pre/post image but not the actual update/remove, it will be in an inconsistent state when it becomes the new primary at that point. Attempting to retry will fail because the server thinks that it has already executed the write, but cannot properly fetch the oplog entries because it expects both the pre/post image and update/remove oplog to exist. |
| Comments |
| Comment by Randolph Tan [ 10/May/18 ] |
|
Note: Took a stab at this and it looks like we can't just simply remove the fields without breaking the previous major release. This is because of the checks like this and this. If we would proceed with this approach, we will need to relax checks on the next major release and completely take out the fields in the next next major release. |
| Comment by Randolph Tan [ 02/Nov/17 ] |
|
Based on offline discussion, we believe that removing the txnNumber, lsid and statementId fields from the pre/post image oplog will make this problem go away. These fields are not really needed, but only there for diagnosibility and sanity checks. |
| Comment by Andy Schwerin [ 02/Nov/17 ] |
|
I propose the correct solution is that pre/post image "n" entries shouldn't update the transaction table when processed on secondaries. |
| Comment by Randolph Tan [ 02/Nov/17 ] |
|
schwerin The reason we have these problem is a combination of these things: 1. findAndModify write history comes in pairs of oplog. You need both to be complete. In more detail: The problem is that we have 2 oplog entries: 1 for the pre/post image and 1 for the actual write. We insert the pre/post image oplog first and the actual write second. If the oplog fetch batch boundary happen to get the pre/post image and not the actual write, the secondary's config.transactions lastWriteOpTime will end up pointing to the pre/post image. When this secondary becomes the new primary without seeing the other oplog, it will end up in a state where it thinks that it has completed the write (because of the optimization) but errors out because it will find out that something is wrong when it actually tries to fetch the write history. |
| Comment by Andy Schwerin [ 02/Nov/17 ] |
|
I'm confused by this description, renctan. If the new primary only has the "n" entry and not the actual description of the write, won't its entry in the transaction table be as though the write never happened? That seems correct. Can you take another crack at explaining this? |