[SERVER-49949] Reconstructing prepared transactions containing multi-key writes crashes the initial syncing node. Created: 28/Jul/20  Updated: 29/Oct/23  Resolved: 14/Aug/20

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: None
Fix Version/s: 4.4.1, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Suganthi Mani Assignee: Lingzhi Deng
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-42251 Cannot timestamp multikey write durin... Closed
related to SERVER-48010 Substitute ghost timestamp with no-op... Closed
is related to SERVER-53932 Multikey write during recovery of pre... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4, v4.2
Sprint: Repl 2020-08-24
Participants:
Linked BF Score: 20

 Description   

On-disk catalog multi-key update for multi-statement transactions happens in a side transaction block. SERVER-48010 replaced ghost timestamp (determined by logical clock) with no-op write for timestamping the on-disk catalog writes on primaries. Since initial sync doesn't set transaction recoveryPrepareOpTime while reconstructing prepared transactions, it wrongly calls primary no-op write code path to acquire timestamp for the catalog write. As a result, we hit this invariant failure which leads to initial syncing node server crash.

This happens in both 4.4 and master.



 Comments   
Comment by Samyukta Lanka [ 19/Jan/21 ]

The original problem can't happen on 4.2, but we're seeing that the prepareTimestamp can be earlier than the oldestTimestamp during startup recovery, which will cause an issue while trying to set the timestamp for the catalog multi-key update. Requesting a backport to 4.2 since the rounding logic will solve this issue as well.

Comment by Githook User [ 24/Aug/20 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-49949: Fix reconstructing prepared transactions containing multi-key writes after initial sync

(cherry picked from commit 5821e5bc1e2e8c4ed3e791a60a104d57f104caf1)
Branch: v4.4
https://github.com/mongodb/mongo/commit/4419a5c2a0d099b40b2155dd4def37802286cb7c

Comment by Suganthi Mani [ 19/Aug/20 ]

tess.avitabile 4.2 is not affected by this issue. SERVER-48010 which introduced this bug is only available from 4.4 onwards.

Comment by Tess Avitabile (Inactive) [ 19/Aug/20 ]

lingzhi.deng, suganthi.mani, do you know if this affects 4.2?

Comment by Githook User [ 14/Aug/20 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-49949: Fix reconstructing prepared transactions containing multi-key writes after initial sync
Branch: master
https://github.com/mongodb/mongo/commit/5821e5bc1e2e8c4ed3e791a60a104d57f104caf1

Comment by Lingzhi Deng [ 12/Aug/20 ]

After taking a look at the problem, I think this is probably not a one-liner fix because we could be reconstructing a prepared transaction behind the oldest timestamp during initial sync. And if we use the timestamp of the prepare entry to do the multikey write, then we would be writing behind the oldest timestamp. And I think that was why we needed setRoundUpPreparedTimestamps. So I think we will need some kind of roundup as well for the multikey writes. I tried using the max of (prepare timestamp, oldest timestamp) for the multikey write and it seems to work. daniel.gottlieb, do you think this is a reasonable approach?

Another idea is to use the initialDataTimestamp. I think this is also safe as no reader is allowed at a timestamp earlier than that after the initial sync completes. So it doesnt seem to matter which timestamp to use as long as it is <= initialDataTimestamp.

Comment by Suganthi Mani [ 03/Aug/20 ]

kelsey.schubert, Just want to bring this ticket to your attention as this bug exists on 4.4. I think as per "4.4 Backports Post GA schedule by team", for replication team, it's scheduled as August 21. Do you see any urgency for this ticket to be backported before August 21?

Comment by Suganthi Mani [ 29/Jul/20 ]

We don't hit this problem for startup/rollback recovery because we set recoveryPrepareOpTime to prepareTimestamp when the node's oplog application mode is recovering. As a result, this prepareTimestamp will be used to timestamp catalog multi-key update. We should have a similar solution to tackle the initial sync scenario case. It's actually safe to use prepare timestamp for catalog multi-key update during initial sync & startup/rollback recovery because that timestamp is guaranteed to be less than or equal to the commit timestamp of the transaction and the contract is that multi-key write must occur at a time <= the first write that makes an index multi-key (See this comment).

Note: Secondary oplog application doesn't use the side transaction block for catalog multi-key update because the opCtx that performs catalog write is different from the opCtx which apply prepare transaction.

Fix should be one-liner change.  tess.avitabile Do you know what's the priority of this ticket would be? We consistently crash the initial syncing node when reconstructing prepared transactions containing multi-key writes on 4.4.

CC daniel.gottlieb

Generated at Thu Feb 08 05:21:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.