[SERVER-60969] ReplClientInfo::getLastOp may not be updated before being used to wait for replication in tenant migration Created: 25/Oct/21  Updated: 29/Oct/23  Resolved: 11/Nov/21

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 5.2.0, 5.1.1

Type: Bug Priority: Major - P3
Reporter: Wenbin Zhu Assignee: A. Jesse Jiryu Davis
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
Related
related to SERVER-61404 Set rejectReadsBeforeTimestamp only once Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.1, v5.0
Sprint: Server Serverless 2021-11-01, Server Serverless 2021-11-15
Participants:
Linked BF Score: 167

 Description   

waitUntilMigrationReachesReturnAfterReachingTimestamp updates the rejectReadsBeforeTimestamp in the recipient state doc, and then use ReplClientInfo::getLastOp()  to get the latest opTime written by this client. This opTime is later being used to wait for replication. However in some cases this opTime may not be initialized at all, so the wait for replication will return immediately. This causes the recipientSyncData command to return without replicating the state doc updates to majority nodes, which is not correct because the donor might incorrectly think the migration is finished.

One scenario for this to happen:

  1. Recipient reaches consistent state
  2. Donor sends recipientSyncData command with returnAfterReachingTimestamp to recipient
  1. Donor starts failing over and the new donor primary sends another recipientSyncData command with the same returnAfterReachingTimestamp
  2. The first recipientSyncData command updates the rejectReadsBeforeTimestamp in the state doc, but fails to wait for replication due to being stepped down
  3. The second recipientSyncData also starts updating the rejectReadsBeforeTimestamp, but since the state doc was already updated in memory with the same rejectReadsBeforeTimestamp, this will be a no-op and we do not write anything to the oplog. This means that the ReplClientInfo::getLastOp() will not be updated. So we end up waiting on an uninitialized opTime since no writes ever happened on this client, and the wait for replication will return immediately.

This pattern (getting opTime from ReplClientInfo::getLastOp() and wait on that opTime) is being used in multiple places in tenant migration, so this issue may happen elsewhere.



 Comments   
Comment by Githook User [ 16/Nov/21 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-60969 Race in tenant migration state machine, try 2

(cherry picked from commit 65b6cd2fc68021388b992fad6d46fa349324f2a2)
Branch: v5.1
https://github.com/mongodb/mongo/commit/71e85a2029d7e93f2915191ea28be10e2d869859

Comment by Githook User [ 11/Nov/21 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-60969 Race in tenant migration state machine, try 2
Branch: master
https://github.com/mongodb/mongo/commit/65b6cd2fc68021388b992fad6d46fa349324f2a2

Comment by Githook User [ 05/Nov/21 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: Revert "SERVER-60969 Race in tenant migration state machine"

This reverts commit 0eda431672cedbbe62e3b487d2389b1f5c5695df.
Branch: master
https://github.com/mongodb/mongo/commit/cb50e527b2d886d68ab831abe6d8dc61d63237bd

Comment by Githook User [ 05/Nov/21 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-60969 Race in tenant migration state machine
Branch: master
https://github.com/mongodb/mongo/commit/0eda431672cedbbe62e3b487d2389b1f5c5695df

Comment by Suganthi Mani [ 03/Nov/21 ]

Here are *few updates*:

I was wondering whether the fix needs to be applied to donor side as well. Thinking about it, made me to doubt the scenario mentioned in the problem description as it describes that the new recipient primary already has the state document w/ RejectReadsBeforeTimestamp set (which is non-majority committed), resulted in no-op write and returning the recipientSyncData command response w/o ensuring the state doc is majority committed and eventually leading to rollback of the RejectReadsBeforeTimestamp update .

To my understanding, POS will start the recipient instance using the majority-committed state document. So, which means, there is no possibility that the recipientSyncData command response was returned before the state doc w/ RejectReadsBeforeTimestamp update was majority committed.

Spoke With wenbin.zhu and we realized the scenario that caused BF failure is that (also aligns with BF-22917 logs)
1) Donor D1 (primary) sends recipientSyncData w/ returnAfterReachingDonorTimestamp to Recipient R1 (primary).
2) Donor D1 steps down.
3) Donor D2 steps up and becomes a primary; it also sends recipientSyncData w/ returnAfterReachingDonorTimestamp to Recipient R1.
4) R1 received 2 concurrent recipientSyncData cmds, one from stale primary D1 and other from active primary D2. waitUntilMigrationReachesReturnAfterReachingTimestamp() is not entirely serialized. Both commands wait for block timestamp to be majority committed and then, try to update the state doc w/ RejectReadsBeforeTimestamp, cmd from stale primary D1 updated the document. So, cmd from active primary D2 did a no-op.
5)As a result, cmd from active primary D2 didn't wait for the state doc to be majority replicated. and returned response to active primary D1
6) Now, cmd from stale primary D1 waits for the state doc to be majority committed, but R1 steps down and eventually lead to rollback of the RejectReadsBeforeTimestamp update

The problem is that waitUntilMigrationReachesReturnAfterReachingTimestamp() which updates state document, is not concurrent safe.

This pattern (getting opTime from ReplClientInfo::getLastOp() and wait on that opTime) is being used in multiple places in tenant migration, so this issue may happen elsewhere.

Having said that, this pattern is currently safer in all places of tenant migration code (both donor & recipient) except this place where we update the state doc for "RejectReadsBeforeTimestamp" because that's the only place we update the state doc outside the future chain. I believe, we initially wrote the tenant migration code with the assumption that the state doc gets modified only via the chain. The set "RejectReadsBeforeTimestamp" seems to be exceptional where the state doc gets updated in the command caller path.

Comment by Lingzhi Deng [ 25/Oct/21 ]

FWIW, client writes use the LastOpFixer to handle noop writes. We may want to consider using that for internal writes in tenant migration.

Comment by Wenbin Zhu [ 25/Oct/21 ]

cc lingzhi.deng

Comment by Wenbin Zhu [ 25/Oct/21 ]

I did some hack to quickly repro this issue by setting the rejectReadsBeforeTimestamp directly to the returnAfterReachingTimestamp instead of setting to the result of selectRejectReadsBeforeTimestamp, which returns the maximum of finalRecipientWriteTimestamp and returnAfterReachingTimestamp. In reality returnAfterReachingTimestamp could be greater than finalRecipientWriteTimestamp, so the repro should be valid.

Generated at Thu Feb 08 05:51:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.