[SERVER-51246] Write a noop into the oplog buffer after each batch to ensure tenant applier reaches stop timestamp Created: 30/Sep/20  Updated: 29/Oct/23  Resolved: 29/Oct/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Task Priority: Major - P3
Reporter: Judah Schvimer Assignee: Judah Schvimer
Resolution: Fixed Votes: 0
Labels: pm-1791_milestone-B
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-51227 Make find/getMore cmd with $_requestR... Closed
is depended on by SERVER-51734 Enable tenant migration recipient tes... Closed
Related
related to SERVER-61440 Race in tenant_migration_recipient_cu... Closed
is related to SERVER-52628 Tenant migration recipient can give a... Closed
is related to SERVER-49897 Insert no-op entries into oplog buffe... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2020-10-05, Repl 2020-10-19, Repl 2020-11-02
Participants:

 Description   

Consider the following scenario:

  1. We start migrating tenant X
  2. The migration sets a start timestamp of TS(100)
  3. When the tenant cloners complete, the last write on the donor for tenant X is TS(90) and tenant Y is TS(150)
  4. TS(150) is the read concern majority optime on the donor, and thus is the ‘lastVisibleOpTime’ that the recipient receives.The recipient thus sets its 'stopTimestamp’ to TS(150).
  5. The last oplog entry fetched on the recipient is at TS(90)

The recipient will never apply an oplog entry with a timestamp greater than or equal to TS(150), and thus will never think it’s consistent.

To fix this, we make sure that the tenant oplog applier writes a noop oplog entry into its oplog buffer whenever it receives a batch. We must be careful however, that this noop entry is not too high. If the recipient wrote the ‘lastVisibleOpTime’ as a noop, then if the recipient were lagged, that noop could make it appear as though the recipient were actually more up to date than it actually is. The correct value is the “latest oplog timestamp the donor sees when doing its oplog query”. This is exactly what the TRACK_LATEST_OPLOG_TS query parameter includes in the query response, with the postBatchResumeToken.

We write these noops for empty batches as well since it should be simple to ignore duplicate timestamps in the oplog buffer and it will ensure the recipient does not need to rescan oplog entries on recovery that it filtered out previously.

Resharding faces this analogous problem, but is solving it in aggregation since they use aggregation rather than find commands. We must correctly expose this resume token for find commands in SERVER-51227, and then write and process the noops in this ticket.



 Comments   
Comment by Githook User [ 29/Oct/20 ]

Author:

{'name': 'Judah Schvimer', 'email': 'judah@mongodb.com', 'username': 'judahschvimer'}

Message: SERVER-51246 Write a noop into the oplog buffer after each batch to ensure tenant applier reaches stop timestamp
Branch: master
https://github.com/mongodb/mongo/commit/0ccc9275efc3e4c36850bd4cc297c90152b7a7e6

Generated at Thu Feb 08 05:24:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.