Consider the following scenario:
- We start migrating tenant X
- The migration sets a start timestamp of TS(100)
- When the tenant cloners complete, the last write on the donor for tenant X is TS(90) and tenant Y is TS(150)
- TS(150) is the read concern majority optime on the donor, and thus is the ‘lastVisibleOpTime’ that the recipient receives.The recipient thus sets its 'stopTimestamp’ to TS(150).
- The last oplog entry fetched on the recipient is at TS(90)
The recipient will never apply an oplog entry with a timestamp greater than or equal to TS(150), and thus will never think it’s consistent.
To fix this, we make sure that the tenant oplog applier writes a noop oplog entry into its oplog buffer whenever it receives a batch. We must be careful however, that this noop entry is not too high. If the recipient wrote the ‘lastVisibleOpTime’ as a noop, then if the recipient were lagged, that noop could make it appear as though the recipient were actually more up to date than it actually is. The correct value is the “latest oplog timestamp the donor sees when doing its oplog query”. This is exactly what the TRACK_LATEST_OPLOG_TS query parameter includes in the query response, with the postBatchResumeToken.
We write these noops for empty batches as well since it should be simple to ignore duplicate timestamps in the oplog buffer and it will ensure the recipient does not need to rescan oplog entries on recovery that it filtered out previously.
Resharding faces this analogous problem, but is solving it in aggregation since they use aggregation rather than find commands. We must correctly expose this resume token for find commands in
SERVER-51227, and then write and process the noops in this ticket.