-
Type: Task
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: None
-
Fully Compatible
-
Server Serverless 2023-03-06, Server Serverless 2023-03-20, Server Serverless 2023-04-03, Server Serverless 2023-04-17, Server Serverless 2023-05-01
Supporting pre-images in the event of a failover after the completion of a Shard Merge is difficult. Consider the following scenario:
- TenantOplogApplier applies a write on the recipient primary during the oplog catchup phase of a shard merge with donor timeline TS = 100 and a pre-image is inserted at the same ts via writeChangeStreamPreImage in db/repl/oplog.cpp. The write operation itself is logged on the recipient timeline at TS = 150.
- ChangeStreamPreImageCollectionManager::insertPreImage is called via op observer as a result of the above applied write, but with an optime on the recipient timeline at TS = 150.
- On the recipient secondary, replication happens for the write at TS = 150, resulting in writeChangeStreamPreImage being called, this time on the recipient timeline at TS = 150.
at the end of this sequence of events, we have 2 pre-image entries for the same write on the primary, identical except for differing timestamps (TS = 100 and TS = 150). on the secondary, we have one pre-image entry at TS = 150. if we attempt to resume a change stream on the recipient primary with a resume token from before the migration, we can successfully resume because the pre-image entry on the donor timeline exists. resumption fails on the secondary because only the recipient timeline pre-image entry exists.
Ideally, we should also try to fix the "duplicate" pre-image entry issue described above so that the entries on primary and secondaries are consistent.