[SERVER-79024] Avoid deleting pre-image/change collection entries before allDurable/lastApplied timestamps Created: 17/Jul/23  Updated: 06/Sep/23  Resolved: 06/Sep/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Jordi Olivares Provencio Assignee: Jordi Olivares Provencio
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Assigned Teams:
Storage Execution EMEA
Backport Requested:
v7.1, v7.0, v6.0
Sprint: Execution EMEA Team 2023-09-04, Execution EMEA Team 2023-09-18
Participants:

 Description   

As change collections and preimages are implicitly replicated collections only deletes are propagated. This leads to an oplog applier race between the delete and insert since both operations could occur at the same time. To avoid this issue we should make sure that only entries after the lastApplied/allDurable timestamps are deleted. That way we can ensure that the insert will always occur before a delete.

Note that this only occurs with replicated deletes. Unreplicated truncates already solved this issue with SERVER-77972.



 Comments   
Comment by Jordi Olivares Provencio [ 06/Sep/23 ]

Closing this as Won't Do. This will be fixed in SERVER-80791 by ensuring implicitly replicated collections do not cause data consistency issues.

Comment by Jordi Olivares Provencio [ 31/Aug/23 ]

Requesting backports back to 6.0 since it affects preimage correctness

Comment by Jordi Olivares Provencio [ 24/Jul/23 ]

I've rewritten the ticket as we realised that this only kicked the can down the road. As SERVER-77972 already fixed this issue with unreplicated deletes we will retarget this ticket to replicate that behaviour with replicated deletes.

Comment by Josef Ahmad [ 18/Jul/23 ]

This is only a problem when the TTL period of these internal tables is set to an unreasonably low value (less than a few seconds). A customer choosing to do so is more likely to experience more fundamental problems – e.g. change stream falling off 'the oplog' – than the inconsistency described here.

Comment by Josef Ahmad [ 18/Jul/23 ]

This can also be a problem in the absence of lagging secondaries when using an unreasonably low expireAfterSeconds approximating an oplog hole's duration. With cleanup based on (replicated) multi-deletes and expireAfterSeconds=0, the secondary races the application of replicated deletes with the (implicit) application of inserts to these change stream tables: because the inserts are implicit, there's no way for the applier threads to serialise deletes with inserts on these tables. As a result, we've observed these tables become inconsistent. Jordi's proposal to set a reasonable floor (10s) to expireAfterSeconds eradicates this corner case.

Generated at Thu Feb 08 06:39:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.