[SERVER-61933] Interleaved operations during sharded collection drop violate change stream guarantees Created: 06/Dec/21  Updated: 01/Feb/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Bernard Gorman Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-61026 Invert order of shards receiving drop... Closed
Assigned Teams:
Query Execution
Sprint: Sharding EMEA 2021-12-27, Sharding EMEA 2022-01-10
Participants:

 Description   

Following a conversation with max.hirschhorn, it appears that while dropping a sharded collection, it is possible for CRUD or DDL events to interleave with the collection drops from the individual shards, based on their respective clusterTimes:

  Shard 1 Shard 2 Shard 3
T1 drop    
T2   insert  
T3   drop  
T4     drop

In the above scenario, for a single-collection stream, this will cause change streams to break in two significant ways:

In order for change stream semantics to remain robust through collection drops, there cannot be any CRUD or DDL events on that collection at or later than the clusterTime of the earliest drop event across all shards. Max tells me that this is already the case for sharded renameCollection, and that it may be possible to apply a similar approach to drop. The individual collection drops performed by the dropDatabase command should similarly obey the above constraints.



 Comments   
Comment by Jordi Serra Torrens [ 11/Jan/22 ]

For whether it is guaranteed to happen before any new events on the same namespace: I am certain that we must have such a guarantee, because without it we can potentially wipe-out a recreated collection after a step-down in the middle of a drop. jordi.serra-torrens, do you happen to know how exactly do we ensure that? Is it a matter of us not releasing the critical section before we have deleted the coordinator document?

ShardingDDLCoordinator will first remove the coordinator document for the drop and then release the distlocks. This guarantees that indeed the coordinator document is removed before any other coordinator for that ns can run. However, I'm having a hard time thinking about how we protect an implicit collection creation (unsharded) from sneaking it at that point, since we don't even take the critical section on the dropCollection coordinator.

Comment by Kaloian Manassiev [ 11/Jan/22 ]

bernard.gorman, apologies for the delayed reply. The DDL coordinator doesn't emit any special event to mark the end of the operation, because we never thought we will need to do that. But it happens to be that the last thing that all coordinators do is to delete the coordinator document. This event is guaranteed to happen after all participant operations have completed and majority committed (e.g., drop events on the shards).

For whether it is guaranteed to happen before any new events on the same namespace: I am certain that we must have such a guarantee, because without it we can potentially wipe-out a recreated collection after a step-down in the middle of a drop. jordi.serra-torrens, do you happen to know how exactly do we ensure that? Is it a matter of us not releasing the critical section before we have deleted the coordinator document?

Comment by Bernard Gorman [ 21/Dec/21 ]

Hey kaloian.manassiev: does the DDL co-ordinator emit a separate oplog event to mark the end of the entire operation? If so then I think this could work, assuming that the following points are both guaranteed:

  • This event will be written into the oplog with a later optime than the drop events on any of the shards
  • This event will appear in the stream BEFORE any further operations on the same namespace - e.g. it's not possible for a collection re-creation to sneak in between the final drop and the DDL co-ordinator end-of-operation event.

Another alternative would be to mark the final drop event with some special field in the oplog, e.g. {o2: {finalDropAcrossShards: true}}.

Comment by Kaloian Manassiev [ 16/Dec/21 ]

bernard.gorman, the concurrency behaviour of the sharded DDL operations was decided as part of the scope for PM-1965 and it explicitly specified the behaviour that that you are describing.

If this is a problem for change streams we have two ways to address it:

  1. We make every DDL operation 2-Phase with respect to the critical section (which is really expensive just for change streams).
  2. We make change streams listen for the "end" event from the DDL coordinators and only use this for drop/rename/etc events.

I slightly prefer that we do (2). What do you think?

Comment by Max Hirschhorn [ 07/Dec/21 ]

Max tells me that this is already the case for sharded renameCollection, and that it may be possible to apply a similar approach to drop.

I looked back over the code and saw _shardsvrRenameCollectionParticipant command (1) acquires the critical section to block writes, (2) acquires the critical section to block reads, and (3) locally renames (or renames + drops) the collection. The acquisition of the critical section isn't synchronized across all shards prior to any of them doing the local rename. This means rename change events face the same issue Bernard described for drop change events.

Generated at Thu Feb 08 05:53:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.