[SERVER-84361] Check that CreateCollectionCoordinatorLegacy terminates on error Created: 21/Dec/23  Updated: 22/Jan/24  Resolved: 22/Jan/24

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Romans Kasperovics Assignee: Pol Pinol
Resolution: Works as Designed Votes: 0
Labels: car-investigation
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Catalog and Routing
Sprint: CAR Team 2024-01-22
Participants:

 Description   

Check that throwing in 
CreateCollectionCoordinatorLegacy::_runImpl() before the coordinator document is persisted by _buildPhaseHandler() does not lead to undead/zombie coordinator instance which is not removed by PrimaryOnlyServiceOpObserver.



 Comments   
Comment by Pol Pinol [ 22/Jan/24 ]

I ran a custom test to trigger an exception on the first phase of the legacy coordinator - without buildPhaseHandler, and compared the results with throwing an exception in the following phase kCommit, which is registered by the buildPhaseHandler.

When throwing an exception, both runs are followed by the .onCompletion phase of the future chain. After that, as they must return an error, a cleanup is performed. We can see these traces of logs in both runs, which confirm that we are releasing resources.

The only difference between both runs is where the instance of the coordinator is removed from the registry (config.system.sharding_ddl_coordinators).

If it has thrown in the first phase, without executing the buildPhaseHandler, the coordinator document will not exist, and this will be executed to remove the instance.

On the other hand, if the buildPhaseHandler has been executed, we will delete the coordinator document, and the PrimaryOnlyServiceOpObserver will be responsible for removing the instance from the registry.

Finally, both runs release the remaining resources, i.e. DDL locks.

To summarize, although they are using different implementations for removing resources, I don’t see a place where throwing without installing the coord document can lead to zombie coordinator instances. If there are no other concerns, I'm closing this ticket.

Generated at Thu Feb 08 06:54:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.