[SERVER-59328] Renaming a sharded collection might hang if the destination collection is previously dropped Created: 12/Aug/21  Updated: 31/Aug/21  Resolved: 31/Aug/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Marcos José Grillo Ramirez Assignee: Pierlauro Sciarelli
Resolution: Won't Do Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Operating System: ALL
Sprint: Sharding EMEA 2021-08-23, Sharding EMEA 2021-09-06
Participants:
Linked BF Score: 140

 Description   

When renaming a sharded collection, there is a phase where the CRUD operations are reactivated on all the involved shards, and this is done by sending a command, however, a previous drop of the destination collection might generate an scenario where the next refresh on the recipient fails with a "QueryPlanKilled" error, and because there is a retry policy implemented, the command does not terminate. We could prevent the error like it is done on the first phase or checking for QueryPlanKilled errors specifically.



 Comments   
Comment by Pierlauro Sciarelli [ 31/Aug/21 ]

Closing in favor of SERVER-40865 that will solve the underlying issue.

Comment by Pierlauro Sciarelli [ 24/Aug/21 ]

The error has the same root of many failures waiting for bug fix on SERVER-40865. The real problem is that a catalog cache refresh can race with a catalog cache "drop" (when the epoch changes).

I think we may simply add a call to waitForCollectionFlush after renaming the collection in each participant.

I believe it's much safer than removing the DDL coordiantor document, otherwise the coordinator could potentially shut down in an unclean way (e.g. could not unblock all participants).

Generated at Thu Feb 08 05:46:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.