[SERVER-59328] Renaming a sharded collection might hang if the destination collection is previously dropped Created: 12/Aug/21 Updated: 31/Aug/21 Resolved: 31/Aug/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Marcos José Grillo Ramirez | Assignee: | Pierlauro Sciarelli |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | sharding-wfbf-day | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Operating System: | ALL | ||||
| Sprint: | Sharding EMEA 2021-08-23, Sharding EMEA 2021-09-06 | ||||
| Participants: | |||||
| Linked BF Score: | 140 | ||||
| Description |
|
When renaming a sharded collection, there is a phase where the CRUD operations are reactivated on all the involved shards, and this is done by sending a command, however, a previous drop of the destination collection might generate an scenario where the next refresh on the recipient fails with a "QueryPlanKilled" error, and because there is a retry policy implemented, the command does not terminate. We could prevent the error like it is done on the first phase or checking for QueryPlanKilled errors specifically. |
| Comments |
| Comment by Pierlauro Sciarelli [ 31/Aug/21 ] |
|
Closing in favor of |
| Comment by Pierlauro Sciarelli [ 24/Aug/21 ] |
|
The error has the same root of many failures waiting for bug fix on I think we may simply add a call to waitForCollectionFlush after renaming the collection in each participant. I believe it's much safer than removing the DDL coordiantor document, otherwise the coordinator could potentially shut down in an unclean way (e.g. could not unblock all participants). |