[SERVER-31643] Make drop collection in concurrency workloads retry LockBusy errors Created: 19/Oct/17 Updated: 06/Dec/22 Resolved: 03/Aug/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Dianna Hohensee (Inactive) | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Assigned Teams: |
Sharding
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Sprint: | Sharding 2017-11-13 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Linked BF Score: | 29 | ||||||||||||||||||||
| Description |
|
Sharding metadata commands often take collection distributed locks. Drop collection on a sharded collection takes distlocks. Concurrency suites cannot / do not handle the LockBusy errors that can arise due to concurrent metadata commands, drop collection being more popular than the other metadata commands in our current concurrency tests. view_catalog_cycle_with_drop.js is a frequently failing workload. kill_aggregation.js (kill_rooted_or.js) failed, too, and drop_collection.js workload. Concurrency workloads should not be running drop collection commands. Drop collection during set up between concurrency tests is fine – no concurrency --, but not in the workloads themselves. UPDATE: Rather than removing drop collection from concurrency workloads, make LockBusy on drop collection/database commands retryable in the concurrency suites. See the comments for similar cases we've handled in the JS. |
| Comments |
| Comment by Janna Golden [ 03/Aug/18 ] |
|
This should be fixed by taking the NamespaceSerializer lock before taking the distlocks - marking as dupe of |
| Comment by Kaloian Manassiev [ 20/Oct/17 ] |
|
Dianna will figure it out with Max whether this is something they can do in the passthrough infrastructure or if we should pick it up. |
| Comment by Dianna Hohensee (Inactive) [ 19/Oct/17 ] |
|
That seems reasonable. Another example is the ManualInterventionRequired error handling committed here. I'll update the ticket title / description. |
| Comment by Max Hirschhorn [ 19/Oct/17 ] |
|
If it is always safe to retry on a LockBusy error response, then I'd rather make that change. There's prior art for doing this in the concurrency suite with the DatabaseDropPending error response. The changes from b5014cc as part of |
| Comment by Dianna Hohensee (Inactive) [ 19/Oct/17 ] |
|
max.hirschhorn, we have drop collection taking two distlocks now, which is new for v3.6. I would think that can up the frequency of LockBusy errors by slowing things down. It's fundamentally unsafe for the concurrency suites to be running concurrent metadata commands without specially handling LockBusy errors – say by retrying them, maybe. |
| Comment by Max Hirschhorn [ 19/Oct/17 ] |
|
dianna.hohensee, did we change the behavior of the distributed lock in MongoDB 3.6? For example, could it be the case that we're not releasing as quickly anymore? I would appreciate some investigation/clarification as to why we're only have these failures on the master branch as the FSM workloads themselves are identical to the versions running on the 3.4 branch. |