[SERVER-31643] Make drop collection in concurrency workloads retry LockBusy errors Created: 19/Oct/17  Updated: 06/Dec/22  Resolved: 03/Aug/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-36322 NamespaceSerializer lock should be us... Closed
Related
related to SERVER-31668 checkUUIDsConsistentAcrossCluster mus... Closed
Assigned Teams:
Sharding
Operating System: ALL
Sprint: Sharding 2017-11-13
Participants:
Linked BF Score: 29

 Description   

Sharding metadata commands often take collection distributed locks. Drop collection on a sharded collection takes distlocks. Concurrency suites cannot / do not handle the LockBusy errors that can arise due to concurrent metadata commands, drop collection being more popular than the other metadata commands in our current concurrency tests.

view_catalog_cycle_with_drop.js is a frequently failing workload. kill_aggregation.js (kill_rooted_or.js) failed, too, and drop_collection.js workload.

Concurrency workloads should not be running drop collection commands. Drop collection during set up between concurrency tests is fine – no concurrency --, but not in the workloads themselves.

UPDATE:

Rather than removing drop collection from concurrency workloads, make LockBusy on drop collection/database commands retryable in the concurrency suites. See the comments for similar cases we've handled in the JS.



 Comments   
Comment by Janna Golden [ 03/Aug/18 ]

This should be fixed by taking the NamespaceSerializer lock before taking the distlocks - marking as dupe of SERVER-36322.

Comment by Kaloian Manassiev [ 20/Oct/17 ]

Dianna will figure it out with Max whether this is something they can do in the passthrough infrastructure or if we should pick it up.

Comment by Dianna Hohensee (Inactive) [ 19/Oct/17 ]

That seems reasonable. Another example is the ManualInterventionRequired error handling committed here. I'll update the ticket title / description.

Comment by Max Hirschhorn [ 19/Oct/17 ]

If it is always safe to retry on a LockBusy error response, then I'd rather make that change. There's prior art for doing this in the concurrency suite with the DatabaseDropPending error response. The changes from b5014cc as part of SERVER-31567 could probably be undone in that case then too.

Comment by Dianna Hohensee (Inactive) [ 19/Oct/17 ]

max.hirschhorn, we have drop collection taking two distlocks now, which is new for v3.6. I would think that can up the frequency of LockBusy errors by slowing things down. It's fundamentally unsafe for the concurrency suites to be running concurrent metadata commands without specially handling LockBusy errors – say by retrying them, maybe.

Comment by Max Hirschhorn [ 19/Oct/17 ]

dianna.hohensee, did we change the behavior of the distributed lock in MongoDB 3.6? For example, could it be the case that we're not releasing as quickly anymore? I would appreciate some investigation/clarification as to why we're only have these failures on the master branch as the FSM workloads themselves are identical to the versions running on the 3.4 branch.

Generated at Thu Feb 08 04:27:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.