[SERVER-70340] Range deleter can cause migration lock acquisition timeouts Created: 07/Oct/22 Updated: 30/Nov/22 Resolved: 01/Nov/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 6.0.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Allison Easton | Assignee: | Tommaso Tocci |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Steps To Reproduce: | Create a two shard cluster. Add all the data while collection is unsharded and create 10 indexes and then shard the collection. The data will be migrated from one shard to the other and after the 15 minute range deletion timeout, we will start seeing critical section releases fail on the donor shard because of failing to acquire the critical section. |
||||||||||||||||
| Sprint: | Sharding EMEA 2022-10-17, Sharding EMEA 2022-10-31, Sharding EMEA 2022-11-14 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||
| Description |
|
This ticket is to investigate the lock timeouts caused by the range deleter on v6.0. The This is a problem on 6.0 because of the acquisition of the ScopedRangeDeleterLock before running the deletion. This lock is necessary to prevent the FCV upgrade orphan counter code from setting incorrect counters on the range deletions. However, the problem we are running into here is that the ScopedRangeDeleterLock acquires the DBLock on the config database, which automatically acquires the GlobalLock. But we have already acquired the DBLock on the user database, which has already automatically acquired the GlobalLock. This double acquisition means that we are recursively locking the global lock, and so the yield policy is replaced with NO_YIELD. We also cannot change the ScopedRangeDeleterLock to never acquire the global lock, because some usages of the persistUpdatedNumOrphans function during migrations are not holding the global lock already. One option would be to change the ScopedRangeDeleterLock to acquire the GlobalLock conditionally based on whether we already hold it, something like replacing this line with the code below.
But it may be better to find a more general solution that also considers the general problem of the ScopedRangeDeleterLock in
|
| Comments |
| Comment by Tommaso Tocci [ 01/Nov/22 ] |
|
Fixed by |
| Comment by Allison Easton [ 18/Oct/22 ] |
|
Some complications came up with the solution of acquiring the global lock conditionally. We found that to migrate the chunks in config.system.sessions, we would also need to make the db lock acquisition optional. And to keep the ordering of locking for collections, we would need to put special handling for the config.system.sessions collection in v6.0. Since there have also been some complications with |
| Comment by Andrew Witten (Inactive) [ 11/Oct/22 ] |
|
Yes I am! Thank you for sending that. |
| Comment by Garaudy Etienne [ 11/Oct/22 ] |
|
You're saying that we need PM-1699? |
| Comment by Andrew Witten (Inactive) [ 11/Oct/22 ] |
|
I want to mention that even after we fix this, the lock contention between the range deleter and chunk migration will continue to be a performance issue. Anytime we have an unsharded collection that we then decide to shard, there will be lots of writes on the source shard because of the range deleter (which acquires an IX lock on the collection), and also lots of reads because of the actual chunk migrations (each chunk migration critical section gets an S lock on the collection). The problem is that we don't have a granular critical section for chunk migrations. We acquire an S lock on the entire collection, even though a chunk comprises only a small fraction of the collection. I don't see why the range deleter should ever contend with chunk migrations. (We can only delete a range after it has been succesfully migrated.) |