[SERVER-47032] Long running exclusive locks can end up blocking all operations if all tickets are exhausted Created: 20/Mar/20 Updated: 27/Oct/23 Resolved: 01/Apr/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Index Maintenance, Replication, Sharding, Stability |
| Affects Version/s: | 3.6.13 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kay Agahd | Assignee: | Carl Champain (Inactive) |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
|||||||||||||||||||
| Issue Links: |
|
|||||||||||||||||||
| Operating System: | ALL | |||||||||||||||||||
| Steps To Reproduce: | Here are some chronological traces: Command convertToCapped executed at 11:54:13 and terminated at 13:15:34:
Index Build started at 13:15:38 and finished at 17:58:25.
The first 17% of the index build took nearly 5 hours while the rest took only a few seconds. Or was the index build killed by my command killOp or killSession, executed at 14:53:25 eventually?
At around 14:20 we remark that read and write ticket are out, so we increase it from 256 to 512:
At around 15:00 we remark that replication got stuck on both secondaries.
At around 15:25 we shut down the 3 most heavily loaded routers (mongo-hotel-01, mongo-hotel-02, mongo-hotel-03), so that the DB does not get a load from their clients anymore. At about 15:50 the first secondary is back in sync, therefore we stepDown the primary because we (wrongly) think it might be a hardware problem:
At about 17:20, we promote the stepped-down server to be Primary again because we don't think that's a hardware problem anymore. |
|||||||||||||||||||
| Participants: | ||||||||||||||||||||
| Description |
|
We are running a cluster consisting of:
While running the command convertToCapped, all databases were blocked. Neither the command killOp nor killSession could kill this operation. Clients to all other databases had to wait until the command terminated. Last but not least, replication got stuck during convertToCapped, so both secondaries were more and more behind the primary. Please see attached some screenshots of our dashboards of the primary replSet member. They show nicely when the primary were massively or even completely blocked. I'll also upload log and diagnostic data files from all mongoD's and mongoS's. This ticket is related to |
| Comments |
| Comment by Kay Agahd [ 02/Apr/20 ] | ||||||||
|
Hi carl.champain and milkie, Many thanks for the analysis! Furthermore the command convertToCapped was only active from 11:54 to 13:15 according to the log file but the whole DBS was quasi blocked until 17:30. Last but not least, all 4 killOp commands executed at 12:56, 13:00, 13:03, 14:42 and 14:43 did not kill anything but my nerfs. | ||||||||
| Comment by Carl Champain (Inactive) [ 01/Apr/20 ] | ||||||||
|
Here is what happened:
We recommend you run convertToCapped during maintenance or reduce the volume of reads/writes going to locked things. I'm going to close this ticket as this is not a bug. Kind regards, | ||||||||
| Comment by Eric Milkie [ 01/Apr/20 ] | ||||||||
|
Note that a viable workaround for this situation is to do what Kay tried, by raising the number of read tickets temporarily. This will work until the incoming connection limit is hit (or some other resource limit). | ||||||||
| Comment by Eric Milkie [ 01/Apr/20 ] | ||||||||
|
The convertToCapped command holds a Database exclusive lock for its entire duration. The creation of the temp collection and the final rename also take Database exclusive locks, but those are nested (recursive) and thus have no bearing on the state of the lock queue for the Database.
We haven't made any changes to this behavior of the command even up to version 4.4. | ||||||||
| Comment by Carl Champain (Inactive) [ 23/Mar/20 ] | ||||||||
|
We have received a total of 26 files. Thank you! | ||||||||
| Comment by Kay Agahd [ 22/Mar/20 ] | ||||||||
|
Thanks dmitry.agranat for the upload location. | ||||||||
| Comment by Dmitry Agranat [ 22/Mar/20 ] | ||||||||
|
I've created a secure upload portal for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Thanks, | ||||||||
| Comment by Kay Agahd [ 20/Mar/20 ] | ||||||||
|
Could you please give me the link to your upload portal for uploading the log and diagnostic files? Thanks! I tried this one but it did not work:
|