[SERVER-51281] mongod live locked Created: 01/Oct/20 Updated: 06/Dec/23 Resolved: 12/Feb/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 4.9.0, 4.4.5 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Mark Weaver | Assignee: | Lingzhi Deng |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||||||||||||||
| Sprint: | Storage - Ra 2021-01-25, Repl 2021-02-22 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||
| Description |
|
We are encountering an issue with mongo-4.4.1 (4.4.0 also exhibited the issue) running under kubernetes using: https://hub.docker.com/layers/mongo/library/mongo/4.4/images/sha256-31f6433f7cfcd2180483e40728cbf97142df1e85de36d80d75c93e5e7fe10405?context=explore The problem is that occasionally the server will spin the CPU, and pile up connections running queries that never finish. It appears that most of them are waiting for a lock that is never unlocked, and they are stuck in that state – the actual TCP connections to the server get shut down by the clients, but server does not decrement the connection count. We can't reproduce this in a reliable fashion – it happens every few days. The previous version of mongo that we were using was 4.0.5 and we no problems there. If I can collect any other information please let me know (I have a core dump that I generated the backtraces from, but I can't attach that in case it contains user data). Thanks, Mark |
| Comments |
| Comment by Githook User [ 22/Feb/21 ] | ||||||||||||||||
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: (cherry picked from commit d7b4c6bcaa61dc54d830007657babdd6337f9cfb) | ||||||||||||||||
| Comment by Githook User [ 10/Feb/21 ] | ||||||||||||||||
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: | ||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 07/Jan/21 ] | ||||||||||||||||
|
Is this the same underlying issue as | ||||||||||||||||
| Comment by Louis Williams [ 08/Oct/20 ] | ||||||||||||||||
|
bruce.lucas, my understanding is that those WT tickets will allow more data to be evictable overall. It's not clear to me how, or to what extent the dirty content is allowed to expand beyond the default threshold. Is dirty content allowed to expand beyond the default threshold as a high water mark or is it a hard limit? Or is the dirty content limit still bounded by the cache size? All of the threads waiting for eviction seem to think cache eviction will allow them to make forward progress. This is what I see as the problem: Assume there is a hard limit on dirty content (whether it be the configured limit or the WT cache size). In this example, every operation has uncommitted data taking up cache space. All are trying to commit, but because of a lack of cache space, they are tasked with eviction. No operations are able to evict because none can commit, which would make their data evictable (and in no case is uncommitted data evictable). This example has an obvious problem and that is operations with uncommitted data fill up the cache by waiting for eviction as they commit, further preventing other operations from committing. I know there is logic already in WT to roll back operations when no progress is being made for eviction. Why didn't that kick in for this scenario? | ||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 07/Oct/20 ] | ||||||||||||||||
|
louis.williams, I'm not seeing how those WT tickets will help: isn't it the case that the cache content is not evictable because it is pinned by the the open transaction, and that open transaction is unable to proceed because there is no space in the cache to hold its next write? In short, that single transaction wants to generate more dirty content than the cache dirty content limit allows, and it's fairly fundamental to the design of WT that that is not possible? If so then I don't see how those WT tickets would help, but I may be missing something. However I do feel that it could be considered a flaw that we have no way to abort transactions that get stuck in this way, but instead we allow the whole instance to become wedged permanently. | ||||||||||||||||
| Comment by Louis Williams [ 07/Oct/20 ] | ||||||||||||||||
|
To summarize what is going on here:
I'm not sure if there's a specific bug with either the transaction killer or the session reaper. Once an operation is inside the storage engine, there is little to nothing that the MongoDB code can do to kill it. Aside from the linked WT improvements, the only solution I can think of on our end would be to do a better job of throttling access into the storage engine. bruce.lucas, do you agree? | ||||||||||||||||
| Comment by Eric Milkie [ 06/Oct/20 ] | ||||||||||||||||
|
Eric Milkie, Louis Williams, question: suppose an operation that is part of a transaction is stuck waiting for cache; will the expired transaction killer be able to abort the transaction? The stack traces above show the expired transaction killer is stuck in checkOutSessionForKill; could be because there is an operation in progress on that session that is stuck waiting for cache? | ||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 06/Oct/20 ] | ||||||||||||||||
|
Thanks Mark, appreciate the detailed information. | ||||||||||||||||
| Comment by Mark Weaver [ 06/Oct/20 ] | ||||||||||||||||
|
For gdb.log, those are two separate stack traces from two separate incidents – I uploaded the 2nd incident so that you could have matching gdb, mongo log, diagnostic data and currentOps output. For timestamps, I have 2020-10-01T15:45:08Z for the first incident and 2020-10-02T05:50:20Z for the second incident. If it helps, I still have the core dumps so I can further interrogate those (but I very likely can't upload them as they contain 3rd party data – I'd have to get permission from the 3rd party to share that). | ||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 06/Oct/20 ] | ||||||||||||||||
|
milkie, louis.williams, question: suppose an operation that is part of a transaction is stuck waiting for cache; will the expired transaction killer be able to abort the transaction? The stack traces above show the expired transaction killer is stuck in checkOutSessionForKill; could be because there is an operation in progress on that session that is stuck waiting for cache? | ||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 05/Oct/20 ] | ||||||||||||||||
|
Good point, mark.weaver@trelica.com. I've been looking into this further and I'm not sure this is really entirely a storage engine issue. The diagnostic data that you uploaded shows two incidents, and in one case there was a batch of inserts preceding the incident, but in the second case there was not. In both cases though we see something like this:
At A a transaction is started ("total started") but never commits ("total committed") and remains open ("current open"). We would expect such a transaction to be times out after 60 seconds (at default settings, which you appear to be using). This is the thread that is responsible for killing expired transactions, and it appears to be trying to do so, but it appears to be stuck waiting to check out a session:
I'm not sure what it is stuck waiting for, but we will look into it further. Can you clarify what is in gdb.log - it appears to be two different stack trace samples, but I'm not sure if it's two consecutive samples from the same incident, or two samples from two different incidents. Do you have the timestamps of the samples to help clarify? | ||||||||||||||||
| Comment by Mark Weaver [ 03/Oct/20 ] | ||||||||||||||||
|
Yes, I can certainly give it a go with 4.4.2 (and knowing what it is I can try using larger transactions to see if that makes the behaviour more reproducible). It certainly fits with what we observe (it usually happens early am when some batch jobs run). There is still a bug in there somewhere though because mongo continues to spin the cpu even after every client has been disconnected – if it was just a cache limit I'd expect aborting all existing connections to free things up? Thanks, | ||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 02/Oct/20 ] | ||||||||||||||||
|
Thanks Mark. You have a fairly small cache at ~250MB cache size, which only allows up to 50MB of dirty content in the cache. WiredTiger is able to evict all committed content from the cache, but can't evict uncommitted content, so it is relatively easy to reach the dirty limit before the content has been committed. The stall happens immediately following a cluster of inserts, which generates enough dirty content that the cache becomes stuck full with that dirty content. Since you were able to run your workload with 4.0, then we believe that Also additional future work in | ||||||||||||||||
| Comment by Mark Weaver [ 02/Oct/20 ] | ||||||||||||||||
|
@Bruce Lucas: I have put some log files in the s3 bucket using the link you provided above from another instance of the same problem (the output of db.currentOp() which shows a bunch of threads waiting for latches; the output of gdb which shows roughly the same; the mongod logs and diagnostic.data for the relevant period). I also have a core dump locally – it's doubtful I can share that but I can interrogate it for you if that's helpful. | ||||||||||||||||
| Comment by Mark Weaver [ 01/Oct/20 ] | ||||||||||||||||
|
I've attached the diagnostics again – the last upload failed, so I've cut it down a bit as I guess there's a size limit. | ||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 01/Oct/20 ] | ||||||||||||||||
|
mark.weaver@trelica.com, thanks for the report. Can you please archive and attach to this ticket the $dbpath/diagnostic.data directory? It will also be helpful to have the timestamps of the incident(s). Also, can you attach the mongodb log? If it is too large or you don't want to post it on a public ticket you can upload it to this secure private portal. | ||||||||||||||||
| Comment by Mark Weaver [ 01/Oct/20 ] | ||||||||||||||||
|
(Sorry about the mess, I accidentally hit enter so I've ended up with a half formed bug I can't see how to edit.) We are encountering an issue with mongo-4.4.1 (4.4.0 also exhibited the issue) running under kubernetes using: https://hub.docker.com/layers/mongo/library/mongo/4.4/images/sha256-31f6433f7cfcd2180483e40728cbf97142df1e85de36d80d75c93e5e7fe10405?context=explore The problem is that occasionally the server will spin the CPU, and pile up connections running queries that never finish. It appears that most of them are waiting for a lock that is never unlocked, and they are stuck in that state – the actual TCP connections to the server get shut down by the clients, but server does not decrement the connection count. We can't reproduce this in a reliable fashion – it happens every few days. The previous version of mongo that we were using was 4.0.5 and we no problems there. If I can collect any other information please let me know (I have a core dump that I generated the backtraces from, but I can't attach that in case it contains user data). Thanks, Mark |