[SERVER-34082] MongoDB stalls randomly Created: 23/Mar/18 Updated: 23/Jul/18 Resolved: 21/Jun/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.4.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Anton Neznaenko | Assignee: | Kelsey Schubert |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | SEW | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
We are experiencing random MongoDB freezes on our 2 replicasets. During short time intervals (about 30-60 seconds) we are unable to get response from DB. That happens randomly. No CPU, Disk, Network or memory spikes are observed in our Nagios monitoring. Is this problem resolved in a latter MongoDB releases? We are planing an update and want to make sure that the fix is already included. |
| Comments |
| Comment by Kelsey Schubert [ 21/Jun/18 ] |
|
Hi neanton, We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. Unfortunately, we do not have enough information to continue to investigate this issue. If this is still an problem for you after upgrading to MongoDB 3.6, please provide additional information and we will reopen the ticket. Kind regards, |
| Comment by Kelsey Schubert [ 02/May/18 ] |
|
Hi neanton, Unfortunately, since these fixes are speculative in nature, they do not justify backport into older stable branches. Kind regards, |
| Comment by Anton Neznaenko [ 16/Apr/18 ] |
|
Hi, Kelsey, Is there any chance this speculative fix is backported to 3.4 branch? We would be able to upgrade our cluster then and check if it's resolved. |
| Comment by Kelsey Schubert [ 10/Apr/18 ] |
|
Hi neanton, After investigating this issue, we've made a few speculative fixes which will be included in MongoDB 3.6.4. Unfortunately, we haven't been able to reproduce the issue that you're observing and and so we aren't able to confirm whether these fixes resolve the issue. Would you be able to upgrade to MongoDB 3.6.4 when it is available, and let us know if it resolves the problem? Thank you, |
| Comment by Alexander Gorrod [ 08/Apr/18 ] |
|
Thanks for the investigation keith.bostic, I can answer your question about eviction worker threads:
The statistic there is misleading - dynamic eviction worker threads are disabled in the version of MongoDB this user is running, so there will be a constant number of workers (4). |
| Comment by Keith Bostic (Inactive) [ 06/Apr/18 ] |
|
I spent some more time looking at this today. As bruce.lucas noted, there's a gap right when the problem happens, so all we can really look at is the stuff around the edges. I think it's interesting that DB2 has a big spike in memory allocation/free calls and the disk goes to 100%, but everything else pretty much shuts down. Eviction/reconciliation and logging all go to zero, and eviction/reconciliation doesn't recover for a long time, even after ftdc data is coming in again. There's a checkpoint that starts inside the gap, but whatever happened started before the checkpoint started: the checkpoint might not be helping, but it didn't cause the problem. The oplog is quiescent. One note is that during all of this, the number of eviction worker threads goes to 0 and has to ramp up again, which doesn't seem right. I don't see anything in the code that makes me think that should happen, so I'm possibly misreading that. Looking at the DB1 data (which didn't suffer the ftdc outage), it's pretty clear the system went from 0 to 100 in a few seconds, that is, everything was quiescent at the start, and then everything was slammed without any ramp-up, so I think it's a fair guess the system was in a state where it couldn't quickly adapt to a surge in traffic. I suspect the explanation for the surge in cursor restarts is that threads are repeatedly hitting pages that have split (why those pages are splitting right at this time, I have no idea). The design assumption was a cursor restart does enough work (searching the tree, for example), that there wasn't any reason to yield the CPU. This is also on a 16-core system, which should give us some buffering from too many threads chasing too few cores. That said, I suspect the large number of cursor restarts is because some relatively few cursors were able to build up impressive restart counts because they were hitting the same page repeatedly. |
| Comment by Keith Bostic (Inactive) [ 26/Mar/18 ] |
|
alexander.gorrod@mongodb.com, michael.cahill, when cursor_restart is incremented, the restart is going to cause a search of the tree from the root, and I don't think that's likely to complete without yielding, which makes me think the cursor restart spike is a side-effect of something else. |
| Comment by Anton Neznaenko [ 23/Mar/18 ] |
|
Yep, Bruce, our app does a series of read and write operations during single HTTP request, some of those operations are with secondaryPreferred read preference to distribute reads to secondaries. So I would say that if during read request to db2 something hung, there would be no other operations and current HTTP request will finish with error. We are using default driver timeout of 30 seconds, so ALL our HTTP requests hang during this period of time. |
| Comment by Bruce Lucas (Inactive) [ 23/Mar/18 ] |
|
Hi Anton, Thanks for the detailed problem report. From the metrics you uploaded, it appears to me that the problem occurred on the db2 secondary node, stalling operations there. The other nodes appeared to be unaffected, except that they weren't processing any requests during the incident. I imagine this could occur if your application depends on secondary reads from db2, so that if it stalls, all application threads stall. Is this plausible, given the way your application works? Meanwhile we have not yet identified a known issue that matches the symptoms, but we are still looking into it. Thanks, |