[SERVER-34082] MongoDB stalls randomly Created: 23/Mar/18  Updated: 23/Jul/18  Resolved: 21/Jun/18

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.4.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Anton Neznaenko Assignee: Kelsey Schubert
Resolution: Incomplete Votes: 0
Labels: SEW
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File keith.png     Zip Archive mongo_diag.zip     File ncs-mongo-hung-14.04.2018.tar.gz     PNG File stall.png    
Issue Links:
Related
related to WT-3997 The cursor walk code can spin without... Closed
is related to WT-4027 Yield cursor operations between resta... Closed
Operating System: ALL
Participants:

 Description   

We are experiencing random MongoDB freezes on our 2 replicasets. During short time intervals (about 30-60 seconds) we are unable to get response from DB. That happens randomly. No CPU, Disk, Network or memory spikes are observed in our Nagios monitoring.
MongoDB logs show that db accepts connections and after 30 seconds client disconnects.
From the application side a bunch of MongoCursorTimeoutException's were thrown from a PHP driver at March 21, 2018, 3:01:42 a.m. UTC.
I've attached diagnostic.data logs for the problematic period of time from our 3 db instances.
DB1 - master, DB2/DB3 - slaves. The problem started on DB2 at March 21, 2018, 3:01:12 and 30 seconds later we've observed exceptions on our PHP application servers.
Digging into diagnostic logs I've found that the problematic period of time is empty there.

Is this problem resolved in a latter MongoDB releases? We are planing an update and want to make sure that the fix is already included.



 Comments   
Comment by Kelsey Schubert [ 21/Jun/18 ]

Hi neanton,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. Unfortunately, we do not have enough information to continue to investigate this issue. If this is still an problem for you after upgrading to MongoDB 3.6, please provide additional information and we will reopen the ticket.

Kind regards,
Kelsey

Comment by Kelsey Schubert [ 02/May/18 ]

Hi neanton,

Unfortunately, since these fixes are speculative in nature, they do not justify backport into older stable branches.

Kind regards,
Kelsey

Comment by Anton Neznaenko [ 16/Apr/18 ]

Hi, Kelsey,

Is there any chance this speculative fix is backported to 3.4 branch? We would be able to upgrade our cluster then and check if it's resolved.
BTW, we've seen same strange behavior at around 1:41 UTC on 14th of April. Attaching diagnostic data logs, if that would give you some additional info. ncs-mongo-hung-14.04.2018.tar.gz

Comment by Kelsey Schubert [ 10/Apr/18 ]

Hi neanton,

After investigating this issue, we've made a few speculative fixes which will be included in MongoDB 3.6.4. Unfortunately, we haven't been able to reproduce the issue that you're observing and and so we aren't able to confirm whether these fixes resolve the issue. Would you be able to upgrade to MongoDB 3.6.4 when it is available, and let us know if it resolves the problem?

Thank you,
Kelsey

Comment by Alexander Gorrod [ 08/Apr/18 ]

Thanks for the investigation keith.bostic, I can answer your question about eviction worker threads:

One note is that during all of this, the number of eviction worker threads goes to 0 and has to ramp up again, which doesn't seem right. I don't see anything in the code that makes me think that should happen, so I'm possibly misreading that.

The statistic there is misleading - dynamic eviction worker threads are disabled in the version of MongoDB this user is running, so there will be a constant number of workers (4).

Comment by Keith Bostic (Inactive) [ 06/Apr/18 ]

I spent some more time looking at this today. As bruce.lucas noted, there's a gap right when the problem happens, so all we can really look at is the stuff around the edges.

I think it's interesting that DB2 has a big spike in memory allocation/free calls and the disk goes to 100%, but everything else pretty much shuts down. Eviction/reconciliation and logging all go to zero, and eviction/reconciliation doesn't recover for a long time, even after ftdc data is coming in again. There's a checkpoint that starts inside the gap, but whatever happened started before the checkpoint started: the checkpoint might not be helping, but it didn't cause the problem. The oplog is quiescent.

One note is that during all of this, the number of eviction worker threads goes to 0 and has to ramp up again, which doesn't seem right. I don't see anything in the code that makes me think that should happen, so I'm possibly misreading that.

Looking at the DB1 data (which didn't suffer the ftdc outage), it's pretty clear the system went from 0 to 100 in a few seconds, that is, everything was quiescent at the start, and then everything was slammed without any ramp-up, so I think it's a fair guess the system was in a state where it couldn't quickly adapt to a surge in traffic.

I suspect the explanation for the surge in cursor restarts is that threads are repeatedly hitting pages that have split (why those pages are splitting right at this time, I have no idea). The design assumption was a cursor restart does enough work (searching the tree, for example), that there wasn't any reason to yield the CPU. This is also on a 16-core system, which should give us some buffering from too many threads chasing too few cores. That said, I suspect the large number of cursor restarts is because some relatively few cursors were able to build up impressive restart counts because they were hitting the same page repeatedly.

Comment by Keith Bostic (Inactive) [ 26/Mar/18 ]

alexander.gorrod@mongodb.com, michael.cahill, when cursor_restart is incremented, the restart is going to cause a search of the tree from the root, and I don't think that's likely to complete without yielding, which makes me think the cursor restart spike is a side-effect of something else.

Comment by Anton Neznaenko [ 23/Mar/18 ]

Yep, Bruce, our app does a series of read and write operations during single HTTP request, some of those operations are with secondaryPreferred read preference to distribute reads to secondaries. So I would say that if during read request to db2 something hung, there would be no other operations and current HTTP request will finish with error. We are using default driver timeout of 30 seconds, so ALL our HTTP requests hang during this period of time.
Please let us know if you need any other detailed information about our setup or some additional debugging info.

Comment by Bruce Lucas (Inactive) [ 23/Mar/18 ]

Hi Anton,

Thanks for the detailed problem report. From the metrics you uploaded, it appears to me that the problem occurred on the db2 secondary node, stalling operations there. The other nodes appeared to be unaffected, except that they weren't processing any requests during the incident. I imagine this could occur if your application depends on secondary reads from db2, so that if it stalls, all application threads stall. Is this plausible, given the way your application works?

Meanwhile we have not yet identified a known issue that matches the symptoms, but we are still looking into it.

Thanks,
Bruce

Generated at Thu Feb 08 04:35:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.