[SERVER-35543] Secondary server got frozen with 100% CPU Created: 12/Jun/18  Updated: 27/Oct/23  Resolved: 26/Oct/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.6.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Isaac Cruz Assignee: Bruce Lucas (Inactive)
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS 7


Attachments: PNG File incident.png     Text File pstack-03a.txt     Text File pstack-03b.txt    
Operating System: ALL
Steps To Reproduce:

It has happened randomly, around 3 times in last 3 weeks: first two with a few hours difference, and on different servers (and I think it happened to primaries at that time), and then now again in a secondary.

Participants:

 Description   

We have a sharding cluster DB, with 8 shards and each of them using two replica sets + arbiter. Today we had a problem in one of the secondaries server: it suddenly started to use 100% CPU, and did not respond to any query. It remained in that state until restarted.

I'm attaching stack trace from "pstack" in case it helps, it seems most threads are waiting for a lock, except some of them which might be hoarding the locks while consuming all CPU (this server has 2 CPUs): Threads 70, 73 and 83



 Comments   
Comment by Kelsey Schubert [ 26/Oct/18 ]

Thanks for the update, icruz.

Comment by Isaac Cruz [ 26/Oct/18 ]

Hi Kelsey,

we have upgraded to 4.0.x and we have not seen this behavior again. So maybe we can resolve, and will reopen again if it happens again.

Thanks,
Isaac

Comment by Kelsey Schubert [ 25/Oct/18 ]

Hi icruz,

Is this still an issue for you? If so, I would recommend upgrading to a more recent version of MognoDB 3.6 has there have been a number of fixes in this space since this ticket was originally opened. If after upgrading you're still encountering this issue, would you please provide the stack traces while the lag is building as Bruce requested?

Thank you,
Kelsey

Comment by Bruce Lucas (Inactive) [ 14/Jun/18 ]

Regarding monitoring replication lag on the secondaries for future, is it using db.printSlaveReplicationInfo() / db.printReplicationInfo() or is there an easier way ?

Yes, that's the recommended way of monitoring it; details are described here. Thanks for your help troubleshooting this problem.

Bruce

Comment by Laxman P [ 13/Jun/18 ]

Uploaded mongos.log from 10.0.0.4 for that day and also some other commands that I had captured during the time (especially currentOp on 03a)
Regarding monitoring replication lag on the secondaries for future, is it using db.printSlaveReplicationInfo() / db.printReplicationInfo() or is there an easier way ?

Comment by Bruce Lucas (Inactive) [ 13/Jun/18 ]

Thanks for the detailed timeline. Can you upload that mongos log? I'm not sure what that message is and would like to see the context.

Here's what we're seeing in the diagnostic data:

  • From about 22:16 to 22:21, a very long-running transaction held a global intent lock without yielding for about 5 minutes, and this blocked replication, caused the node to lag behind the primary, and would also have prevented any read operations on the secondary from succeeding because of the stalled replication. However there is no slow operation logged that coincides with that long transaction, nor any other trace in the mongod log or diagnostic data that I have been able to find.
  • When this ended at 22:21, as a result of the large lag the server then soon encountered severe cache pressure while trying to catch up, and went into high-CPU mode as you noted, due to SERVER-34938. This is a known issue, but I am not going to close this ticket as a duplicate but instead use it to investigate the preceding triggering issue.

If you are in a position to monitor replication lag on secondaries, then that may give you an indication that such an incident is in progress. If you see lag building, please capture stack traces and db.currentOp if possible while lag is building. The hope is that we can capture this information during the triggering incident mentioned in the first bullet point above, before it enters high-CPU mode.

In any case if there is a recurrence of this issue, whether on a primary or a secondary, please upload diagnostic.data and mongod log for our analysis.

Thanks,
Bruce

Comment by Laxman P [ 13/Jun/18 ]

On 10.0.0.4 and few other servers, we have mongos running.
Around that timeframe in the mongos.log on 10.0.0.4, I only see several of these from 22:23 to 22:33
"mongos collstats doesn't know about: ..."

Comment by Laxman P [ 13/Jun/18 ]

Hi Bruce - Here's some additional information on the sequence of steps

Note all times UTC

  1. 03b was secondary at that time
  2. cpu on 03b was 96%+ from 22:23:00 to 23:17:00
  3. pstack on 03b was captured at Jun 11 23:13
  4. restarted mongod on 03b a few seconds after capturing pstack
  5. 03a which was primary also had high cpu 75%+ from 22:23 until June 12th 02:28
  6. Tried to run rs.stepDown() on 03a after mongod restart was complete on 03b but it failed with this error "No electable secondaries caught up"
  7. Captured pstack on 03a at 23:48
Comment by Bruce Lucas (Inactive) [ 12/Jun/18 ]

Hi Isaac,

Thanks, it would certainly be helpful to understand whether there was anything unusual happening from an application perspective around that time.

One of the possible factors contributing to the incident was a very long-running storage-engine transaction, but I have not been able to identify the cause of that (it may or may not have been related to specific application activity), from about 22:15:51 to about 22:21:39. Coinciding with the start of that transaction was a connection from 10.0.0.4; what mongod or mongos processes are running on that machine? If you have the log files for those mongod or mongos processes covering the incident we may be able to get more information about the cause of the long-running transaction.

Regarding the time when the stack traces were captured, perhaps if you have the original pstack-03b.txt file still on the machine where you collected the stack traces the file creation date would tell us when they were collected? As I mentioned this would be very helpful in understanding the incident as we see a couple of different distinct phases to the incident in the diagnostic.data that you uploaded, and without the timestamp I'm not sure which phase the stack traces correspond to.

Beyond that, if the incident occurs again can you please upload diagnostic.data and logs, and if possible stack traces with timestamp. This will help us identify common elements to the incidents.

Also, is there any possibility to upgrade to the most recent version of 3.6 in order to pick up the fix to SERVER-32876? In the data you uploaded the diagnostic information was not collected during a substantial portion of the incident due to that issue, so upgrading would give us more complete information.

Thanks,
Bruce

Comment by Isaac Cruz [ 12/Jun/18 ]

Hi Bruce,
basically our application continuously inserts data to a single collection (that is the main load at that time), we create a new collection every day. And sometimes we read from previous day collection all data from one lineId (that read operation is done on secondaries).
I don't think there was something special at that time, but I will look into it more thoroughly and will update if I find something more.

Comment by Bruce Lucas (Inactive) [ 12/Jun/18 ]

Thanks Isaac. What information do you have about exactly when the stack traces were captured, or a time range? There were a couple of different phases to the incident, and it would help to get a complete picture of the incident if we had information about when the stack traces were collected.

Comment by Isaac Cruz [ 12/Jun/18 ]

Hi Bruce,

I have uploaded both files. Incident happened on Jun 11, between 22:20 and 23:18 when it was restarted (everything UTC timezone). Unfortunately data from previous incidents has been already removed.

Please let me know if you need anything else.

Thanks,
Isaac

 

Comment by Bruce Lucas (Inactive) [ 12/Jun/18 ]

Hi Isaac,

Can you please

  • archive and upload $dbpath/diagnostic.data from the node where the most recent incident occured? You can upload it to this private secure portal
  • upload mongod log files covering the incident
  • provide a specific timeline for the incident (including timezone) so we can find the relevant data

If diagnostic.data for the earlier incidents is still available can you please do the same for those incidents. You can check whether it is still available by looking at the files in diagnostic.data; the name of each file reflects the beginning of the time range covered by that file.

Thanks,
Bruce

Generated at Thu Feb 08 04:40:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.