[SERVER-35543] Secondary server got frozen with 100% CPU Created: 12/Jun/18 Updated: 27/Oct/23 Resolved: 26/Oct/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.6.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Isaac Cruz | Assignee: | Bruce Lucas (Inactive) |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
CentOS 7 |
||
| Attachments: |
|
| Operating System: | ALL |
| Steps To Reproduce: | It has happened randomly, around 3 times in last 3 weeks: first two with a few hours difference, and on different servers (and I think it happened to primaries at that time), and then now again in a secondary. |
| Participants: |
| Description |
|
We have a sharding cluster DB, with 8 shards and each of them using two replica sets + arbiter. Today we had a problem in one of the secondaries server: it suddenly started to use 100% CPU, and did not respond to any query. It remained in that state until restarted. I'm attaching stack trace from "pstack" in case it helps, it seems most threads are waiting for a lock, except some of them which might be hoarding the locks while consuming all CPU (this server has 2 CPUs): Threads 70, 73 and 83 |
| Comments |
| Comment by Kelsey Schubert [ 26/Oct/18 ] |
|
Thanks for the update, icruz. |
| Comment by Isaac Cruz [ 26/Oct/18 ] |
|
Hi Kelsey, we have upgraded to 4.0.x and we have not seen this behavior again. So maybe we can resolve, and will reopen again if it happens again. Thanks, |
| Comment by Kelsey Schubert [ 25/Oct/18 ] |
|
Hi icruz, Is this still an issue for you? If so, I would recommend upgrading to a more recent version of MognoDB 3.6 has there have been a number of fixes in this space since this ticket was originally opened. If after upgrading you're still encountering this issue, would you please provide the stack traces while the lag is building as Bruce requested? Thank you, |
| Comment by Bruce Lucas (Inactive) [ 14/Jun/18 ] |
Yes, that's the recommended way of monitoring it; details are described here. Thanks for your help troubleshooting this problem. Bruce |
| Comment by Laxman P [ 13/Jun/18 ] |
|
Uploaded mongos.log from 10.0.0.4 for that day and also some other commands that I had captured during the time (especially currentOp on 03a) |
| Comment by Bruce Lucas (Inactive) [ 13/Jun/18 ] |
|
Thanks for the detailed timeline. Can you upload that mongos log? I'm not sure what that message is and would like to see the context. Here's what we're seeing in the diagnostic data:
If you are in a position to monitor replication lag on secondaries, then that may give you an indication that such an incident is in progress. If you see lag building, please capture stack traces and db.currentOp if possible while lag is building. The hope is that we can capture this information during the triggering incident mentioned in the first bullet point above, before it enters high-CPU mode. In any case if there is a recurrence of this issue, whether on a primary or a secondary, please upload diagnostic.data and mongod log for our analysis. Thanks, |
| Comment by Laxman P [ 13/Jun/18 ] |
|
On 10.0.0.4 and few other servers, we have mongos running. |
| Comment by Laxman P [ 13/Jun/18 ] |
|
Hi Bruce - Here's some additional information on the sequence of steps Note all times UTC
|
| Comment by Bruce Lucas (Inactive) [ 12/Jun/18 ] |
|
Hi Isaac, Thanks, it would certainly be helpful to understand whether there was anything unusual happening from an application perspective around that time. One of the possible factors contributing to the incident was a very long-running storage-engine transaction, but I have not been able to identify the cause of that (it may or may not have been related to specific application activity), from about 22:15:51 to about 22:21:39. Coinciding with the start of that transaction was a connection from 10.0.0.4; what mongod or mongos processes are running on that machine? If you have the log files for those mongod or mongos processes covering the incident we may be able to get more information about the cause of the long-running transaction. Regarding the time when the stack traces were captured, perhaps if you have the original pstack-03b.txt file still on the machine where you collected the stack traces the file creation date would tell us when they were collected? As I mentioned this would be very helpful in understanding the incident as we see a couple of different distinct phases to the incident in the diagnostic.data that you uploaded, and without the timestamp I'm not sure which phase the stack traces correspond to. Beyond that, if the incident occurs again can you please upload diagnostic.data and logs, and if possible stack traces with timestamp. This will help us identify common elements to the incidents. Also, is there any possibility to upgrade to the most recent version of 3.6 in order to pick up the fix to Thanks, |
| Comment by Isaac Cruz [ 12/Jun/18 ] |
|
Hi Bruce, |
| Comment by Bruce Lucas (Inactive) [ 12/Jun/18 ] |
|
Thanks Isaac. What information do you have about exactly when the stack traces were captured, or a time range? There were a couple of different phases to the incident, and it would help to get a complete picture of the incident if we had information about when the stack traces were collected. |
| Comment by Isaac Cruz [ 12/Jun/18 ] |
|
Hi Bruce, I have uploaded both files. Incident happened on Jun 11, between 22:20 and 23:18 when it was restarted (everything UTC timezone). Unfortunately data from previous incidents has been already removed. Please let me know if you need anything else. Thanks,
|
| Comment by Bruce Lucas (Inactive) [ 12/Jun/18 ] |
|
Hi Isaac, Can you please
If diagnostic.data for the earlier incidents is still available can you please do the same for those incidents. You can check whether it is still available by looking at the files in diagnostic.data; the name of each file reflects the beginning of the time range covered by that file. Thanks, |