[SERVER-21625] WiredTiger Deadlock Created: 23/Nov/15  Updated: 01/Feb/16  Resolved: 01/Feb/16

Status: Closed
Project: Core Server
Component/s: Sharding, WiredTiger
Affects Version/s: 3.0.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Dan Doyle Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 0
Labels: RF
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2016-01-20 at 3.09.29 pm.png     File currentOp.json     Text File hung_shard.log     Text File parsed_iostat.log     Text File parsed_ss.log    
Operating System: ALL
Participants:

 Description   

We have been periodically experiencing what seems to be a deadlock on our mongo setup. Our setup (currently all on one physical machine):

CentOS 6.6
mongodb 3.0.7
3x mongod shard servers, wiredtiger
3x mongod config servers
1x mongos

We currently have 14 databases, most with 5 or 6 collections inside them. The system is heavily write based as we're collecting timeseries data. Total data size right now according to a "show dbs" is around 1.7 TB.

What happens is that "randomly" the system grinds to a halt. This can take a few hours or the longest stretch was around 5 days. The machine is perfectly responsive, but all mongo processes are using 0% CPU. Load drops to 0, all the usual stuff you'd expect. When performing a "db.currentOp()" we can see a lot of queries with long "microsecs_running" - typically all with an "opid" attribute referencing one of the shards. Attempting something like "show dbs" hangs in this state.

A "kill -9" of the shard in question will stop the process. Upon starting it up again everything resumes normally without having to touch any of the other processes. It is always one of the mongod shard processes, never mongos or mongod config.

We enabled level 2 verbosity logging to hopefully get a better view into what was going on. I am including an example of the "db.currentOp()" as well as the end of the log from when the process last hung. Please note that while there are "COLLSCANS" present in the currentOp output here, we have since tried indexing everything to cover all of our queries and the issue persists.

Please let me know if I can provide any more information or help. This is obviously quite a big issue to us as it happens about once every 1-2 days and requires human intervention to resolve.



 Comments   
Comment by Ramon Fernandez Marina [ 01/Feb/16 ]

daldoyle, after examining the data you've uploaded we haven't been able to find evidence of a bug in the server. The main theory is that machine resources are being stretched very thin, and a small increase in resources demand in this box may be pushing the machine over the edge (e.g: increased page faults, swapping, iowait) and cause these apparent deadlocks. I'd strongly recommend you spread the load over multiple servers/VMs.

One thing you may consider is to upgrade to MongoDB 3.2.1 (the latest stable release right now), which contains a lot of performance and stability improvements. It also has the advantage that it collects diagnostics data automatically, so if this behavior happens again we can collect the data and take a closer look.

Without any further data that may point to a bug in the server I'm going to close this ticket. Please consider using more servers and upgrading to MongoDB 3.2.1, and if you experience this behavior after that we can reopen this ticket. If none of this alternatives is viable you can post on the mongodb-user group for support assistance. See also our Technical Support page for additional support resources.

Regards,
Ramón.

Comment by Dan Doyle [ 12/Jan/16 ]

Happy New Year to all!

I was hoping we could pick up on this thread again now that the holidays are through. Is there anything else we can help with in terms of tracking this down?

Comment by Dan Doyle [ 09/Dec/15 ]

Just wanted to follow up to see if there's anything else you guys need. Thanks.

-Dan

Comment by Dan Doyle [ 02/Dec/15 ]

Bruce,

We are using a worker queue system to push data into MongoDB, primarily through updates (vs inserts), and our monitoring system is set up to monitor the overall queue length to let us know if things are backing up. This has been our primary mode of detection. All times are in UTC both in this comment and in the logs.

Also, there's an off by 1 error on the mongod shard names according to mongo vs the proc/log names. Mongo refers to the first shard as shard0 whereas the person who set up this server referred to it as shard1. I will refer to the terminology as mongo sees it.

12-01-2015 00:23:32
Alert that the queue size was passing the warning threshold, meaning our queries weren't completing as quickly as normal. This may or may not be a red herring as it's possible that certain other events might be causing extra load on the server and slowing things down temporarily.

12-01-2015 02:16:32
Queue size has passed the warning threshold and entered the critical threshold. Oncall engineer is called to investigate and notices that in the output of db.currentOp() there are a lot of queries blocking on shard0. A kill -9 and startup of that shard causes the system to resume processing and the queue length begins to drop, ultimately clearing the alarm within a few minutes. System remains stable after that.

Hopefully that helps. I am more than happy to answer or clarify anything else, just let me know.

Thanks,
-Dan

Comment by Bruce Lucas (Inactive) [ 02/Dec/15 ]

Thanks Dan. Can you tell us the timeline of the event from the perspective of what you observed, as accurately as you can? (Please be sure to include timezone information.) This will help ensure we are looking at the right thing in the data you've provided.

Thanks,
Bruce

Comment by Dan Doyle [ 02/Dec/15 ]

Stephen,

Thanks for creating that. I have just uploaded a .tar of the various log files gzip'd.

-Dan

Comment by Stennie Steneker (Inactive) [ 02/Dec/15 ]

Hi Dan,

I've created a custom endpoint where you can securely upload larger files for this server issue:
https://10gen-httpsupload.s3.amazonaws.com/upload_forms/fdadf054-4c52-4156-9a49-11688eeb1538.html

Please add a comment after any new files are uploaded.

Thanks,
Stephen

Comment by Dan Doyle [ 01/Dec/15 ]

We had another event last night. Is there an alternate way to upload larger files? Even gzip'd the debug log is 213M which is over the threshold on Jira. If not I can work on trimming down the head of the file until it fits, but I wanted to show everything we had as requested.

-Dan

Comment by Bruce Lucas (Inactive) [ 30/Nov/15 ]

Just the data bearing nodes, and just the primaries probably, although the secondaries might be useful as well.

Thanks,
Bruce

Comment by Dan Doyle [ 30/Nov/15 ]

To be clear, we'll need to run this N times connecting to each mongod instance? Does this include the config servers as well or just the data bearing ones?

-Dan

Comment by Dan Doyle [ 30/Nov/15 ]

Bruce,

Ha, okay. I'll turn the logging back on but we will have to wait until it happens again. When it next occurs I will upload the entire log along with the relevant mongod log.

-Dan

Comment by Bruce Lucas (Inactive) [ 30/Nov/15 ]

Hi Dan,

Thanks for uploading the logs. Unfortunately the serverStatus timeseries was collected through mongos, not mongod, so it doesn't include the key statistics we need to look at, which are only available by directly communicating with mongod. Can you adjust the mongo command that that generates ss.log by adding a --port option so that it connects directly to mongod rather than mongos? Please run as Stephen suggested with a sampling frequency of 1 s (i.e. sleep(1000). Also please upload the complete ss.log and iostat.log even if they are large because we can get useful information by looking at the behavior over time leading up to the event. It will be helpful also to have the corresponding mongod log and timing of the event from your perspective so we can correlate everything.

Thanks,
Bruce

Comment by Dan Doyle [ 30/Nov/15 ]

Hi Stephen,

Thanks for the information, we'll definitely take that into consideration.

It looks like we had another event over the holiday weekend, so I'm including the information requested as attachments. I have pruned the log files down to the time about 10-15m leading up to the event in case that's helpful. If this is too much let me know and I can prune it down further.

-Dan

Comment by Stennie Steneker (Inactive) [ 30/Nov/15 ]

Hi Dan,

Please let us know when you have further information. In your original description you suggested the problem recurs every 1-2 days so hopefully we'll be able to progress investigation this week.

I revised the stats collection suggestion to sample every second as that should be sufficient data:

mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep(1000)}" > ss.log &
iostat -k -t -x 1 > iostat.log &

A few comments on your configuration:

No, there is no partitioning. We have tried cgroups in the past but mongo did not appear to play nicely with them (prior to 3.0.7, maybe fixed since then). The longer term plan here is to move the shards onto other physical machines but we're doing initial scaling and testing on a single machine for logistical / monetary purposes.

We generally don't recommend having multiple mongod processes on the same server as they are contending for the same resources. Using cgroups is definitely one approach to limit unexpected interaction, but the configuration is rather non-trivial.

It sounds like you've set CacheSizeGB conservatively enough on the data-bearing mongods to leave a reasonable amount of RAM free for filesystem cache, but some sort of resource partitioning would likely be useful. Scaling and testing on a single machine is not fully representative of the behaviour with multiple machines or environments with partitioned resources.

We are using an ext4 filesystem mounted with noatime.

FYI, there have been some reported performance issues with WiredTiger and ext4 under heavy write load (eg. SERVER-18314). Our current recommendation is to use XFS if possible.

For more configuration recommendations, see the MongoDB Production Notes.

Regards,
Stephen

Comment by Dan Doyle [ 25/Nov/15 ]

Stephen,

Thanks for the reply. Sorry if this is too pedantic - I am not actually sure if the queries start to become slow or if they just suddenly halt. Since this is a random occurrence we don't have a person actively watching them all the time and can only observe the end result once our monitoring triggers alarms. I will run the commands requested starting now and then prune down the next time it occurs.

More info re: deployment:

Have you partitioned physical server resources using VMs/containers/cgroups for the various mongod processes?

  • No, there is no partitioning. We have tried cgroups in the past but mongo did not appear to play nicely with them (prior to 3.0.7, maybe fixed since then). The longer term plan here is to move the shards onto other physical machines but we're doing initial scaling and testing on a single machine for logistical / monetary purposes.

What filesystem are you using?

  • We are using an ext4 filesystem mounted with noatime.

What is the total RAM available?

  • Total RAM on the box is 64G.

What is your WiredTiger cacheSizeGB value for each instance?

  • The storage.wigerTiger portion of the config for each data instance is:
    wiredTiger:
    engineConfig:
    journalCompressor: "snappy"
    cacheSizeGB: 5
Comment by Stennie Steneker (Inactive) [ 25/Nov/15 ]

Hi Dan,

Thanks for reporting this issue. We'll need some additional information to investigate.

Our setup (currently all on one physical machine)

Can you describe your deployment in further detail:

  • Have you partitioned physical server resources using VMs/containers/cgroups for the various mongod processes?
  • What filesystem are you using?
  • What is the total RAM available?
  • What is your WiredTiger cacheSizeGB value for each instance?

Please let me know if I can provide any more information or help. This is obviously quite a big issue to us as it happens about once every 1-2 days and requires human intervention to resolve.

Can you collect some additional information on the mongod shard(s) when the queries start to become slow:

mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep(1000)}" > ss.log &
iostat -k -t -x 1 > iostat.log &

The intent is to record system information at the same time you're experiencing slow queries or apparent stalls. Please continue to run these commands until you see some slow queries, after which you can interrupt the monitoring commands and upload the ss.log and iostat.log files so we can analyze them.

Thanks,
Stephen

Generated at Thu Feb 08 03:57:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.