[SERVER-21625] WiredTiger Deadlock Created: 23/Nov/15 Updated: 01/Feb/16 Resolved: 01/Feb/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding, WiredTiger |
| Affects Version/s: | 3.0.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Dan Doyle | Assignee: | Ramon Fernandez Marina |
| Resolution: | Done | Votes: | 0 |
| Labels: | RF | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
We have been periodically experiencing what seems to be a deadlock on our mongo setup. Our setup (currently all on one physical machine): CentOS 6.6 We currently have 14 databases, most with 5 or 6 collections inside them. The system is heavily write based as we're collecting timeseries data. Total data size right now according to a "show dbs" is around 1.7 TB. What happens is that "randomly" the system grinds to a halt. This can take a few hours or the longest stretch was around 5 days. The machine is perfectly responsive, but all mongo processes are using 0% CPU. Load drops to 0, all the usual stuff you'd expect. When performing a "db.currentOp()" we can see a lot of queries with long "microsecs_running" - typically all with an "opid" attribute referencing one of the shards. Attempting something like "show dbs" hangs in this state. A "kill -9" of the shard in question will stop the process. Upon starting it up again everything resumes normally without having to touch any of the other processes. It is always one of the mongod shard processes, never mongos or mongod config. We enabled level 2 verbosity logging to hopefully get a better view into what was going on. I am including an example of the "db.currentOp()" as well as the end of the log from when the process last hung. Please note that while there are "COLLSCANS" present in the currentOp output here, we have since tried indexing everything to cover all of our queries and the issue persists. Please let me know if I can provide any more information or help. This is obviously quite a big issue to us as it happens about once every 1-2 days and requires human intervention to resolve. |
| Comments |
| Comment by Ramon Fernandez Marina [ 01/Feb/16 ] | ||
|
daldoyle, after examining the data you've uploaded we haven't been able to find evidence of a bug in the server. The main theory is that machine resources are being stretched very thin, and a small increase in resources demand in this box may be pushing the machine over the edge (e.g: increased page faults, swapping, iowait) and cause these apparent deadlocks. I'd strongly recommend you spread the load over multiple servers/VMs. One thing you may consider is to upgrade to MongoDB 3.2.1 (the latest stable release right now), which contains a lot of performance and stability improvements. It also has the advantage that it collects diagnostics data automatically, so if this behavior happens again we can collect the data and take a closer look. Without any further data that may point to a bug in the server I'm going to close this ticket. Please consider using more servers and upgrading to MongoDB 3.2.1, and if you experience this behavior after that we can reopen this ticket. If none of this alternatives is viable you can post on the mongodb-user group for support assistance. See also our Technical Support page for additional support resources. Regards, | ||
| Comment by Dan Doyle [ 12/Jan/16 ] | ||
|
Happy New Year to all! I was hoping we could pick up on this thread again now that the holidays are through. Is there anything else we can help with in terms of tracking this down? | ||
| Comment by Dan Doyle [ 09/Dec/15 ] | ||
|
Just wanted to follow up to see if there's anything else you guys need. Thanks. -Dan | ||
| Comment by Dan Doyle [ 02/Dec/15 ] | ||
|
Bruce, We are using a worker queue system to push data into MongoDB, primarily through updates (vs inserts), and our monitoring system is set up to monitor the overall queue length to let us know if things are backing up. This has been our primary mode of detection. All times are in UTC both in this comment and in the logs. Also, there's an off by 1 error on the mongod shard names according to mongo vs the proc/log names. Mongo refers to the first shard as shard0 whereas the person who set up this server referred to it as shard1. I will refer to the terminology as mongo sees it. 12-01-2015 00:23:32 12-01-2015 02:16:32 Hopefully that helps. I am more than happy to answer or clarify anything else, just let me know. Thanks, | ||
| Comment by Bruce Lucas (Inactive) [ 02/Dec/15 ] | ||
|
Thanks Dan. Can you tell us the timeline of the event from the perspective of what you observed, as accurately as you can? (Please be sure to include timezone information.) This will help ensure we are looking at the right thing in the data you've provided. Thanks, | ||
| Comment by Dan Doyle [ 02/Dec/15 ] | ||
|
Stephen, Thanks for creating that. I have just uploaded a .tar of the various log files gzip'd. -Dan | ||
| Comment by Stennie Steneker (Inactive) [ 02/Dec/15 ] | ||
|
Hi Dan, I've created a custom endpoint where you can securely upload larger files for this server issue: Please add a comment after any new files are uploaded. Thanks, | ||
| Comment by Dan Doyle [ 01/Dec/15 ] | ||
|
We had another event last night. Is there an alternate way to upload larger files? Even gzip'd the debug log is 213M which is over the threshold on Jira. If not I can work on trimming down the head of the file until it fits, but I wanted to show everything we had as requested. -Dan | ||
| Comment by Bruce Lucas (Inactive) [ 30/Nov/15 ] | ||
|
Just the data bearing nodes, and just the primaries probably, although the secondaries might be useful as well. Thanks, | ||
| Comment by Dan Doyle [ 30/Nov/15 ] | ||
|
To be clear, we'll need to run this N times connecting to each mongod instance? Does this include the config servers as well or just the data bearing ones? -Dan | ||
| Comment by Dan Doyle [ 30/Nov/15 ] | ||
|
Bruce, Ha, okay. I'll turn the logging back on but we will have to wait until it happens again. When it next occurs I will upload the entire log along with the relevant mongod log. -Dan | ||
| Comment by Bruce Lucas (Inactive) [ 30/Nov/15 ] | ||
|
Hi Dan, Thanks for uploading the logs. Unfortunately the serverStatus timeseries was collected through mongos, not mongod, so it doesn't include the key statistics we need to look at, which are only available by directly communicating with mongod. Can you adjust the mongo command that that generates ss.log by adding a --port option so that it connects directly to mongod rather than mongos? Please run as Stephen suggested with a sampling frequency of 1 s (i.e. sleep(1000). Also please upload the complete ss.log and iostat.log even if they are large because we can get useful information by looking at the behavior over time leading up to the event. It will be helpful also to have the corresponding mongod log and timing of the event from your perspective so we can correlate everything. Thanks, | ||
| Comment by Dan Doyle [ 30/Nov/15 ] | ||
|
Hi Stephen, Thanks for the information, we'll definitely take that into consideration. It looks like we had another event over the holiday weekend, so I'm including the information requested as attachments. I have pruned the log files down to the time about 10-15m leading up to the event in case that's helpful. If this is too much let me know and I can prune it down further. -Dan | ||
| Comment by Stennie Steneker (Inactive) [ 30/Nov/15 ] | ||
|
Hi Dan, Please let us know when you have further information. In your original description you suggested the problem recurs every 1-2 days so hopefully we'll be able to progress investigation this week. I revised the stats collection suggestion to sample every second as that should be sufficient data:
A few comments on your configuration:
We generally don't recommend having multiple mongod processes on the same server as they are contending for the same resources. Using cgroups is definitely one approach to limit unexpected interaction, but the configuration is rather non-trivial. It sounds like you've set CacheSizeGB conservatively enough on the data-bearing mongods to leave a reasonable amount of RAM free for filesystem cache, but some sort of resource partitioning would likely be useful. Scaling and testing on a single machine is not fully representative of the behaviour with multiple machines or environments with partitioned resources.
FYI, there have been some reported performance issues with WiredTiger and ext4 under heavy write load (eg. For more configuration recommendations, see the MongoDB Production Notes. Regards, | ||
| Comment by Dan Doyle [ 25/Nov/15 ] | ||
|
Stephen, Thanks for the reply. Sorry if this is too pedantic - I am not actually sure if the queries start to become slow or if they just suddenly halt. Since this is a random occurrence we don't have a person actively watching them all the time and can only observe the end result once our monitoring triggers alarms. I will run the commands requested starting now and then prune down the next time it occurs. More info re: deployment: Have you partitioned physical server resources using VMs/containers/cgroups for the various mongod processes?
What filesystem are you using?
What is the total RAM available?
What is your WiredTiger cacheSizeGB value for each instance?
| ||
| Comment by Stennie Steneker (Inactive) [ 25/Nov/15 ] | ||
|
Hi Dan, Thanks for reporting this issue. We'll need some additional information to investigate.
Can you describe your deployment in further detail:
Can you collect some additional information on the mongod shard(s) when the queries start to become slow:
The intent is to record system information at the same time you're experiencing slow queries or apparent stalls. Please continue to run these commands until you see some slow queries, after which you can interrupt the monitoring commands and upload the ss.log and iostat.log files so we can analyze them. Thanks, |