[SERVER-15005] Connections Spike on Secondary, Load Jumps, Server Becomes Unresponsive Created: 22/Aug/14 Updated: 01/Apr/15 Resolved: 01/Apr/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Index Maintenance |
| Affects Version/s: | 2.6.3, 2.6.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Michael Saffitz | Assignee: | Ramon Fernandez Marina |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Amazon Linux AMI release 2014.03 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | Linux | ||||||||
| Participants: | |||||||||
| Description |
|
We have a 3 shard cluster, which each shard consisting of a primary, secondary, and a hidden secondary (for EBS snapshots). Each of the nodes is identical to all of the others. We've seen the issue described below once every 2-3 weeks on 2.6.3. After upgrading to 2.6.4, we saw it at least hourly, sometimes as frequent as every 5-10 minutes. When it does occur, our production system goes down. The symptoms of the issue are a sudden spike in the number of connections to the visible secondary on our first shard. We haven't seen it occur on the primary, nor have we seen it occur on any of the other shards. The connections seem to all deadlock-- the I/O on the machine drops dramatically when this occurs. I've attached a screenshot of the machine reporting from New Relic that shows this-- user CPU spiking while disk IO goes to 0. I've also attached the mms reports for this, which show the connections spiking while the number of operations fall dramatically. Interestingly the lock spikes during this time as well, and that is all coming from the local database. There is no much in the logs of interest, and certainly no smoking gun. I've attached the log, and the spike in connections appears to occur at: 2014-08-22T18:38:10.730+0000 Finally, restarting the effective mongod immediately resolves the issue. |
| Comments |
| Comment by Ramon Fernandez Marina [ 01/Apr/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
m@saffitz.com, now that 2.6.9 is released and includes a fix for Regards, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 18/Mar/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
m@saffitz.com, looks like this ticket fell through the cracks – my apologies. You may have seen that If this is still an issue for you you may want to give 2.6.9-rc0 a try and report back any issues you may find. Note that a 2.6.9 release will soon follow this release candidate if no blocking issues are reported against it. Regards, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Saffitz [ 25/Aug/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for the details. Is the query planner stable? It looks like running the query in the console results in using the ` {conversation_id: 1, nonce: 1}` index. Hinting on the `{_id: 1}` index definitely takes a lot longer. I expect the average number of documents for a query of this form to be ~30 or less-- seems like I could just hint for the first index to force that and work around this issue until
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Thomas Rueckstiess [ 25/Aug/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Michael, The query portion of the slow queries is
and presumably the most selective field here is "conversation_id". The sort portion is on
The Query Planner would evaluate 3 index scan plans:
As none of the queries return any documents (nreturned:0), during plan evaluation the amount of work for any plan is the same. However, there is a tie-breaker that prefers plans that don't need to sort in memory, which would be the _id index here. Therefore we believe that you're experiencing another instance (or slight variation) of To avoid the problem altogether, you can do some tests with an index on {conversation_id:1, _id:1}, which is probably what you'd need to efficiently answer these queries. Note that Index Intersection does not currently work for just the sort portion of a query, so MongoDB can't combine the existing indices for this query. Regards, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Saffitz [ 25/Aug/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks Thomas. That column is indexed, so I suspect that this is more of a symptom than the cause. Here are the full list of indexes for that collection:
The collection is not sharded. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Thomas Rueckstiess [ 25/Aug/14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Michael, I've checked your logs and it seems that a large number of slow queries on the namespace apptentive_production.messages (red dots) are building up and slowing down your system. The time these queries start queueing up matches the time you experience the performance impact.
The queries are scanning up to half a million documents and taking several minutes at times, but not matching anything. The chosen index is on {_id:1} in order to avoid in-memory sorts (as you request a sort by _id). Can you please provide a list of indices that you have set up on apptentive_production.messages? Regards, |