[SERVER-15005] Connections Spike on Secondary, Load Jumps, Server Becomes Unresponsive Created: 22/Aug/14  Updated: 01/Apr/15  Resolved: 01/Apr/15

Status: Closed
Project: Core Server
Component/s: Index Maintenance
Affects Version/s: 2.6.3, 2.6.4
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Michael Saffitz Assignee: Ramon Fernandez Marina
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Amazon Linux AMI release 2014.03
r3.2xlarge instances


Attachments: PNG File Screen Shot 2014-08-22 at 11.59.52 AM.png     PDF File mms.pdf     File mongo-log.log.gz     PNG File slow_queries.png    
Issue Links:
Related
is related to SERVER-14071 For queries with .sort(), bad non-blo... Closed
Operating System: Linux
Participants:

 Description   

We have a 3 shard cluster, which each shard consisting of a primary, secondary, and a hidden secondary (for EBS snapshots). Each of the nodes is identical to all of the others.

We've seen the issue described below once every 2-3 weeks on 2.6.3. After upgrading to 2.6.4, we saw it at least hourly, sometimes as frequent as every 5-10 minutes. When it does occur, our production system goes down.

The symptoms of the issue are a sudden spike in the number of connections to the visible secondary on our first shard. We haven't seen it occur on the primary, nor have we seen it occur on any of the other shards.

The connections seem to all deadlock-- the I/O on the machine drops dramatically when this occurs. I've attached a screenshot of the machine reporting from New Relic that shows this-- user CPU spiking while disk IO goes to 0. I've also attached the mms reports for this, which show the connections spiking while the number of operations fall dramatically.

Interestingly the lock spikes during this time as well, and that is all coming from the local database.

There is no much in the logs of interest, and certainly no smoking gun. I've attached the log, and the spike in connections appears to occur at: 2014-08-22T18:38:10.730+0000

Finally, restarting the effective mongod immediately resolves the issue.



 Comments   
Comment by Ramon Fernandez Marina [ 01/Apr/15 ]

m@saffitz.com, now that 2.6.9 is released and includes a fix for SERVER-14071 I'm going to close this ticket. If you upgrade to 2.6.9 and the problem persists please feel free to re-open the ticket.

Regards,
Ramón.

Comment by Ramon Fernandez Marina [ 18/Mar/15 ]

m@saffitz.com, looks like this ticket fell through the cracks – my apologies.

You may have seen that SERVER-14071 was fixed recently, and just yesterday we released a 2.6.9-rc0 release candidate containing the fix.

If this is still an issue for you you may want to give 2.6.9-rc0 a try and report back any issues you may find. Note that a 2.6.9 release will soon follow this release candidate if no blocking issues are reported against it.

Regards,
Ramón.

Comment by Michael Saffitz [ 25/Aug/14 ]

Thanks for the details. Is the query planner stable? It looks like running the query in the console results in using the `

{conversation_id: 1, nonce: 1}

` index. Hinting on the `{_id: 1}` index definitely takes a lot longer. I expect the average number of documents for a query of this form to be ~30 or less-- seems like I could just hint for the first index to force that and work around this issue until SERVER-14071 is addressed?

mongos-prod-1(mongos-2.6.3)[mongos] apptentive_production> db.messages.find({deleted_at: null, conversation_id: ObjectId('534d0cf835731f6b2301269e'), hidden: { $ne: true }, _id: { $gt: ObjectId('53ebab9806878e24a704828f')}} ).sort({_id: 1}).explain()
{
  "cursor": "BtreeCursor conversation_id_1_nonce_1",
  "isMultiKey": false,
  "n": 0,
  "nscannedObjects": 1,
  "nscanned": 1,
  "nscannedObjectsAllPlans": 2,
  "nscannedAllPlans": 8,
  "scanAndOrder": true,
  "indexOnly": false,
  "nYields": 0,
  "nChunkSkips": 0,
  "millis": 0,
  "indexBounds": {
    "conversation_id": [
      [
        ObjectId("534d0cf835731f6b2301269e"),
        ObjectId("534d0cf835731f6b2301269e")
      ]
    ],
    "nonce": [
      [
        {
          "$minElement": 1
        },
        {
          "$maxElement": 1
        }
      ]
    ]
  },
  "server": "mongo-prod-r1-d:27017",
  "filterSet": false,
  "millis": 0
}

mongos-prod-1(mongos-2.6.3)[mongos] apptentive_production> db.messages.find({deleted_at: null, conversation_id: ObjectId('534d0cf835731f6b2301269e'), hidden: { $ne: true }, _id: { $gt: ObjectId('53ebab9806878e24a704828f')}} ).sort({_id: 1}).hint({_id: 1}).explain()
{
  "cursor": "BtreeCursor _id_",
  "isMultiKey": false,
  "n": 0,
  "nscannedObjects": 648390,
  "nscanned": 648390,
  "nscannedObjectsAllPlans": 648390,
  "nscannedAllPlans": 648390,
  "scanAndOrder": false,
  "indexOnly": false,
  "nYields": 5975,
  "nChunkSkips": 0,
  "millis": 54447,
  "indexBounds": {
    "_id": [
      [
        ObjectId("53ebab9806878e24a704828f"),
        ObjectId("ffffffffffffffffffffffff")
      ]
    ]
  },
  "server": "mongo-prod-r1-d:27017",
  "filterSet": false,
  "millis": 54447
}

Comment by Thomas Rueckstiess [ 25/Aug/14 ]

Hi Michael,

The query portion of the slow queries is

{ 
    deleted_at: null, 
    conversation_id: ObjectId('534d0cf835731f6b2301269e'), 
    hidden: { $ne: true }, 
    _id: { $gt: ObjectId('53ebab9806878e24a704828f') 
} 

and presumably the most selective field here is "conversation_id".

The sort portion is on

{_id:1}

The Query Planner would evaluate 3 index scan plans:

  • {conversation_id: 1, nonce: 1} - requires in-memory sort
  • {conversation_id: 1, _type: 1} - requires in-memory sort
  • {_id: 1} - does not require in-memory sort

As none of the queries return any documents (nreturned:0), during plan evaluation the amount of work for any plan is the same. However, there is a tie-breaker that prefers plans that don't need to sort in memory, which would be the _id index here. Therefore we believe that you're experiencing another instance (or slight variation) of SERVER-14071.

To avoid the problem altogether, you can do some tests with an index on {conversation_id:1, _id:1}, which is probably what you'd need to efficiently answer these queries. Note that Index Intersection does not currently work for just the sort portion of a query, so MongoDB can't combine the existing indices for this query.

Regards,
Thomas

Comment by Michael Saffitz [ 25/Aug/14 ]

Thanks Thomas. That column is indexed, so I suspect that this is more of a symptom than the cause. Here are the full list of indexes for that collection:

[
  {
    "v": 1,
    "key": {
      "_id": 1
    },
    "ns": "apptentive_production.messages",
    "name": "_id_"
  },
  {
    "v": 1,
    "key": {
      "conversation_id": 1,
      "nonce": 1
    },
    "unique": true,
    "ns": "apptentive_production.messages",
    "name": "conversation_id_1_nonce_1",
    "background": true
  },
  {
    "v": 1,
    "key": {
      "conversation_id": 1,
      "_type": 1
    },
    "ns": "apptentive_production.messages",
    "name": "conversation_id_1__type_1",
    "background": true
  },
  {
    "v": 1,
    "key": {
      "app_id": 1,
      "tags._id": 1
    },
    "name": "app_id_1_tags._id_1",
    "ns": "apptentive_production.messages",
    "background": true
  }
]

The collection is not sharded.

Comment by Thomas Rueckstiess [ 25/Aug/14 ]

Hi Michael,

I've checked your logs and it seems that a large number of slow queries on the namespace apptentive_production.messages (red dots) are building up and slowing down your system. The time these queries start queueing up matches the time you experience the performance impact.

2014-08-22T18:42:24.801+0000.801+0000 [conn6249] query apptentive_production.messages query: { $query: { deleted_at: null, conversation_id: ObjectId('534d0cf835731f6b2301269e'), hidden: { $ne: true }, _id: { $gt: ObjectId('53ebab9806878e24a704828f') } }, $orderby: { _id: 1 } } planSummary: IXSCAN { _id: 1 } ntoreturn:51 ntoskip:0 nscanned:465956 nscannedObjects:465956 keyUpdates:0 numYields:3581 locks(micros) r:106163477 nreturned:0 reslen:20 281918ms

The queries are scanning up to half a million documents and taking several minutes at times, but not matching anything. The chosen index is on {_id:1} in order to avoid in-memory sorts (as you request a sort by _id).

Can you please provide a list of indices that you have set up on apptentive_production.messages?

Regards,
Thomas

Generated at Thu Feb 08 03:36:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.