[SERVER-14325] Severe Performance Degradation in Mongo 2.6.2 Created: 20/Jun/14  Updated: 24/Jan/15  Resolved: 23/Jan/15

Status: Closed
Project: Core Server
Component/s: Performance, Querying
Affects Version/s: 2.6.1, 2.6.2
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Matt Kolbert Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

I have a sharded cluster with 12 nodes. This is a test instance so no replica sets. I am querying a collection that is sharded across the 12 servers using a shard key of { "_id" : "hashed" }. I also have an index on _id as well. My collection has 124 million documents.

I am submitting the following script via mongo shell:

   db = db.getSiblingDB('dbName')
   var myCursor=db.cName.find({},{_id: 1}).hint({_id: 1}).batchSize(5000)
   var currentTotal = 0;
   var currentCount = 0;
   while ( myCursor.hasNext() )
   {
      myCursor.next();
      ++currentTotal;
      ++currentCount;
      if (currentCount == 1000000)
      {
         currentCount = 0;
         var currentTime = new Date();
         print("Iterated through " + currentTotal + " documents in " + (currentTime.getTime() - commandStart.getTime()) + " ms");
      }
   }

I ran this using version 2.4.8 and it was iterating at an average of around 200 million per hour. This was using a system that was just started with no warming up. I then upgraded to version 2.6.2 and ran the same script. This time the average was less than 3 million per hour.

Some additional information.

First - using 2.6.2 when I run myCursor.explain() to get the plan each one of the 12 shards returns:

    "indexOnly" : false

For version 2.4.8 this is true.

When I connect directly to one of the shards running 2.6.2 and run:

   var myCursor=db.cName.find({},{_id: 1}).hint({_id: 1}).batchSize(5000)
   myCursor.explain()

The result is the same except "indexOnly" is true.

Additionally when O monitor the operations that are running on the mongos where I submitted the script I see the following:

        "opid" : "shard0005:31",
        "active" : true,
        "secs_running" : 14,
        "microsecs_running" : NumberLong(14543587),
        "op" : "getmore",
        "ns" : "dbName.cName",
        "query" : {
 
        },
        "client_s" : "1.1.1.1:52589",
        "desc" : "conn3",
        "threadId" : "0x7e77b3d1e700",
        "connectionId" : 3,
        "waitingForLock" : false,
        "numYields" : 4260,
        "lockStats" : {
                "timeLockedMicros" : {
                        "r" : NumberLong(384559),
                        "w" : NumberLong(0)
                },
                "timeAcquiringMicros" : {
                        "r" : NumberLong(12600),
                        "w" : NumberLong(0)
                }
        }

I only ever see one of these operations running at a time. They run for each shard sequentially: shard0000, shard0001, shard0002... shard0011

I would expect that the mongos would submit all of these queries in parallel.



 Comments   
Comment by Thomas Rueckstiess [ 08/Jul/14 ]

Hi Matt,

One representative log from each type of node would be sufficient for now, that is

  • one primary shard server mongod log
  • one secondary shard server mongod log (same shard as primary)
  • one mongos log

The config server log is probably not very useful.

If you have those covering both a 2.4.x and a 2.6.x. test that would be extremely helpful. If you are going to repeat the test, setting the log level to 2 (very high verbosity) would be even better. This would allow us to see page fault exceptions in the log. Alternatively, if you have iostats and mongostats during the tests, those would be helpful as well.

I also recommend disabling the balancer as it can have side-effects and a significant impact during the test.

Thanks,
Thomas

Comment by Matt Kolbert [ 03/Jul/14 ]

Hi Thomas,
Here's some answers to your questions.

1 . The test was run on the same set of 12 servers. The only difference was that I upgraded in between runs. I believe that the balancer was on during the test.
2. There was no activity on the system other than me running the script. I hijacked a set of servers that we are planning to add as secondaries for our performance testing environment and added config server and mongos. This was the only mongos pointing to these servers and only I knew about it.
3. Are you looking for the full logs on all 12 shards, config servers and mongos? Please let me know and I will see what I can cull out. The logs are intermixed as I did not save between runs. Let me kow what is most useful and I can try to correlate on Monday.
Thanks
Matt

Comment by Thomas Rueckstiess [ 02/Jul/14 ]

Hi Matt,

It turns out that the indexOnly issue is a red herring. 2.4 always does a fetch before returning a document on a sharded system in order to check if the chunk owns the document (SERVER-5022), but doesn't take this into account when setting the value for "indexOnly". 2.6 also always does a fetch, but it does take that into account when setting "indexOnly". So this was just a display issue in 2.4.

So far, we were unable to reproduce the performance difference you're seeing.

A few more questions to further diagnose this:

  1. Was the data distributed the same way in both 2.4 and 2.6. runs? Was the balancer off in both cases?
  2. Was there anything else happening on the server, or was it completely idle apart from your tests?
  3. Can you share the full log files of both tests?

Thanks,
Thomas

Comment by Ramon Fernandez Marina [ 20/Jun/14 ]

mkolbert@copyright.com, thanks for your report, we can reproduce the behavior you describe with the indexes and we're investigating.

Generated at Thu Feb 08 03:34:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.