[SERVER-5910] slow sharding already existed collections Created: 23/May/12  Updated: 08/Mar/13  Resolved: 23/Oct/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.1.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Azat Khuzhin Assignee: Spencer Brody (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File m1.medium.tar.gz    
Operating System: ALL
Participants:

 Description   

Results of migration one chunk (default size: 64mb), collection with 180 million of rows

by {_id}:      ~ 245 secs
by {key}:      ~ 570 secs
by {key, _id}: ~1100 secs

where

_id - ObjectId
key - 32 chars - md5 hash

Using next configuration

3.75 GB memory
2 EC2 Compute Unit (1 virtual core with 2 EC2 Compute Unit)
410 GB instance storage
32-bit or 64-bit platform
I/O Performance: Moderate
API name: m1.medium

Even in the best case it is slow (245 secs)
No concurrent read/write to database, just sharding

"removeshard" is also not so fast

Attach logs (iostat, mongostat, mongotop, vmstat, at some tests have mongo

{db,s}

logs)

Related links:
https://groups.google.com/d/msg/mongodb-user/ziPjlJIS5_w/WMG3iZSuB6kJ
https://groups.google.com/d/msg/mongodb-user/R7XS1tDME8g/ucQR63ixmy0J



 Comments   
Comment by Spencer Brody (Inactive) [ 23/Oct/12 ]

I'm closing out this ticket. If you are able to reproduce this on 2.2 with MMS and the log and other monitoring output that I mentioned earlier, feel free to re-open.

Comment by Azat Khuzhin [ 04/Sep/12 ]

Hi Spencer,

I know this, but I stop all application instances, and manually move data, after this start application.
So all was fine after.

I'v already don't have such configuration.
But I'll try to restore it on amazon or something like this.
And post here results.

If you can start investigation without my results, please start it.
Thanks.

Comment by Spencer Brody (Inactive) [ 04/Sep/12 ]

Hi Azat,
Please be advised that we do not recommend modifying data on the shards directly like you're doing here. The sharding config data won't be updated if you move documents in this way, and this can lead to data seeming to be missing when querying through the mongos.

Before we go any further here, I noticed that you are running the 2.1.1 unstable developer release. Can you try running these tests again, but using the newly-released stable 2.2 release? There have been a lot of changes in 2.2 since 2.1.1, and I want to make sure we aren't chasing down a bug that's actually already been fixed. Having MMS set up for the next attempt (along with capturing mongostat, iostat as you did for the last run) and sending the mongodb logs from the test run would all be helpful in helping us understand what's happening here.

Comment by Azat Khuzhin [ 28/Aug/12 ]

I understand what happened. But it is too slow.
Individual migrations are slow. System have more io-wait then it have in "normal" state. But why io-wait increased I can understand.

But some time ago, because of sharding is slow, and I don't have enough time, I run "removeShard" command, and it was slower than migration individual chunk.
So, I just "stop" "removeShard" command, and run code like this from not primary shards:

var primary = new Mongo('primary');
 
db.sharded_collection.find().forEach(function(row) {
	primary.sharded_collection.insert(row);
	db.sharded_collection.remove({_id: row._id});
})

And it works faster than "removeShard" 100 times, and even more order (1000 maybe), I don't remember now.

Comment by Spencer Brody (Inactive) [ 28/Aug/12 ]

If you shard a collection when it has no data in it, then insert a bunch of data, the system will be splitting and balancing as the data is being inserted, so when all the data is inserted, the amount remaining to be balanced should be small.

If you insert a bunch of data first, then shard the collection, the data will all be living on one shard initially, and will have to be balanced over a longer period of time to come into balance.

To clarify, are you saying that the system runs slow, the time to completely balance is slow, or the individual migrations are slow?

Comment by Azat Khuzhin [ 28/Aug/12 ]

My English is not so good.

When I sharding already existed collection - it is very slow,
but if I configure sharding before inserting rows to collection, and then insert rows to it - it is fast.

Comment by Spencer Brody (Inactive) [ 27/Aug/12 ]

I'm sorry, perhaps I misunderstand what you're doing. Are you saying that you're seeing these long migration times even on new collections with no data in them? Can you clarify what you mean by the speed isn't comparable when sharding an existing collection vs when you shard the collection by the beginning?

Comment by Azat Khuzhin [ 27/Aug/12 ]

Hi Spencer,

I understand that it need IO seeks and reading documents from disk.
But speed to sharding from the beginning not comparable with sharding already existed data.

Comment by Spencer Brody (Inactive) [ 27/Aug/12 ]

Hi Azat,
Looking at the mongostat and iostat output you attached, it seems like you are getting a lot of page faults during your migrations (because the documents being moved aren't in memory, so they need to be fetched from disk), and that this is increasing your disk utilization. You have relatively small documents, which means that each migration will have to move a lot of documents from different locations on disk. It makes sense that when sharded on the md5 harsh "key" field the migrations take even longer than when using the ObjectId _id, because finding documents in a chunk range for a migration on a shard key that is a random hash will cause a lot of random I/O, creating a lot of disk seeks and making the migration take more time overall. The ObjectId, however, has a correction to insertion time, which usually has a rough correlation to on-disk order, meaning the migration will be more sequential I/O, which will be faster. That said, even for the ObjectIds it's a lot of random I/O, so it isn't that surprising that your migrations are taking a while. Lowering the chunk size may be helpful for you as it will mean less documents will need to be moved for each migration, so you'll have less disk seeks.

Comment by Azat Khuzhin [ 24/May/12 ]

It was virtual machine, amazon ec2 m1.medium
But on physical server, results are not much better (see in google groups)

Comment by Eliot Horowitz (Inactive) [ 24/May/12 ]

What was the underlying hardware when running the test?

Comment by Azat Khuzhin [ 24/May/12 ]

I'v already shutdown this two instances.
If I were to repeat such tests, i install MMS.

Comment by Eliot Horowitz (Inactive) [ 24/May/12 ]

Can you install MMS for this cluster with munin so we can see where the bottleneck is?

Generated at Thu Feb 08 03:10:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.