[SERVER-7686] Chunk migration and cleanup performance Created: 16/Nov/12  Updated: 27/Oct/15  Resolved: 04/Dec/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.2.1
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Aristarkh Zagorodnikov Assignee: Ian Daniel
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux 64-bit


Issue Links:
Related
related to DOCS-461 Tutorial: Configure Balancer Settings Closed
is related to DOCS-640 Balancer configuration of _secondaryT... Closed
Backwards Compatibility: Fully Compatible
Participants:

 Description   

In our environment we generally use MongoDB in two modes:
1. "Small" (100s GiB) read/write-intensive (10^4 reads, 10^3 writes/sec) data on SSDs: data+oplog+journal spread over an array of Intel 320s (since they are both cheap and have the capacitor that allows for enabled write caching without a dedicated controller with a battery-backed cache)
2. "Large" (10s TiB) read-relaxed/write-rarely (10^2 reads, 10^1 writes/sec) data (GridFS actually) on rotational disks: data+oplog on consumer-grade 3TB drives in software RAID1 with disabled write caching along with several journal on a RAID1 of Intel 320 SSD.
The first mode gives no problems due to nature of SSDs and high percentage of data being memory-resident.
The second mode works fine also (while being extremely cheap), until the chunk migrations come.
Since this is GridFS, there are only two kinds of data to migrate: fs.chunks and fs.files. The fs.chunks migrate easily, since there there are about 600-800 documents per migration chunk (30-60MiB). The fs.files, on the other hand, have about 200-300k documents per migration chunk and migrating these almost kills the server to the point that other nodes consider it gone (doesn't accept connections). I wrote a small tool that analyses primaries logs and outputs the migration statistics, see at the excerpts of its output below (date, ns -> bytes/documents, migration time+cleanup time to target):
Fri Nov 16 03:07:51 +0400 2012 a.fs.chunks -> 34M/620d in 19s+10s to driveFS-1
Fri Nov 16 03:08:22 +0400 2012 a.fs.chunks -> 45M/688d in 24s+12s to driveFS-1
Fri Nov 16 03:09:04 +0400 2012 a.fs.chunks -> 38M/639d in 19s+11s to driveFS-1
Fri Nov 16 03:09:37 +0400 2012 a.fs.chunks -> 51M/853d in 26s+14s to driveFS-1
Fri Nov 16 03:10:20 +0400 2012 a.fs.files -> 30M/249983d in 298s+1258s to driveFS-4
Fri Nov 16 08:22:04 +0400 2012 a.fs.files -> 23M/194827d in 623s+920s to driveFS-1

You surely see the difference between the fs.chunks and fs.files migration.
I understand that since migration (and replication) is done per-document, the overhead (especially in delete case) comes from additional disk seeks for data and metadata (index) operations. Actually, the problem is not with high load itself, it's in the short time this load is compressed into, bringing the server (actually disks) to its knees. Limiting the balancer activity window helps a bit because it moves the load spikes from prime-time to the off hours, but still replica set failing over due to instances stopping responding during the chunk migration is no good.
I guess that the problem might be alleviated a little by adding more memory, but I afraid that data being moved would be cold, so caching won't help much. Also, adding memory won't help with writes. The other solution might be adding some kind of intermediate SSD-based write cache (either software like FlashCache or hardware like CacheCade), but these solutions are adding (unnecessary in my opinion) complexity to installation, maintenance and administration, and are usually unavailable on cloud-based hosting (not our case, but as far as I know many people run MongoDB on AWS/Azure/Rackspace Cloud/etc. where you either get a low-IOPS or an expensive high-IOPS hardware).
I believe that the better solution lies in introducing a [tunable] per-shard (or at least per-balancer) parameter that allowed for throttling of migration (both transfer and cleanup), preferably in documents/sec with some kind of intelligent statistics-based throttling engine (to prevent a large chunk of documents being migrated at a 100% speed, then waiting for the rest of the "time-slice", still creating spiky load on the cluster).



 Comments   
Comment by Ian Daniel [ 04/Dec/12 ]

Hi Aristarkh,

I am glad that your problem is solved and that _secondaryThrottle helped somewhat. There is already an issue for documenting using it as a balancer configuration item: DOCS-640. I have linked this issue to that one.

Kind regards,
Ian

Comment by Aristarkh Zagorodnikov [ 29/Nov/12 ]

David, it looks like our problem is solved, and at least partially by enabling the "_secondaryThrottle" parameter. I wanted to suggest making this the default, but found out that there is already SERVER-7779 for that. I still would like to suggest mentioning this configuration option somewhere on the http://docs.mongodb.org/manual/administration/sharding/ (it's noted there, but only as a parameter to moveChunk command, not as a balancer configuration item), and maybe writing something along the lines of "if you have migration-caused I/O peaks that do not cooperate with other workloads, try enabling _secondaryThrottle".

Comment by David Hows [ 29/Nov/12 ]

Aristarkh,

Is there anything more we can do for you on this issue?

Or do you believe the combination of things you have done has helped?

Cheers,

David

Comment by Aristarkh Zagorodnikov [ 27/Nov/12 ]

As we're still recovering from SERVER-7732, I did not have time for a detailed review, but, in short, yes, it probably helped.

We haven't (yet) observe the problem since we did the following:
1. Moved from CFQ to deadline scheduler
2. Enabled disk write caching
3. Set the vm.dirty_bytes to 64MiB
4. Enabled _secondaryThrottle

The (1) and (3) should increase responsiveness and prevent blocking, so if the problem persisted, we still would have I/O peaks. As we have none and the config.changelog collection indicates that migrations on problematic collections are still done, it's either (2) or (4). I don't believe much in (2), since disk caches are 64MB only and there is much more data transferred along with truly random I/O, so it appears that _secondaryThrottle really helps. I also observed chunk migration times that took considerably longer time than originally observed, still without putting disks to their knees.

Comment by Ian Daniel [ 27/Nov/12 ]

Hi Aristarkh,

Do you have any indication yet of whether the _secondaryThrottle option is helping?

Kind regards,
Ian

Comment by Aristarkh Zagorodnikov [ 19/Nov/12 ]

Thanks for the advice, I'll enable it now, but since fs.files migrations are not done very often it may take several days to find out whether it helps.
Also there is a typo in the command, the second line should read

db.settings.update({ "_id" : "balancer"}, {$set:{_secondaryThrottle : true }}, true)

Comment by Eliot Horowitz (Inactive) [ 18/Nov/12 ]

There is a new option we're experimenting with in 2.2 called: _secondaryThrottle

If you do this against a mongos:

use config;
db.settings.update ({ "_id" : "balancer"}, {$set:{_secondaryThrottle" : true }, true})

Migrations will throttle themselves based on replication, which has the impact of elongating the window reducing impact.

Can you try that?

Generated at Thu Feb 08 03:15:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.