Priority: Major - P3
Affects Version/s: 2.2.1
Fix Version/s: None
In our environment we generally use MongoDB in two modes:
1. "Small" (100s GiB) read/write-intensive (10^4 reads, 10^3 writes/sec) data on SSDs: data+oplog+journal spread over an array of Intel 320s (since they are both cheap and have the capacitor that allows for enabled write caching without a dedicated controller with a battery-backed cache)
2. "Large" (10s TiB) read-relaxed/write-rarely (10^2 reads, 10^1 writes/sec) data (GridFS actually) on rotational disks: data+oplog on consumer-grade 3TB drives in software RAID1 with disabled write caching along with several journal on a RAID1 of Intel 320 SSD.
The first mode gives no problems due to nature of SSDs and high percentage of data being memory-resident.
The second mode works fine also (while being extremely cheap), until the chunk migrations come.
Since this is GridFS, there are only two kinds of data to migrate: fs.chunks and fs.files. The fs.chunks migrate easily, since there there are about 600-800 documents per migration chunk (30-60MiB). The fs.files, on the other hand, have about 200-300k documents per migration chunk and migrating these almost kills the server to the point that other nodes consider it gone (doesn't accept connections). I wrote a small tool that analyses primaries logs and outputs the migration statistics, see at the excerpts of its output below (date, ns -> bytes/documents, migration time+cleanup time to target):
Fri Nov 16 03:07:51 +0400 2012 a.fs.chunks -> 34M/620d in 19s+10s to driveFS-1
Fri Nov 16 03:08:22 +0400 2012 a.fs.chunks -> 45M/688d in 24s+12s to driveFS-1
Fri Nov 16 03:09:04 +0400 2012 a.fs.chunks -> 38M/639d in 19s+11s to driveFS-1
Fri Nov 16 03:09:37 +0400 2012 a.fs.chunks -> 51M/853d in 26s+14s to driveFS-1
Fri Nov 16 03:10:20 +0400 2012 a.fs.files -> 30M/249983d in 298s+1258s to driveFS-4
Fri Nov 16 08:22:04 +0400 2012 a.fs.files -> 23M/194827d in 623s+920s to driveFS-1
You surely see the difference between the fs.chunks and fs.files migration.
I understand that since migration (and replication) is done per-document, the overhead (especially in delete case) comes from additional disk seeks for data and metadata (index) operations. Actually, the problem is not with high load itself, it's in the short time this load is compressed into, bringing the server (actually disks) to its knees. Limiting the balancer activity window helps a bit because it moves the load spikes from prime-time to the off hours, but still replica set failing over due to instances stopping responding during the chunk migration is no good.
I guess that the problem might be alleviated a little by adding more memory, but I afraid that data being moved would be cold, so caching won't help much. Also, adding memory won't help with writes. The other solution might be adding some kind of intermediate SSD-based write cache (either software like FlashCache or hardware like CacheCade), but these solutions are adding (unnecessary in my opinion) complexity to installation, maintenance and administration, and are usually unavailable on cloud-based hosting (not our case, but as far as I know many people run MongoDB on AWS/Azure/Rackspace Cloud/etc. where you either get a low-IOPS or an expensive high-IOPS hardware).
I believe that the better solution lies in introducing a [tunable] per-shard (or at least per-balancer) parameter that allowed for throttling of migration (both transfer and cleanup), preferably in documents/sec with some kind of intelligent statistics-based throttling engine (to prevent a large chunk of documents being migrated at a 100% speed, then waiting for the rest of the "time-slice", still creating spiky load on the cluster).