[SERVER-14704] Chunk migration become (linearly?) slower as collection grows Created: 27/Jul/14  Updated: 10/Dec/14  Resolved: 19/Aug/14

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.6.3
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Vincent Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File log extract.log    
Participants:

 Description   

In my use case I've multiple databases with the same "schemas" and type of data. I've noticed that chunk migration becomes slower and slower, in correlation with the collection size.

For small databases/collections, migrating a chunk is generally done in less than 20 seconds while for my bigger collections it takes 1800 seconds in average (sometimes more than 1 hour), with all nuances between them (I've about 35 identical databases, with all sizes). Chunks have roughly the same size and number of documents in all cases, with exactly the same indexes.

Updates/Inserts are happening, but at a slow pace (I'd say less than 10 updates/inserts per hour are happening on the chunk being migrated).
My chunks are 256MB and each document have an average size of 2810 bytes (about 50,000 documents per chunk / 140MB as it seems chunks aren't "full"). The cluster doesn't receive a lot of writes (globally about 30 updates and 5 inserts per second) and I transferred as many reads as possible to secondaries. Almost 0 deletes are happening, cluster wide

All disks are regular SATA (because of dataset size).

Exemple of a low migration:
"step 1 of 6" : 119,
"step 2 of 6" : 3266,
"step 3 of 6" : 1618,
"step 4 of 6" : 2597284,
"step 5 of 6" : 2733,
"step 6 of 6" : 0

Data do not fit in RAM (but indexes does).
When I look at the logs of the "sender", I can see that "cloned"/"clonedBytes" are increasing very slowly and pauses every 16MB or so for few seconds.

iotop tells me that both the sender and the recipient are performing a lot of writes (both stuck at 100%). Magnitudes more than what is being transmitted.

  • The sender *
    It's a basic 16GB of RAM / soft RAID 1 SATA disks server
    On the sender I'd expect high reads/low writes (as the range deleter removes the previously transmitted chunks). Due to data locality I'd probably expect reads to be slower in big collections, but definitively don't expect that amount of writes.
    Typical "atop" output:
    DSK | sda | busy 100% | read 130 | write 2635 | MBr/s 0.13 | MBw/s 2.07 | avio 3.62 ms |
    DSK | sdb | busy 81% | read 83 | write 2613 | MBr/s 0.09 | MBw/s 2.04 | avio 3.00 ms |
  • The recipient *
    96 GB of RAM / hard RAID 1 SATA disks
    I'm moving all my data to this new server (I'll end with a cluster with a single shard... but this server have 2 times more RAM than the previously combine 3 shards - 3x16=48GB vs 96 GB)
    On the recipient I'd expect writes in correlation with the chunk data being migrated. This server was synced from its replicaset about 1 week ago, so it's very clean in data locality, no holes in files (it wasn't "bootstraped").

DSK | sda | busy 100% | read 152 | write 2664 | MBr/s 0.36 | MBw/s 11.84 | avio 3.55 ms |

You can probably find more insights in my MMS account: https://mms.mongodb.com/host/cluster/51a2dc5c7fe227e9f188c509/52bb9a10e4b0256ace50e0d3

Have a look to the log extract for a typical overview of chunk migration speed.



 Comments   
Comment by Vincent [ 14/Nov/14 ]

For reference, I solved this issue by switching to servers with less RAM (32GB instead of 96GB) but equipped with SSD disks. It dramatically improved performances of my application and chunk migrations are now blazing fast, even if my data doesn't fit in RAM (not even my indexes).

Comment by Ramon Fernandez Marina [ 19/Aug/14 ]

tubededentifrice, thanks for sending your config database. There are several factors that can affect migrations, for example:

  • Network traffic
  • Amount of data in each chunk; note that a chunk is a logical range, and the actual data in each range may be very different from the chunk size
  • Misconfigured disks: data writes involve lots of random I/O, so an aggressive read-ahead configuration may harm performance
  • Choice of shard key: sharding on {_id:1} causes all writes to go to the same shard, thus slowing down migrations involving that shard
  • Application I/O load on the sending and/or receiving end, even with large amounts of RAM

After examining the data you sent we haven't found any evidence of a bug in MongoDB. Since the SERVER project is for reporting bugs or feature suggestions for the MongoDB server and tools, I would recommend that you post your questions on the mongodb-user group, where you can reach a wide audience of MongoDB experts.

Regards,
Ramón.

Comment by Vincent [ 30/Jul/14 ]

It looks like the writes on the receiving side are caused by application updates thrown, not by the chunk migration, my apologies.
However, when the application is stopped and no write is happening, the chunk migration isn't faster (and the sender side is still throwing out a lot of reads+writes that are, for sure, related to the chunk migration in progress).

Comment by Vincent [ 30/Jul/14 ]

Note that I already had this problem when I switched from 1 to 5 and then 5 to 3 shards (migration took really forever, with about 1 chunk (64 MB at the time...) moved per 24 hour – the hardware used was much less powerful at the time).
I ended up renting few very big SSD cloud servers just to speed up the migrations!
But here, the hardware should not be the limiting factor on the receiving side: 96GB of RAM, about 60GB total index size for the entire dataset (including data not yet on that shard), hardware RAID with 512Mo of cache with BBU and it still performs way to much NON SEQUENTIAL writes.

Comment by Vincent [ 29/Jul/14 ]

Hi Ramon,
1. I did the dump, how can I send it to you privately? I don't want to disclose architecture/site data publicly. This would also allow me to disclose you things without having to anonymize them before. Maybe I could send you a dump of sender/receiver logs as well?

2. Nope, initially it was a simple RS without sharding. Then I had 5 shards, then 3 and now I'm moving all the data to have only 1 shard (maybe 2 shards then, etc.). Chunks are moved using moveChunk commands (because I can only have 1 draining shard, and it would be dumb to have the chunks moved to the other shard I want to remove!)

3. I had some "jumbo" chunks in the past, which were able to move with 256MB. Beside this, the doc state(d ?) that chunk migration was more efficient with big chunks and put less stress on mongos (I had an issue with this too...) at the cost of a less evenly balanced cluster and more "painful" migrations, which I don't really care about. (moving a 256MB chunk take less time than moving 4x 64MB chunks, isn't?)

4. I keep a "tail -f" on the logs => shardKeyPattern:

{ _id: 1.0 }

, state: "clone", counts:

{ cloned: 24470, clonedBytes: 73203286, catchup: 0, steady: 0 }

; the numbers I wrote may not be 100% accurate but gives a good idea of the reality.

Edit: I've attached the dump to this ticket, restricted to project users

Comment by Ramon Fernandez Marina [ 29/Jul/14 ]

tubededentifrice, we'll need more information to determine whether there's a bug here:

  1. Can you send us a dump of your config metadata so we can investigate further? You can get it using mongodump against a mongos as follows:

    mongodump -d config --host <mongos host:port>

    This metadata contains the chunk migration history, so we can correlate migration speed with the information in MMS.

  2. Did you pre-split your chunks before inserting data?
  3. Can you elaborate on the reason for choosing your chunksize?
  4. How do you compute the amount of data in one chunk?

Thanks,
Ramón.

Comment by Vincent [ 27/Jul/14 ]

I forgot to mention: the 3rd shard (the one that is not yet involved in chunk migration but still holds ~30% of the "big" collection I'm moving) is resting:
DSK | sda | | busy 8% | read 0 | | write 231 | KiB/r 0 | | KiB/w 5 | MBr/s 0.00 | | MBw/s 0.13 | avq 54.11 | | avio 3.43 ms

which makes me conclude it's not a mater of what other operations are being performed in the database, but only a mater of the chunks being migrated.
(this shard behaves exactly as the others when it sends a chunk)

And also: all filesystems are ext4

Generated at Thu Feb 08 03:35:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.