Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-14704

Chunk migration become (linearly?) slower as collection grows

    • Type: Icon: Improvement Improvement
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.6.3
    • Component/s: Sharding
    • None

      In my use case I've multiple databases with the same "schemas" and type of data. I've noticed that chunk migration becomes slower and slower, in correlation with the collection size.

      For small databases/collections, migrating a chunk is generally done in less than 20 seconds while for my bigger collections it takes 1800 seconds in average (sometimes more than 1 hour), with all nuances between them (I've about 35 identical databases, with all sizes). Chunks have roughly the same size and number of documents in all cases, with exactly the same indexes.

      Updates/Inserts are happening, but at a slow pace (I'd say less than 10 updates/inserts per hour are happening on the chunk being migrated).
      My chunks are 256MB and each document have an average size of 2810 bytes (about 50,000 documents per chunk / 140MB as it seems chunks aren't "full"). The cluster doesn't receive a lot of writes (globally about 30 updates and 5 inserts per second) and I transferred as many reads as possible to secondaries. Almost 0 deletes are happening, cluster wide

      All disks are regular SATA (because of dataset size).

      Exemple of a low migration:
      "step 1 of 6" : 119,
      "step 2 of 6" : 3266,
      "step 3 of 6" : 1618,
      "step 4 of 6" : 2597284,
      "step 5 of 6" : 2733,
      "step 6 of 6" : 0

      Data do not fit in RAM (but indexes does).
      When I look at the logs of the "sender", I can see that "cloned"/"clonedBytes" are increasing very slowly and pauses every 16MB or so for few seconds.

      iotop tells me that both the sender and the recipient are performing a lot of writes (both stuck at 100%). Magnitudes more than what is being transmitted.

      • The sender *
        It's a basic 16GB of RAM / soft RAID 1 SATA disks server
        On the sender I'd expect high reads/low writes (as the range deleter removes the previously transmitted chunks). Due to data locality I'd probably expect reads to be slower in big collections, but definitively don't expect that amount of writes.
        Typical "atop" output:
        DSK | sda | busy 100% | read 130 | write 2635 | MBr/s 0.13 | MBw/s 2.07 | avio 3.62 ms |
        DSK | sdb | busy 81% | read 83 | write 2613 | MBr/s 0.09 | MBw/s 2.04 | avio 3.00 ms |
      • The recipient *
        96 GB of RAM / hard RAID 1 SATA disks
        I'm moving all my data to this new server (I'll end with a cluster with a single shard... but this server have 2 times more RAM than the previously combine 3 shards - 3x16=48GB vs 96 GB)
        On the recipient I'd expect writes in correlation with the chunk data being migrated. This server was synced from its replicaset about 1 week ago, so it's very clean in data locality, no holes in files (it wasn't "bootstraped").

      DSK | sda | busy 100% | read 152 | write 2664 | MBr/s 0.36 | MBw/s 11.84 | avio 3.55 ms |

      You can probably find more insights in my MMS account: https://mms.mongodb.com/host/cluster/51a2dc5c7fe227e9f188c509/52bb9a10e4b0256ace50e0d3

      Have a look to the log extract for a typical overview of chunk migration speed.

            Assignee:
            ramon.fernandez@mongodb.com Ramon Fernandez Marina
            Reporter:
            tubededentifrice Vincent
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: