Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-70602

Handle faulty balancerCompliant reporting by waiting for some no-op balancing rounds

    • Fully Compatible
    • ALL
    • v6.1, v6.0
    • Sharding EMEA 2022-10-31, Sharding EMEA 2022-11-14
    • 35

      The tests are already double checking the collection balancerComplaint (SERVER-69136) through the awaitCollectionBalance function due to the "Won't fixed" bug SERVER-68777

      However, in SERVER-68777 we only considered the possibility of having a wrong collection size on the donor and not on the recipient.

      Why recipient may report a wrong collection size?

      Because the actual migration of the documents is not executed atomically with the update of orphanCounter on the recipient side. So, the recipient shard would report a longer collection size than the actual one, if the request is addressed before the update of orphanCounter and after the actual migration of the documents.

      NOTE: The recipient clears the 'orphan' tag of the received chunk once the migration commit state is reached.

      Example of failure

      1. First balancerCompliant returns too early because the donor shard is reporting a lower value for collection size due to SERVER-68777. At that moment the last migration is still in progress.

      DONOR size RECIPIENT size
      7340214 MB (actual: 8388816 MB) 5243010 MB

      2. At that moment there are no orphans so we don't have to wait for them to be 0.

      3. Second balancerCompliant returns too early as well because the recipient has already received the documents but didn't update the orphans counter. So we return wrongly from awaitCollectionBalanced.

      DONOR size RECIPIENT size
      8388816 MB 6291612 MB (actual: 5243010 MB)

      4. Once the migration ends we finally get the proper size on each shard

      DONOR size RECIPIENT size
      7340214 MB 6291612 MB

            silvia.surroca@mongodb.com Silvia Surroca
            silvia.surroca@mongodb.com Silvia Surroca
            0 Vote for this issue
            4 Start watching this issue