Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-10024

cluster can end up with large chunks that did not get split and will time out on migration

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major - P3
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: Backlog
    • Component/s: Sharding
    • Labels:
      None

      Description

      Consider the case where:

      • large volume of insertion
      • migration is slow due to slow hardware and many indices (e.g. 20)
      • consequently moveChunk operation takes a long time (e.g. 1 min)
      • consequently any split fail during that time since the ns is locked, and chunks become larger.
      • consequently chunks become even longer to move... This downward spiral makes thing worse and worse
      • eventually chunks cannot be moved at all. The migration gets aborted after some minutes and no progress is made at all. But the system is super busy all the time trying to migrate those documents.

      I think we need several server improvements:

      A. any chunk migration abort due to timeout should result in a split. If anything the split wont hurt. Right now the split seems to be for a specific case only.

      B. ideally the migration process would avoid retrying the same chunk over and over. May need some amount of randomization on candidate chunks.

      C. when mongos fails to split due to NS locked, it should mark the metadata as "needs split" for later. Ideally all "need split" should be cleared before the next migration is attempted.

      This is all to avoid the bad catch 22 problems where large chunks end up clogging the whole system.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              backlog-server-sharding Backlog - Sharding Team
              Reporter:
              antoine Antoine Girbal
              Participants:
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated: