Uploaded image for project: 'Documentation'
  1. Documentation
  2. DOCS-13200

Investigate changes in SERVER-42273: Introduce a "force" option to `moveChunk` to allow migrating jumbo chunks

      Description

      Downstream Change Summary

      This ticket changes the behavior of chunk migration such that it is now possible to “force” a jumbo chunk to be migrated. There are changes to both the ‘moveChunk’ command as well as balancer configuration settings.

      Changes to moveChunk command:
      A new optional boolean parameter 'forceJumbo' that defaults to false. If set to true and the chunk would otherwise have been deemed too large to move, the donor shard will enter the critical section early and writes will be blocked during the cloning phase. This is important to note as it can cause a long period of time where ops are blocked on this collection.

      Changes to balancer configuration settings:
      A new field 'attemptToBalanceJumboChunks' in the 'balancer' document in the config.settings collection. This a boolean field that defaults to false. This document will now look something like

      {"_id": "balancer", "mode": "full", "stopped": false, "attemptToBalanceJumboChunks": false}

      If 'attemptToBalanceJumboChunks' is set to true, the balancer will schedule migrations that attempt to move large chunks as long as the chunk is not marked 'jumbo' in config.chunks. A chunk is marked 'jumbo' only after an attempt to split or move a large chunk has failed because of its size or the size of the transfer mods queue. The balancer should not continually try to schedule the migration of a chunk that has failed for either of these reasons previously to avoid the risk of forever scheduling the same migration. A user can run 'clearJumboFlag' so that the balancer with schedule this migration in the future, or they can choose to use the moveChunk command to manually move the chunk.

      Unlike the new behavior of the moveChunk command above, the donor shard will not enter the critical section early, and if the transfer mods queue (queue of writes that modify any documents being migrated) surpasses 500MB of memory the migration will fail. This is to avoid unintended "down time" in the case a user was unaware that moving a large chunk can cause a long period of time where ops are blocked on this collection.

      Changes to shard removal:
      If a shard is in draining mode, meaning it has been removed, the balancer will also attempt to schedule migrations of any large chunks currently belonging to this shard. The balancer will behave the same as if 'attemptToBalanceJumboChunks' is set to true (described above).

      Description of Linked Ticket

      Currently, if a chunk is larger than 64MB by default or 1GB max, the balancer will mark it as jumbo and will refuse to move it.

      It is possible to manually issue a moveChunk command and pass the unsupported and undocumented maxChunkSizeBytes parameter, which will override the check for max chunk size, but even with this, given sufficient write load to the chunk being migrated, the memory usage on the donor shard could exceed 500MB in which case migration will still fail.

      This ticket proposes adding a new forceJumbo option to the moveChunk command in order to allow large chunks to be migrated at the possible expense of blocking writes to the owning collection on the shard in question. The option will have the following deviation from the way it currently operates:

      1. It will skip the step, which sorts the cloned chunk's document ids and will instead give out the chunks in the order of the shard key (this means it will never return a 'jumbo chunk' error)
      2. Instead of failing the migration, if the memory usage exceeds 500MB, it will instead enter the critical section (this means that writes to the collection being migrated will possibly block for longer period of time)

      Scope of changes

      Impact to Other Docs

      MVP (Work and Date)

      Resources (Scope or Design Docs, Invision, etc.)

            Assignee:
            kay.kim@mongodb.com Kay Kim (Inactive)
            Reporter:
            backlog-server-pm Backlog - Core Eng Program Management Team
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved:
              3 years, 50 weeks, 2 days ago