Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-74647

Resharding state machine creation should be retried after interruption

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 7.0.0-rc0, 6.0.6, 5.0.17
    • Affects Version/s: 5.0.15, 6.2.0-rc6, 6.0.5, 6.2.1, 6.3.0-rc2
    • Component/s: None
    • None
    • Sharding NYC
    • Fully Compatible
    • ALL
    • v6.3, v6.2, v6.0, v5.0
    • 105

      createReshardingStateMachine is the function in charge of:

      1. Writing the state machine document on disk
      2. Creating the related POS instance

      Unfortunately this function is not idempotent. In fact if during the first execution the opCtx gets interrupted between (1.) and (2.) on subsequent executions the function will try to execute (1.) it will fail with a DuplicateKey error and it will not execute (2.). Thus in this scenario the state machine document will be written on disk but the POS instance for the recipient/donor won't be actually installed and executed leaving the resharding operation in an "hang" state.

      The createReshardingStateMachine is called as part of shard version recovery procedure, the operation context of this procedure is interrupted every time some thread enter the collection critical for instance as part of a chunk migration.

      One possible solution would be to attempt the creation of the POS instance even in case we hit the DuplicateKey error on insertion.

            Assignee:
            brett.nawrocki@mongodb.com Brett Nawrocki
            Reporter:
            tommaso.tocci@mongodb.com Tommaso Tocci
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: