Consider increasing MigrationDestinationManager::startCommit timeout

XMLWordPrintableJSON

    • Fully Compatible
    • 9
    • None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      If there is any network error when the moveChunk receiver communicates with the config server, the operation fails after hanging for 30 seconds (startCommit timeout == timeout before retrying a failed network request).

      Detailed explanation

      In the moveChunk flow - on the receiver side - the migrateThread is calling MigrationDestinationManager::_migrateDriver in order to perform the necessary steps. After that, it notifies the _isActiveCV condition variable on which startCommit waits for a maximum of 30 seconds.

      After each MigrationDestinationManager::_migrateDriver's step, the state is logged on the CSRS through the MoveTimingHelper that calls into the ShardingLogger to insert a config document. As highlighted in SERVER-51397, if a network partition happens during a CatalogClient request, the first retry happens after 30 seconds (too late because the startCommit timeout is exactly 30 seconds).

              Assignee:
              Pierlauro Sciarelli
              Reporter:
              Pierlauro Sciarelli
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: