Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-94151

Improve resharding cloner resilience to prevent potential data loss during refresh failures

    • Cluster Scalability
    • Fully Compatible
    • ALL
    • v8.0
    • 200

      Currently, the resharding cloner logic has a potential flaw that could lead to data loss if an exception occurs during the refresh phase. That was spotted after the introduction (and subsequent reversion) of SERVER-92530. While the immediate risk has been mitigated by reverting SERVER-92530, the underlying issue in the cloner logic still needs to be addressed to improve its resilience and prevent potential data loss in future scenarios.

      Issue Summary:

      • The cloner fetches the list of documents from the donor and writes them to a temporary collection.
      • The write operation is managed by resharding::data_copy::withOneStaleConfigRetry, which forces a refresh and re-calls the failed callback in case of a StaleConfig error.
      • In such cases, _writeOnceWithNaturalOrder is executed twice.
      • However, the dispatchResult query gets moved at the first attempt, resulting in an empty batch on the second attempt.
      • This can lead to an empty temporary collection, and as consequence to a data loss.

      Currently, the _writeOnceWithNaturalOrder is protected from hitting a StaleConfig as the cloner thread created by the function itself follows the same logic . In case of StaleConfig, it's the cloner thread that refreshes. 
      SERVER-92530 introduced the possibility for a refresh to fail. In case of failure, _writeOnceWithNaturalOrder would find stale metadata and behave as described above. Even though reverted, we are still planning to re-commit SERVER-92530 once the related issues are fixed.

            Assignee:
            kruti.shah@mongodb.com Kruti Shah
            Reporter:
            enrico.golfieri@mongodb.com Enrico Golfieri
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: