Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-37051

ShardServerCatalogCacheLoader does not check the internal term after reading from the task queue

    XMLWordPrintable

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.0, v3.6
    • Sprint:
      Sharding 2018-09-24
    • Linked BF Score:
      57

      Description

      There is a race condition in ShardServerCatalogCacheLoader, where if a shard node is running as a primary and a step-down happens, it may read in-memory task queue and persisted cache state, which is not consistent.

      Specifically, consider a node which is a primary and found some data reading from the config server here. It will then schedule this data to be persisted to the cache collections and then will proceed to do a merge of the task queue + what's already persisted in order to produce a list of the changed chunks.

      In a stepdown-free case, this would work fine. However, if by the time it got to read what it persisted and what is on the queue, the node stepped down, neither the write to the cache collections could have happened, nor anything remained on the task queue because of the change in term. That way it could come back with incomplete data (which would be a data loss) or it could come back with an empty list, which will invariant.

      In order to fix it, after reading from the task queue + persisted cache, we should check if the term has changed here and throw ConflictingOperationInProgress error so the load can be retried as secondary.

        Attachments

          Activity

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: