Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-67078

Advancing just the minor version on the primary of a shard should not stall the secondaries

    • Type: Icon: Bug Bug
    • Resolution: Won't Do
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.6.23, 4.0.28, 4.4.13, 4.2.20, 5.0.9, 6.0.0-rc8
    • Component/s: None
    • Labels:
    • Catalog and Routing
    • ALL

      The following sequence is possible:

      1. The primary of a shard performs a split or commits chunk migration (For the purposes of this ticket we will just consider splits, because they are more problematic since they are more frequent and only bump the minor version).
      2. As part of the split, we bump the minor version of the collection, but still advance the filtering metadata (shardVersion) on the primary.
      3. This causes the newly split chunks to be written to the config.cache.chunks.XXX collection, but since the write is not atomic, we first need to write a refreshing:true entry and then clear it once we have written all the changes.
      4. Upon seeing the first write from the previous step, the secondary will throw out its filtering metadata (shardVersion), which means that any read which comes to that secondary now will stall and will wait for the primary to complete at least one refresh from the CSRS and clear the refreshing flag.
      5. The secondary is stalled until the primary completes one round of refresh from the CSRS
      6. By the time it loops around though in order to read the new metadata, the primary might have committed another split and that split would have generated yet another refreshing:true.

      This loop potentially has a liveness problem if there are too many splits (or merges and moves) happening on the primary back to back, since it might not be able to complete. For moves, since they happen much less frequently it has normally not been a problem, but for splits it definitely is.

            Assignee:
            backlog-server-catalog-and-routing [DO NOT USE] Backlog - Catalog and Routing
            Reporter:
            kaloian.manassiev@mongodb.com Kaloian Manassiev
            Votes:
            0 Vote for this issue
            Watchers:
            18 Start watching this issue

              Created:
              Updated:
              Resolved: