Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-33954

CatalogCache refresh methods are not causally consistent

    XMLWordPrintable

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Backport Requested:
      v4.0, v3.6
    • Sprint:
      Sharding 2018-04-09, Sharding 2018-04-23, Sharding 2018-05-07, Sharding 2018-05-21, Sharding 2018-06-04, Sharding 2018-06-18
    • Linked BF Score:
      73

      Description

      The CatalogCache refresh methods have a logic that will simply 'join' with an another thread if there is already another thread trying to refresh the same collection. This will cause the refresh to miss the changes that happened at Tn if the refresh method was called while an in-progress refresh started at Tn-1.

      Here's a concrete example based on the build failure:
      1. moveChunk command is called.
      2. At the end of the migration, source shardA sends setShardVersion command to recipient shardB asynchronously.
      3. setShardVersion command triggers a CatalogCache refresh at shardB.
      4. Since step 3 is happening asynchronously, the test can proceed to dropping the collection.
      5. At the final step of drop is to send setShardVersion (0, 0) to shards.
      6. When setShardVersion arrives at shardB while step 3 is still refreshing, it will simply join with it and get an old info where the collection has not dropped yet.
      7. setShardVersion on drop fails since version (0, 0) would not match the version found in the refresh.

      Original description:

      After `test.foo` is dropped, the drop is not reflected in the catalogCache. After refreshing, the shard logs that it refreshed from the old version:

      Refresh for collection test.foo took 92 ms and found version 3|0||5aac36f988c1f30185b7b1df

      and the test fails because when trying to setShardVersion during the drop, the correct version is sent (version 0|0||000000000000000000000000), but the incorrect version is found.

      Attached is the test to reproduce the error as well as the logs.

        Attachments

        1. repro.diff
          1 kB
        2. stale_mongos_updates_and_removes.js
          10 kB
        3. stalemongosfail.txt
          1.22 MB
        4. test.js
          1 kB

          Issue Links

            Activity

              People

              Assignee:
              matthew.saltz Matthew Saltz
              Reporter:
              janna.golden Janna Golden
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: