Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-36154

Shard's in-memory CSS is not refreshed after upgrading from 3.4 to 3.6, causing a UUID mismatch on $changeStream operations

    • Sharding
    • ALL
    • Hide

      This is a slight modification to the uuid_propagated_to_shards_on_setFCV_3_6.js test which reproduces the issue:

      (function() {
          let st = new ShardingTest({shards: {rs0: {nodes: 1}}, other: {config: 3}});
      
          load('jstests/libs/uuid_util.js');
      
          // Start in fcv=3.4.
          assert.commandWorked(st.s.adminCommand({setFeatureCompatibilityVersion: "3.4"}));
      
          let db1 = "test1";
          assert.commandWorked(st.s.adminCommand({enableSharding: db1}));
          st.ensurePrimaryShard(db1, st.shard0.shardName);
      
          assert.commandWorked(st.s.adminCommand({shardCollection: db1 + ".foo0", key: {_id: 1}}));
      
          jsTest.log("upgrade the cluster to fcv=3.6");
          assert.commandWorked(st.s.adminCommand({setFeatureCompatibilityVersion: "3.6"}));
      
          st.checkUUIDsConsistentAcrossCluster();
          assert.commandWorked(st.shard0.getDB('admin').runCommand({forceRoutingTableRefresh: "test1.foo0"}));
      
          let db = st.s.getDB(db1);
          let cs = db.foo0.watch();
      
          assert.writeOK(db.foo0.insert({_id: 0}));
          assert.writeOK(db.foo0.insert({_id: 1}));
      
          assert.soon(() => cs.hasNext());
          assert.eq(cs.next().operationType, "insert");
      })();
      
      Show
      This is a slight modification to the uuid_propagated_to_shards_on_setFCV_3_6.js test which reproduces the issue: (function() { let st = new ShardingTest({shards: {rs0: {nodes: 1}}, other: {config: 3}}); load( 'jstests/libs/uuid_util.js' ); // Start in fcv=3.4. assert .commandWorked(st.s.adminCommand({setFeatureCompatibilityVersion: "3.4" })); let db1 = "test1" ; assert .commandWorked(st.s.adminCommand({enableSharding: db1})); st.ensurePrimaryShard(db1, st.shard0.shardName); assert .commandWorked(st.s.adminCommand({shardCollection: db1 + ".foo0" , key: {_id: 1}})); jsTest.log( "upgrade the cluster to fcv=3.6" ); assert .commandWorked(st.s.adminCommand({setFeatureCompatibilityVersion: "3.6" })); st.checkUUIDsConsistentAcrossCluster(); assert .commandWorked(st.shard0.getDB( 'admin' ).runCommand({forceRoutingTableRefresh: "test1.foo0" })); let db = st.s.getDB(db1); let cs = db.foo0.watch(); assert .writeOK(db.foo0.insert({_id: 0})); assert .writeOK(db.foo0.insert({_id: 1})); assert .soon(() => cs.hasNext()); assert .eq(cs.next().operationType, "insert" ); })();
    • Sharding 2018-10-08, Sharding 2018-11-05

      Issue Status as of Mar 11, 2019

      ISSUE DESCRIPTION AND IMPACT
      After upgrading to MongoDB 3.6 and attempting to initialize a change stream against a sharded collection, users may encounter the following error:

      errmsg: 'Collection foo.bar UUID differs from UUID on change stream operations',
      

      This error occurs when at least one shard that owns data for the collection has received an operation for that collection before upgrading to Feature Compatibility Version 3.6. This causes sharding cache entries to become persisted without a UUID. Once a shard's cache reaches this state any subsequent refreshes of the cache will not add a UUID regardless of the Feature Compatibility Version.

      DIAGNOSIS AND AFFECTED VERSIONS
      This can occur after upgrading a MongoDB sharded cluster to 3.6.x.

      The situation can be confirmed by running the following query directly against the shard Primary that encountered the error and checking if it has a UUID associated.

      db.getSiblingDB("config").cache.collections.find({_id:<namespace>})
      

      REMEDIATION AND WORKAROUNDS
      In order to resolve this issue please perform the following steps:
      1. Connect to the shard Primary directly not through the mongos.

      mongo --port <shardport>
      

      2. Remove the document in config.cache.collections that matches the problem namespace.

      db.getSiblingDB("config").cache.collections.remove({_id:<namespace>}, {writeConcern: {w:"majority"}})
      

      3. Drop the config.cache.chunks collection that matches your namespace. If you are on 4.0, you can pass a write concern of "majority" to the drop statement to ensure it becomes majority-committed before proceeding. If you are on 3.6, run the command without the write concern and check the Secondaries to confirm that the drop has become majority-committed.

      db.getSiblingDB("config").cache.chunks.<namespace>.drop({writeConcern: {w:"majority"}})
      

      4. Restart the affected shards by performing a rolling restart.
      5. Perform a query that touches all of shards that contain the problem collection.

      db.getSiblingDB("<database>").<collection>.find().toArray().length
      
      Original description

      This was found as part of the investigation for SERVER-35999, where a user tries to open a change stream against a sharded collection just after upgrading to 3.6. Currently, the setFCV command does attempt to propagate the newly generated UUIDs for existing collections, however the in-memory cache will still be stale. 

      Change streams will verify that the UUID from the oplog matches the UUID in the CSS, failing if there's a mismatch or if the UUID does not exist.  While bouncing the shards is a valid workaround, it would be nice from a usability standpoint if the setFCV flow also forced a refreshed of the in-memory CSS. 

            Assignee:
            backlog-server-sharding [DO NOT USE] Backlog - Sharding Team
            Reporter:
            nicholas.zolnierz@mongodb.com Nicholas Zolnierz
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: