Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-33714

Downgrading FCV from 3.6 to 3.4 leaves an admin.system.keys collection on shards that on upgrade is orphaned and renamed without a UUID

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 3.6.3
    • Fix Version/s: 3.6.5
    • Component/s: Sharding
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      The way to reproduce this issue is to add two lines into the test (updates_in_heterogeneous_repl_set.js):

      replTest.awaitSecondaryNodes();
       
      + replTest.stepUp(replTest.nodes[2]);
      + replTest.awaitSecondaryNodes();
       
      // Set the replica set feature compatibility version to 3.6.
      primary = replTest.getPrimary();
      

      This elects to primary the node that was originally added to the replica set as a v3.4 binary, so it never initial sync'ed admin.system.keys, and then have it run setFCV(3.6) and create the admin.system.keys collection during upgrade.

      Show
      The way to reproduce this issue is to add two lines into the test (updates_in_heterogeneous_repl_set.js): replTest.awaitSecondaryNodes(); + replTest.stepUp(replTest.nodes[2]); + replTest.awaitSecondaryNodes(); // Set the replica set feature compatibility version to 3.6. primary = replTest.getPrimary(); This elects to primary the node that was originally added to the replica set as a v3.4 binary, so it never initial sync'ed admin.system.keys, and then have it run setFCV(3.6) and create the admin.system.keys collection during upgrade.
    • Sprint:
      Sharding 2018-04-23
    • Linked BF Score:
      45

      Description

      This is for v3.6 only!

      "admin.system.keys" collection was introduced in v3.6, and a v3.4 node cannot clone it from the primary during initial sync: system collections must be white listed for cloning. So you can end up in a v3.6 and v3.4 binary replica set with FCV 3.4 where the v3.4 binaries don't have a collection that the v3.6 binaries do. This can happen on shards, but not config servers, because config servers drop the collection on downgrade, whereas shards do not.

      If the v3.4 binary is then upgraded to v3.6, elected primary and runs setFCV 3.6, it will create admin.system.keys, which the secondaries already have. This causes the secondary to rename the original admin.system.keys collection to a tmp collection and then create a new admin.system.keys. Now the 3.6 nodes have an orphan collection "admin.tmpxxxxx.create" without an UUID.

      This was caught by UUID validation code because downgrade to FCV 3.4 in the test strips the UUIDs, then upgrade to FCV 3.6 via the originally v3.4 node sends a createCollection admin.system.keys w/ UUID on the oplog to the secondaries, which already have the collection and rename their original collection w/o a UUID to admin.tmpxxxxx.create, which is left orphaned.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              jack.mulrow Jack Mulrow
              Reporter:
              xiangyu.yao Xiangyu Yao (Inactive)
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: