[SERVER-31019] Changing fCV during initial sync leads to divergent data across replica set members Created: 10/Sep/17  Updated: 30/Oct/23  Resolved: 09/Oct/17

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: None
Fix Version/s: 3.6.0-rc0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Judah Schvimer
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-31189 fassert if feature compatibility vers... Closed
Duplicate
is duplicated by SERVER-31102 Clone admin.system.version first in i... Closed
Related
related to SERVER-31387 oplog application conflates upserting... Closed
related to SERVER-28151 Authentication database should be syn... Closed
related to SERVER-31254 Fail initial sync if fCV targetVersio... Closed
is related to SERVER-31384 applyOps should propagate oplog appli... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

I've only had success in reproducing this issue with the MMAPv1 storage engine, and not with the WiredTiger or EphemeralForTest storage engines; however, it isn't clear to me why this issue would be storage engine-specific though.

python buildscripts/resmoke.py --suites=no_server repro_server31019.js --storageEngine=mmapv1 --repeat=5

repro_server31019.js

(function() {
    "use strict";
 
    const verbositySettings = tojson({
        verbosity: 1,
        replication: 2,
        storage: 2,
    });
 
    const rst = new ReplSetTest({
        nodes: 1,
        nodeOptions: {
            setParameter: {logComponentVerbosity: verbositySettings},
        }
    });
 
    rst.startSet();
    rst.initiate();
 
    const primaryDB = rst.getPrimary().getDB("test");
 
    rst.add({
        setParameter: {
            "failpoint.initialSyncHangBeforeCopyingDatabases": tojson({mode: "alwaysOn"}),
            logComponentVerbosity: verbositySettings
        }
    });
 
    // We disallow the secondary node from voting so that the primary's featureCompatibilityVersion
    // can be modified while the secondary node is still waiting to complete its initial sync.
    {
        const replSetConfig = rst.getReplSetConfigFromNode(0);
        replSetConfig.members = rst.getReplSetConfig().members;
        replSetConfig.members[1].priority = 0;
        replSetConfig.members[1].votes = 0;
        ++replSetConfig.version;
        assert.commandWorked(primaryDB.adminCommand({replSetReconfig: replSetConfig}));
    }
 
    // We set the primary's featureCompatibilityVersion to "3.4" and implicitly create a collection
    // without a UUID via an insert operation.
    {
        assert.commandWorked(primaryDB.adminCommand({setFeatureCompatibilityVersion: "3.4"}));
 
        primaryDB.mycoll.drop();
        assert.writeOK(primaryDB.mycoll.insert({_id: "while in fCV=3.4"}));
    }
 
    // Next, we set the primary's featureCompatibilityVersion to "3.6" and drop the collection that
    // was previously created. We then implicitly create another collection of the same name (but
    // with a UUID this time) via an insert operation.
    {
        assert.commandWorked(primaryDB.adminCommand({setFeatureCompatibilityVersion: "3.6"}));
 
        primaryDB.mycoll.drop();
        assert.writeOK(primaryDB.mycoll.insert({_id: "while in fCV=3.6"}));
    }
 
    // Finally, we allow the secondary node to proceed with its initial sync. It should end up with
    // only the document that was inserted into the collection when the primary's
    // featureCompatibilityVersion was "3.6".
    const secondaryDB = rst.getSecondary().getDB("test");
    assert.commandWorked(secondaryDB.adminCommand({
        configureFailPoint: "initialSyncHangBeforeCopyingDatabases",
        mode: "off",
    }));
 
    rst.checkReplicatedDataHashes();
    rst.stopSet();
})();

Sprint: Repl 2017-10-02, Repl 2017-10-23
Participants:
Linked BF Score: 0

 Description   

The node performing the initial sync appears to be able to retain the documents that were inserted prior to the collection being dropped and re-created after changing the featureCompatibilityVersion to 3.6. This issue is related to UUIDs and their impact on oplog application, and therefore doesn't affect the 3.2 or 3.4 branches.

2017-09-10T19:21:06.495-0400 The following documents are missing on the primary:
2017-09-10T19:21:06.495-0400 {  "_id" : "while in fCV=3.4" }
...
2017-09-10T19:21:06.498-0400 checkReplicatedDataHashes, the primary and secondary have a different hash for the test database: {
2017-09-10T19:21:06.498-0400 	"master" : {
2017-09-10T19:21:06.499-0400 		"host" : "hanamizu:20010",
2017-09-10T19:21:06.499-0400 		"collections" : {
2017-09-10T19:21:06.499-0400 			"mycoll" : "09aabf5621c57d91db16b98b365d8e65"
2017-09-10T19:21:06.499-0400 		},
2017-09-10T19:21:06.499-0400 		"md5" : "2105eeb0b1ec2ade59f08fa1f3f40ba9",
2017-09-10T19:21:06.499-0400 		"timeMillis" : 0,
2017-09-10T19:21:06.499-0400 		"ok" : 1,
2017-09-10T19:21:06.499-0400 		"operationTime" : Timestamp(1505085665, 18)
2017-09-10T19:21:06.499-0400 	},
2017-09-10T19:21:06.499-0400 	"slaves" : [
2017-09-10T19:21:06.499-0400 		{
2017-09-10T19:21:06.500-0400 			"host" : "hanamizu:20011",
2017-09-10T19:21:06.500-0400 			"collections" : {
2017-09-10T19:21:06.500-0400 				"mycoll" : "b8b6211fb0b559d95ae6df5cc4071420"
2017-09-10T19:21:06.500-0400 			},
2017-09-10T19:21:06.500-0400 			"md5" : "072bbaef3649d98b3270e6a2a6eac21f",
2017-09-10T19:21:06.500-0400 			"timeMillis" : 0,
2017-09-10T19:21:06.500-0400 			"ok" : 1,
2017-09-10T19:21:06.500-0400 			"operationTime" : Timestamp(1505085665, 18)
2017-09-10T19:21:06.500-0400 		}
2017-09-10T19:21:06.500-0400 	]
2017-09-10T19:21:06.500-0400 }



 Comments   
Comment by Githook User [ 09/Oct/17 ]

Author:

{'email': 'judah@mongodb.com', 'name': 'Judah Schvimer', 'username': 'judahschvimer'}

Message: SERVER-31019 fail initial sync if fCV changes during oplog application
Branch: master
https://github.com/mongodb/mongo/commit/d7a30a716243db13644a16618a939df6bc1344fc

Comment by Spencer Brody (Inactive) [ 15/Sep/17 ]

To fix this we should just fail initial sync if the featureCompatibilityVersion changes in the middle of it. To do this, we should make sure that the very first collection we clone is admin.system.version, so that we know the FCV of the sync source at the beginning of initial sync. Then during initial sync oplog application, we should fail and restart initial sync if we replicate a change to the FCV.

Generated at Thu Feb 08 04:25:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.