[SERVER-60606] Race condition during initial sync when index builds start in data cloning phase Created: 11/Oct/21  Updated: 29/Oct/23  Resolved: 12/Oct/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.2.0, 4.4.11, 5.0.4, 5.1.0-rc2

Type: Bug Priority: Major - P3
Reporter: Yuhong Zhang Assignee: Yuhong Zhang
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-46659 Make initial sync work with two phase... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.1, v5.0, v4.4
Steps To Reproduce:

/**
 * 
 */
 
 (function() {
    'use strict';
    
    load("jstests/libs/fail_point_util.js");
    load('jstests/replsets/rslib.js');
    const basename = 'initial_sync_rename_database_before_cloning';
    
    jsTestLog('Bring up a replica set');
    const rst = new ReplSetTest({name: basename, nodes: 1});
    rst.startSet();
    rst.initiate();
    
    const dbName = "test";
    
    const primary = rst.getPrimary();
    const primaryDb = primary.getDB(dbName);
    
    jsTestLog("Create collections on primary");
    const collName = "coll";
    
    jsTestLog('Waiting for replication');
    rst.awaitReplication();
    
    jsTestLog('Bring up a new node');
    const secondary = rst.add({setParameter: 'numInitialSyncAttempts=1'});
    
    // Add a fail point that causes the secondary's initial sync to hang before
    // copying databases.
    const failPoint = configureFailPoint(secondary, 'initialSyncHangBeforeCopyingDatabases');
    
    jsTestLog('Begin initial sync on secondary');
    let conf = rst.getPrimary().getDB('admin').runCommand({replSetGetConfig: 1}).config;
    conf.members.push({_id: 1, host: secondary.host, priority: 0, votes: 0});
    conf.version++;
    assert.commandWorked(rst.getPrimary().getDB('admin').runCommand({replSetReconfig: conf}));
    assert.eq(primary, rst.getPrimary(), 'Primary changed after reconfig');
    
    // Confirm that initial sync started on the secondary node.
    jsTestLog('Waiting for initial sync to start');
    failPoint.wait();
 
    assert.commandWorked(primaryDb.getCollection(collName).insert({}));
    assert.commandWorked(primaryDb.runCommand({emptycapped: "coll"}))
    assert.commandWorked(primaryDb.getCollection(collName).insert({_id: 0, a: 1}));
    
    jsTestLog('Build index on the sync source');
    const fpIndexSrc = configureFailPoint(primary, 'hangAfterStartingIndexBuild');
    
    const awaitIndex = startParallelShell(() => {
      assert.commandWorked(db.getCollection('coll').createIndex({a: 1}));
    }, primary.port)
    
    jsTestLog('Pause index build on sync source')
    fpIndexSrc.wait();
    
    const fpIndexInit = configureFailPoint(secondary, 'hangAfterStartingIndexBuild');
    
    jsTestLog('Resume the initial sync')
    failPoint.off();
 
    jsTestLog('Pause index build on initial sync node');
    fpIndexInit.wait();
    
    jsTestLog('Resume index build on sync source to finish');
    fpIndexSrc.off();
 
    awaitIndex();
 
    jsTestLog('Resume index build on initial sync node to finish');
    fpIndexInit.off();   
    
    jsTestLog('Wait for both nodes to be up-to-date');
 
    rst.awaitSecondaryNodes();
    rst.awaitReplication();
 
    rst.checkReplicatedDataHashes();
    rst.checkOplogs();
    rst.stopSet();
    })();

Sprint: Execution Team 2021-10-18
Participants:
Linked BF Score: 145

 Description   

Currently we will kick off an index build on the new node if an index is observed as unfinished on the sync source. However, the new node can get stuck in this loop if:

  1. the index build on the sync source then finishes and is unregistered, AND
  2. the oplog entry before "commitIndexBuild" can get "BackgroundOperationInProgressForNamespace" when an index build is running on the collection

It turns out we did handle this case when getting a "BackgroundOperationInProgressForNamespace" error while replaying the oplogs by aborting the conflicting index build. But we forgot to include the case for emptycapped command. Adding the case should solve the issue.



 Comments   
Comment by Githook User [ 19/Oct/21 ]

Author:

{'name': 'Yuhong Zhang', 'email': 'danielzhangyh@gmail.com', 'username': 'YuhongZhang98'}

Message: SERVER-60606 Abort index builds conflicting with emptycapped during initial sync

(cherry picked from commit a4ad66c348822a19bbad38fb8485edd884b89a1a)
Branch: v5.0
https://github.com/mongodb/mongo/commit/d7d4733e6095f6be4f8a2bdc29354e46a5771d42

Comment by Githook User [ 19/Oct/21 ]

Author:

{'name': 'Yuhong Zhang', 'email': 'danielzhangyh@gmail.com', 'username': 'YuhongZhang98'}

Message: SERVER-60606 Abort index builds conflicting with emptycapped during initial sync

(cherry picked from commit a4ad66c348822a19bbad38fb8485edd884b89a1a)
Branch: v4.4
https://github.com/mongodb/mongo/commit/d1c187be228c7dab44b00f3665a83c8d1e834e9f

Comment by Githook User [ 19/Oct/21 ]

Author:

{'name': 'Yuhong Zhang', 'email': 'danielzhangyh@gmail.com', 'username': 'YuhongZhang98'}

Message: SERVER-60606 Abort index builds conflicting with emptycapped during initial sync

(cherry picked from commit a4ad66c348822a19bbad38fb8485edd884b89a1a)
Branch: v5.1
https://github.com/mongodb/mongo/commit/6c52f240a479fe3c20064be2101f2c0bfac59941

Comment by Githook User [ 12/Oct/21 ]

Author:

{'name': 'Yuhong Zhang', 'email': 'danielzhangyh@gmail.com', 'username': 'YuhongZhang98'}

Message: SERVER-60606 Abort index builds conflicting with emptycapped during initial sync
Branch: master
https://github.com/mongodb/mongo/commit/a4ad66c348822a19bbad38fb8485edd884b89a1a

Generated at Thu Feb 08 05:50:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.