[SERVER-48518] Rollback via refetch (EMRC = false) can make readers to see the rolled back data even after the rollback node catches up to primary. Created: 01/Jun/20  Updated: 29/Oct/23  Resolved: 01/Aug/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.4.1, 4.7.0, 4.2.10, 4.0.21

Type: Bug Priority: Major - P3
Reporter: Suganthi Mani Assignee: A. Jesse Jiryu Davis
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
depends on SERVER-47844 Update _setStableTimestampForStorage ... Closed
Problem/Incident
causes SERVER-50183 Copy _awaitPrimaryAppliedSurpassesRol... Closed
Related
related to SERVER-46721 Step up may cause reads at PIT with h... Closed
related to SERVER-38341 Remove Parallel Batch Writer Mutex Closed
related to SERVER-38925 Rollback via refetch can cause _id du... Closed
related to SERVER-47866 Secondary readers do not need to reac... Closed
is related to SERVER-48603 Rollback via refetch can result in ou... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4, v4.2, v4.0, v3.6
Steps To Reproduce:

(function() {
"use strict";
 
load("jstests/core/txns/libs/prepare_helpers.js");
load("jstests/replsets/libs/rollback_test.js");
 
const dbName = jsTest.name();
const collName = "coll";
 
const rst = new ReplSetTest(
    {name: collName, nodes: 3, useBridge: true, nodeOptions: {enableMajorityReadConcern: "false"}});
 
rst.startSet();
const config = rst.getReplSetConfig();
config.members[2].priority = 0;
config.settings = {
    chainingAllowed: false
};
rst.initiateWithHighElectionTimeout(config);
 
const rollbackTest = new RollbackTest(collName, rst);
const primary = rollbackTest.getPrimary();
const secondary = rollbackTest.getSecondary();
const tiebreaker = rst.getSecondaries()[1];
let testDB = primary.getDB(dbName);
const testColl = testDB.getCollection(collName);
 
 
jsTestLog("Do a document write");
assert.commandWorked(testColl.insert({_id: 0}, {"writeConcern": {"w": "majority"}}));
 
rollbackTest.transitionToRollbackOperations();
assert.commandWorked(testColl.insert({_id: 1}, {"writeConcern": {"w": 1}}));
 
jsTestLog(`Isolating the primary ${primary.host} from tiebreaker`);
primary.disconnect([tiebreaker]);
 
jsTestLog(`Reconnecting the secondary ${secondary.host} to the tiebreaker`);
secondary.reconnect([tiebreaker]);
 
jsTestLog(`stepup the secondary ${secondary.host} as the new primary`);
assert.soonNoExcept(() => {
            const res = secondary.adminCommand({replSetStepUp: 1});
            return res.ok;
        }, `failed to step up node ${secondary.host}`, ReplSetTest.kDefaultTimeoutMS);
const newPrimary = secondary;
checkLog.containsJson(newPrimary, 21331);	
 
assert.commandWorked(testDB["anotherCollection"].insert({_id: 0}, {"writeConcern": {"w": 1}}));	
assert.commandWorked(testColl.remove({_id: 1}));
assert.commandWorked(testColl.remove({_id: 0}));
 
jsTestLog(`Reconnecting the stale primary ${primary.host} so it'll go into rollback`);
primary.reconnect([newPrimary]);
 
const stalePrimary = primary;
checkLog.containsJson(stalePrimary, 21103);	
checkLog.containsJson(stalePrimary, 21682);	
 
jsTestLog(`Wait for the stale primary ${primary.host} to transition into secondary from rollback`);
rst.waitForState(stalePrimary, ReplSetTest.State.SECONDARY);
 
let itCount = testColl.find().itcount();
let fastCount = testColl.count();
 
// FastCount shows 1, itCount shows 3 - [ { "_id" : 0 }, { "_id" : 1 }, { "_id" : 0 } ]
// Expected result of itCount is 1 =  [ { "_id" : 0 }]
jsTestLog("Check collection count");
assert.eq(itCount, fastCount, "test collection, fastCount: " + fastCount + ", itCount: " + itCount +  ", coll data : " + tojson(testColl.find().toArray()));
 
rollbackTest.stop();
}());

Sprint: Repl 2020-06-29, Repl 2020-07-13, Repl 2020-08-10
Participants:
Linked BF Score: 96

 Description   

Below problems can be seen only for EMRC=false environment.

Problem 1: Find command on secondaries can list the rollbacked records. 
Assume, rollback node is Node A and current primary is node B. Common point is TS(10). And, Node B’s (current primary) lastAppliedOpTime  is TS(13).
1. Also, assume the following oplog entries in Node A will be rolled back,

  • Insert  {_id:1} to collection foo at TS(11) .
  • Delete {_id:1} from  collection foo at TS(15).
    Since {_id:1} is not present in node B, the rollback via refetch algorithm effectively doesn't do anything for {_id:1} document on node A.

2. Before rollback, the node A’s lastApplied opTime is TS(15). When the rollback completes, the lastApplied opTime of node A is set to TS(10) = common point.
3. Node A after transitioning to secondary from rollback state, it catches up to primary and so node A's lastApplied opTime will be forwarded to TS(13).

Now, if a user tries runs a find command on collection ‘foo’ on node A (secondary), the read source of that find command will be set to kNoOverlap (default read source for secondaries). As a result readTimestamp for that find command will be set as TS(13) = min(lastAppliedopTime, allDurable). Because of which find command would be able to list the rollbacked data record {id:1} .

Problem 2: Find command on secondaries can also list records with duplicated _id. 
Assume, rollback node is Node A and current primary is node B. Common point is TS(10). And, Node B’s (current primary) lastAppliedOpTime  is TS(13).
1. Lets's say at TS(5), the user previously inserted a document {_id:1} collection foo with RecordId=1.
2. Now, assume the below oplog entry in Node A will be rolled back,

  • Delete {_id:1} from  collection foo at TS(15).
    To rollback the delete operation, the rollback via refetch algorithm will refetch the document {_id:1} document from node B and insert the {_id:1} with RecordId=2 into node A and it will be a * non-timestamped write* (which is equivalent to TS(0)).

3. Before rollback, the node A’s lastApplied opTime is TS(15). When the rollback completes, the lastApplied opTime of node A is set to TS(10) = common point.
4. Node A after transitioning to secondary from rollback state, it catches up to primary and so node A's lastApplied opTime will be forwarded to TS(13).

Now, if a user tries runs a find command on collection ‘foo’ on node A (secondary), the read source will be  kNoOverlap and read timestamp will be TS(13) = min(lastAppliedopTime, allDurable). Because of which find command would be able to list *two {_id:1} documents, one being RecordId=1 (whose TS(5)) and other being non-timestamped write RecordId=2 written during rollback.

I have put the repro steps to demonstrate the above 2 issues. Also, to be noted, both the above issues causes cursor count (itcount) and fast count mismatch( which was the symptom seen in the build failures).

Root cause:
The main problem for EMRC= false is that  it's not safe to read on secondaries at a timestamp <  the local node's top of oplog's timestamp prior to entering rollback. 



 Comments   
Comment by Githook User [ 07/Oct/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-48518 Fix rollback via refetch anomaly, try 2

Includes the following partial backports:

SERVER-47844 Add the ability to get the initialDataTimestamp from the storage engine interface

(cherry picked from commit 1408e1b8a5392a9001ee598b5cec66afc4e1cf77)
(cherry picked from commit 329d8c517d8b3c3fb4bcb63eecf6064ac9a007cf)

SERVER-48518 Fix rollback via refetch anomaly

(cherry picked from commit eee49c64cdeb8fa95704b9a316b779eb5eb9800c)
(cherry picked from commit 88c0265e057f0e5581306f294d1ca2bda19760e4)

SERVER-50183 Copy _awaitPrimaryAppliedSurpassesRollbackApplied function from RollbackTest to RollbackTestDeluxe

(cherry picked from commit 252251d38915b9e6722186b9742cc914a045d589)
(cherry picked from commit d4b960b5f3f4a7a2b18b48d7fb14251704a8bda8)
Branch: v4.0
https://github.com/mongodb/mongo/commit/ece51101e58dfaf7e455c8c96df6ade42b99515c

Comment by Githook User [ 02/Oct/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: Revert "SERVER-48518 Fix rollback via refetch anomaly"

This reverts commit 09c14216c0f4adae7d12c27dc034ffbf4e9b7001.
Branch: v4.0
https://github.com/mongodb/mongo/commit/5726924e5c92e019a1730afb8b0724f0cb0e3473

Comment by Githook User [ 29/Sep/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-48518 Fix rollback via refetch anomaly
Branch: v4.0
https://github.com/mongodb/mongo/commit/09c14216c0f4adae7d12c27dc034ffbf4e9b7001

Comment by Githook User [ 15/Sep/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: Revert "SERVER-48518 Fix rollback via refetch anomaly"

This reverts commit b07f80de5850c665e75dc259def6b8999d1077dd.
Branch: v4.0
https://github.com/mongodb/mongo/commit/856c3e24c4fc0bf36998e95b4c7b72b075be8e6b

Comment by Githook User [ 14/Sep/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-48518 Improvements to rollback_via_refetch_anomaly.js
Branch: master
https://github.com/mongodb/mongo/commit/0f035d926b879b4a32a3b277dc2e9096fd8f2137

Comment by Githook User [ 14/Sep/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-48518 Improvements to rollback_via_refetch_anomaly.js
Branch: v4.4
https://github.com/mongodb/mongo/commit/259777fa196cc78f855683259d5a40a4515a7184

Comment by Githook User [ 14/Sep/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-48518 Improvements to rollback_via_refetch_anomaly.js
Branch: v4.2
https://github.com/mongodb/mongo/commit/5a68185f00cd8fafe967c273c4eba4d7123f62f8

Comment by Githook User [ 08/Sep/20 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-48518 Fix rollback via refetch anomaly

Includes the following partial backports:

SERVER-47844 Add the ability to get the initialDataTimestamp from the storage engine interface

(cherry picked from commit 1408e1b8a5392a9001ee598b5cec66afc4e1cf77)
(cherry picked from commit 329d8c517d8b3c3fb4bcb63eecf6064ac9a007cf)

SERVER-48518 Fix rollback via refetch anomaly

(cherry picked from commit eee49c64cdeb8fa95704b9a316b779eb5eb9800c)
(cherry picked from commit 88c0265e057f0e5581306f294d1ca2bda19760e4)

SERVER-50183 Copy _awaitPrimaryAppliedSurpassesRollbackApplied function from RollbackTest to RollbackTestDeluxe

(cherry picked from commit 252251d38915b9e6722186b9742cc914a045d589)
(cherry picked from commit d4b960b5f3f4a7a2b18b48d7fb14251704a8bda8)
Branch: v4.0
https://github.com/mongodb/mongo/commit/b07f80de5850c665e75dc259def6b8999d1077dd

Comment by Githook User [ 25/Aug/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-48518 Fix rollback via refetch anomaly

(cherry picked from commit eee49c64cdeb8fa95704b9a316b779eb5eb9800c)
(cherry picked from commit 88c0265e057f0e5581306f294d1ca2bda19760e4)

SERVER-50183 Copy _awaitPrimaryAppliedSurpassesRollbackApplied function from RollbackTest to RollbackTestDeluxe

(cherry picked from commit 252251d38915b9e6722186b9742cc914a045d589)
(cherry picked from commit d4b960b5f3f4a7a2b18b48d7fb14251704a8bda8)
Branch: v4.2
https://github.com/mongodb/mongo/commit/e41d7e8da4527f2ee1f227593d874140055e68ad

Comment by Githook User [ 20/Aug/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-48518 Fix rollback via refetch anomaly

(cherry picked from commit eee49c64cdeb8fa95704b9a316b779eb5eb9800c)
Branch: v4.4
https://github.com/mongodb/mongo/commit/88c0265e057f0e5581306f294d1ca2bda19760e4

Comment by Githook User [ 01/Aug/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-48518 Fix rollback via refetch anomaly
Branch: master
https://github.com/mongodb/mongo/commit/eee49c64cdeb8fa95704b9a316b779eb5eb9800c

Comment by A. Jesse Jiryu Davis [ 22/Jul/20 ]

I reverted my first attempt, 74ab0cdac56, because it caused BFs on the ephemeralForTest storage engine.

Comment by Suganthi Mani [ 15/Jul/20 ]

jesse, just noticed that this commit added the file "jstests/replsets/rollback_via_refetch_anomaly.js" under "replica_sets_jscore_multiversion_passthrough" section in etc/backports_required_for_multiversion_tests.yml. My understanding is that "replica_sets_jscore_multiversion_passthrough" section will hold only tests under jstests/core directory. I think we should include the jstests/replsets/ directory files under "replica_sets_multiversion" section.

Comment by Githook User [ 13/Jul/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: Revert "SERVER-48518 Fix rollback via refetch anomaly"

This reverts commit 74ab0cdac56e1cc62fc96d3ca3be3ddfa54f2bcb.
Branch: master
https://github.com/mongodb/mongo/commit/692b37ddb002a4f1ea1e07a145e1f3ff1e6c9e6e

Comment by Githook User [ 11/Jul/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-48518 Fix rollback via refetch anomaly
Branch: master
https://github.com/mongodb/mongo/commit/74ab0cdac56e1cc62fc96d3ca3be3ddfa54f2bcb

Comment by Judah Schvimer [ 10/Jul/20 ]

I think for now we want to do both, with the expectation that SERVER-48048 will remove one of them.

Comment by A. Jesse Jiryu Davis [ 10/Jul/20 ]

Thanks. The new test I'm writing is in replsets/, should I exclude it in backports_required_for_multiversion_tests.yml or in replica_sets_multiversion.yml? I'm finding the instructions hard to follow.

Comment by A. Jesse Jiryu Davis [ 10/Jul/20 ]

The test I'm developing on master is tagged multiversion_incompatible. Before closing this ticket, we must backport to some/all supported versions and adjust/remove the tag.

Comment by Suganthi Mani [ 09/Jun/20 ]

Spoke with daniel.gottlieb to see if we can have some solution (similar to SERVER-48603's proposed solution) like this "When delete document oplog entries are rolled back, add 0-timestamped tombstone to the update chain for that doc irrespective of whether the sync source have that document or not". But then, it seems the fix is complex and WT doesn't help in anyway.

So, as per our agreement, we would be implementing solution#3 "node will not transition from recovery to secondary until the node’s last applied opTime >= max (minValid, initialDataTimestamp).". Basically, this check in ReplicationCoordinatorImpl::finishRecoveryIfEligible() should be modified to take "initialDataTimestamp" also into account.

Note: We also decided not to update the minValid to set as max(currentMinValid, initialDataTimestamp) because that would make node's recovery to take a longer time even on node restart. To be noted, minvalid documents are persisted (checkpointed).

Comment by Judah Schvimer [ 02/Jun/20 ]

But the problem with this approach is that when read source is set as kUnset ( which is equivalent to kNoTimestamp) is that we would need PBWM lock. This would contradict the work SERVER-38341 SERVER-47866.

I think it would be ok to block SERVER-38341 on removing eMRC=F and as a result removing rollback via refetch.

I personally vote Solution #3, it is easy to reason about and should be straightforward to implement. I think we can just set minValid to the max(currentMinValid, initialDataTimestamp).

Comment by Daniel Gottlieb (Inactive) [ 02/Jun/20 ]

suganthi.mani do you believe this problem dates back to 3.6 (when secondaries could start reading at last applied)?

Comment by Suganthi Mani [ 02/Jun/20 ]

Solution # 1

We don’t take stable checkpoints until stable timestamp reaches initialDataTimestamp (= max(localTopOfOplog, syncSourceTopOfOplog), set during rollback via refetch) to avoid data corruption/duplicate data.  I think we should have some kind of similar constraint while choosing readTimestamp on secondaries which has just transitioned from rollback, something like below

readTimestamp = min(lastAppliedopTime, allDurable);
if (EMRC == false) {
   if (readTimestamp < initialDataTimestamp) {
      // Change the read source from koverlap to unset.
   }
}

Alternatively, we can also play by  setting the collectionMinSnapshot to lastAppliedopTime of the rollback node that was prior to entering rollback, for all collections that were touched during rolllback via refetch. So that we hit this check for a find command and it forces to change our read source from overlap to unset.
 
But the problem with this approach is that when read source is set as kUnset ( which is equivalent to kNoTimestamp) is that we would need PBWM lock. This would contradict the work SERVER-38341 SERVER-47866.

Solution # 2
If readTimestamp < initialDataTimestamp, then we can make the readers to read at common point (that's currently not been recorded, we would need to record it if we go with this fix) which would be the stale data. Not sure, it's ok to read the stale data???

Solution # 3
We should not allow reads (or delay transitioning to secondaries from rollback) until the node's lastAppliedTimestamp < initialDataTimestamp.

Solution # 4
secondaries before running find command, they should run makeNoopWriteIfNeeded() (which is currently used only for shared environment) to make sure they have reached a target read timestamp >= initialDataTimestamp (proposed by louis.williams).

Note : My understanding is that we support at/afterClusterTime reads for EMRC false. So, we would also face problems for at/afterClusterTime reads.
CC tess.avitabile 

Generated at Thu Feb 08 05:17:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.