[SERVER-48518] Rollback via refetch (EMRC = false) can make readers to see the rolled back data even after the rollback node catches up to primary. Created: 01/Jun/20 Updated: 29/Oct/23 Resolved: 01/Aug/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.4.1, 4.7.0, 4.2.10, 4.0.21 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Suganthi Mani | Assignee: | A. Jesse Jiryu Davis |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.4, v4.2, v4.0, v3.6
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Repl 2020-06-29, Repl 2020-07-13, Repl 2020-08-10 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 96 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Below problems can be seen only for EMRC=false environment. Problem 1: Find command on secondaries can list the rollbacked records.
2. Before rollback, the node A’s lastApplied opTime is TS(15). When the rollback completes, the lastApplied opTime of node A is set to TS(10) = common point. Now, if a user tries runs a find command on collection ‘foo’ on node A (secondary), the read source of that find command will be set to kNoOverlap (default read source for secondaries). As a result readTimestamp for that find command will be set as TS(13) = min(lastAppliedopTime, allDurable). Because of which find command would be able to list the rollbacked data record {id:1} . Problem 2: Find command on secondaries can also list records with duplicated _id.
3. Before rollback, the node A’s lastApplied opTime is TS(15). When the rollback completes, the lastApplied opTime of node A is set to TS(10) = common point. Now, if a user tries runs a find command on collection ‘foo’ on node A (secondary), the read source will be kNoOverlap and read timestamp will be TS(13) = min(lastAppliedopTime, allDurable). Because of which find command would be able to list *two {_id:1} documents, one being RecordId=1 (whose TS(5)) and other being non-timestamped write RecordId=2 written during rollback. I have put the repro steps to demonstrate the above 2 issues. Also, to be noted, both the above issues causes cursor count (itcount) and fast count mismatch( which was the symptom seen in the build failures). Root cause: |
| Comments |
| Comment by Githook User [ 07/Oct/20 ] | ||||||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: Includes the following partial backports:
(cherry picked from commit 1408e1b8a5392a9001ee598b5cec66afc4e1cf77)
(cherry picked from commit eee49c64cdeb8fa95704b9a316b779eb5eb9800c)
(cherry picked from commit 252251d38915b9e6722186b9742cc914a045d589) | ||||||
| Comment by Githook User [ 02/Oct/20 ] | ||||||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: Revert " This reverts commit 09c14216c0f4adae7d12c27dc034ffbf4e9b7001. | ||||||
| Comment by Githook User [ 29/Sep/20 ] | ||||||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: | ||||||
| Comment by Githook User [ 15/Sep/20 ] | ||||||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: Revert " This reverts commit b07f80de5850c665e75dc259def6b8999d1077dd. | ||||||
| Comment by Githook User [ 14/Sep/20 ] | ||||||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: | ||||||
| Comment by Githook User [ 14/Sep/20 ] | ||||||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: | ||||||
| Comment by Githook User [ 14/Sep/20 ] | ||||||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: | ||||||
| Comment by Githook User [ 08/Sep/20 ] | ||||||
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: Includes the following partial backports:
(cherry picked from commit 1408e1b8a5392a9001ee598b5cec66afc4e1cf77)
(cherry picked from commit eee49c64cdeb8fa95704b9a316b779eb5eb9800c)
(cherry picked from commit 252251d38915b9e6722186b9742cc914a045d589) | ||||||
| Comment by Githook User [ 25/Aug/20 ] | ||||||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: (cherry picked from commit eee49c64cdeb8fa95704b9a316b779eb5eb9800c)
(cherry picked from commit 252251d38915b9e6722186b9742cc914a045d589) | ||||||
| Comment by Githook User [ 20/Aug/20 ] | ||||||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: (cherry picked from commit eee49c64cdeb8fa95704b9a316b779eb5eb9800c) | ||||||
| Comment by Githook User [ 01/Aug/20 ] | ||||||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: | ||||||
| Comment by A. Jesse Jiryu Davis [ 22/Jul/20 ] | ||||||
|
I reverted my first attempt, 74ab0cdac56, because it caused BFs on the ephemeralForTest storage engine. | ||||||
| Comment by Suganthi Mani [ 15/Jul/20 ] | ||||||
|
jesse, just noticed that this commit added the file "jstests/replsets/rollback_via_refetch_anomaly.js" under "replica_sets_jscore_multiversion_passthrough" section in etc/backports_required_for_multiversion_tests.yml. My understanding is that "replica_sets_jscore_multiversion_passthrough" section will hold only tests under jstests/core directory. I think we should include the jstests/replsets/ directory files under "replica_sets_multiversion" section. | ||||||
| Comment by Githook User [ 13/Jul/20 ] | ||||||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: Revert " This reverts commit 74ab0cdac56e1cc62fc96d3ca3be3ddfa54f2bcb. | ||||||
| Comment by Githook User [ 11/Jul/20 ] | ||||||
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: | ||||||
| Comment by Judah Schvimer [ 10/Jul/20 ] | ||||||
|
I think for now we want to do both, with the expectation that | ||||||
| Comment by A. Jesse Jiryu Davis [ 10/Jul/20 ] | ||||||
|
Thanks. The new test I'm writing is in replsets/, should I exclude it in backports_required_for_multiversion_tests.yml or in replica_sets_multiversion.yml? I'm finding the instructions hard to follow. | ||||||
| Comment by A. Jesse Jiryu Davis [ 10/Jul/20 ] | ||||||
|
The test I'm developing on master is tagged multiversion_incompatible. Before closing this ticket, we must backport to some/all supported versions and adjust/remove the tag. | ||||||
| Comment by Suganthi Mani [ 09/Jun/20 ] | ||||||
|
Spoke with daniel.gottlieb to see if we can have some solution (similar to SERVER-48603's proposed solution) like this "When delete document oplog entries are rolled back, add 0-timestamped tombstone to the update chain for that doc irrespective of whether the sync source have that document or not". But then, it seems the fix is complex and WT doesn't help in anyway. So, as per our agreement, we would be implementing solution#3 "node will not transition from recovery to secondary until the node’s last applied opTime >= max (minValid, initialDataTimestamp).". Basically, this check in ReplicationCoordinatorImpl::finishRecoveryIfEligible() should be modified to take "initialDataTimestamp" also into account. Note: We also decided not to update the minValid to set as max(currentMinValid, initialDataTimestamp) because that would make node's recovery to take a longer time even on node restart. To be noted, minvalid documents are persisted (checkpointed). | ||||||
| Comment by Judah Schvimer [ 02/Jun/20 ] | ||||||
I think it would be ok to block I personally vote Solution #3, it is easy to reason about and should be straightforward to implement. I think we can just set minValid to the max(currentMinValid, initialDataTimestamp). | ||||||
| Comment by Daniel Gottlieb (Inactive) [ 02/Jun/20 ] | ||||||
|
suganthi.mani do you believe this problem dates back to 3.6 (when secondaries could start reading at last applied)? | ||||||
| Comment by Suganthi Mani [ 02/Jun/20 ] | ||||||
|
Solution # 1 We don’t take stable checkpoints until stable timestamp reaches initialDataTimestamp (= max(localTopOfOplog, syncSourceTopOfOplog), set during rollback via refetch) to avoid data corruption/duplicate data. I think we should have some kind of similar constraint while choosing readTimestamp on secondaries which has just transitioned from rollback, something like below
Alternatively, we can also play by setting the collectionMinSnapshot to lastAppliedopTime of the rollback node that was prior to entering rollback, for all collections that were touched during rolllback via refetch. So that we hit this check for a find command and it forces to change our read source from overlap to unset. Solution # 2 Solution # 3 Solution # 4 Note : My understanding is that we support at/afterClusterTime reads for EMRC false. So, we would also face problems for at/afterClusterTime reads. |