[SERVER-31606] Primary removes drop-pending collection on step up before drop is replicated to a majority Created: 17/Oct/17 Updated: 30/Oct/23 Resolved: 23/Oct/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.0-rc1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Benety Goh | Assignee: | William Schultz (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Sprint: | Repl 2017-10-23, Repl 2017-11-13 | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
See attached retro script drop_collections_two_phase_step_down.js. In the repro script, we have a 2 node replica set where we have paused oplog application on the secondary to keep the replica set commit point from advancing. This ensures that drop-pending collections are not cleaned up. While oplog application is paused on the secondary, we step down the primary using {force: true}and wait for it to regain its PRIMARY status again. On regaining PRIMARY status, it erroneously notifies the drop pending collection reaper that the commit point has advanced past the drop-pending collection's drop optime. This results in the drop-pending collection being dropped on the primary even though the secondary has not transitioned the original collection to a drop-pending state. |
| Comments |
| Comment by Githook User [ 23/Oct/17 ] | ||||||||||||||||||||||||||||||||||||
|
Author: {'email': 'william.schultz@mongodb.com', 'name': 'William Schultz', 'username': 'will62794'}Message: | ||||||||||||||||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 17/Oct/17 ] | ||||||||||||||||||||||||||||||||||||
|
I believe this will not affect the correctness of majority read or write concern, as the snapshot that the storage engine uses to serve majority reads is only based on timestamp, not OpTime, and the value that the primary transmits through the spanning tree to tell secondaries to mark operations as committed is the lastCommittedOpTime which is still maintained correctly. So I think the only actual impact of this is dropping drop-pending collections before they should actually be dropped, which could result in an unrecoverable rollback that requires a full resync | ||||||||||||||||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 17/Oct/17 ] | ||||||||||||||||||||||||||||||||||||
|
Okay I believe I have identified the problem.
Notice the 'readConcernMajorityOpTime' has the timestamp of the 'lastCommittedOpTime' but the term of the 'appliedOpTime'. It is set to an OpTime that doesn't correspond to any actual oplog entry. I believe the proper fix would be to replace the '_stableTimestampCandidates' list with a '_stableOpTimeCandidates' list that includes the terms for each candidate optime. Then only at the last minute (in what is currently called StorageInterfaceImpl::setStableTimestamp), we strip off the term before passing into the storage engine code for maintaining the stable timestamp, since the storage engine doesn't care about terms. | ||||||||||||||||||||||||||||||||||||
| Comment by Benety Goh [ 17/Oct/17 ] | ||||||||||||||||||||||||||||||||||||
|
Stack trace at breakpoint in replication_coordinator_external_state_impl.cpp:824 when replication coordinator advances the commit point prematurely:
|