[SERVER-31606] Primary removes drop-pending collection on step up before drop is replicated to a majority Created: 17/Oct/17  Updated: 30/Oct/23  Resolved: 23/Oct/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.6.0-rc1

Type: Bug Priority: Major - P3
Reporter: Benety Goh Assignee: William Schultz (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File drop_collections_two_phase_step_down.js    
Issue Links:
Related
is related to SERVER-30638 Change setReadFromMajorityCommittedSn... Closed
is related to SERVER-29127 Improve behavior of tailable awaitDat... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2017-10-23, Repl 2017-11-13
Participants:

 Description   

See attached retro script drop_collections_two_phase_step_down.js.

In the repro script, we have a 2 node replica set where we have paused oplog application on the secondary to keep the replica set commit point from advancing. This ensures that drop-pending collections are not cleaned up.

While oplog application is paused on the secondary, we step down the primary using

{force: true}

and wait for it to regain its PRIMARY status again. On regaining PRIMARY status, it erroneously notifies the drop pending collection reaper that the commit point has advanced past the drop-pending collection's drop optime.

This results in the drop-pending collection being dropped on the primary even though the secondary has not transitioned the original collection to a drop-pending state.



 Comments   
Comment by Githook User [ 23/Oct/17 ]

Author:

{'email': 'william.schultz@mongodb.com', 'name': 'William Schultz', 'username': 'will62794'}

Message: SERVER-31606 Make sure primary doesn't drop collection before it is committed
Branch: master
https://github.com/mongodb/mongo/commit/1bea66bcf55e597c0dd4511cd8befcebd7e507cb

Comment by Spencer Brody (Inactive) [ 17/Oct/17 ]

I believe this will not affect the correctness of majority read or write concern, as the snapshot that the storage engine uses to serve majority reads is only based on timestamp, not OpTime, and the value that the primary transmits through the spanning tree to tell secondaries to mark operations as committed is the lastCommittedOpTime which is still maintained correctly. So I think the only actual impact of this is dropping drop-pending collections before they should actually be dropped, which could result in an unrecoverable rollback that requires a full resync

Comment by Spencer Brody (Inactive) [ 17/Oct/17 ]

Okay I believe I have identified the problem.
I added a log to print the output of replSetGetStatus right after the assert in benety's repro fails. The "optimes" field of the replSetGetStatus output from the primary was:

[js_test:drop_collections_two_phase_step_down] 2017-10-17T16:56:49.906-0400     "optimes" : {
[js_test:drop_collections_two_phase_step_down] 2017-10-17T16:56:49.906-0400             "lastCommittedOpTime" : {
[js_test:drop_collections_two_phase_step_down] 2017-10-17T16:56:49.906-0400                     "ts" : Timestamp(1508273791, 1),
[js_test:drop_collections_two_phase_step_down] 2017-10-17T16:56:49.907-0400                     "t" : NumberLong(1)
[js_test:drop_collections_two_phase_step_down] 2017-10-17T16:56:49.907-0400             },
[js_test:drop_collections_two_phase_step_down] 2017-10-17T16:56:49.907-0400             "readConcernMajorityOpTime" : {
[js_test:drop_collections_two_phase_step_down] 2017-10-17T16:56:49.907-0400                     "ts" : Timestamp(1508273791, 1),
[js_test:drop_collections_two_phase_step_down] 2017-10-17T16:56:49.907-0400                     "t" : NumberLong(2)
[js_test:drop_collections_two_phase_step_down] 2017-10-17T16:56:49.907-0400             },
[js_test:drop_collections_two_phase_step_down] 2017-10-17T16:56:49.907-0400             "appliedOpTime" : {
[js_test:drop_collections_two_phase_step_down] 2017-10-17T16:56:49.907-0400                     "ts" : Timestamp(1508273809, 1),
[js_test:drop_collections_two_phase_step_down] 2017-10-17T16:56:49.907-0400                     "t" : NumberLong(2)
[js_test:drop_collections_two_phase_step_down] 2017-10-17T16:56:49.907-0400             },
[js_test:drop_collections_two_phase_step_down] 2017-10-17T16:56:49.908-0400             "durableOpTime" : {
[js_test:drop_collections_two_phase_step_down] 2017-10-17T16:56:49.908-0400                     "ts" : Timestamp(1508273809, 1),
[js_test:drop_collections_two_phase_step_down] 2017-10-17T16:56:49.908-0400                     "t" : NumberLong(2)
[js_test:drop_collections_two_phase_step_down] 2017-10-17T16:56:49.908-0400             }
[js_test:drop_collections_two_phase_step_down] 2017-10-17T16:56:49.908-0400     },

Notice the 'readConcernMajorityOpTime' has the timestamp of the 'lastCommittedOpTime' but the term of the 'appliedOpTime'. It is set to an OpTime that doesn't correspond to any actual oplog entry.
I believe the culprit is this line.
We take the stable timestamp and add the current term, and pass that down as the committed snapshot time.

I believe the proper fix would be to replace the '_stableTimestampCandidates' list with a '_stableOpTimeCandidates' list that includes the terms for each candidate optime. Then only at the last minute (in what is currently called StorageInterfaceImpl::setStableTimestamp), we strip off the term before passing into the storage engine code for maintaining the stable timestamp, since the storage engine doesn't care about terms.

Comment by Benety Goh [ 17/Oct/17 ]

Stack trace at breakpoint in replication_coordinator_external_state_impl.cpp:824 when replication coordinator advances the commit point prematurely:

(lldb) breakpoint set --file replication_coordinator_external_state_impl.cpp --line 824                                                                       Breakpoint 1: where = mongod`mongo::repl::ReplicationCoordinatorExternalStateImpl::notifyOplogMetadataWaiters(mongo::repl::OpTime const&) + 206 at replication_coordinator_external_state_impl.cpp:824, address = 0x000000010dd9430e
 
* thread #48, name = 'rsSync', stop reason = breakpoint 1.1
    frame #0: 0x000000010dd9430e mongod`mongo::repl::ReplicationCoordinatorExternalStateImpl::notifyOplogMetadataWaiters(this=0x00007fb2145a4cc0, committedOpTime=0x000070000abd03c0) at replication_coordinator_external_state_impl.cpp:824
   821 	    // optimes before or at the committed optime.
   822 	    if (auto earliestDropOpTime = _dropPendingCollectionReaper->getEarliestDropOpTime()) {
   823 	        if (committedOpTime >= *earliestDropOpTime) {
-> 824 	            auto reaper = _dropPendingCollectionReaper;
   825 	            scheduleWork(
   826 	                _taskExecutor.get(),
   827 	                [committedOpTime, reaper](const executor::TaskExecutor::CallbackArgs& args) {
(lldb) thread backtrace
* thread #48, name = 'rsSync', stop reason = breakpoint 1.1
  * frame #0: 0x000000010dd9430e mongod`mongo::repl::ReplicationCoordinatorExternalStateImpl::notifyOplogMetadataWaiters(this=0x00007fb2145a4cc0, committedOpTime=0x000070000abd03c0) at replication_coordinator_external_state_impl.cpp:824
    frame #1: 0x000000010dd94218 mongod`mongo::repl::ReplicationCoordinatorExternalStateImpl::updateCommittedSnapshot(this=0x00007fb2145a4cc0, newCommitPoint=SnapshotInfo @ 0x000070000abd03c0) at replication_coordinator_external_state_impl.cpp:809
    frame #2: 0x000000010e4de87f mongod`mongo::repl::ReplicationCoordinatorImpl::_updateCommittedSnapshot_inlock(this=0x00007fb215039e00, newCommittedSnapshot=SnapshotInfo @ 0x000070000abd0510) at replication_coordinator_impl.cpp:3548
    frame #3: 0x000000010e4c6199 mongod`mongo::repl::ReplicationCoordinatorImpl::_setStableTimestampForStorage_inlock(this=0x00007fb215039e00) at replication_coordinator_impl.cpp:3218
    frame #4: 0x000000010e4bdafa mongod`mongo::repl::ReplicationCoordinatorImpl::_setMyLastAppliedOpTime_inlock(this=0x00007fb215039e00, opTime=0x00007fb214718908, isRollbackAllowed=false) at replication_coordinator_impl.cpp:1117
    frame #5: 0x000000010e4c50b9 mongod`mongo::repl::ReplicationCoordinatorImpl::setMyLastAppliedOpTimeForward(this=0x00007fb215039e00, opTime=0x00007fb214718908) at replication_coordinator_impl.cpp:1045
    frame #6: 0x000000010e02ecb1 mongod`mongo::repl::_logOpsInner(this=0x00007fb2147188f8)::$_18::operator()() const at oplog.cpp:399
    frame #7: 0x000000010e02ec59 mongod`void mongo::RecoveryUnit::onCommit<mongo::repl::_logOpsInner(mongo::OperationContext*, mongo::NamespaceString const&, mongo::DocWriter const* const*, mongo::Timestamp*, unsigned long, mongo::Collection*, mongo::repl::OpTime)::$_18>(this=0x00007fb2147188f0)::$_18)::OnCommitChange::commit() at recovery_unit.h:219
    frame #8: 0x000000010d77778f mongod`mongo::WiredTigerRecoveryUnit::_commit(this=0x00007fb2145ccff0) at wiredtiger_recovery_unit.cpp:82
    frame #9: 0x000000010d778279 mongod`mongo::WiredTigerRecoveryUnit::commitUnitOfWork(this=0x00007fb2145ccff0) at wiredtiger_recovery_unit.cpp:122
    frame #10: 0x000000010d6b1cd8 mongod`mongo::WriteUnitOfWork::commit(this=0x000070000abd0ed8) at operation_context.h:541
    frame #11: 0x000000010dd98550 mongod`mongo::repl::ReplicationCoordinatorExternalStateImpl::onTransitionToPrimary(this=0x000070000abd1040)::$_4::operator()() const at replication_coordinator_external_state_impl.cpp:450
    frame #12: 0x000000010dd90770 mongod`_ZN5mongo18writeConflictRetryIZNS_4repl39ReplicationCoordinatorExternalStateImpl21onTransitionToPrimaryEPNS_16OperationContextEbE3$_4EEDaS4_NS_10StringDataES7_OT_(opCtx=0x00007fb2145ccee0, opStr=(_data = "logging transition to primary to oplog", _size = 38), ns=(_data = "local.oplog.rs", _size = 14), f=0x000070000abd1040) at write_conflict_exception.h:88
    frame #13: 0x000000010dd90521 mongod`mongo::repl::ReplicationCoordinatorExternalStateImpl::onTransitionToPrimary(this=0x00007fb2145a4cc0, opCtx=0x00007fb2145ccee0, isV1ElectionProtocol=true) at replication_coordinator_external_state_impl.cpp:444
    frame #14: 0x000000010e4c3f6b mongod`mongo::repl::ReplicationCoordinatorImpl::signalDrainComplete(this=0x00007fb215039e00, opCtx=0x00007fb2145ccee0, termWhenBufferIsEmpty=3) at replication_coordinator_impl.cpp:979
    frame #15: 0x000000010de57d70 mongod`mongo::repl::SyncTail::oplogApplication(this=0x000070000abd1c40, replCoord=0x00007fb215039e00) at sync_tail.cpp:850
    frame #16: 0x000000010ddaa390 mongod`mongo::repl::RSDataSync::_run(this=0x00007fb21460eb20) at rs_sync.cpp:82
    frame #17: 0x000000010ddaae4d mongod`void* std::__1::__thread_proxy<std::__1::tuple<void (mongo::repl::RSDataSync::*)(), mongo::repl::RSDataSync*> >(void*) [inlined] decltype(__f=0x00007fb21460e970, __a0=0x00007fb21460e980)).*fp(std::__1::forward<>(fp1))) std::__1::__invoke<void (mongo::repl::RSDataSync::*)(), mongo::repl::RSDataSync*, void>(void (mongo::repl::RSDataSync::*&&)(), mongo::repl::RSDataSync*&&) at __functional_base:383
    frame #18: 0x000000010ddaadce mongod`void* std::__1::__thread_proxy<std::__1::tuple<void (mongo::repl::RSDataSync::*)(), mongo::repl::RSDataSync*> >(void*) [inlined] void std::__1::__thread_execute<void (mongo::repl::RSDataSync::*)(), mongo::repl::RSDataSync*, 1ul>(__t=0x00007fb21460e970)(), mongo::repl::RSDataSync*>&, std::__1::__tuple_indices<1ul>) at thread:347
    frame #19: 0x000000010ddaada6 mongod`void* std::__1::__thread_proxy<std::__1::tuple<void (mongo::repl::RSDataSync::*)(), mongo::repl::RSDataSync*> >(__vp=0x00007fb21460e970) at thread:357
    frame #20: 0x00007fffe06ac93b libsystem_pthread.dylib`_pthread_body + 180
    frame #21: 0x00007fffe06ac887 libsystem_pthread.dylib`_pthread_start + 286
    frame #22: 0x00007fffe06ac08d libsystem_pthread.dylib`thread_start + 13

Generated at Thu Feb 08 04:27:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.