[SERVER-46218] Race between removal and shutdown in arbiter Created: 18/Feb/20  Updated: 29/Oct/23  Resolved: 20/Feb/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.2.4, 4.3.4

Type: Bug Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: A. Jesse Jiryu Davis
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.2
Sprint: Repl 2020-02-24
Participants:

 Description   

If an arbiter is shut down soon after it is removed from the replica set by a reconfig, the arbiter crashes and logs:

[ReplCoord-2] This node is not a member of the config
[ReplCoord-2] transition to REMOVED from ARBITER
[ReplCoord-2] terminate() called. An exception is active; attempting to gather more information
[ReplCoord-2] DBException::toString(): ShutdownInProgress: aborting KeysCollectionManager::PeriodicRunner::setFunc because node is shutting down
Actual exception type: mongo::error_details::ExceptionForImpl<(mongo::ErrorCodes::Error)91, mongo::ExceptionForCat<(mongo::ErrorCategory)6>, mongo::ExceptionForCat<(mongo::ErrorCategory)7>, mongo::ExceptionForCat<(mongo::
ErrorCategory)13> >
----- BEGIN BACKTRACE -----
 mongod(_ZN5mongo15printStackTraceERNS_14StackTraceSinkE+0xB4) [0x562227EC2114]
 mongod(_ZN5mongo15printStackTraceERSo+0x2F) [0x562227EC2E2F]
 mongod(+0x2AD2686) [0x562227EC1686]
 mongod(_ZN10__cxxabiv111__terminateEPFvvE+0x6) [0x562228033266]
 mongod(+0x2CD8589) [0x5622280C7589]
 mongod(__gxx_personality_v0+0x2C5) [0x562228032C85]
 libgcc_s.so.1(+0x10613) [0x7F876CB02613]
 libgcc_s.so.1(_Unwind_Resume+0x125) [0x7F876CB02E95]
 mongod(+0xD73B00) [0x562226162B00]
 mongod(_ZN5mongo10ThreadPool10_doOneTaskEPSt11unique_lockINS_5LatchEE+0xFF) [0x562226C5140F]
 mongod(_ZN5mongo10ThreadPool13_consumeTasksEv+0x91) [0x562226C53CD1]
 mongod(_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x12E) [0x562226C54C7E]
 mongod(+0x1865E93) [0x562226C54E93]
 mongod(+0x2C5FCCF) [0x56222804ECCF]
 libpthread.so.0(+0x76DB) [0x7F876C8DA6DB]
 libc.so.6(clone+0x3F) [0x7F876C60388F]
-----  END BACKTRACE  -----

The sequence is on the arbiter is:

  • ReplicationCoordinatorImpl::_heartbeatReconfigFinish
  • ReplicationCoordinatorImpl::_performPostMemberStateUpdateAction with action=kActionRollbackOrRemoved
  • ReplicationCoordinatorExternalStateImpl::shardingOnStepDownHook (despite the name, this hook doesn't only run on stepdown)
  • KeysCollectionManager::enableKeyGenerator with doEnable=false
  • KeysCollectionManager::PeriodicRunner::setFunc is called with a lambda
  • The PeriodicRunner throws a shutdown error, which is uncaught and terminates mongod

I can only reproduce this with an arbiter, not a data node, not sure why.

Proposed fix: KeysCollectionManager::PeriodicRunner::setFunc catches and logs shutdown errors.



 Comments   
Comment by Githook User [ 21/Feb/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis', 'email': 'jesse@mongodb.com'}

Message: SERVER-46218 Fix removal/shutdown race in arbiter

If an arbiter is shut down soon after it is removed from the replica set
by a reconfig, the arbiter crashes due to a race between shutdown and
reconfig in KeysCollectionManager::enableKeyGenerator.

(cherry picked from commit 32f47846d78a4fdae9564b7ebb442d53e737d845)
Branch: v4.2
https://github.com/mongodb/mongo/commit/2c53a56910033ae757b19747edb4e6f2de59e130

Comment by Githook User [ 20/Feb/20 ]

Author:

{'username': 'ajdavis', 'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com'}

Message: SERVER-46218 Fix removal/shutdown race in arbiter

If an arbiter is shut down soon after it is removed from the replica set
by a reconfig, the arbiter crashes due to a race between shutdown and
reconfig in KeysCollectionManager::enableKeyGenerator.
Branch: master
https://github.com/mongodb/mongo/commit/32f47846d78a4fdae9564b7ebb442d53e737d845

Comment by A. Jesse Jiryu Davis [ 18/Feb/20 ]

The bug was apparently introduced between 4.2.1 and 4.2.2, I haven't bisected it to a specific commit yet. This fix should be backported to 4.2.

I'm putting this in the Safe Replica Set Reconfig epic since it's blocking testing for that project.

Generated at Thu Feb 08 05:10:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.