[SERVER-35407] ReplicationCoordinatorExternalState and data replication must not be started after shutdown Created: 05/Jun/18  Updated: 29/Oct/23  Resolved: 20/Dec/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.2.3, 4.3.3, 4.0.15

Type: Bug Priority: Major - P3
Reporter: Matthew Russotto Assignee: Lingzhi Deng
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Duplicate
is duplicated by SERVER-35408 ReplicationCoordinatorExternalState a... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.2, v4.0
Sprint: Repl 2019-11-04, Repl 2019-12-30
Participants:
Linked BF Score: 53

 Description   

It is possible for the ReplicationCoordinatorImpl to call startThreads on the ReplicationCoordinatorExternalState after calling shutdown() on the external state. In this case the ReplicationCoordinatorExternalState should refuse to start.

Additionally, ReplicationCoordinatorImpl::_startDataReplication may be called after shutdown().
This calls _externalState->startSteadyStateReplication, which should also do nothing after shutdown.

Further, if initial sync is required, ReplicationCoordinatorImpl::_startDataReplication should avoid creating the initial syncer after shutdown.



 Comments   
Comment by Githook User [ 14/Jan/20 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-35407: Refuse to start threads and initial sync if replication is shutting down

(cherry picked from commit 9eb31a596db5cdf94e9660ba9a8eb178f483e329)
Branch: v4.2
https://github.com/mongodb/mongo/commit/f304776caf5e2113347c3ea52a05eee396b9ce66

Comment by Githook User [ 14/Jan/20 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-35407: Refuse to start threads and initial sync if replication is shutting down

(cherry picked from commit 9eb31a596db5cdf94e9660ba9a8eb178f483e329)
Branch: v4.0
https://github.com/mongodb/mongo/commit/a5da3c438438194bab72bd6d98118fa6f7e02699

Comment by Githook User [ 20/Dec/19 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-35407: Refuse to start threads and initial sync if replication is shutting down
Branch: master
https://github.com/mongodb/mongo/commit/9eb31a596db5cdf94e9660ba9a8eb178f483e329

Comment by Lingzhi Deng [ 18/Dec/19 ]

Apparently SERVER-39776 also added code avoid creating InitialSyncer if _inShutDown. But SERVER-42165 accidentally removed it.

Comment by Lingzhi Deng [ 18/Dec/19 ]

Yes, it is possible for initialSyncerCopy to startup during the other thread (shutdown thread) calling shutdown on the same initialSyncer. But I think it should still be safe – 1. if InitialSyncer::shutdown runs first, it will set _state to State::kComplete and InitialSyncer::startup will just fail; 2. If InitialSyncer::startup runs first, InitialSyncer::shutdown should shut it down.

Comment by Matthew Russotto [ 17/Dec/19 ]

Yes, that looks right. The first bug is apparently SERVER-39776, I added the log statement but missed the return. But even if that was fixed, the second bug can get us, since we release the lock after clearing _initialSyncer.

So if you add the return, _startThreads will properly do nothing.

_externalState->startSteadyStateReplication already correctly does nothing in the case where initial sync is not needed..

And making sure _initialSyncer isn't created while in shutdown should handle that case.

This still allows the initialSyncerCopy to start while in shutdown, but I think it will just fail; might be worth checking.

Comment by Lingzhi Deng [ 16/Dec/19 ]

It looks like we did at one point want ReplicationCoordinatorExternalStateImpl::startThreads to do nothing after ReplicationCoordinatorExternalStateImpl::shutdown. But it seems like we are missing a "return" here?

And ReplicationCoordinatorExternalStateImpl::startSteadyStateReplication already has code to skip starting up if _inShutdown. So this should be safe?

So my understand of this ticket is to:
1. Add an early return here at ReplicationCoordinatorExternalStateImpl::startThreads.
2. Fix ReplicationCoordinatorImpl::_startDataReplication to avoid creating InitialSyncer here if _inShutdown.

matthew.russotto, does this cover the problem discussed here? Or is there anything else that I need to be careful with?

Comment by Tess Avitabile (Inactive) [ 30/Oct/19 ]

Got it, thank you for explaining.

Comment by Matthew Russotto [ 30/Oct/19 ]

I believe the issue in the bug is that while we're in shutdown(), we can have reached kConfigSteadyState and not wait for startup to complete, but startup isn't really complete.

Suppose we're here in one thread

https://github.com/mongodb/mongo/blob/604cc100d1e12422aa3243385bc994651e0e19d2/src/mongo/db/repl/replication_coordinator_impl.cpp#L598

and blocked right here on another thread (the shutdown thread)

https://github.com/mongodb/mongo/blob/604cc100d1e12422aa3243385bc994651e0e19d2/src/mongo/db/repl/replication_coordinator_impl.cpp#L817

Now finishLoadLocalConfig reaches here, where it has released locks

https://github.com/mongodb/mongo/blob/604cc100d1e12422aa3243385bc994651e0e19d2/src/mongo/db/repl/replication_coordinator_impl.cpp#L610

Because setCurrentRSConfig has been run, the config state is now steady state.

Now the external state shutdown can run. Then we can hop back on the finishLoadLocalConfig thread and call _startDataReplication.

https://github.com/mongodb/mongo/blob/604cc100d1e12422aa3243385bc994651e0e19d2/src/mongo/db/repl/replication_coordinator_impl.cpp#L621

Then we exit finishLoadLocalConfig and shutdown finishes, but there are still potentially replication components which haven't been shut down.

Comment by Tess Avitabile (Inactive) [ 30/Oct/19 ]

matthew.russotto, I believe the code structure you describe is intentional. InĀ SERVER-31514, startSteadyStateReplication was moved outside of the mutex and shutdown handling was implemented in ReplicationCoordinatorExternalStateImpl. Also, when the ReplicationCoordinatorImpl is shut down, we wait for startup to complete before shutting down, so it is supposed to be acceptable for startup to continue after the shutdown has been signaled. However, it is concerning that you saw WT shut down before replication finished shutting down. Unfortunately, we no longer have logs, so I'm not sure how to diagnose the issue. Do you recall any more details on how the bug manifested?

Generated at Thu Feb 08 04:39:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.