[SERVER-35407] ReplicationCoordinatorExternalState and data replication must not be started after shutdown Created: 05/Jun/18 Updated: 29/Oct/23 Resolved: 20/Dec/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.2.3, 4.3.3, 4.0.15 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matthew Russotto | Assignee: | Lingzhi Deng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Requested: |
v4.2, v4.0
|
||||||||||||||||
| Sprint: | Repl 2019-11-04, Repl 2019-12-30 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 53 | ||||||||||||||||
| Description |
|
It is possible for the ReplicationCoordinatorImpl to call startThreads on the ReplicationCoordinatorExternalState after calling shutdown() on the external state. In this case the ReplicationCoordinatorExternalState should refuse to start. Additionally, ReplicationCoordinatorImpl::_startDataReplication may be called after shutdown(). Further, if initial sync is required, ReplicationCoordinatorImpl::_startDataReplication should avoid creating the initial syncer after shutdown. |
| Comments |
| Comment by Githook User [ 14/Jan/20 ] |
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: (cherry picked from commit 9eb31a596db5cdf94e9660ba9a8eb178f483e329) |
| Comment by Githook User [ 14/Jan/20 ] |
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: (cherry picked from commit 9eb31a596db5cdf94e9660ba9a8eb178f483e329) |
| Comment by Githook User [ 20/Dec/19 ] |
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: |
| Comment by Lingzhi Deng [ 18/Dec/19 ] |
|
Apparently |
| Comment by Lingzhi Deng [ 18/Dec/19 ] |
|
Yes, it is possible for initialSyncerCopy to startup during the other thread (shutdown thread) calling shutdown on the same initialSyncer. But I think it should still be safe – 1. if InitialSyncer::shutdown runs first, it will set _state to State::kComplete and InitialSyncer::startup will just fail; 2. If InitialSyncer::startup runs first, InitialSyncer::shutdown should shut it down. |
| Comment by Matthew Russotto [ 17/Dec/19 ] |
|
Yes, that looks right. The first bug is apparently So if you add the return, _startThreads will properly do nothing. _externalState->startSteadyStateReplication already correctly does nothing in the case where initial sync is not needed.. And making sure _initialSyncer isn't created while in shutdown should handle that case. This still allows the initialSyncerCopy to start while in shutdown, but I think it will just fail; might be worth checking. |
| Comment by Lingzhi Deng [ 16/Dec/19 ] |
|
It looks like we did at one point want ReplicationCoordinatorExternalStateImpl::startThreads to do nothing after ReplicationCoordinatorExternalStateImpl::shutdown. But it seems like we are missing a "return" here? And ReplicationCoordinatorExternalStateImpl::startSteadyStateReplication already has code to skip starting up if _inShutdown. So this should be safe? So my understand of this ticket is to: matthew.russotto, does this cover the problem discussed here? Or is there anything else that I need to be careful with? |
| Comment by Tess Avitabile (Inactive) [ 30/Oct/19 ] |
|
Got it, thank you for explaining. |
| Comment by Matthew Russotto [ 30/Oct/19 ] |
|
I believe the issue in the bug is that while we're in shutdown(), we can have reached kConfigSteadyState and not wait for startup to complete, but startup isn't really complete. Suppose we're here in one thread and blocked right here on another thread (the shutdown thread) Now finishLoadLocalConfig reaches here, where it has released locks Because setCurrentRSConfig has been run, the config state is now steady state. Now the external state shutdown can run. Then we can hop back on the finishLoadLocalConfig thread and call _startDataReplication. Then we exit finishLoadLocalConfig and shutdown finishes, but there are still potentially replication components which haven't been shut down. |
| Comment by Tess Avitabile (Inactive) [ 30/Oct/19 ] |
|
matthew.russotto, I believe the code structure you describe is intentional. InĀ |