[SERVER-25593] ReplicationCoordinatorImpl should hold DataReplicator with shared pointer Created: 12/Aug/16 Updated: 19/Nov/16 Resolved: 26/Sep/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 3.3.15 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Judah Schvimer | Assignee: | Scott Hernandez (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Repl 2016-08-29, Repl 2016-09-19, Repl 2016-10-10 | ||||||||
| Participants: | |||||||||
| Description |
|
The ReplicationCoordinatorImpl currently uses a unique pointer to hold onto the DataReplicator. When we call doInitialSync, the pointer is not locked, so another thread could reset the pointer before the call, like resync. Using a shared pointer will fix this and make it so we no longer needed to synchronize on the pointer use, only when replaced/swapped. Each user of the current DataReplicator would make a copy of the shared_ptr which will stay valid. The sequence of events Tests |
| Comments |
| Comment by Githook User [ 14/Oct/16 ] |
|
Author: {u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}Message: |
| Comment by Githook User [ 26/Sep/16 ] |
|
Author: {u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}Message: |
| Comment by Scott Hernandez (Inactive) [ 13/Sep/16 ] |
|
As discussed in person, this can happen with concurrent resync commands. I will update the description to make it more clear and what tests need to be included. |
| Comment by Eric Milkie [ 29/Aug/16 ] |
|
Can someone explain the situation where we have one DR that is still shutting down while a new one has started? |
| Comment by Andy Schwerin [ 29/Aug/16 ] |
|
I still don't understand. If you have to wait for the old DR to be fully shut down before starting a new one, you shouldn't need shared ownership. Keeping it by shared pointer seems like it's going to lead to bugs with two running DRs. |
| Comment by Judah Schvimer [ 15/Aug/16 ] |
|
I don't think that would be safe, but we will make sure that we wait for the old DataReplicator to shutdown before starting the new one. Using a shared_ptr will allow the old DataReplicator to remain alive to shutdown after we swap the pointer for the new DataReplicator without having to maintain a lock over the DataReplicator. That way when a resync resets the pointer, there is no race, and the call to doInitialSync operates on a non-null DataReplicator (and then immediately is shutdown). |
| Comment by Andy Schwerin [ 12/Aug/16 ] |
|
But is it safe to start a new Data Replicator before you know the old one has stopped doing writes? |
| Comment by Judah Schvimer [ 12/Aug/16 ] |
|
The goal is to allow one thread to shutdown the old data replicator and create a new one while the old one is still running. The old data replicator will then terminate since shutdown has been called and be destroyed. |
| Comment by Andy Schwerin [ 12/Aug/16 ] |
|
I'm wary of using a shared_ptr. Who owns the repjlicator when the other thread resets the repl coord's data replicator pointer? Could this other thread then create a new data replicator while the old one still exists? Is that meaningful? Safe? |