[SERVER-25593] ReplicationCoordinatorImpl should hold DataReplicator with shared pointer Created: 12/Aug/16  Updated: 19/Nov/16  Resolved: 26/Sep/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.3.15

Type: Bug Priority: Major - P3
Reporter: Judah Schvimer Assignee: Scott Hernandez (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-26355 Ensure concurrent resync works on sec... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2016-08-29, Repl 2016-09-19, Repl 2016-10-10
Participants:

 Description   

The ReplicationCoordinatorImpl currently uses a unique pointer to hold onto the DataReplicator. When we call doInitialSync, the pointer is not locked, so another thread could reset the pointer before the call, like resync.

Using a shared pointer will fix this and make it so we no longer needed to synchronize on the pointer use, only when replaced/swapped. Each user of the current DataReplicator would make a copy of the shared_ptr which will stay valid. The sequence of events

Tests
A test to prove correctness will be concurrent resync requests during initial sync, or at all. Each resync will stop an existing initial-sync/resync and start a new initial sync, optionally waiting till completion.



 Comments   
Comment by Githook User [ 14/Oct/16 ]

Author:

{u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}

Message: SERVER-25593: check shutdown/failure status during initial sync when acquiring lock.
Branch: master
https://github.com/mongodb/mongo/commit/886db25408492ced7d84b1cc06956d1e8613b29e

Comment by Githook User [ 26/Sep/16 ]

Author:

{u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}

Message: SERVER-25593: use shared_ptr for DataReplicator
Branch: master
https://github.com/mongodb/mongo/commit/1e703b173478fd3dbef9611ec03d1d702ad5de02

Comment by Scott Hernandez (Inactive) [ 13/Sep/16 ]

As discussed in person, this can happen with concurrent resync commands. I will update the description to make it more clear and what tests need to be included.

Comment by Eric Milkie [ 29/Aug/16 ]

Can someone explain the situation where we have one DR that is still shutting down while a new one has started?

Comment by Andy Schwerin [ 29/Aug/16 ]

I still don't understand. If you have to wait for the old DR to be fully shut down before starting a new one, you shouldn't need shared ownership. Keeping it by shared pointer seems like it's going to lead to bugs with two running DRs.

Comment by Judah Schvimer [ 15/Aug/16 ]

I don't think that would be safe, but we will make sure that we wait for the old DataReplicator to shutdown before starting the new one. Using a shared_ptr will allow the old DataReplicator to remain alive to shutdown after we swap the pointer for the new DataReplicator without having to maintain a lock over the DataReplicator. That way when a resync resets the pointer, there is no race, and the call to doInitialSync operates on a non-null DataReplicator (and then immediately is shutdown).

Comment by Andy Schwerin [ 12/Aug/16 ]

But is it safe to start a new Data Replicator before you know the old one has stopped doing writes?

Comment by Judah Schvimer [ 12/Aug/16 ]

The goal is to allow one thread to shutdown the old data replicator and create a new one while the old one is still running. The old data replicator will then terminate since shutdown has been called and be destroyed.

Comment by Andy Schwerin [ 12/Aug/16 ]

I'm wary of using a shared_ptr. Who owns the repjlicator when the other thread resets the repl coord's data replicator pointer? Could this other thread then create a new data replicator while the old one still exists? Is that meaningful? Safe?

Generated at Thu Feb 08 04:09:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.