[SERVER-46962] Test new sync source in quiesce mode Created: 18/Mar/20 Updated: 29/Oct/23 Resolved: 26/Jun/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.7.0 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Tess Avitabile (Inactive) | Assignee: | Pavithra Vetriselvan |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Sprint: | Repl 2020-06-01, Repl 2020-06-15, Repl 2020-06-29 | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
Test that fetching oplog from a new sync source that is in quiesce mode fails to establish a connection, causing the server to reenter sync source selection. |
| Comments |
| Comment by Githook User [ 26/Jun/20 ] | |||||||
|
Author: {'name': 'Pavi Vetriselvan', 'email': 'pvselvan@umich.edu', 'username': 'pvselvan'}Message: | |||||||
| Comment by Githook User [ 22/Jun/20 ] | |||||||
|
Author: {'name': 'Pavi Vetriselvan', 'email': 'pvselvan@umich.edu', 'username': 'pvselvan'}Message: | |||||||
| Comment by Pavithra Vetriselvan [ 28/May/20 ] | |||||||
|
tess.avitabile Yep, you're right! Got it, I'll look into using replSetSyncFrom, thanks for the input everyone! | |||||||
| Comment by Tess Avitabile (Inactive) [ 28/May/20 ] | |||||||
|
Thanks for the investigation, pavithra.vetriselvan! The changes you propose sound right to me. For (1), I wonder if the right place to prevent the OplogFetcher from retrying in response to ShutdownInProgress is in OplogFetcherRestartDecisionDefault::shouldContinue(). I'd recommend testing that we cannot choose a sync source that is in quiesce mode by running replSetSyncFrom to sync from the node that is in quiesce mode and making sure that the node instead eventually syncs from the primary (in this case, we would not create a partition between the node and the primary). | |||||||
| Comment by Samyukta Lanka [ 27/May/20 ] | |||||||
|
The way sync source selection works is that the Sync Source Resolver will request a candidate from the topology coordinator by calling chooseNewSyncSource. When that function chooses a new sync source candidate, it will set the _syncSource field which will be used to populate the syncSourceHost in replSetGetStatus. So even though we fail sync from that node because we couldn't create an oplog fetcher, until we either clear the sync source or find a new one, replsetGetStatus will report that node as its sync source. I think we'll need to find a different way to check that the node fails to sync from the node in quiesce mode. | |||||||
| Comment by Pavithra Vetriselvan [ 27/May/20 ] | |||||||
|
Logs where we do not blacklist node1 after oplog fetcher errors: https://logkeeper.mongodb.org/lobster/logdrop#bookmarks=1751%2C1752%2C1753%2C1759%2C1760%2C1761%2C1762&f~=000~sdlfkj&l=1 Logs where we blacklist the node in the sync source resolver: https://logkeeper.mongodb.org/lobster/logdrop#bookmarks=1607%2C2380%2C3177&f~=000~Blackli&l=1 | |||||||
| Comment by Pavithra Vetriselvan [ 27/May/20 ] | |||||||
|
tess.avitabile samy.lanka While writing the test for this ticket (test file attached), I ran into some unexpected behavior with sync source selection. To start, this ticket is distinct from In my test, I try to force this scenario by creating a 3 node replica set. node0 is the primary, node1 is the syncSource, and node2 is the syncingNode. I disconnect the primary from node2 and then stop server replication on node2 (from my understanding, this prevents us from running through the _produce loop in bgsync). This way, node2 will only be able to choose node1 as its sync source when we do restart replication. After doing a couple writes on node1 so that it is ahead of node2, we cause it to enter quiesce mode. We then restart replication on node2. From the following logs, we can see that the oplog fetcher gets a ShutdownInProgress error:
The logs I added correspond to the lines here and here. One thing I noticed is that we never enter the _checkConnection function in DBClientConnection, which attempts to re-establish connections. I think this is likely because we're not in a failed state here (_failed defaults to false)? In my test, I would have expected the call to rst.awaitSyncSource to hang since we got this error from the oplog fetcher, but it looks like we are listing node1 as our sync source. awaitSyncSource checks the syncSourceHost field from replSetGetStatus. The test is passing, and I'm not sure if it's intended behavior that we still list node1 as our sync source even though the oplog fetcher returned a ShutdownInProgress error. Something else I noticed while looking through the code is that we will blacklist candidates if we encounter an error in the sync source resolver here. So, I think the only changes we need to make are: Once these changes are made, I assume that my test will no longer pass because node1 will never become node2's sync source. Samy, do the theories/changes sound correct to you? |