[SERVER-26494] remove unreachable else-branch in sync_source_feedback.cpp logic Created: 05/Oct/16  Updated: 13/Jul/19  Resolved: 28/Feb/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.5.4

Type: Bug Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: Benety Goh
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-16441 syncSourceFeedback can spin, with net... Closed
is related to SERVER-25702 add support to OplogFetcher for resta... Closed
is related to SERVER-16272 SyncSourceFeedback spams log on errors Closed
is related to SERVER-18029 Data Replicator Reporter - Extend Rep... Closed
is related to SERVER-23085 Data Replicator Reporter - Integrate ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2016-12-12, Repl 2017-01-23, Repl 2017-02-13, Repl 2017-03-06
Participants:

 Description   

This block of code has 2 if-statements which will cover all cases, preventing the else-branch which blacklists the current sync source from ever being reached.

This code is new in 3.3



 Comments   
Comment by Siyuan Zhou [ 13/Jul/19 ]

The following commit is part of this ticket but has a wrong SERVER number, so I posted it here.

Author:

{u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}

Message: SERVER-26484 removed unreachable code in SyncSourceFeedback. Blacklisting of reporter sync source will be done in BackgroundSync and SyncSourceResolver.
Branch: master
https://github.com/mongodb/mongo/commit/b690ca960b5062ee5b8af034ffe338051a228431

Comment by Githook User [ 28/Feb/17 ]

Author:

{u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}

Message: SERVER-26494 removed unused function arguments from SyncSourceFeedback
Branch: master
https://github.com/mongodb/mongo/commit/d63633540dd66d7b1830a257fd178c2fd38272b9

Comment by Spencer Brody (Inactive) [ 23/Feb/17 ]

Okay then, feel free to update this ticket to be about removing the dead code rather than fixing it

Comment by Benety Goh [ 23/Feb/17 ]

I can confirm that NotMasterOrSecondary errors caused by replSetMaintenance are handled correctly in BackgroundSync and SyncSourceFeedback:

On a chaining replica set setup n1 (pri) <- n2 (sec) <- n3 (sec),

running

{ replSetMaintenance: true }

on n2 will cause the oplog fetcher on n3 to fail with the following message:

2017-02-23T15:13:54.144-0500 W REPL     [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: sync source <n2> (config version: 5; last applied optime: { ts: Timestamp 1487880823000|1, t: 3 }; sync source index: -1; primary index: 0) is no longer valid

BackgroundSync clears its sync source and this leads to the Reporter in SyncSourceFeedback returning with an error when it detects that the sync source is no longer valid:

2017-02-23T15:13:54.162-0500 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to <n2>: InvalidSyncSource: Sync source was cleared. Was <n2>

Eventually, n3 selects n1 as its sync source and we go back to having a stable replica set:

2017-02-23T15:15:17.392-0500 I REPL     [rsBackgroundSync] sync source candidate: <n1>

Comment by Spencer Brody (Inactive) [ 23/Feb/17 ]

Yeah, okay, I'm mostly convinced. The only last thing I'd want to confirm is that NotMasterOrSecondary errors caused by something other than a config change (for example by a node having replSetMaintenance called on it) will also be correctly handled outside of sync_source_feedback

Comment by Benety Goh [ 22/Feb/17 ]

It seems that the common cause for the remote replSetUpdatePosition failing with those error codes NodeNotFound,NotMasterOrSecondary and InvalidReplicaSetConfig would be a change in config. Config changes currently will cause the oplog fetcher to fail because it will detect a change in the config version contained in the oplog query metadata. Subsequently, after exiting the oplog fetcher, we will go back to the sync source resolver which has its own blacklisting logic.

IMO, it'd be beneficial to centralize our blacklisting criteria in one location.

Comment by Spencer Brody (Inactive) [ 13/Feb/17 ]

Should we just swallow network exceptions like we do InvalidSyncSource and NodeNotFound? Conversely, is there a specific set of codes we should be looking for and blacklisting based on?

From a very cursory skim of the code for the replSetUpdatePosition command, it looks like it can fail with NodeNotFound, NotMasterOrSecondary, and InvalidReplicaSetConfig at least, and the latter two of those at least seem worthwhile to blacklist based on? Actually, we should probably also blacklist on NodeNotFound, the problem is we can't currently distinguish between a NodeNotFound error that we generated when trying to construct the update position command and a NodeNotFound received over the network from the sync source from it processing the update position command.

Comment by Judah Schvimer [ 08/Feb/17 ]

I think it's probably fine to rely on BackgroundSync to do blacklisting.

Comment by Benety Goh [ 08/Feb/17 ]

Blacklist the sync source on a failed Reporter command may interfere with the retry logic added to the OplogFetcher in SERVER-25702. We may prematurely blacklist a sync source due to an intermittent network issue. If the sync source really is unreachable, we would blacklist it anyway when looking for a new sync source in BackgroundSync after exiting OplogFetcher.

spencer, judah.schvimer, thoughts?

Comment by Benety Goh [ 10/Jan/17 ]

Those != should be ==.

Comment by Spencer Brody (Inactive) [ 07/Nov/16 ]

Benety, can you take a look at this and see if there's a potential bug here?

Generated at Thu Feb 08 04:12:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.