Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: 2.5.5
Affects Version/s: None
Component/s: None
Labels:
- 26qa

Operating System:
ALL
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

This is a list of known issues (mostly surrounding the _check method) that should potentially be tackled at the same time.

When selectAndCheckNode or getMaster can't find a suitable node, it calls _check to update the local view of the set which in turn calls _checkConnection and waits on _checkConnectionLock. This means that only one thread at a time can be issuing isMaster calls. Each thread will wait its turn to issue an isMaster call to each node in the set even if another thread has already updated while we were waiting. This can result in significant latency as a request can block for thousands of threads to do up to 24 round-trips over the network (2 per node in the set).
_check unconditionally retries after sleeping for 1 second if no master is found.
- selectAndCheckNode has to wait here even if there is a suitable secondary that it could return.
- This can delay the ReplicaSetMonitor background thread (which calls checkAll -> check -> _check) preventing it from checking other sets, even though it will retry with that set again soon.
- The sleep also happens after the retry even though it won't retry again.
_check iterates over _nodes but indirectly calls _checkHosts which can mutate _nodes, causing hosts to be skipped. There is some protection against this from _checkConnMatch_inlock, but it isn't complete.
- There is a related issue that it will alternate between adding and removing a host if two nodes disagree about who is in the set. This might cause an issue in a split-brain scenario where one reachable host doesn't think the current master is in the set (especially since _master is just an index into _nodes).
Updates due to needing a master should probably take advantage of the "primary" field in isMaster responses to find the master rather than linearly scanning _nodes.

Not bugs, but changes that could make the class easier to reason about:

Both of the getSlave methods are unused and can be deleted.
If getMaster() just became a wrapper around selectAndCheckNode(PrimaryOnly) all node selection and retry logic would be in a single path.

is depended on by

SERVER-7937 Write more test for ReplicaSetMonitor

Closed

SERVER-10304 Don't hold mutex while trying to establish connection to replica sets

Closed

is duplicated by

SERVER-6703 Uneven distribution of request are being sent to one node if all nodes are over localThreshold

Closed

SERVER-7274 Check on connect() for DBClientRS?

Closed

SERVER-10686 Selection of replicas in mongos not conforming to documentation

Closed

SERVER-12221 Sleep in ReplicaSetMonitor::_check is causing latency for slaveOk() queries in sharded cluster when there is no primary

Closed

SERVER-9021 Make sure that at most one thread at a time in mongos is making calls to the shard replSets to update the health of the nodes

Closed

SERVER-5496 Refactor ReplicaSetMonitor to avoid duplicate work

Closed

is related to

SERVER-12635 ReplicaSetMonitor should return Scoped connections to pool

Closed

related to

SERVER-10304 Don't hold mutex while trying to establish connection to replica sets

Closed

(3 is duplicated by, 1 is related to, 1 related to)

Assignee:: Mathias Stearn
Reporter:: Mathias Stearn
Participants:: Andy Schwerin, Githook User, Mathias Stearn, Scott Hernandez
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Jan 07 2014 08:40:14 PM UTC
Updated:: Jul 11 2016 05:19:26 PM UTC
Resolved:: Jan 30 2014 12:21:55 AM UTC
Confidence Status Last Update:: 10/Jan/14 9:33 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates