[SERVER-12284] ReplicaSetMonitor is broken Created: 07/Jan/14  Updated: 11/Jul/16  Resolved: 30/Jan/14

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 2.5.5

Type: Bug Priority: Major - P3
Reporter: Mathias Stearn Assignee: Mathias Stearn
Resolution: Done Votes: 0
Labels: 26qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-7937 Write more test for ReplicaSetMonitor Closed
is depended on by SERVER-10304 Don't hold mutex while trying to esta... Closed
Duplicate
is duplicated by SERVER-6703 Uneven distribution of request are be... Closed
is duplicated by SERVER-7274 Check on connect() for DBClientRS? Closed
is duplicated by SERVER-10686 Selection of replicas in mongos not c... Closed
is duplicated by SERVER-12221 Sleep in ReplicaSetMonitor::_check is... Closed
is duplicated by SERVER-9021 Make sure that at most one thread at ... Closed
is duplicated by SERVER-5496 Refactor ReplicaSetMonitor to avoid d... Closed
Related
related to SERVER-10304 Don't hold mutex while trying to esta... Closed
is related to SERVER-12635 ReplicaSetMonitor should return Scope... Closed
Operating System: ALL
Participants:

 Description   

This is a list of known issues (mostly surrounding the _check method) that should potentially be tackled at the same time.

  • When selectAndCheckNode or getMaster can't find a suitable node, it calls _check to update the local view of the set which in turn calls _checkConnection and waits on _checkConnectionLock. This means that only one thread at a time can be issuing isMaster calls. Each thread will wait its turn to issue an isMaster call to each node in the set even if another thread has already updated while we were waiting. This can result in significant latency as a request can block for thousands of threads to do up to 24 round-trips over the network (2 per node in the set).
  • _check unconditionally retries after sleeping for 1 second if no master is found.
    • selectAndCheckNode has to wait here even if there is a suitable secondary that it could return.
    • This can delay the ReplicaSetMonitor background thread (which calls checkAll -> check -> _check) preventing it from checking other sets, even though it will retry with that set again soon.
    • The sleep also happens after the retry even though it won't retry again.
  • _check iterates over _nodes but indirectly calls _checkHosts which can mutate _nodes, causing hosts to be skipped. There is some protection against this from _checkConnMatch_inlock, but it isn't complete.
    • There is a related issue that it will alternate between adding and removing a host if two nodes disagree about who is in the set. This might cause an issue in a split-brain scenario where one reachable host doesn't think the current master is in the set (especially since _master is just an index into _nodes).
  • Updates due to needing a master should probably take advantage of the "primary" field in isMaster responses to find the master rather than linearly scanning _nodes.

Not bugs, but changes that could make the class easier to reason about:

  • Both of the getSlave methods are unused and can be deleted.
  • If getMaster() just became a wrapper around selectAndCheckNode(PrimaryOnly) all node selection and retry logic would be in a single path.


 Comments   
Comment by Githook User [ 29/Jan/14 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-12284 Make test wait for mongos to be aware of secondary.
Branch: master
https://github.com/mongodb/mongo/commit/a195fdda02361abf7e74f8a99dc80550f46c2f84

Comment by Githook User [ 29/Jan/14 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-12284 Simplify TagSet class

It is now little more than a well-typed BSONArray that has a more useful
default constructor.

Since the new RSM no longer uses the iterator portion of the API, it isn't
worth the cost of maintaining it. Additionally, TagSet being it's own iterator
was more than a little odd.
Branch: master
https://github.com/mongodb/mongo/commit/98a4ac46bfcc5cbaf7b6d0ab1530d3fd96202a0d

Comment by Githook User [ 29/Jan/14 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-12284 Fix dbtests/replica_set_monitor_test.cpp for RSM rewrite
Branch: master
https://github.com/mongodb/mongo/commit/d30a0aec7c9f8e45d0b2b95f9e55fafe3bedf170

Comment by Githook User [ 29/Jan/14 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-12284 Rewrite ReplSetMonitor
Branch: master
https://github.com/mongodb/mongo/commit/e031dffb83c2a6f9f64a7d055493a8664af997d5

Comment by Githook User [ 29/Jan/14 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-12284 Rip ReplicaSetMonitor out of dbclient_rs files

This is pure code motion.
Branch: master
https://github.com/mongodb/mongo/commit/cf3753ffe909b3f8cc45072933ee5f0426d2329d

Comment by Githook User [ 29/Jan/14 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-12284 Improve HostAndPort comparisons

Comment by Scott Hernandez (Inactive) [ 24/Jan/14 ]

It would be good for the "nearest/local" threshold parameter (defaults to 15ms I believe) to be live for all replica set monitors, not just copied at the time of the monitor creation. Then you can dynamically set the threshold without restarting the process.

Comment by Mathias Stearn [ 10/Jan/14 ]

This is a sketch of the behavior we decided to move forward with. This algorithm is written assuming a single actor, but has a clear path to allow multiple threads.

STATE:
    Node
        HostAndPort host
        bool isUp = false
        bool isMaster = false
        bool knowTags = false
        set<string> tags
 
    ScanState
        deque<HostAndPort> hostsToScan
        set<HostAndPort> possibleNodes
        bool foundUpMaster = false
        bool foundAnyUpNodes = false
        set<HostAndPort> triedNodes
        set<Node> quarantine
 
    SetState
        string name
        set<Node> nodes
        set<HostAndPort> seedNodes
        optional<ScanState> scanState
        optional<HostAndPort> previousMaster
 
 
STEPS: (only the first 3 are entry points. start must be run exactly once before any other step)
    start(name, seeds)
        set = new SetState
        set.name = name
        set.seedNodes = seeds
 
    getHost(criteria)
        if any host in set.nodes is up and meets criteria
            return host
        return NONE
 
    getHostWithRetry(criteria)
        host = getHost(criteria)
        if host != NONE
            return host
 
        host = scanHostsUntil(criteria)
        if host != NONE
            return host
 
        sleep 1 second
        return scanHostsUntil(criteria) // may be NONE
 
    scanHostsUntil(crieria)
        host = getNextHostToScan
        if (host == NONE) // last scan ended on the final node but didn't pop the NONE
            host = getNextHostToScan
 
        for (; host != NONE; host = getNextHostToScan)
            reply = callIsMaster(host)
            if failed
                failedHost(host)
                contine
 
            recievedIsMaster(host, reply)
            node = getHost(criteria)
            if (node)
                return node
            
        return NONE
 
    getNextHostToScan
        if (!have scanState) // start a new scan
            set.scanState = new scanState
            set.scanState.hostsToScan = set.nodes
            randomly shuffle hostsToScan
            move down nodes to end of hostsToScan
            move set.previousMaster to front of hostsToScan (even if down)
 
        if (hostsToScan.empty())
            if (!foundUpMaster)
                add all possibleNodes not in triedNodes to hostsToScan
 
        if (hostsToScan.empty())
            if (!foundAnyUpNodes) 
                add all seedNodes not in triedNodes to hostsToScan
 
        if (hostsToScan.empty())
            if (!foundUpMaster)
                add all quarantined nodes to set.nodes
            set.scanState.reset()
            return NONE
 
        return hostsToScan.pop()
 
    failedHost(host)
        triedHosts.insert(host)
        if host in set.nodes
            reset node entry for host to default state (leaving host set)
        
    recievedIsMaster(host, reply)
        triedNodes.insert(host)
 
        if (reply.name != set.name || !(reply.isNormalMemberAndUp()))
            failedHost(host)
            return
 
        myNode = NULL
        if (reply.isMaster)
            if foundUpMaster
                mark old master as not master
 
            remove nodes from set.nodes and hostsToScan not in reply.allNormalHosts
            for each host in reply.allNormalHosts
                if node not in set.nodes
                    if node in quarantine
                        move to set.nodes
                    else
                        add to set.nodes
 
            myNode = nodes.find(host)
            foundUpMaster = true
            set.previousMaster = host
        else
            for each host in reply.allNormalHosts
                if node not in set.nodes
                    add to potentialNodes
 
            if reply.primary not in triedNodes
                move primary to front of hostsToScan (inserting if needed)
 
            if host in set.nodes
                myNode = set.nodes.find(host)
            else
                myNode = quarantine.addNew()
                myNode.host = host
 
        foundAnyUpNode = true
        myNode.isUp = true
        myNode.knowTags = true
        myNode.tags = reply.tags
        myNode.isMaster = reply.isMaster

Comment by Andy Schwerin [ 10/Jan/14 ]

redbeard0531, please attach the proposed behavior description we discussed today, maybe just in a comment in the ticket.

Generated at Thu Feb 08 03:28:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.