[SERVER-2471] Issue with slaveok failover for mongos Created: 02/Feb/11  Updated: 12/Jul/16  Resolved: 02/Mar/11

Status: Closed
Project: Core Server
Component/s: Replication, Sharding, Stability
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Greg Studer Assignee: Greg Studer
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux 2.6.35-25-generic #44-Ubuntu SMP Fri Jan 21 17:40:44 UTC 2011 x86_64 GNU/Linux


Attachments: File shard_shutdown.js    
Operating System: ALL
Participants:

 Description   

Despite being set as slaveok, mongos seems unable to use slaves for queries once primary in replica set goes down. Test which reproduces the issue attached, if test fails an error is thrown from the final two lines ( coll.findOne() ). Further calls to coll.findOne() when test is run with load('shard_shutdown.js') and shell still open causes different connection timeout errors, assuming related.

Duplicated on multiple systems (ubuntu linux) but not reproducible everywhere, seems to be system-dependent.



 Comments   
Comment by Eliot Horowitz (Inactive) [ 06/Mar/11 ]

Was there a commit for this?

Comment by Greg Studer [ 07/Feb/11 ]

Sequence of events:

1. Primary server in shard replica set goes down.
2. Request for data from the shard hits a slave node b/c slaveOk set to true.
3. Request for last error remembers the previous shard, checks out the replica set connection from the thread-local storage, but is hardcoded never to allow checks on the slaves. Fails with error, never checks replica set connection back in.
4. All further requests for data from the replica set (on that particular thread at least) fail, since it is impossible to establish a new connection to the replica set without all the nodes.

You can get the same effect with any command where slaveok is not true (for example, turning slaveok off than on again). The error resets the thread-local connection, and new connections are not allowed when the replica set is down.

Comment by Greg Studer [ 04/Feb/11 ]

getNextError is called by default when the output of the previous command is undefined ( actually the variable name "db" is hardcoded in, if you use another variable for your db you won't get this behavior ). The query gets a cursor, but it seems like there is an issue populating the result variable from that cursor. Looking into it.

Comment by Eliot Horowitz (Inactive) [ 04/Feb/11 ]

the shell shouldn't call getLastError for a findOne() ...
Its only supposed to do that for writes?
Can you verify?

Comment by Greg Studer [ 03/Feb/11 ]

Error:

Thu Feb 3 12:55:41 uncaught exception: getlasterror failed: {
"assertion" : "DBClientBase::findOne: transport error: ubuntu:31100 query:

{ getlasterror: 1.0, w: 1.0 }

",
"assertionCode" : 10276,
"errmsg" : "db assertion failure",
"ok" : 0
}

On subsequent requests to coll.findOne():

dbclient error communicating with server: ubuntu:31100

Think I've managed to track down what's happening - the query returns ok, but the subsequent default mongo shell call to getLastError fails to use slaveok. This causes the assertation error and somehow borks the connection for further queries. Hardcoding the slaveok flag in ClientInfo::getLastError seems to fix this, but not sure it's the best solution.

Comment by Eliot Horowitz (Inactive) [ 03/Feb/11 ]

Can you send output when you run this?
Seems to just hang for me.

Comment by Greg Studer [ 03/Feb/11 ]

Seems like a race condition.... shutting down, then waiting, then querying sometimes works.

Generated at Thu Feb 08 03:00:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.