[JAVA-231] Failed to retrieve any result when using SlaveOK with all slaves are down Created: 11/Dec/10  Updated: 17/Mar/11  Resolved: 16/Feb/11

Status: Closed
Project: Java Driver
Component/s: Cluster Management
Affects Version/s: 2.3
Fix Version/s: 2.5

Type: Bug Priority: Major - P3
Reporter: Joseph Wang Assignee: Antoine Girbal
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

[joseph.wang@lpsdb1.la2 ~]$ uname -a
Linux lpsdb1.la2.estalea.net 2.6.18-194.17.4.el5 #1 SMP Mon Oct 25 15:50:53 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
[joseph.wang@lpsdb1.la2 ~]$ ps auwww | grep mongo
knut 1400 0.0 0.0 83116 4728 pts/0 S+ Dec10 0:00 /usr/local/mongodb-linux-x86_64-1.6.3/bin/mongo localhost:4110
805 2044 0.0 0.0 61172 720 pts/2 S+ 13:02 0:00 grep mongo


Attachments: Text File BaseTableQueryEngine.java     Text File MongoConnnection.java     Java Archive File mongo.jar    

 Description   

Java driver 2.3.

We have 3 mongo servers. Each has 48G with 24 CPUs. The servers are running with replica set.

[joseph.wang@lpsdb1.la2 ~]$ /usr/local/mongodb-linux-x86_64-1.6.3/bin/mongo localhost:4110
MongoDB shell version: 1.6.3
connecting to: localhost:4110/test
> rs.status()
{
"set" : "prod",
"date" : "Sat Dec 11 2010 13:04:28 GMT-0800 (PST)",
"myState" : 1,
"members" : [

{ "_id" : 0, "name" : "lpsdb1.la2.estalea.net:4110", "health" : 1, "state" : 1, "self" : true }

,

{ "_id" : 1, "name" : "mongo-prod-mem2.lps.la2.estalea.net:4110", "health" : 1, "state" : 2, "uptime" : 11095, "lastHeartbeat" : "Sat Dec 11 2010 13:04:27 GMT-0800 (PST)" }

,

{ "_id" : 2, "name" : "mongo-prod-mem3.lps.la2.estalea.net:4110", "health" : 1, "state" : 2, "uptime" : 11178, "lastHeartbeat" : "Sat Dec 11 2010 13:04:26 GMT-0800 (PST)" }

],
"ok" : 1
}

The connection pool was connecting to all three servers. We manually brought down two slaves to see if we can get query result (as part of fault tolerance testing).
There were no update/writes. Just query. When 1 slave was down, we'd no problem getting query result. When 2 slaves were down, we got no query result.
When a consumer query comes in, we fork out multiple threads, each responsible for fetch data from specific collection.

MongoConnection.java shows our singleton connection pool code.
BaseTableQueryEngine.java shows our query to one of our collections.

As you can see, we set slaveOk at the query level.

if (db != null) {
DBCollection coll = db.getCollection(currentCollection);
DBCursor cur = null;

cur = coll.find(dbQuery).addOption(
Bytes.QUERYOPTION_SLAVEOK);

DBObject dbObject = db.getLastError();
if (dbObject != null && dbObject.get("err") != null)

{ log.warn("BaseTableQueryEngine: Encounter error for query " + dbQuery.toString()); setSuccess(false); }

if (enable_debug)

{ log.debug("BaseTableQueryEngine: Run query " + dbQuery.toString()); log.debug("BaseTableQueryEngine: Found " + cur.count() + " in " + (System.currentTimeMillis() - fStart)); }

while (cur.hasNext()) {
BasicDBObject dbo = (BasicDBObject) cur.next();
BaseTableRow row = new BaseTableRow(dbo);
if (row.isValid())

{ tuples.add(row.getTuple()); }

long time = (timeout - (System.currentTimeMillis() - fStart));
if (time < 0)

{ break; }

}

if (enable_debug)

{ log.debug("BaseTableQueryEngine: tuples " + tuples.size() + " in time " + (System.currentTimeMillis() - fStart)); }

}



 Comments   
Comment by Antoine Girbal [ 16/Feb/11 ]

I tested this case and was able to read fine from last slave, with 2 servers down from replica set.
I am assuming that this issue stemmed from other bugs fixed earlier.

Comment by Antoine Girbal [ 13/Dec/10 ]

jar from trunk

Comment by Antoine Girbal [ 13/Dec/10 ]

this is most likely related to bug
http://jira.mongodb.org/browse/JAVA-225

basically the Java driver was ignoring the slaveOk option when looking for a master.
Joseph, could you try with latest driver from trunk (I will also attach a jar you can use).
Note that builds from trunk should only be used for testing.

Comment by Scott Hernandez (Inactive) [ 12/Dec/10 ]

SlaveOk means that queries can be sent to the slave, not they must be, IMO.

Comment by Joseph Wang [ 12/Dec/10 ]

if there is a way to determine that all slaves are down, i don't mind reissuing the query w/o slave_ok s.t. it will get to primary/master.

Comment by Eliot Horowitz (Inactive) [ 12/Dec/10 ]

Correct - slave_ok means reads hit slaves, and writes hit master.
slave_ok is only relevant for reads.

the correct thing is probably to read from the master if all slaves are down.

the only issue is if you do queries that are really slow and you assume is going to happen on slave

Comment by Joseph Wang [ 12/Dec/10 ]

My understanding from 2.2 driver fix was that SlaveOK meant querying slave for query, but still hit master for write/update.

When all slaves are down, we need to have an option to specify hitting master.
1) From the getLastError(), how do we determine if all slaves are done. If we can detect this condition, I can remove SlaveOK.?
2) If we cannot determine if all slaves are down, is there an option that I can specify to say it is acceptable to hit master?

Comment by Eliot Horowitz (Inactive) [ 12/Dec/10 ]

Its a tad unclear.
The general contract of slave_ok is that only slaves will be used.
So I'm not 100% sure what the right thing is here.
Either is probably technically ok - most important is that its well documented and consistent across all drivers.

Comment by Joseph Wang [ 11/Dec/10 ]

Yes, that will be desirable. If no slave is available, query from the primary/master even if SLAVE_OK is set at the query/db/collection level.

Comment by Scott Hernandez (Inactive) [ 11/Dec/10 ]

It seems like if the non-master pool is empty then the master should be used, yes?

Generated at Thu Feb 08 08:51:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.