[SERVER-5797] Uncaught exception in count_slaveok.js Created: 09/May/12  Updated: 11/Jul/16  Resolved: 07/Jun/12

Status: Closed
Project: Core Server
Component/s: Replication, Sharding
Affects Version/s: None
Fix Version/s: 2.1.2

Type: Bug Priority: Major - P3
Reporter: Ian Whalen (Inactive) Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: 212push, buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OS X 10.5 32-bit


Issue Links:
Duplicate
duplicates SERVER-5344 Enable parallel execution of sharded ... Closed
Related
is related to SERVER-2435 Implement count() in parallel Closed
Operating System: ALL
Participants:

 Description   

 m30999| Tue May  8 19:47:36 [conn2] sharded connection to countSlaveOk-rs0/bs-mm0.local:31100,bs-mm0.local:31101 not being returned to the pool
Tue May  8 19:47:36 uncaught exception: count failed: {
	"errmsg" : "exception: ReplicaSetMonitor no master found for set: countSlaveOk-rs0",
	"code" : 10009,
	"ok" : 0
}
failed to load: /Users/mike/buildslaves/mongo/OS_X_105_32bit/mongo/jstests/sharding/count_slaveok.js

http://buildbot.mongodb.org/builders/OS%20X%2010.5%2032-bit/builds/3697/steps/test_9/logs/stdio



 Comments   
Comment by auto [ 08/Jun/12 ]

Author:

{u'login': u'', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-5797 Updated test to fix buildbot failure caused by a recent change in ShardingTest
Branch: master
https://github.com/mongodb/mongo/commit/3db2a58dcbf890adc97da97091de32da7d63fb0f

Comment by auto [ 07/Jun/12 ]

Author:

{u'login': u'', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-5797

Makre ReplicaSetMonitor fail fast when no usable master is available.
Bypass shard version checking in mongos for slaveOk ops (short term hack).
Branch: master
https://github.com/mongodb/mongo/commit/c42e156843c0b57a9ec19e44293d8739ea7698da

Comment by Randolph Tan [ 01/Jun/12 ]

Similar problem, triggered in a different code path. To make this 100% reproducible, simply comment out the code that spawns the ReplicaSetMonitorWatcher.

Comment by Ian Whalen (Inactive) [ 01/Jun/12 ]

Same failure just showed up again: http://buildbot.mongodb.org/builders/Windows%2064-bit%202008%2B/builds/366/steps/test_9/logs/stdio

Comment by auto [ 22/May/12 ]

Author:

{u'login': u'', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-5797 Uncaught exception in count_slaveok.js

Rewrote the count command in mongos to use ShardStrategy::commandOP and push handling of StaleConfigException to the caller.

Also modified ParallelSortClusteredCursor to handle SyncClusterConnection not allowing the call method to be used on commands.
Branch: master
https://github.com/mongodb/mongo/commit/4a9a294376f6d4d599244154e0b1e49d0172aed7

Comment by Randolph Tan [ 10/May/12 ]

Cause:

Primary of a 2 member replica set is down (part of the test) and checkShardVersion tries to call ReplicaSetMonitor::getMaster which would assert because there is no master. This happens intermittently depending on whether the ReplicaSetMonitorWatcher has realized that the primary is already down. To make this reproduce easily, simply insert a sleep in the test:

conn.setSlaveOk()
 
sleep( 20000 ); // <------------------ insert sleep here
 
// Should throw exception, since not slaveOk'd
assert.eq( 30, coll.find({ i : 0 }).count() )

Because of this bug, you can't do a query/commands on a replica set shard that has no master...

Generated at Thu Feb 08 03:09:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.