|
Summary of what happened:
Passing case:
1. Primary noticed that two other nodes are down, relinquishes primary status and closes all connections.
2. When RSM::_check is called from RSMW, the _getConnWithRefresh method calls the isStillConnected method and realizes that the connection is broken so it creates one.
3. _master variable (index for _master) still points to 0.
4. getVersionable would use the RSM::masterConn method to get the same master conn at #2.
Failing case:
1. Primary noticed that two other nodes are down, relinquishes primary status and closes all connections.
2. getVersionable would use the RSM::masterConn method to get the master conn.
3. It gets a socket exception when it tried to use it since the connection is already broken -> this connection is now in failed state.
4. The next time RSM checks the _master connection it notices that it is failed state so it calls the notifyFailure method, setting _master index to -1.
5. Once _master is -1, RSM will always raise the 10009 assertion when calling checkMaster until a primary is elected in the set. As a consequence, clients won't be able to issue slaveOk queries against mongos at this point. This is the issue described in SERVER-7246.
The fix in SERVER-10453 changed the connect method from needing to acquire a mutex to none so it affected the timing with other threads, notably the RSMW and made it easier to fall on to the failure case.
|
|
Reverted 961f3761203d4c1e78f44b898b6bb2cf2b98d00b - this issue was being tickled by SERVER-10453.
Overall we need comprehensive testing of sharding's interaction with replica sets with no master, preferably before any other sharding tests are run at all. 0_replica_set_basics.js
|
|
Lots of Buildbot failures due to this last night. I counted 8.
8x shard_insert_getlasterror_w2.js
On Linux 32-bit, Linux RHEL 32-bit, Linux 64-bit, OSX 64-bit,
These all failed with one of two messages:
error: {
|
"$err" : "ReplicaSetMonitor no master found for set: repset1",
|
"code" : 10009
|
}
|
error: {
|
"$err" : "can't authenticate to server repset1/127.0.0.1:31000,127.0.0.1:31001,127.0.0.1:31002",
|
"code" : 15847
|
}
|
|
|
Perhaps I should have created a new ticket rather than reopening this one. But shard_insert_getlasterror_w2.js is failing on multiple MCI builders.
renctan and spencer - please diagnose these failures ASAP in the morning!
master rev 3696ac227caa, Wed Aug 21
Failures without auth
http://mci.10gen.com/ui/task/mongodb_mongo_master_cxx11_ubuntu1204_64_3696ac227caa1dd04aa133ad202b8250ebd7c02d_13_08_21_23_33_07_sharding_ubuntu1204
http://buildlogs.mongodb.org/mci_0.9_cxx11_ubuntu1204_64/builds/605/test/sharding_0/shard_insert_getlasterror_w2.js
error: {
|
"$err" : "ReplicaSetMonitor no master found for set: repset1",
|
"code" : 10009
|
}
|
assert.soon failed: function () { try {
|
db.foo.find().next();
|
}
|
catch(e) {
|
print(e);
|
return false;
|
}
|
return true;
|
}, msg:Queries took too long to complete correctly.
|
Error: Printing Stack Trace
|
at printStackTrace (src/mongo/shell/utils.js:37:15)
|
at doassert (src/mongo/shell/assert.js:6:5)
|
at Function.assert.soon (src/mongo/shell/assert.js:174:60)
|
at go (/data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js:71:12)
|
at /data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js:91:1
|
2013-08-22T01:13:26.095+0000 assert.soon failed: function () { try {
|
db.foo.find().next();
|
}
|
catch(e) {
|
print(e);
|
return false;
|
}
|
return true;
|
}, msg:Queries took too long to complete correctly. at src/mongo/shell/assert.js:7
|
failed to load: /data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js
|
http://mci.10gen.com/ui/task/mongodb_mongo_master_linux_64_debug_3696ac227caa1dd04aa133ad202b8250ebd7c02d_13_08_21_23_33_07_sharding_linux_64
http://buildlogs.mongodb.org/mci_0.9_linux_64_debug/builds/605/test/sharding_0/shard_insert_getlasterror_w2.js
error: {
|
"$err" : "ReplicaSetMonitor no master found for set: repset1",
|
"code" : 10009
|
}
|
assert.soon failed: function () { try {
|
db.foo.find().next();
|
}
|
catch(e) {
|
print(e);
|
return false;
|
}
|
return true;
|
}, msg:Queries took too long to complete correctly.
|
Error: Printing Stack Trace
|
at printStackTrace (src/mongo/shell/utils.js:37:15)
|
at doassert (src/mongo/shell/assert.js:6:5)
|
at Function.assert.soon (src/mongo/shell/assert.js:174:60)
|
at go (/data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js:71:12)
|
at /data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js:91:1
|
2013-08-22T01:26:02.156+0000 assert.soon failed: function () { try {
|
db.foo.find().next();
|
}
|
catch(e) {
|
print(e);
|
return false;
|
}
|
return true;
|
}, msg:Queries took too long to complete correctly. at src/mongo/shell/assert.js:7
|
failed to load: /data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js
|
Failures with auth
http://mci.10gen.com/ui/task/mongodb_mongo_master_linux_64_3696ac227caa1dd04aa133ad202b8250ebd7c02d_13_08_21_23_33_07_sharding_auth_linux_64
http://buildlogs.mongodb.org/mci_0.9_linux_64/builds/605/test/sharding_auth_0/shard_insert_getlasterror_w2.js
error: {
|
"$err" : "can't authenticate to server repset1/127.0.0.1:31000,127.0.0.1:31001,127.0.0.1:31002",
|
"code" : 15847
|
}
|
assert.soon failed: function () { try {
|
db.foo.find().next();
|
}
|
catch(e) {
|
print(e);
|
return false;
|
}
|
return true;
|
}, msg:Queries took too long to complete correctly.
|
Error: Printing Stack Trace
|
at printStackTrace (src/mongo/shell/utils.js:37:15)
|
at doassert (src/mongo/shell/assert.js:6:5)
|
at Function.assert.soon (src/mongo/shell/assert.js:174:60)
|
at go (/data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js:71:12)
|
at /data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js:91:1
|
2013-08-22T01:46:48.357+0000 assert.soon failed: function () { try {
|
db.foo.find().next();
|
}
|
catch(e) {
|
print(e);
|
return false;
|
}
|
return true;
|
}, msg:Queries took too long to complete correctly. at src/mongo/shell/assert.js:7
|
failed to load: /data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js
|
http://mci.10gen.com/ui/task/mongodb_mongo_master_enterprise_ubuntu1204_64_3696ac227caa1dd04aa133ad202b8250ebd7c02d_13_08_21_23_33_07_sharding_auth_ubuntu1204
http://buildlogs.mongodb.org/mci_0.9_enterprise_ubuntu1204_64/builds/605/test/sharding_auth_0/shard_insert_getlasterror_w2.js
error: {
|
"$err" : "can't authenticate to server repset1/127.0.0.1:31000,127.0.0.1:31001,127.0.0.1:31002",
|
"code" : 15847
|
}
|
assert.soon failed: function () { try {
|
db.foo.find().next();
|
}
|
catch(e) {
|
print(e);
|
return false;
|
}
|
return true;
|
}, msg:Queries took too long to complete correctly.
|
Error: Printing Stack Trace
|
at printStackTrace (src/mongo/shell/utils.js:37:15)
|
at doassert (src/mongo/shell/assert.js:6:5)
|
at Function.assert.soon (src/mongo/shell/assert.js:174:60)
|
at go (/data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js:71:12)
|
at /data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js:91:1
|
2013-08-22T02:16:30.322+0000 assert.soon failed: function () { try {
|
db.foo.find().next();
|
}
|
catch(e) {
|
print(e);
|
return false;
|
}
|
return true;
|
}, msg:Queries took too long to complete correctly. at src/mongo/shell/assert.js:7
|
failed to load: /data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js
|
|
|
Author:
{u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}
Message: Revert "SERVER-8689, SERVER-8691"
This reverts commit c2db241639fc9c26c8dfb773d29a8d2eac6e32cc.
" Temporarily disable flaky tests shard_insert_getlasterror_w2.js and
replReads.js"
Branch: master
https://github.com/mongodb/mongo/commit/42e79743e335e676f560b103da3ce73b5537a468
|
|
Author:
{u'date': u'2013-02-24T12:15:20Z', u'name': u'Dan Pasette', u'email': u'dan@10gen.com'}
Message: SERVER-8689, SERVER-8691
Temporarily disable flaky tests shard_insert_getlasterror_w2.js and
replReads.js
Branch: v2.2
https://github.com/mongodb/mongo/commit/0b6b3dd955cfb6a09f509a5038208669981f409e
|
|
Author:
{u'date': u'2013-02-24T12:15:20Z', u'name': u'Dan Pasette', u'email': u'dan@10gen.com'}
Message: SERVER-8689, SERVER-8691
Temporarily disable flaky tests shard_insert_getlasterror_w2.js and
replReads.js
Branch: master
https://github.com/mongodb/mongo/commit/c2db241639fc9c26c8dfb773d29a8d2eac6e32cc
|
|
There are a number of issues manifesting here -
1) There is SERVER-8690 - this causes the test to pass more often than would otherwise be expected, due to SERVER-7246.
2) There's a race condition in the test itself, in that the primary node is not necessarily guaranteed to step down before we start querying it - this apparently often happens on faster systems and causes the test to pass immediately.
3) There may be other codepaths in the replica set refresh logic that refresh the node at an indeterminate time, for example, while we're in a query that is refreshing the nodes itself.
|
Generated at Thu Feb 08 03:18:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.