[SERVER-8689] jstests/sharding/shard_insert_getlasterror_w2.js is failing Created: 23/Feb/13  Updated: 10/Dec/14  Resolved: 05/Sep/13

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tad Marshall Assignee: Randolph Tan
Resolution: Duplicate Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-7246 Mongos cannot do slaveOk queries when... Closed
Operating System: ALL
Participants:

 Description   

jstests/sharding/shard_insert_getlasterror_w2.js is failing on multiple platforms. Indications are that this may be a race condition; the failures do not seem to happen on fast machines. On Windows, they have been seen so far only in debug builds, though on Linux they have been seen in release builds.

Some failures:
http://buildbot.mongodb.org/builders/Nightly%20Linux%2032-bit/builds/1228/steps/test_11/logs/stdio
http://buildlogs.mongodb.org/Nightly%20Linux%2032-bit/builds/1228/test/sharding/shard_insert_getlasterror_w2.js

http://buildbot.mongodb.org/builders/Nightly%20Linux%20RHEL%2032-bit/builds/398/steps/test_11/logs/stdio
http://buildlogs.mongodb.org/Nightly%20Linux%20RHEL%2032-bit/builds/398/test/sharding/shard_insert_getlasterror_w2.js

http://buildbot.mongodb.org/builders/Nightly%20Linux%2032-bit/builds/1226/steps/test_11/logs/stdio
http://buildlogs.mongodb.org/Nightly%20Linux%2032-bit/builds/1226/test/sharding/shard_insert_getlasterror_w2.js

http://buildbot.mongodb.org/builders/Windows%2064-bit%202008%2B%20DEBUG/builds/637/steps/test_11/logs/stdio
http://buildlogs.mongodb.org/Windows%2064-bit%202008%2B%20DEBUG/builds/637/test/sharding/shard_insert_getlasterror_w2.js

http://buildbot.mongodb.org/builders/Windows%2064-bit%20DEBUG/builds/596/steps/test_11/logs/stdio
http://buildlogs.mongodb.org/Windows%2064-bit%20DEBUG/builds/596/test/sharding/shard_insert_getlasterror_w2.js



 Comments   
Comment by Randolph Tan [ 23/Aug/13 ]

Summary of what happened:

Passing case:
1. Primary noticed that two other nodes are down, relinquishes primary status and closes all connections.
2. When RSM::_check is called from RSMW, the _getConnWithRefresh method calls the isStillConnected method and realizes that the connection is broken so it creates one.
3. _master variable (index for _master) still points to 0.
4. getVersionable would use the RSM::masterConn method to get the same master conn at #2.

Failing case:
1. Primary noticed that two other nodes are down, relinquishes primary status and closes all connections.
2. getVersionable would use the RSM::masterConn method to get the master conn.
3. It gets a socket exception when it tried to use it since the connection is already broken -> this connection is now in failed state.
4. The next time RSM checks the _master connection it notices that it is failed state so it calls the notifyFailure method, setting _master index to -1.
5. Once _master is -1, RSM will always raise the 10009 assertion when calling checkMaster until a primary is elected in the set. As a consequence, clients won't be able to issue slaveOk queries against mongos at this point. This is the issue described in SERVER-7246.

The fix in SERVER-10453 changed the connect method from needing to acquire a mutex to none so it affected the timing with other threads, notably the RSMW and made it easier to fall on to the failure case.

Comment by Greg Studer [ 22/Aug/13 ]

Reverted 961f3761203d4c1e78f44b898b6bb2cf2b98d00b - this issue was being tickled by SERVER-10453.

Overall we need comprehensive testing of sharding's interaction with replica sets with no master, preferably before any other sharding tests are run at all. 0_replica_set_basics.js

Comment by Matt Kangas [ 22/Aug/13 ]

Lots of Buildbot failures due to this last night. I counted 8.

8x shard_insert_getlasterror_w2.js
On Linux 32-bit, Linux RHEL 32-bit, Linux 64-bit, OSX 64-bit,

These all failed with one of two messages:

	error: {
		"$err" : "ReplicaSetMonitor no master found for set: repset1",
		"code" : 10009
	}

  • OR -

	error: {
		"$err" : "can't authenticate to server repset1/127.0.0.1:31000,127.0.0.1:31001,127.0.0.1:31002",
		"code" : 15847
	}

Comment by Matt Kangas [ 22/Aug/13 ]

Perhaps I should have created a new ticket rather than reopening this one. But shard_insert_getlasterror_w2.js is failing on multiple MCI builders.

renctan and spencer - please diagnose these failures ASAP in the morning!

master rev 3696ac227caa, Wed Aug 21

Failures without auth

http://mci.10gen.com/ui/task/mongodb_mongo_master_cxx11_ubuntu1204_64_3696ac227caa1dd04aa133ad202b8250ebd7c02d_13_08_21_23_33_07_sharding_ubuntu1204
http://buildlogs.mongodb.org/mci_0.9_cxx11_ubuntu1204_64/builds/605/test/sharding_0/shard_insert_getlasterror_w2.js

error: {
	"$err" : "ReplicaSetMonitor no master found for set: repset1",
	"code" : 10009
}
assert.soon failed: function () { try {
                db.foo.find().next();
            }
            catch(e) {
                print(e);
                return false;
            }
            return true;
        }, msg:Queries took too long to complete correctly.
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at doassert (src/mongo/shell/assert.js:6:5)
    at Function.assert.soon (src/mongo/shell/assert.js:174:60)
    at go (/data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js:71:12)
    at /data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js:91:1
2013-08-22T01:13:26.095+0000 assert.soon failed: function () { try {
                db.foo.find().next();
            }
            catch(e) {
                print(e);
                return false;
            }
            return true;
        }, msg:Queries took too long to complete correctly. at src/mongo/shell/assert.js:7
failed to load: /data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js

http://mci.10gen.com/ui/task/mongodb_mongo_master_linux_64_debug_3696ac227caa1dd04aa133ad202b8250ebd7c02d_13_08_21_23_33_07_sharding_linux_64
http://buildlogs.mongodb.org/mci_0.9_linux_64_debug/builds/605/test/sharding_0/shard_insert_getlasterror_w2.js

error: {
	"$err" : "ReplicaSetMonitor no master found for set: repset1",
	"code" : 10009
}
assert.soon failed: function () { try {
                db.foo.find().next();
            }
            catch(e) {
                print(e);
                return false;
            }
            return true;
        }, msg:Queries took too long to complete correctly.
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at doassert (src/mongo/shell/assert.js:6:5)
    at Function.assert.soon (src/mongo/shell/assert.js:174:60)
    at go (/data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js:71:12)
    at /data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js:91:1
2013-08-22T01:26:02.156+0000 assert.soon failed: function () { try {
                db.foo.find().next();
            }
            catch(e) {
                print(e);
                return false;
            }
            return true;
        }, msg:Queries took too long to complete correctly. at src/mongo/shell/assert.js:7
failed to load: /data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js

Failures with auth

http://mci.10gen.com/ui/task/mongodb_mongo_master_linux_64_3696ac227caa1dd04aa133ad202b8250ebd7c02d_13_08_21_23_33_07_sharding_auth_linux_64
http://buildlogs.mongodb.org/mci_0.9_linux_64/builds/605/test/sharding_auth_0/shard_insert_getlasterror_w2.js

error: {
	"$err" : "can't authenticate to server repset1/127.0.0.1:31000,127.0.0.1:31001,127.0.0.1:31002",
	"code" : 15847
}
assert.soon failed: function () { try {
                db.foo.find().next();
            }
            catch(e) {
                print(e);
                return false;
            }
            return true;
        }, msg:Queries took too long to complete correctly.
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at doassert (src/mongo/shell/assert.js:6:5)
    at Function.assert.soon (src/mongo/shell/assert.js:174:60)
    at go (/data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js:71:12)
    at /data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js:91:1
2013-08-22T01:46:48.357+0000 assert.soon failed: function () { try {
                db.foo.find().next();
            }
            catch(e) {
                print(e);
                return false;
            }
            return true;
        }, msg:Queries took too long to complete correctly. at src/mongo/shell/assert.js:7
failed to load: /data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js

http://mci.10gen.com/ui/task/mongodb_mongo_master_enterprise_ubuntu1204_64_3696ac227caa1dd04aa133ad202b8250ebd7c02d_13_08_21_23_33_07_sharding_auth_ubuntu1204
http://buildlogs.mongodb.org/mci_0.9_enterprise_ubuntu1204_64/builds/605/test/sharding_auth_0/shard_insert_getlasterror_w2.js

error: {
	"$err" : "can't authenticate to server repset1/127.0.0.1:31000,127.0.0.1:31001,127.0.0.1:31002",
	"code" : 15847
}
assert.soon failed: function () { try {
                db.foo.find().next();
            }
            catch(e) {
                print(e);
                return false;
            }
            return true;
        }, msg:Queries took too long to complete correctly.
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at doassert (src/mongo/shell/assert.js:6:5)
    at Function.assert.soon (src/mongo/shell/assert.js:174:60)
    at go (/data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js:71:12)
    at /data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js:91:1
2013-08-22T02:16:30.322+0000 assert.soon failed: function () { try {
                db.foo.find().next();
            }
            catch(e) {
                print(e);
                return false;
            }
            return true;
        }, msg:Queries took too long to complete correctly. at src/mongo/shell/assert.js:7
failed to load: /data/mci/git@github.commongodb/mongo.git/master/jstests/sharding/shard_insert_getlasterror_w2.js

Comment by auto [ 23/May/13 ]

Author:

{u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: Revert "SERVER-8689, SERVER-8691"

This reverts commit c2db241639fc9c26c8dfb773d29a8d2eac6e32cc.
" Temporarily disable flaky tests shard_insert_getlasterror_w2.js and
replReads.js"
Branch: master
https://github.com/mongodb/mongo/commit/42e79743e335e676f560b103da3ce73b5537a468

Comment by auto [ 12/Mar/13 ]

Author:

{u'date': u'2013-02-24T12:15:20Z', u'name': u'Dan Pasette', u'email': u'dan@10gen.com'}

Message: SERVER-8689, SERVER-8691

Temporarily disable flaky tests shard_insert_getlasterror_w2.js and
replReads.js
Branch: v2.2
https://github.com/mongodb/mongo/commit/0b6b3dd955cfb6a09f509a5038208669981f409e

Comment by auto [ 24/Feb/13 ]

Author:

{u'date': u'2013-02-24T12:15:20Z', u'name': u'Dan Pasette', u'email': u'dan@10gen.com'}

Message: SERVER-8689, SERVER-8691

Temporarily disable flaky tests shard_insert_getlasterror_w2.js and
replReads.js
Branch: master
https://github.com/mongodb/mongo/commit/c2db241639fc9c26c8dfb773d29a8d2eac6e32cc

Comment by Greg Studer [ 24/Feb/13 ]

There are a number of issues manifesting here -

1) There is SERVER-8690 - this causes the test to pass more often than would otherwise be expected, due to SERVER-7246.
2) There's a race condition in the test itself, in that the primary node is not necessarily guaranteed to step down before we start querying it - this apparently often happens on faster systems and causes the test to pass immediately.
3) There may be other codepaths in the replica set refresh logic that refresh the node at an indeterminate time, for example, while we're in a query that is refreshing the nodes itself.

Generated at Thu Feb 08 03:18:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.