[SERVER-10111] buildbot-special: sharding/remove2.js failing on Nightly Linux 64-bit SSL Amazon AMI Created: 05/Jul/13  Updated: 11/Jul/16  Resolved: 13/Jul/13

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 2.5.1

Type: Bug Priority: Major - P3
Reporter: Matt Kangas Assignee: Greg Studer
Resolution: Done Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

buildbot-special: Nightly Linux 64-bit SSL Amazon AMI
buildbot-special: Nightly Linux 64-bit Subscription Ubuntu 1204


Issue Links:
Duplicate
is duplicated by SERVER-10120 sharding/remove2.js failed on OS X 10... Closed
Related
related to SERVER-6071 Use command on local.slaves instead o... Closed
Operating System: ALL
Steps To Reproduce:

scons -j1 --no-glibc-check --ssl --distmod=amzn64-ssl --sharedclient --release mongosTest smokeSharding

Participants:

 Description   

Failed on two most recent builds:

Jul 05 02:56	5f949c19a260...	failure	#541	Failed test_4
Jul 04 09:29	5f949c19a260...	failure	#540	Failed test_4
Jul 03 22:51	5f949c19a260...	success	#539	Build successful

Build #540 July 04

http://buildlogs.mongodb.org/Nightly%20Linux%2064-bit%20SSL%20Amazon%20AMI/builds/540/test/sharding/remove2.js?mode=raw

assert.soon failed: function () {
        res = st.admin.runCommand( { removeshard: replTest.name } );
        printjson(res);
        return res.ok && res.msg == 'removeshard completed successfully';
    }, msg:failed to remove shard
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at doassert (src/mongo/shell/assert.js:6:5)
    at Function.assert.soon (src/mongo/shell/assert.js:174:60)
    at removeShard (/data/buildslaves/Linux_64bit_SSL_Amazon_AMI_Nightly/mongo/jstests/sharding/remove2.js:22:12)
    at /data/buildslaves/Linux_64bit_SSL_Amazon_AMI_Nightly/mongo/jstests/sharding/remove2.js:139:1
Thu Jul  4 21:16:25.396 assert.soon failed: function () {
        res = st.admin.runCommand( { removeshard: replTest.name } );
        printjson(res);
        return res.ok && res.msg == 'removeshard completed successfully';
    }, msg:failed to remove shard at src/mongo/shell/assert.js:7
failed to load: /data/buildslaves/Linux_64bit_SSL_Amazon_AMI_Nightly/mongo/jstests/sharding/remove2.js

Build #541 July 05 fails in the same way



 Comments   
Comment by J Rassi [ 13/Jul/13 ]

Confirmed that the SERVER-6071 commits introduced this failure, which as of today have been reverted in master (6486b403). sharding/remove2.js now passes on this machine, and I've re-enabled the buildslave.

Looking further into the failure, I discovered that slaveTracking::_slaves on the primary would sometimes list the secondary as one oplog entry behind the primary, long after the secondary had fully caught up. As such, the cleanup thread was never signaled to continue. Per above, this drastically slowed down the secondary-throttle migration mechanism, and in the end resulted in extremely slow migrations. It seems that the SERVER-6071 changes would sometimes cause the primary to miss updates from the secondary on its position in the oplog.

Comment by J Rassi [ 12/Jul/13 ]

My analysis: while the cleanup is waiting for secondary-throttle replication, the thread queues itself up on the _threadsWaitingForReplication condition variable, but sometimes doesn't get signaled even when the secondary has caught up. The thread only gets woken up after the one-minute timed wait, which exceeds the 30-second assert.soon() default, generating the test failure. I would guess that this failure is related to the recent slaveTracking changes. It's not clear to me yet whether the lack of signaling is due to a deadlock or some other timing-related bug.

I've taken the bs-e-amzn64-2 buildslave process offline, and I can reproduce the issue on that machine. I redefined the assert.soon() macro in this test to hang on failure, so I'm able to drop into a gdb session for further work (and, helpfully, the test still fails on the debug build).

Comment by Matt Kangas [ 11/Jul/13 ]

Same builder, different error message?

Nightly Linux 64-bit SSL Amazon AMI Build #548

http://buildbot-special.10gen.com/builders/Nightly%20Linux%2064-bit%20SSL%20Amazon%20AMI/builds/548/steps/test_4/logs/stdio

http://buildlogs.mongodb.org/Nightly%20Linux%2064-bit%20SSL%20Amazon%20AMI/builds/548/test/sharding/remove2.js

assert.soon failed: function () {
        printjson( st.s.getDB( "config" ).locks.find().toArray() )
        return !st.isAnyBalanceInFlight();
    }, msg:migrations did not end?
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at doassert (src/mongo/shell/assert.js:6:5)
    at Function.assert.soon (src/mongo/shell/assert.js:174:60)
    at removeShard (/data/buildslaves/Linux_64bit_SSL_Amazon_AMI_Nightly/mongo/jstests/sharding/remove2.js:29:12)
    at /data/buildslaves/Linux_64bit_SSL_Amazon_AMI_Nightly/mongo/jstests/sharding/remove2.js:209:1
Thu Jul 11 00:05:01.592 assert.soon failed: function () {
        printjson( st.s.getDB( "config" ).locks.find().toArray() )
        return !st.isAnyBalanceInFlight();
    }, msg:migrations did not end? at src/mongo/shell/assert.js:7
failed to load: /data/buildslaves/Linux_64bit_SSL_Amazon_AMI_Nightly/mongo/jstests/sharding/remove2.js

Comment by Greg Studer [ 08/Jul/13 ]

Looks like a deadlock in removeRange - discussing as part of another fix.

Comment by Matt Kangas [ 08/Jul/13 ]

Ditto on Nightly Linux 64-bit Subscription Ubuntu 1204 build #157 July 07

http://buildbot-special.10gen.com/builders/Nightly%20Linux%2064-bit%20Subscription%20Ubuntu%201204/builds/157/steps/shell_3/logs/stdio
http://buildlogs.mongodb.org/Nightly%20Linux%2064-bit%20Subscription%20Ubuntu%201204/builds/157/test/sharding/remove2.js

Sun Jul  7 16:15:01.910 assert.soon failed: function () {
        res = st.admin.runCommand( { removeshard: replTest.name } );
        printjson(res);
        return res.ok && res.msg == 'removeshard completed successfully';
    }, msg:failed to remove shard at src/mongo/shell/assert.js:7
failed to load: /data/buildslaves/Linux_64bit_Subscription_Ubuntu_1204_Nightly/mongo/jstests/sharding/remove2.js

Generated at Thu Feb 08 03:22:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.