[SERVER-10125] stepdown.js failing on buildbot-special V2.4 Linux 64-bit Subscription SUSE 11 Created: 08/Jul/13  Updated: 11/Jul/16  Resolved: 24/Jul/13

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 2.5.2

Type: Bug Priority: Major - P3
Reporter: Matt Kangas Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

buildbot-special: V2.4 Linux 64-bit Subscription SUSE 11 build #144
buildbot-special Nightly Linux 64-bit Subscription SUSE 11 Build #465


Operating System: ALL
Participants:

 Description   

http://buildbot-special.10gen.com/builders/V2.4%20Linux%2064-bit%20Subscription%20SUSE%2011/builds/144
http://buildbot-special.10gen.com/builders/V2.4%20Linux%2064-bit%20Subscription%20SUSE%2011/builds/144/steps/shell_2/logs/stdio
http://buildlogs.mongodb.org/V2.4%20Linux%2064-bit%20Subscription%20SUSE%2011/builds/144/test/replica%20sets/stepdown.js

ReplSetTest getMaster failed: "[Finding master] timed out after 60000ms ( 31 tries )"
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at ReplSetTest.getMaster (src/mongo/shell/replsettest.js:412:5)
    at /data/buildslaves/Linux_64bit_Subscription_SUSE_11_V2.4/mongo/jstests/replsets/stepdown.js:101:10
Sun Jul  7 08:14:46.064 JavaScript execution failed: [Finding master] timed out after 60000ms ( 31 tries ) at src/mongo/shell/replsettest.js:L413
failed to load: /data/buildslaves/Linux_64bit_Subscription_SUSE_11_V2.4/mongo/jstests/replsets/stepdown.js

Preceding this are repeated authentication failures. Possibly the cause?

Error: 18 { code: 18, ok: 0.0, errmsg: "auth fails" }
...
Caught exception while authenticating connection: "[Authenticating connection: connection to ip-10-86-167-136:31001] timed out after 5000ms ( 6 tries )"



 Comments   
Comment by auto [ 18/Jul/13 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-10125 stepdown.js failing on buildbot-special V2.4 Linux 64-bit Subscription SUSE 11

replace jsTest.attempt to assert.soon
Branch: master
https://github.com/mongodb/mongo/commit/3f4169d7144fe1300129f5b0ccda1b75b8f76cef

Comment by Randolph Tan [ 16/Jul/13 ]

If my diagnosis is correct, I believe this is a separate issue.

Comment by Matt Kangas [ 15/Jul/13 ]

stepdown2.js failed on Nightly Linux 64-bit SSL SUSE 11 Build #497 July 15

Is this a different issue?

http://buildbot-special.10gen.com/builders/Nightly%20Linux%2064-bit%20SSL%20SUSE%2011/builds/497/steps/test_3/logs/stdio
http://buildlogs.mongodb.org/Nightly%20Linux%2064-bit%20SSL%20SUSE%2011/builds/497/test/replica%20sets/stepdown2.js

 m31001| Mon Jul 15 23:06:21.242 [rsHealthPoll] replset info ip-10-110-253-72:31000 heartbeat failed, retrying
 m31001| Mon Jul 15 23:06:21.243 [rsHealthPoll] replSet info ip-10-110-253-72:31000 is down (or slow to respond): 
 m31001| Mon Jul 15 23:06:21.243 [rsHealthPoll] replSet member ip-10-110-253-72:31000 is now in state DOWN
 m31001| Mon Jul 15 23:06:21.243 [rsMgr] replSet can't see a majority, will not try to elect self
 m31001| Mon Jul 15 23:06:21.577 [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
 m31001| Mon Jul 15 23:06:23.243 [rsHealthPoll] replset info ip-10-110-253-72:31000 heartbeat failed, retrying
 m31001| Mon Jul 15 23:06:25.244 [rsHealthPoll] replset info ip-10-110-253-72:31000 heartbeat failed, retrying
 m31001| Mon Jul 15 23:07:40.272 [rsHealthPoll] replset info ip-10-110-253-72:31000 heartbeat failed, retrying
 m31001| Mon Jul 15 23:07:40.272 [rsMgr] replSet can't see a majority, will not try to elect self
 m31001| Mon Jul 15 23:07:42.272 [rsHealthPoll] replset info ip-10-110-253-72:31000 heartbeat failed, retrying
 m31001| Mon Jul 15 23:07:44.273 [rsHealthPoll] replset info ip-10-110-253-72:31000 heartbeat failed, retrying
 
...
 
 m31001| Mon Jul 15 23:08:34.308 [rsHealthPoll] replset info ip-10-110-253-72:31000 heartbeat failed, retrying
 m31001| Mon Jul 15 23:08:34.308 [rsHealthPoll] couldn't connect to ip-10-110-253-72:31000: couldn't connect to server ip-10-110-253-72:31000
 m31001| Mon Jul 15 23:08:34.308 [rsHealthPoll] couldn't connect to ip-10-110-253-72:31000: couldn't connect to server ip-10-110-253-72:31000
 m31001| Mon Jul 15 23:08:34.308 [rsHealthPoll] couldn't connect to ip-10-110-253-72:31000: couldn't connect to server ip-10-110-253-72:31000
Mon Jul 15 23:08:34.609 Mon Jul 15 23:08:34  process on port 31001, with pid 15108 not terminated, sending sigkill

m31001 says it "will terminate after current cmd ends" but it never actually exits?

Comment by Spencer Brody (Inactive) [ 11/Jul/13 ]

I have also not been able to reproduce this, and it has passed in future runs on the same builder...

Comment by Spencer Brody (Inactive) [ 11/Jul/13 ]

This failure is confusing me. It looks a lot like a timing bug, notice how

m31001| Sun Jul  7 08:14:45.030 [rsMgr] replSet PRIMARY

prints moments before

ReplSetTest getMaster failed: "[Finding master] timed out after 60000ms ( 31 tries )"

but the weird thing is that from the time the old primary steps down:

m31000| Sun Jul  7 08:14:12.080 [conn10] replSet info stepping down as primary secs=100

to the time the test exits after timing out waiting for the new primary:

Sun Jul  7 08:14:48.156 [signalProcessingThread] now exiting

looks to be just 36 seconds, though the test is saying that it waited 60 seconds before aborting. I have no idea why the timing of the timeout waiting for the primary seems so far off...

Comment by Matt Kangas [ 09/Jul/13 ]

Spencer, does this look like a new auth-related issue? Or just a timeout that we should quarantine?

Note that it's on V2.4

Generated at Thu Feb 08 03:22:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.