[SERVER-10418] remove2.js failed on special V2.4 Linux 64-bit SSL SUSE 11 Build #157 Created: 02/Aug/13  Updated: 19/May/14  Resolved: 03/Aug/13

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matt Kangas Assignee: Greg Studer
Resolution: Won't Fix Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

V2.4 Linux 64-bit SSL SUSE 11 Build #157 Aug 2

http://buildbot-special.10gen.com/builders/V2.4%20Linux%2064-bit%20SSL%20SUSE%2011/builds/157/steps/test_3/logs/stdio
http://buildlogs.mongodb.org/V2.4%20Linux%2064-bit%20SSL%20SUSE%2011/builds/157/test/sharding/remove2.js

Fri Aug  2 18:29:58.508 [ReplicaSetMonitorWatcher] warning: No primary detected for set remove2-rs1
2013-08-02 14:30:40 EDT	
]
2013-08-02 14:31:30 EDT	
 m30999| Fri Aug  2 18:29:58.690 [Balancer] distributed lock 'balancer/ip-10-110-253-72:30999:1375468205:1804289383' acquired, ts : 51fbfaa62e22c0be3e64d233
 m30999| Fri Aug  2 18:29:58.690 [Balancer] distributed lock 'balancer/ip-10-110-253-72:30999:1375468205:1804289383' unlocked. 
 m31200| Fri Aug  2 18:29:59.372 [rsMgr] replSet PRIMARY
 m31200| Fri Aug  2 18:30:00.373 [rsHealthPoll] replSet member ip-10-110-253-72:31201 is now in state RECOVERING
 m31201| Fri Aug  2 18:30:00.785 [rsHealthPoll] replSet member ip-10-110-253-72:31200 is now in state PRIMARY
 m30999| Fri Aug  2 18:30:04.692 [Balancer] distributed lock 'balancer/ip-10-110-253-72:30999:1375468205:1804289383' acquired, ts : 51fbfaac2e22c0be3e64d234
 m30999| Fri Aug  2 18:30:04.693 [Balancer] distributed lock 'balancer/ip-10-110-253-72:30999:1375468205:1804289383' unlocked. 
 m31100| Fri Aug  2 18:31:30.245 [conn4] getMore: cursorid not found local.oplog.rs 116381552730394
 m31100| Fri Aug  2 18:31:30.245 [conn4] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1375468146000|11 } } cursorid:116381552730394 ntoreturn:0 keyUpdates:0 locks(micros) r:105 nreturned:0 reslen:20 50471ms
 m31100| Fri Aug  2 18:31:30.245 [conn4] killcursors: found 0 of 1
 m31100| Fri Aug  2 18:31:30.245 [conn4] end connection 10.110.253.72:62307 (13 connections now open)
 m31100| Fri Aug  2 18:31:30.245 [conn5] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1375468146000|11 } } cursorid:116889113026827 ntoreturn:0 keyUpdates:0 locks(micros) r:31 nreturned:0 reslen:20 50469ms
assert.soon failed: function () {
          replTest.getMaster(); // Reload who the current slaves are.
          var slaves = replTest.liveNodes.slaves;
          var len = slaves.length;
          var ready = true;
          for(var i=0; i<len; i++) {
              var isMaster = slaves[i].getDB("admin").runCommand({ismaster: 1});
              var arbiter = isMaster['arbiterOnly'] == undefined ? false : isMaster['arbiterOnly'];
              ready = ready && ( isMaster['secondary'] || arbiter );
          }
          return ready;
      }, msg:Awaiting secondaries
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at doassert (src/mongo/shell/assert.js:6:5)
    at Function.assert.soon (src/mongo/shell/assert.js:110:60)
    at ReplSetTest.awaitSecondaryNodes (src/mongo/shell/replsettest.js:386:10)
    at ReplSetTest.initiate (src/mongo/shell/replsettest.js:482:10)
    at /data/buildslaves/Linux_64bit_SSL_SUSE_11_V2.4/mongo/jstests/sharding/remove2.js:139:6
Fri Aug  2 18:31:30.322 assert.soon failed: function () {
          replTest.getMaster(); // Reload who the current slaves are.
          var slaves = replTest.liveNodes.slaves;
          var len = slaves.length;
          var ready = true;
          for(var i=0; i<len; i++) {
              var isMaster = slaves[i].getDB("admin").runCommand({ismaster: 1});
              var arbiter = isMaster['arbiterOnly'] == undefined ? false : isMaster['arbiterOnly'];
              ready = ready && ( isMaster['secondary'] || arbiter );
          }
          return ready;
      }, msg:Awaiting secondaries at src/mongo/shell/assert.js:7
failed to load: /data/buildslaves/Linux_64bit_SSL_SUSE_11_V2.4/mongo/jstests/sharding/remove2.js

Is this actually related to SERVER-10120? If we backport the increased test timeout there, will this be resolved?



 Comments   
Comment by Greg Studer [ 03/Aug/13 ]

Resolving as won't fix since the issue looks to be at the host level.

Comment by Greg Studer [ 03/Aug/13 ]

Timeline:

Replica set (m31200,m31201) was restarted normally as part of the test:

rst1.startSet();
rst1.initiate();
rst1.awaitReplication();

Things seem okay at first, one node becomes primary, the other starts trying to catch up...:

 m31201| Fri Aug  2 18:29:58.373 [conn2] replSet RECOVERING
 m31201| Fri Aug  2 18:29:58.373 [conn2] replSet info voting yea for ip-10-110-253-72:31200 (0)
 m31200| Fri Aug  2 18:29:59.372 [rsMgr] replSet PRIMARY
 m31200| Fri Aug  2 18:30:00.373 [rsHealthPoll] replSet member ip-10-110-253-72:31201 is now in state RECOVERING
 m31201| Fri Aug  2 18:30:00.785 [rsHealthPoll] replSet member ip-10-110-253-72:31200 is now in state PRIMARY

Then no messages for 90s - we'd expect to see at least the balancer distributed lock messages from mongos, which is uninvolved in the syncing and which happen about every 6s:

 m30999| Fri Aug  2 18:30:04.693 [Balancer] distributed lock 'balancer/ip-10-110-253-72:30999:1375468205:1804289383' unlocked. 
 m31100| Fri Aug  2 18:31:30.245 [conn4] getMore: cursorid not found local.oplog.rs 116381552730394

Our awaitReplication() timeout is immediately afterwards, but the awaitReplication() timeout is for 60s, not 90s, so this also seems like the host itself was frozen or the time changed suddenly. Also, immediately after the freeze, the secondary happily syncs before getting shut down:

 m31201| Fri Aug  2 18:31:31.114 [rsSync] replSet set minValid=51fbfa94:b
 m31201| Fri Aug  2 18:31:31.114 [rsSync] replSet initial sync done

TLDR - looks like a buildbot issue to me, not a test issue, if it doesn't keep happening.

Generated at Thu Feb 08 03:23:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.