[SERVER-10341] sync_passive2.js fails on Nightly Solaris-SmartOS 64-bit, Linux 32-bit Created: 25/Jul/13  Updated: 11/Jul/16  Resolved: 26/Aug/13

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 2.5.3

Type: Bug Priority: Major - P3
Reporter: Matt Kangas Assignee: Matt Dannenberg
Resolution: Done Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Nightly Solaris-SmartOS 64-bit Build #462, #464
Solaris-SmartOS 64-bit Build #1157
V2.4 Windows 32-bit


Issue Links:
Related
related to SERVER-16156 Re-enable test from SERVER-10341 Closed
Operating System: ALL
Participants:

 Description   

Nightly Solaris-SmartOS 64-bit Build #462 July 24

http://buildbot.mongodb.org/builders/Nightly%20Solaris-SmartOS%2064-bit/builds/462/steps/test_11/logs/stdio
http://buildlogs.mongodb.org/Nightly%20Solaris-SmartOS%2064-bit/builds/462/test/replica%20sets/sync_passive2.js

1: [object Object],[object Object],[object Object],[object Object],[object Object] 2a: false 2: false 3: true 4: false
assert failed : tried 200 times, giving up on undefined
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at doassert (src/mongo/shell/assert.js:6:5)
    at assert (src/mongo/shell/assert.js:14:5)
    at wait (jstests/replsets/rslib.js:14:9)
    at /data/buildslaves/SolarisSmartOS_64bit_Nightly/mongo/jstests/replsets/sync_passive2.js:92:1
Wed Jul 24 16:50:24.401 assert failed : tried 200 times, giving up on undefined at src/mongo/shell/assert.js:7
failed to load: /data/buildslaves/SolarisSmartOS_64bit_Nightly/mongo/jstests/replsets/sync_passive2.js

Nightly Solaris-SmartOS 64-bit Build #464 July 25 appears to fail the same way

http://buildlogs.mongodb.org/Nightly%20Solaris-SmartOS%2064-bit/builds/464/test/replica%20sets/sync_passive2.js



 Comments   
Comment by auto [ 26/Aug/13 ]

Author:

{u'username': u'dannenberg', u'name': u'matt dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: SERVER-10341 add a sleep in the notifier thread to prevent starvation in single core builders
Branch: master
https://github.com/mongodb/mongo/commit/d5c242e1d5857ee62e339a80db4a6876a189ea43

Comment by Matt Kangas [ 22/Aug/13 ]

Commit e71ca210ed was verbally lgtm'd by Greg

Comment by auto [ 22/Aug/13 ]

Author:

{u'username': u'kangas', u'name': u'Matt Kangas', u'email': u'matt.kangas@10gen.com'}

Message: SERVER-10341 disable sync_passive2.js test for 2.5.2 release

We identified a performance regression that prevents this test from finishing
on the Win32 builder and probably any single-core host. We should reenable
the test after the performance regression is addressed.
Branch: master
https://github.com/mongodb/mongo/commit/e71ca210ed90143394cdade334d834d9c4ec9303

Comment by Eric Milkie [ 22/Aug/13 ]

With Drew's help I found what seems to be a problem at least on Windows 32-bit. It is affected the most because the builder has only one cpu and so it is more susceptible to CPU looping processes.
During the first part of the sync_passive2.js, it waits for 3 and 4 to sync up. In the failing case, 4 is prevented from making fast progress because 3 is cpu looping trying to reconnect to its missing sync source.

Comment by Matt Kangas [ 22/Aug/13 ]

Buildbot builders failing on this as of Wed Aug 21

And on MCI:

Comment by Matt Kangas [ 21/Aug/13 ]

Again on Solaris

Solaris-SmartOS 64-bit Build #1237 Aug 20
http://buildlogs.mongodb.org/Solaris-SmartOS%2064-bit/builds/1237/test/replica%20sets/sync_passive2.js

assert failed : tried 200 times, giving up on undefined
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at doassert (src/mongo/shell/assert.js:6:5)
    at assert (src/mongo/shell/assert.js:14:5)
    at wait (jstests/replsets/rslib.js:14:9)
    at /data/buildslaves/SolarisSmartOS_64bit/mongo/jstests/replsets/sync_passive2.js:92:1

Comment by Matt Kangas [ 15/Aug/13 ]

Not dead yet

Solaris-SmartOS 64-bit Build #1224 Aug 14

http://buildlogs.mongodb.org/Solaris-SmartOS%2064-bit/builds/1224/test/replica%20sets/sync_passive2.js

Comment by Matt Kangas [ 06/Aug/13 ]

Solaris-SmartOS 64-bit Build #1205
http://buildlogs.mongodb.org/Solaris-SmartOS%2064-bit/builds/1205/test/replica%20sets/sync_passive2.js

assert failed : tried 200 times, giving up on undefined
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at doassert (src/mongo/shell/assert.js:6:5)
    at assert (src/mongo/shell/assert.js:14:5)
    at wait (jstests/replsets/rslib.js:14:9)
    at /data/buildslaves/SolarisSmartOS_64bit/mongo/jstests/replsets/sync_passive2.js:92:1
Tue Aug  6 15:52:02.997 assert failed : tried 200 times, giving up on undefined at src/mongo/shell/assert.js:7
failed to load: /data/buildslaves/SolarisSmartOS_64bit/mongo/jstests/replsets/sync_passive2.js

Comment by Matt Kangas [ 06/Aug/13 ]

V2.4 Windows 32-bit Build #187 (Tue Aug 6, rev 3fc69c4aa)
http://buildlogs.mongodb.org/V2.4%20Windows%2032-bit/builds/187/test/replica%20sets/sync_passive2.js

ReplSetTest getMaster failed: "[Finding master] timed out after 60000ms ( 31 tries )"
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at ReplSetTest.getMaster (src/mongo/shell/replsettest.js:412:5)
    at D:\slave\Windows_32bit_V2.4\mongo\jstests\replsets\sync_passive2.js:128:19
Tue Aug 06 09:49:38.992 [Finding master] timed out after 60000ms ( 31 tries ) at src/mongo/shell/replsettest.js:413
failed to load: D:\slave\Windows_32bit_V2.4\mongo\jstests\replsets\sync_passive2.js

Comment by Matt Kangas [ 05/Aug/13 ]

Note re: Tad's comment above:

I tried reverting commits 27c4e7fbd2ef6eeb04dccd1bcdecdb21b00522d1, 96c541693cc5b6954a5286788f7ef88f5c680bc6 and 90526f8ca838873e40cda563e77c9f4ba71936e5 .
With these three commits reverted, the test passed 800 times on the Solaris 11.1 VM, at which point I stopped the test.

These commits have not been backported to v2.4, so they cannot be the cause of failures on the v2.4 branch.

Comment by Matt Kangas [ 05/Aug/13 ]

More failures:

V2.4 Windows 32-bit Build #180 (Sat Aug 3)

ReplSetTest getMaster failed: "[Finding master] timed out after 60000ms ( 31 tries )"
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at ReplSetTest.getMaster (src/mongo/shell/replsettest.js:412:5)
    at D:\slave\Windows_32bit_V2.4\mongo\jstests\replsets\sync_passive2.js:128:19
Sat Aug 03 09:44:32.452 [Finding master] timed out after 60000ms ( 31 tries ) at src/mongo/shell/replsettest.js:413
failed to load: D:\slave\Windows_32bit_V2.4\mongo\jstests\replsets\sync_passive2.js

V2.4 Windows 32-bit Build #184 (Sun Aug 4)
http://buildlogs.mongodb.org/V2.4%20Windows%2032-bit/builds/184/test/replica%20sets/sync_passive2.js

ReplSetTest getMaster failed: "[Finding master] timed out after 60000ms ( 31 tries )"
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at ReplSetTest.getMaster (src/mongo/shell/replsettest.js:412:5)
    at D:\slave\Windows_32bit_V2.4\mongo\jstests\replsets\sync_passive2.js:128:19
Mon Aug 05 03:58:25.669 [Finding master] timed out after 60000ms ( 31 tries ) at src/mongo/shell/replsettest.js:413
failed to load: D:\slave\Windows_32bit_V2.4\mongo\jstests\replsets\sync_passive2.js

Solaris-SmartOS 64-bit Build #1201 (Mon Aug 5)
http://buildlogs.mongodb.org/Solaris-SmartOS%2064-bit/builds/1201/test/replica%20sets/sync_passive2.js

Mon Aug  5 18:08:01.263 assert failed : tried 200 times, giving up on undefined at src/mongo/shell/assert.js:7
failed to load: /data/buildslaves/SolarisSmartOS_64bit/mongo/jstests/replsets/sync_passive2.js

Comment by Matt Kangas [ 01/Aug/13 ]

Again:

Solaris-SmartOS 64-bit Build #1179 July 31

http://buildbot.mongodb.org/builders/Solaris-SmartOS%2064-bit/builds/1179/steps/test_9/logs/stdio
http://buildlogs.mongodb.org/Solaris-SmartOS%2064-bit/builds/1179/test/replica%20sets/sync_passive2.js

assert failed : tried 200 times, giving up on undefined
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at doassert (src/mongo/shell/assert.js:6:5)
    at assert (src/mongo/shell/assert.js:14:5)
    at wait (jstests/replsets/rslib.js:14:9)
    at /data/buildslaves/SolarisSmartOS_64bit/mongo/jstests/replsets/sync_passive2.js:92:1
Wed Jul 31 19:34:56.286 assert failed : tried 200 times, giving up on undefined at src/mongo/shell/assert.js:7

Comment by Tad Marshall [ 31/Jul/13 ]

Another SmartOS failure:
Solaris-SmartOS 64-bit Build #1179 – 3a14b391 – http://buildbot.mongodb.org/builders/Solaris-SmartOS%2064-bit/builds/1179
http://buildbot.mongodb.org/builders/Solaris-SmartOS%2064-bit/builds/1179/steps/test_9/logs/stdio
http://buildlogs.mongodb.org/Solaris-SmartOS%2064-bit/builds/1179/test/replica%20sets/sync_passive2.js

assert failed : tried 200 times, giving up on undefined
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at doassert (src/mongo/shell/assert.js:6:5)
    at assert (src/mongo/shell/assert.js:14:5)
    at wait (jstests/replsets/rslib.js:14:9)
    at /data/buildslaves/SolarisSmartOS_64bit/mongo/jstests/replsets/sync_passive2.js:92:1
Wed Jul 31 19:34:56.286 assert failed : tried 200 times, giving up on undefined at src/mongo/shell/assert.js:7
failed to load: /data/buildslaves/SolarisSmartOS_64bit/mongo/jstests/replsets/sync_passive2.js

Comment by Tad Marshall [ 30/Jul/13 ]

Another Linux 32-bit failure:
http://buildbot.mongodb.org/builders/Nightly%20Linux%2032-bit/builds/1423
http://buildbot.mongodb.org/builders/Nightly%20Linux%2032-bit/builds/1423/steps/test_9/logs/stdio
http://buildlogs.mongodb.org/Nightly%20Linux%2032-bit/builds/1423/test/replica%20sets/sync_passive2.js

check sync target re-evaluation
 m31000| Tue Jul 30 08:39:17.266 [conn1] replSet replSetReconfig config object parses ok, 5 members specified
 m31000| Tue Jul 30 08:39:17.266 [conn1] replSet replSetReconfig [2]
 m31000| Tue Jul 30 08:39:17.266 [conn1] replSet info saving a newer config version to local.system.replset
 m31000| Tue Jul 30 08:39:17.284 [conn1] replSet saveConfigLocally done
 m31000| Tue Jul 30 08:39:17.284 [conn1] replSet relinquishing primary state
 m31000| Tue Jul 30 08:39:17.284 [conn1] replSet SECONDARY
 m31000| Tue Jul 30 08:39:17.284 [conn1] replSet closing client sockets after relinquishing primary
Tue Jul 30 08:39:17.284 DBClientCursor::init call() failed
Tue Jul 30 08:39:17.284 trying reconnect to 127.0.0.1:31000
 m31000| Tue Jul 30 08:39:17.285 [conn10] end connection 10.255.119.66:47767 (6 connections now open)
 m31000| Tue Jul 30 08:39:17.285 [conn13] end connection 10.255.119.66:47770 (6 connections now open)
 m31000| Tue Jul 30 08:39:17.285 [conn1] replSet PRIMARY
 m31000| Tue Jul 30 08:39:17.287 [initandlisten] connection accepted from 127.0.0.1:59450 #17 (6 connections now open)
Tue Jul 30 08:39:17.287 reconnect 127.0.0.1:31000 ok
 m31000| Tue Jul 30 08:39:17.287 [conn1] replSet replSetReconfig new config saved locally
 m31000| Tue Jul 30 08:39:17.287 [conn1] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [127.0.0.1:59401] 
 m31000| Tue Jul 30 08:39:17.288 [initandlisten] connection accepted from 10.255.119.66:47795 #18 (6 connections now open)
 m31002| Tue Jul 30 08:39:17.288 [SyncSourceFeedbackThread] DBClientCursor::init call() failed
 m31002| Tue Jul 30 08:39:17.288 [SyncSourceFeedbackThread] SyncSourceFeedback error sending update: DBClientBase::findN: transport error: bs-linux32.10gen.cc:31000 ns: admin.$cmd query: { replSetUpdatePosition: 1, optimes: [ { _id: ObjectId('51f7b3ada7c0577a55303b23'), optime: Timestamp 1375187915000|1822, config: { _id: 2, host: "bs-linux32.10gen.cc:31002", priority: 0.0 } }, { _id: ObjectId('51f7b3afcca83ff15f78d3c9'), optime: Timestamp 1375187915000|1822, config: { _id: 4, host: "bs-linux32.10gen.cc:31004", priority: 0.0, hidden: true, buildIndexes: false } }, { _id: ObjectId('51f7b3b0c88f87b066d1f0fb'), optime: Timestamp 1375187957000|1, config: { _id: 3, host: "bs-linux32.10gen.cc:31003", priority: 0.0 } } ] }
ReplSetTest awaitReplication: starting: timestamp for primary, bs-linux32.10gen.cc:31000, is Timestamp(1375187957, 1)
 m31002| Tue Jul 30 08:39:17.292 [rsBackgroundSync] Socket flush send() errno:32 Broken pipe 10.255.119.66:31000
 m31002| Tue Jul 30 08:39:17.292 [rsBackgroundSync]  caught exception (socket exception [SEND_ERROR] for 10.255.119.66:31000) in destructor (~PiggyBackData)
 m31002| Tue Jul 30 08:39:17.292 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: bs-linux32.10gen.cc:31000
 m31002| Tue Jul 30 08:39:17.292 [rsBackgroundSync] replSet syncing to: bs-linux32.10gen.cc:31000
 m31002| Tue Jul 30 08:39:17.294 [SyncSourceFeedbackThread] DBClientCursor::init call() failed
 m31002| Tue Jul 30 08:39:17.294 [SyncSourceFeedbackThread] SyncSourceFeedback error sending handshake: DBClientBase::findN: transport error: bs-linux32.10gen.cc:31000 ns: admin.$cmd query: { replSetUpdatePosition: 1, handshake: { handshake: ObjectId('51f7b3ada7c0577a55303b23'), member: 2, config: { _id: 2, host: "bs-linux32.10gen.cc:31002", priority: 0.0 } } }
 m31000| Tue Jul 30 08:39:17.290 [rsHealthPoll] replSet member bs-linux32.10gen.cc:31004 is up
 m31000| Tue Jul 30 08:39:17.291 [rsHealthPoll] replSet member bs-linux32.10gen.cc:31004 is now in state SECONDARY
 m31000| Tue Jul 30 08:39:17.291 [rsMgr] can't see a majority of the set, relinquishing primary
 m31000| Tue Jul 30 08:39:17.291 [rsMgr] replSet relinquishing primary state
 m31000| Tue Jul 30 08:39:17.291 [rsMgr] replSet SECONDARY
 m31000| Tue Jul 30 08:39:17.291 [rsMgr] replSet closing client sockets after relinquishing primary
 m31000| Tue Jul 30 08:39:17.291 [conn17] end connection 127.0.0.1:59450 (5 connections now open)
 m31000| Tue Jul 30 08:39:17.291 [rsHealthPoll] replSet member bs-linux32.10gen.cc:31003 is up
 m31000| Tue Jul 30 08:39:17.291 [rsHealthPoll] replSet member bs-linux32.10gen.cc:31003 is now in state SECONDARY
 m31000| Tue Jul 30 08:39:17.292 [rsHealthPoll] replSet member bs-linux32.10gen.cc:31002 is up
 m31000| Tue Jul 30 08:39:17.292 [initandlisten] connection accepted from 10.255.119.66:47796 #19 (6 connections now open)
 m31000| Tue Jul 30 08:39:17.292 [rsHealthPoll] replSet member bs-linux32.10gen.cc:31002 is now in state SECONDARY
 m31000| Tue Jul 30 08:39:17.293 [rsMgr] not electing self, bs-linux32.10gen.cc:31004 would veto with 'bs-linux32.10gen.cc:31000 is trying to elect itself but bs-linux32.10gen.cc:31000 is already primary and more up-to-date'
 m31000| Tue Jul 30 08:39:17.293 [rsMgr] not electing self, bs-linux32.10gen.cc:31004 would veto with 'bs-linux32.10gen.cc:31000 is trying to elect itself but bs-linux32.10gen.cc:31000 is already primary and more up-to-date'
 m31000| Tue Jul 30 08:39:17.293 [rsHealthPoll] replSet member bs-linux32.10gen.cc:31001 is up
 m31000| Tue Jul 30 08:39:17.293 [rsHealthPoll] replSet member bs-linux32.10gen.cc:31001 is now in state ARBITER
 m31000| Tue Jul 30 08:39:17.294 [rsMgr] not electing self, bs-linux32.10gen.cc:31004 would veto with 'bs-linux32.10gen.cc:31000 is trying to elect itself but bs-linux32.10gen.cc:31000 is already primary and more up-to-date'
 m31000| Tue Jul 30 08:39:17.294 [conn18] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.255.119.66:47795] 
 m31000| Tue Jul 30 08:39:17.295 [initandlisten] connection accepted from 10.255.119.66:47797 #20 (6 connections now open)
Tue Jul 30 08:39:17.297 DBClientCursor::init call() failed
Tue Jul 30 08:39:17.308 Error: error doing query: failed at src/mongo/shell/collection.js:54
failed to load: /mnt/slaves/Linux_32bit_Nightly/mongo/jstests/replsets/sync_passive2.js

Comment by Tad Marshall [ 28/Jul/13 ]

I was able to reproduce the failure on a Solaris 11.1 VM. On one host machine, it failed after 4 successful passes (running "buildscripts/smoke.py jstests/replsets/sync_passive2.js"), failing on the 5th run. On a different host machine, it passed 70 times and failed on the 71st pass.

I tried reverting commits 27c4e7fbd2ef6eeb04dccd1bcdecdb21b00522d1, 96c541693cc5b6954a5286788f7ef88f5c680bc6 and 90526f8ca838873e40cda563e77c9f4ba71936e5 .

With these three commits reverted, the test passed 800 times on the Solaris 11.1 VM, at which point I stopped the test.

The cause appears to be a deadlock of some kind, where one of the nodes stops responding on the network. This seems similar to SERVER-10093, which is also related to these commits.

Comment by Tad Marshall [ 26/Jul/13 ]

This time on Linux 32-bit:
Linux 32-bit Build #5853 – http://buildbot.mongodb.org/builders/Linux%2032-bit/builds/5853
http://buildbot.mongodb.org/builders/Linux%2032-bit/builds/5853/steps/test_9/logs/stdio
http://buildlogs.mongodb.org/Linux%2032-bit/builds/5853/test/replica%20sets/sync_passive2.js

Comment by Matt Kangas [ 26/Jul/13 ]

Same, again.

Solaris-SmartOS 64-bit Build #1163 July 25
http://buildlogs.mongodb.org/Solaris-SmartOS%2064-bit/builds/1163/test/replica%20sets/sync_passive2.js

1: [object Object],[object Object],[object Object],[object Object],[object Object] 2a: false 2: false 3: true 4: false
assert failed : tried 200 times, giving up on undefined
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at doassert (src/mongo/shell/assert.js:6:5)
    at assert (src/mongo/shell/assert.js:14:5)
    at wait (jstests/replsets/rslib.js:14:9)
    at /data/buildslaves/SolarisSmartOS_64bit/mongo/jstests/replsets/sync_passive2.js:92:1
Fri Jul 26 01:12:59.265 assert failed : tried 200 times, giving up on undefined at src/mongo/shell/assert.js:7
failed to load: /data/buildslaves/SolarisSmartOS_64bit/mongo/jstests/replsets/sync_passive2.js

Comment by Matt Kangas [ 25/Jul/13 ]

Same failure on Solaris-SmartOS 64-bit Build #1157 July 24

http://buildbot.mongodb.org/builders/Solaris-SmartOS%2064-bit/builds/1157/steps/test_9/logs/stdio
http://buildlogs.mongodb.org/Solaris-SmartOS%2064-bit/builds/1157/test/replica%20sets/sync_passive2.js

1: [object Object],[object Object],[object Object],[object Object],[object Object] 2a: false 2: false 3: true 4: false
assert failed : tried 200 times, giving up on undefined
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at doassert (src/mongo/shell/assert.js:6:5)
    at assert (src/mongo/shell/assert.js:14:5)
    at wait (jstests/replsets/rslib.js:14:9)
    at /data/buildslaves/SolarisSmartOS_64bit/mongo/jstests/replsets/sync_passive2.js:92:1
Wed Jul 24 23:32:08.584 assert failed : tried 200 times, giving up on undefined at src/mongo/shell/assert.js:7
failed to load: /data/buildslaves/SolarisSmartOS_64bit/mongo/jstests/replsets/sync_passive2.js

Comment by Matt Kangas [ 25/Jul/13 ]

Tad, can you look at this?

Generated at Thu Feb 08 03:22:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.