[SERVER-10414] chaining.js timeouts on V2.4 Linux 32-bit Created: 02/Aug/13  Updated: 10/Dec/14  Resolved: 07/Aug/13

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matt Kangas Assignee: Randolph Tan
Resolution: Duplicate Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

buildbot V2.4 Linux 32-bit
buildbot V2.4 Linux 64-bit
buildbot-special V2.4 Linux 64-bit Subscription RHEL 5.7


Attachments: File chaining.log    
Issue Links:
Duplicate
is duplicated by SERVER-10437 chaining.js failed on Linux 64-bit su... Closed
Operating System: ALL
Participants:

 Description   

V2.4 Linux 32-bit Build #141 Aug 2

http://buildbot.mongodb.org/builders/V2.4%20Linux%2032-bit/builds/141/steps/test_6/logs/stdio
http://buildlogs.mongodb.org/V2.4%20Linux%2032-bit/builds/141/test/replica%20sets/chaining.js

ReplSetTest awaitReplication: checking secondaries against timestamp Timestamp(1374413621, 1)
ReplSetTest awaitReplication: checking secondary #1: bs-linux32.10gen.cc:31001
ReplSetTest awaitReplication: secondary #1, bs-linux32.10gen.cc:31001, is synced
ReplSetTest awaitReplication: checking secondary #2: bs-linux32.10gen.cc:31002
ReplSetTest awaitReplication: timestamp for secondary #2, bs-linux32.10gen.cc:31002, is Timestamp(1374413616, 1) but latest is Timestamp(1374413621, 1)
ReplSetTest awaitReplication: last oplog entry (of 2) for secondary #2, bs-linux32.10gen.cc:31002, is {  "ts" : Timestamp(1374413616, 1),  "h" : NumberLong("8118325673702981649"),  "v" : 2,  "op" : "i",  "ns" : "chaining.foo",  "o" : {  "_id" : ObjectId("51ebe32fa63b195fb036f671"),  "a" : 1 } }
ReplSetTest awaitReplication: secondary #2, bs-linux32.10gen.cc:31002, is NOT synced
 m31002| Sun Jul 21 09:34:11.586 [rsHealthPoll] couldn't connect to bs-linux32.10gen.cc:31010: couldn't connect to server bs-linux32.10gen.cc:31010
 m31002| Sun Jul 21 09:34:11.586 [rsHealthPoll] couldn't connect to bs-linux32.10gen.cc:31010: couldn't connect to server bs-linux32.10gen.cc:31010
 m31002| Sun Jul 21 09:34:11.586 [rsHealthPoll] replset info bs-linux32.10gen.cc:31010 heartbeat failed, retrying
 m31002| Sun Jul 21 09:34:11.587 [rsHealthPoll] couldn't connect to bs-linux32.10gen.cc:31010: couldn't connect to server bs-linux32.10gen.cc:31010
 m31002| Sun Jul 21 09:34:11.587 [rsHealthPoll] couldn't connect to bs-linux32.10gen.cc:31010: couldn't connect to server bs-linux32.10gen.cc:31010
 m31002| Sun Jul 21 09:34:11.587 [rsHealthPoll] couldn't connect to bs-linux32.10gen.cc:31010: couldn't connect to server bs-linux32.10gen.cc:31010
 m31002| Sun Jul 21 09:34:11.625 [rsHealthPoll] couldn't connect to bs-linux32.10gen.cc:31009: couldn't connect to server bs-linux32.10gen.cc:31009
 m31002| Sun Jul 21 09:34:11.626 [rsHealthPoll] couldn't connect to bs-linux32.10gen.cc:31009: couldn't connect to server bs-linux32.10gen.cc:31009
 m31002| Sun Jul 21 09:34:11.629 [rsHealthPoll] replset info bs-linux32.10gen.cc:31009 heartbeat failed, retrying
 m31002| Sun Jul 21 09:34:11.629 [rsHealthPoll] couldn't connect to bs-linux32.10gen.cc:31009: couldn't connect to server bs-linux32.10gen.cc:31009
 m31002| Sun Jul 21 09:34:11.630 [rsHealthPoll] couldn't connect to bs-linux32.10gen.cc:31009: couldn't connect to server bs-linux32.10gen.cc:31009
 m31002| Sun Jul 21 09:34:11.630 [rsHealthPoll] couldn't connect to bs-linux32.10gen.cc:31009: couldn't connect to server bs-linux32.10gen.cc:31009
 m31000| Sun Jul 21 09:34:11.709 [rsHealthPoll] couldn't connect to bs-linux32.10gen.cc:31005: couldn't connect to server bs-linux32.10gen.cc:31005
 m31000| Sun Jul 21 09:34:11.710 [rsHealthPoll] couldn't connect to bs-linux32.10gen.cc:31005: couldn't connect to server bs-linux32.10gen.cc:31005
2013-07-21 09:34:13 EDT	
 m31000| Sun Jul 21 09:34:11.710 [rsHealthPoll] replset info bs-linux32.10gen.cc:31005 heartbeat failed, retrying
 m31000| Sun Jul 21 09:34:11.714 [rsHealthPoll] couldn't connect to bs-linux32.10gen.cc:31005: couldn't connect to server bs-linux32.10gen.cc:31005
 m31000| Sun Jul 21 09:34:11.714 [rsHealthPoll] couldn't connect to bs-linux32.10gen.cc:31005: couldn't connect to server bs-linux32.10gen.cc:31005
 m31000| Sun Jul 21 09:34:11.714 [rsHealthPoll] couldn't connect to bs-linux32.10gen.cc:31005: couldn't connect to server bs-linux32.10gen.cc:31005
Sun Jul 21 09:34:13.370 JavaScript execution failed: [awaiting replication] timed out after 30000ms ( 16 tries ) at src/mongo/shell/utils.js:L490
failed to load: /mnt/slaves/Linux_32bit_V2.4/mongo/jstests/replsets/chaining.js

This has been happening intermittently for a while:

Build #138 July 17
http://buildlogs.mongodb.org/V2.4%20Linux%2032-bit/builds/138/test/replica%20sets/chaining.js

Build #34 July 11
http://buildlogs.mongodb.org/V2.4%20Linux%2032-bit/builds/134/test/replica%20sets/chaining.js

Build #132 July 9
http://buildlogs.mongodb.org/V2.4%20Linux%2032-bit/builds/132/test/replica%20sets/chaining.js

Build #130 failed on chaining.js with a different error

Build #120
http://buildlogs.mongodb.org/V2.4%20Linux%2032-bit/builds/120/test/replica%20sets/chaining.js

Build #117 June 23
http://buildlogs.mongodb.org/V2.4%20Linux%2032-bit/builds/117/test/replica%20sets/chaining.js



 Comments   
Comment by Matt Kangas [ 06/Aug/13 ]

Same timeout, different builders:

buildbot-special V2.4 Linux 64-bit Subscription RHEL 5.7 Build #165 (Tue Aug 6)
http://buildlogs.mongodb.org/V2.4%20Linux%2064-bit%20Subscription%20RHEL%205.7/builds/165/test/replica%20sets/chaining.js

Tue Aug  6 02:36:56.864 [awaiting replication] timed out after 30000ms ( 16 tries ) at src/mongo/shell/utils.js:490
failed to load: /data/buildslaves/Linux_64bit_Subscription_RHEL_57_V2.4/mongo/jstests/replsets/chaining.js


V2.4 Linux 64-bit Build #168 (Tue Aug 6, rev 3fc69c4aaa)
http://buildlogs.mongodb.org/V2.4%20Linux%2064-bit/builds/168/test/replica%20sets/chaining.js

Tue Aug  6 06:37:22.285 [awaiting replication] timed out after 30000ms ( 16 tries ) at src/mongo/shell/utils.js:490
failed to load: /mntfast/data/slave/Linux_64bit_V2.4/mongo/jstests/replsets/chaining.js

Comment by Randolph Tan [ 05/Aug/13 ]

Looks like timing error on test. The

{ configureFailPoint: 'rsChaining[123]' }

triggers another failpoint that will throw socket exception. The problem here is that this failpoint is global and this could cause the socket exception to happen in the wrong place just like the SocketException that happened @ 08:05:08.250.

Comment by Randolph Tan [ 05/Aug/13 ]

It looks like mongobridge for some reason decided to shutdown and close the socket:

// sec2 connects to sec1 through bridge	
 m31002| Wed Jul 17 08:05:08.240 [rsSyncNotifier] replset setting oplog notifier to bs-linux32.10gen.cc:31010
 m31010| Wed Jul 17 08:05:08.240 connection accepted from 10.255.119.66:56131 #4 (1 connection now open)
 m31001| Wed Jul 17 08:05:08.241 [initandlisten] connection accepted from 10.255.119.66:42243 #6 (5 connections now open)
 
// AwaitReplication log msg...
 
// mongobridge socket problem? Note: 2 means SEND_ERROR
 m31001| Wed Jul 17 08:05:08.250 [conn6] SocketException handling request, closing client connection: 9001 socket exception [2] server [10.255.119.66:42243] 
 m31006| Wed Jul 17 08:05:08.251 connection accepted from 10.255.119.66:49617 #6 (1 connection now open)
 m31001| Wed Jul 17 08:05:08.252 [rsGhostSync] handshake between 2 and bs-linux32.10gen.cc:31006
 m31001| Wed Jul 17 08:05:08.254 [conn2] end connection 10.255.119.66:42194 (3 connections now open)
 
// sec2 tries to get response from sec1 through bridge, but gets no response since the socket was already closed (see socket exception [2] above)
// This will cause the bridge to close all sockets and shutdown!
 m31010| Wed Jul 17 08:05:08.250 closing listening socket: 34
 
 m31002| Wed Jul 17 08:05:08.250 [rsSyncNotifier] replset tracking exception: exception: 10278 dbclient error communicating with server: bs-linux32.10gen.cc:31010
 m31002| Wed Jul 17 08:05:08.254 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: bs-linux32.10gen.cc:31010

Comment by Matt Kangas [ 05/Aug/13 ]

Again:

V2.4 Linux 32-bit Build #159 (Aug 5, rev 392b933)

http://buildbot.10gen.cc/builders/V2.4%20Linux%2032-bit/builds/159/steps/test_6/logs/stdio
http://buildlogs.mongodb.org/V2.4%20Linux%2032-bit/builds/159/test/replica%20sets/chaining.js

Mon Aug  5 05:43:13.081 [awaiting replication] timed out after 30000ms ( 16 tries ) at src/mongo/shell/utils.js:490

Generated at Thu Feb 08 03:23:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.