[SERVER-11190] Flaky initial_sync1 Created: 15/Oct/13  Updated: 10/Dec/14  Resolved: 17/Dec/13

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Shaun Verch Assignee: Matt Kangas
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File linux_64_debug_duroff_1339_test_replicasets_0_initial_sync1_fail.txt     Text File linux_64_debug_duroff_1340_test_replicasets_0_initial_sync1_success.txt    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

Had a failure here: 9fcdfeb195c4 Linux 64-bit DEBUG DUR OFF replicasets that does not seem to be correlated to the surrounding commits



 Comments   
Comment by Shaun Verch [ 15/Oct/13 ]

Notes:

Success:
d1c2ae72c482 Linux 64-bit DEBUG DUR OFF replicasets
Attachment: linux_64_debug_duroff_1340_test_replicasets_0_initial_sync1_success.txt

Failure:
9fcdfeb195c4 Linux 64-bit DEBUG DUR OFF replicasets
Attachment: linux_64_debug_duroff_1339_test_replicasets_0_initial_sync1_fail.txt

Observations:

9. Bring #2 back up
ReplSetTest n is : 1
 m31000| 2013-10-14T21:28:48.147+0000 [rsHealthPoll] DBClientCursor::init call() failed
 m31000| 2013-10-14T21:28:48.147+0000 [rsHealthPoll] dev: lastError==0 won't report:DBClientBase::findN: transport error: ip-10-142-147-116:31001 ns: admin.$cmd query: { replSetHeartbeat: "jstests_initsync1", v: 2, pv: 1, checkEmpty: false, from: "ip-10-142-147-116:31000" }
 m31000| 2013-10-14T21:28:48.147+0000 [rsHealthPoll] replset info ip-10-142-147-116:31001 heartbeat failed, retrying
 m31000| 2013-10-14T21:28:48.148+0000 [rsHealthPoll] warning: Failed to connect to 10.142.147.116, reason: errno:111 Connection refused
 m31000| 2013-10-14T21:28:48.148+0000 [rsHealthPoll] unable to set SO_RCVTIMEO
 m31000| 2013-10-14T21:28:48.149+0000 [rsHealthPoll] unable to set SO_SNDTIMEO
 m31000| 2013-10-14T21:28:48.149+0000 [rsHealthPoll] replSet info ip-10-142-147-116:31001 is down (or slow to respond):
 m31000| 2013-10-14T21:28:48.149+0000 [rsHealthPoll] replSet member ip-10-142-147-116:31001 is now in state DOWN
 m31000| 2013-10-14T21:28:48.946+0000 [rsHealthPoll] replSet member ip-10-142-147-116:31002 is now in state SECONDARY
 m31000| 2013-10-14T21:28:49.394+0000 [conn5] end connection 10.142.147.116:58325 (4 connections now open)
 m31000| 2013-10-14T21:28:50.149+0000 [rsHealthPoll] unable to set SO_RCVTIMEO
 m31000| 2013-10-14T21:28:50.149+0000 [rsHealthPoll] unable to set SO_SNDTIMEO
 m31000| 2013-10-14T21:28:50.151+0000 [rsHealthPoll] warning: Failed to connect to 10.142.147.116, reason: errno:111 Connection refused
 m31000| 2013-10-14T21:28:50.151+0000 [rsHealthPoll] replset info ip-10-142-147-116:31001 heartbeat failed, retrying
ReplSetTest n: 1 ports: [ 31000, 31001 ]    31001 number
 m31000| 2013-10-14T21:28:50.151+0000 [rsHealthPoll] unable to set SO_RCVTIMEO
 m31000| 2013-10-14T21:28:50.151+0000 [rsHealthPoll] unable to set SO_SNDTIMEO
 m31000| 2013-10-14T21:28:50.156+0000 [rsHealthPoll] warning: Failed to connect to 10.142.147.116, reason: errno:111 Connection refused
 m31000| 2013-10-14T21:28:50.156+0000 [rsHealthPoll] unable to set SO_RCVTIMEO
 m31000| 2013-10-14T21:28:50.156+0000 [rsHealthPoll] unable to set SO_SNDTIMEO
 m31000| 2013-10-14T21:28:52.156+0000 [rsHealthPoll] unable to set SO_RCVTIMEO

The test is supposed to kill the node on port 31001 and bring it back up, but we start to see a lot of socket errors after we bring the node up. We also see from the replica set status output that the node is up, but can't see the other two nodes:

{
    "set" : "jstests_initsync1",
    "date" : ISODate("2013-10-14T21:37:10Z"),
    "myState" : 2,
    "members" : [
        {
            "_id" : 0,
            "name" : "ip-10-142-147-116:31000",
            "health" : 0,
            "state" : 8,
            "stateStr" : "(not reachable/healthy)",
            "uptime" : 0,
            "optime" : Timestamp(0, 0),
            "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
            "lastHeartbeat" : ISODate("2013-10-14T21:37:03Z"),
            "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
            "pingMs" : 0
        },
        {
            "_id" : 1,
            "name" : "ip-10-142-147-116:31001",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 282,
            "optime" : Timestamp(1381786108, 1),
            "optimeDate" : ISODate("2013-10-14T21:28:28Z"),
            "self" : true
        },
        {
            "_id" : 2,
            "name" : "ip-10-142-147-116:31002",
            "health" : 0,
            "state" : 8,
            "stateStr" : "(not reachable/healthy)",
            "uptime" : 0,
            "optime" : Timestamp(0, 0),
            "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
            "lastHeartbeat" : ISODate("2013-10-14T21:37:05Z"),
            "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
            "pingMs" : 0
        }
    ],
    "ok" : 1
}

We don't see either of these symptoms in the successful test.

Generated at Thu Feb 08 03:25:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.