[SERVER-36678] member in replset becomes down with unknow reason Created: 14/Aug/18  Updated: 15/Aug/18  Resolved: 15/Aug/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.4.7, 3.6.6
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Bruce Zu Assignee: Nick Brewer
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive diagnostic_data.zip     Zip Archive mongodlog.zip    
Participants:

 Description   

We have a secondary member, IP  172-31-66-130,  became RS_DOWN

the time primary, IP 172-31-54-204, found it became RS_DOWN is 2018-08-04T03:13:54.704+0000
[ec2-user@ip-172-31-54-204 ~]$  sed -n '/2018-08-04/,/2018-08-05/p' /log/mongod.lo* | grep 66-130 | grep -vEi "Connecting to|Error in heartbeat|Failed to connect"

2018-08-04T03:13:54.703+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Ending connection to host ip-172-31-66-130.us-west-2.compute.internal:27017 due to bad connection status; 0 connections to that host remain open
2018-08-04T03:13:54.704+0000 I REPL     [replexec-4551] Member ip-172-31-66-130.us-west-2.compute.internal:27017 is now in state RS_DOWN
I stopped and start this secondary member's host EC2 on Aug 6 PDT. 

before the EC2 is stopped and start on Aug 6 the last mongod.log message is recorded on Aug 4
2018-08-04T03:08:42.613+0000 I NETWORK [thread110] Successfully connected to ip-172-31-54-204.us-west-2.compute.internal:27017 (17585 connections now open to ip-172-31-54-204.us-west-2.compute.internal:27017 with a 0 second timeout)
2018-08-04T03:08:42.613+0000 I NETWORK [thread110] scoped connection to ip-172-31-54-204.us-west-2.compute.internal:27017 not being returned to the pool
2018-08-06T23:34:03.348+0000 I CONTROL [main] ***** SERVER RESTARTED *****
Nothing is left between `2018-08-04T03:08:42.613+0000` to `2018-08-06T23:34:03.348+0000` SERVER RESTARTED

 

mongod.log  mongodlog.zip and diagnostic data are attacheddiagnostic_data.zip 

This is the second time our replset experience this issue. 

This is not any hints  can be found from /var/log

Any help will be appreciated,

Please let me know any other log or information need to provide. 

Thank you!

Bruce

 



 Comments   
Comment by Bruce Zu [ 15/Aug/18 ]

Hi Nick,

I am not sure I run into a bug or not, so I post there with diagnostic.

This is not expected behavior that this secondary became unreachable and frozen suddenly while the other secondary worked well with the same configuration.

Anyway, I post it to user-group https://groups.google.com/forum/#!topic/mongodb-user/HH5yY4QwNmc

Best regards,

Bruce

 

Comment by Nick Brewer [ 15/Aug/18 ]

The SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-user group.

-Nick

Generated at Thu Feb 08 04:43:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.