[SERVER-3308] Replication Set secondary doesn't restart replication after a network glitch Created: 21/Jun/11  Updated: 29/Feb/12  Resolved: 02/Sep/11

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 1.8.1
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Michael D. Norman Assignee: Kristina Chodorow (Inactive)
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux


Operating System: ALL
Participants:

 Description   

The following errors in the secondary log indicate that it had trouble accessing either of the other DBs:

Tue Jun 21 08:58:28 [ReplSetHealthPollTask] DBClientCursor::init call() failed
Tue Jun 21 08:58:28 [ReplSetHealthPollTask] replSet info prod-c0-pacmandb2 is down (or slow to respond): DBClientBase::findOne: transport error: prod-c0-pacmandb2 query: { replSetHeartbeat: "pacman", v: 2, pv: 1, checkEmpty: false, from: "lab-c0-pacmandb1.lab" }
Tue Jun 21 08:58:30 [ReplSetHealthPollTask] DBClientCursor::init call() failed
Tue Jun 21 08:58:30 [ReplSetHealthPollTask] replSet info prod-c0-pacmandb1 is down (or slow to respond): DBClientBase::findOne: transport error: prod-c0-pacmandb1 query: { replSetHeartbeat: "pacman", v: 2, pv: 1, checkEmpty: false, from: "lab-c0-pacmandb1.lab" }
Tue Jun 21 08:59:33 [ReplSetHealthPollTask] replSet info prod-c0-pacmandb2 is up
Tue Jun 21 08:59:34 [initandlisten] connection accepted from 10.10.***.***:54941 #1492
Tue Jun 21 08:59:34 [initandlisten] connection accepted from 10.10.***.***:33786 #1493
Tue Jun 21 08:59:36 [ReplSetHealthPollTask] replSet info prod-c0-pacmandb1 is up

It failed to replicate for over an hour, and only a restart of the secondary DB seems to have fixed the problem. This was not a master log corruption issue because the other secondary was syncing just fine.



 Comments   
Comment by Kristina Chodorow (Inactive) [ 02/Sep/11 ]

Please comment if you're still around.

Comment by Kristina Chodorow (Inactive) [ 24/Jun/11 ]

Do you have the whole log for that time period? Did you run rs.status() while it was failing to replicate?

Generated at Thu Feb 08 03:02:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.