[SERVER-15799] Replication threads stuck in infinite connection loop when there is not enough remote oplog Created: 24/Oct/14  Updated: 14/Apr/15  Resolved: 14/Apr/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.6.5
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Aristarkh Zagorodnikov Assignee: Matt Dannenberg
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File primary.log.bz2     File secondary.log.bz2    
Operating System: ALL
Participants:

 Description   

One of secondary members of our replica set had to be taken offline for a prolonged period. Unfortunately, the oplog on master was not long enough for a proper recovery. But, instead of getting the usual message in the error log (about oplog too short to recover), the secondary started creating a lot of connections to primary at a rate of about 300 per second, until it exhausted local ports (due to 30K of connections to the same remote port, hanging in TIME_WAIT) and then started losing heartbeats due to inability to connect to primary server at all, filling log with these messages:

2014-10-24T17:07:51.063+0400 [rsBackgroundSync] warning: Failed to connect to 10.3.1.12:27032, reason: errno:99 Cannot assign requested address
2014-10-24T17:07:51.064+0400 [rsBackgroundSync] repl: couldn't connect to server d1.s2.fs-temp.drive.bru:27032 (10.3.1.12), connection attempt failed



 Comments   
Comment by Ramon Fernandez Marina [ 14/Apr/15 ]

Thanks onyxmaster, closing this ticket now. If this you run into this again please re-open.

Regards,
Ramón.

Comment by Aristarkh Zagorodnikov [ 14/Apr/15 ]

Hi!
Well, we didn't see this problem in the last few months, so I guess it either got fixed or just doesn't occur anymore in our environment. I think this case should be closed since there is no stable repro.

Comment by Matt Dannenberg [ 14/Apr/15 ]

Hey Aristarkh,

Apologies for the long delay in getting back to you. We were unable to reproduce this behavior on our end: in the scenario you describe the secondary realizes it cannot catch up and does an initial sync. The steps I took for attempting the repro were:

  1. Start a three node set with a small oplog (5MB)
  2. Start doing inserts
  3. Take down one secondary
  4. Do more than enough inserts to overwrite the whole oplog
  5. Bring the secondary back up

At the point, the secondary noticed it was too far behind and dropped in to recovering and suggests a docs page about fixing it:

2015-04-14T10:49:54.517-0400 [rsBackgroundSync] replSet syncing to: dannenstation.local:27017
2015-04-14T10:49:54.519-0400 [rsSync] replSet still syncing, not yet to minValid optime 552d2223:33
2015-04-14T10:49:54.519-0400 [rsBackgroundSync] replSet remoteOldestOp:    Apr 14 10:20:19 552d2223:33
2015-04-14T10:49:54.519-0400 [rsBackgroundSync] replSet lastOpTimeFetched: Apr 14 10:14:47 552d20d7:28
2015-04-14T10:49:54.519-0400 [rsBackgroundSync] replSet syncing to: dannenstation.local:27019
2015-04-14T10:49:54.521-0400 [rsBackgroundSync] replSet remoteOldestOp:    Apr 14 10:20:19 552d2223:33
2015-04-14T10:49:54.522-0400 [rsBackgroundSync] replSet lastOpTimeFetched: Apr 14 10:14:47 552d20d7:28
2015-04-14T10:49:54.522-0400 [rsBackgroundSync] replSet error RS102 too stale to catch up, at least from dannenstation.local:27019
2015-04-14T10:49:54.522-0400 [rsBackgroundSync] replSet our last optime : Apr 14 10:14:47 552d20d7:28
2015-04-14T10:49:54.522-0400 [rsBackgroundSync] replSet oldest at dannenstation.local:27019 : Apr 14 10:20:19 552d2223:33
2015-04-14T10:49:54.522-0400 [rsBackgroundSync] replSet See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
2015-04-14T10:49:54.523-0400 [rsBackgroundSync] replSet error RS102 too stale to catch up
2015-04-14T10:49:54.523-0400 [rsBackgroundSync] replSet RECOVERING

If this is still an issue for you, is there any more information you can provide to successfully reproduce the problem on our end?

Regards,
Matt

Comment by Aristarkh Zagorodnikov [ 24/Oct/14 ]

Attaching log files. They are large (~880 MB), but I think I'd better won't filter them so you have more information.

Comment by Ramon Fernandez Marina [ 24/Oct/14 ]

onyxmaster, can you please upload logs for the primary and the affected secondary going at least as far back as a bit before the secondary was taken down?

Comment by Aristarkh Zagorodnikov [ 24/Oct/14 ]

Restarting secondary multiple times didn't help. While I understand that this (short oplog) is an erroneous condition, the handling of it could be more graceful (as it was before).

Generated at Thu Feb 08 03:39:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.