[SERVER-15799] Replication threads stuck in infinite connection loop when there is not enough remote oplog Created: 24/Oct/14 Updated: 14/Apr/15 Resolved: 14/Apr/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.6.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor - P4 |
| Reporter: | Aristarkh Zagorodnikov | Assignee: | Matt Dannenberg |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
One of secondary members of our replica set had to be taken offline for a prolonged period. Unfortunately, the oplog on master was not long enough for a proper recovery. But, instead of getting the usual message in the error log (about oplog too short to recover), the secondary started creating a lot of connections to primary at a rate of about 300 per second, until it exhausted local ports (due to 30K of connections to the same remote port, hanging in TIME_WAIT) and then started losing heartbeats due to inability to connect to primary server at all, filling log with these messages:
|
| Comments |
| Comment by Ramon Fernandez Marina [ 14/Apr/15 ] | |||||||||||||
|
Thanks onyxmaster, closing this ticket now. If this you run into this again please re-open. Regards, | |||||||||||||
| Comment by Aristarkh Zagorodnikov [ 14/Apr/15 ] | |||||||||||||
|
Hi! | |||||||||||||
| Comment by Matt Dannenberg [ 14/Apr/15 ] | |||||||||||||
|
Hey Aristarkh, Apologies for the long delay in getting back to you. We were unable to reproduce this behavior on our end: in the scenario you describe the secondary realizes it cannot catch up and does an initial sync. The steps I took for attempting the repro were:
At the point, the secondary noticed it was too far behind and dropped in to recovering and suggests a docs page about fixing it:
If this is still an issue for you, is there any more information you can provide to successfully reproduce the problem on our end? Regards, | |||||||||||||
| Comment by Aristarkh Zagorodnikov [ 24/Oct/14 ] | |||||||||||||
|
Attaching log files. They are large (~880 MB), but I think I'd better won't filter them so you have more information. | |||||||||||||
| Comment by Ramon Fernandez Marina [ 24/Oct/14 ] | |||||||||||||
|
onyxmaster, can you please upload logs for the primary and the affected secondary going at least as far back as a bit before the secondary was taken down? | |||||||||||||
| Comment by Aristarkh Zagorodnikov [ 24/Oct/14 ] | |||||||||||||
|
Restarting secondary multiple times didn't help. While I understand that this (short oplog) is an erroneous condition, the handling of it could be more graceful (as it was before). |