[SERVER-30906] Replication timeout between single replica set. Created: 31/Aug/17 Updated: 16/Nov/17 Resolved: 20/Oct/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.4.3, 3.4.4, 3.4.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Lucas | Assignee: | Mark Agarunov |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Participants: |
| Description |
|
I don't know how to report this bug since it was discussed in another improvement issue. I will create this here because the problem we're facing will not be resolved with this kind of improvement. I'm having the same issue as in One of them is now in version 3.4.4 and the other one in 3.4.7 (we updated trying to "solve" this problem). The composition of this set is one primary, one secondary and one arbiter. One of them holds 7k queries + 7k inserts per second (fully SSD + 10Gbps network) Firstly we were thinking the number of connections was the cause because every time our primary reached to 5k connections the replication lag began to increase (this was happening in both of those replica sets). We changed our applications pools, and now even with 3k connections this problem still occurs. We use one of this replica set from more than 3 years, and we never had this problem before. Maybe the update to 3.4.4+ is the cause? We also tried to change many configs (like connections per hosts, incoming connections, etc.), nothing change. One thing we saw in our logging is the query beeing made with 60 seconds timeout but getting error logging a timeout of 5 seconds:
I think this maybe is just a logging issue since this ExceededTimeLimit occurs 1 time per minute. I don't know how to prevent this replication lag, but I think this isn't only a replication issue, but an issue involving the primary performance. We are 100% sure that our primary can tolerate our load (3 years to prove!!). We can't easily downgrade our replica sets to older versions, and we need at least one kind of workaround. I tried to downgrade one of those secondaries to 3.4.3 and the replication still gets timeout. We can't downgrade the primary since it is beeing used in production. Thanks. |
| Comments |
| Comment by Lucas [ 24/Oct/17 ] |
|
Hello mark.agarunov. We were able to manage this problem by using our secondaries to read until we shard this particular replica set. Our team belive the number of connections (4 ~ 5k) could be the problem, since when we get a lower connection number (2k) the timeout to replicate stops. But I don't belive that MongoDB can't manage thousand of connections. Unfortunately I do not know what the real cause of the replication lag and since we tried to bunch of things to fix the problem, I do not have access to these log files, sorry. If this happens again I will provide you guys the logging. Sorry for the delay. Thanks. |
| Comment by Mark Agarunov [ 20/Oct/17 ] |
|
Hello lucasoares, We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket. Thanks, |
| Comment by Kelsey Schubert [ 29/Sep/17 ] |
|
Hi lucasoares, We still need additional information to diagnose the problem. If this is still an issue for you, would you please provide the logs Ramón requested? Thank you, |
| Comment by Ramon Fernandez Marina [ 09/Sep/17 ] |
|
Hi lucasoares,
Can you please attach log files from both your primary and the secondary experiencing the issue so we can take a look? Thanks, |