[SERVER-30906] Replication timeout between single replica set. Created: 31/Aug/17  Updated: 16/Nov/17  Resolved: 20/Oct/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.4.3, 3.4.4, 3.4.7
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Lucas Assignee: Mark Agarunov
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

I don't know how to report this bug since it was discussed in another improvement issue. I will create this here because the problem we're facing will not be resolved with this kind of improvement.

I'm having the same issue as in SERVER-19605 with two different replica sets (not sharded clusters) here.

One of them is now in version 3.4.4 and the other one in 3.4.7 (we updated trying to "solve" this problem). The composition of this set is one primary, one secondary and one arbiter.

One of them holds 7k queries + 7k inserts per second (fully SSD + 10Gbps network)
The other one 5k/s commands, 700 updates/s and 41 inserts/s (fully SSD + 1Gbps network)

Firstly we were thinking the number of connections was the cause because every time our primary reached to 5k connections the replication lag began to increase (this was happening in both of those replica sets).

We changed our applications pools, and now even with 3k connections this problem still occurs. We use one of this replica set from more than 3 years, and we never had this problem before. Maybe the update to 3.4.4+ is the cause?

We also tried to change many configs (like connections per hosts, incoming connections, etc.), nothing change.

One thing we saw in our logging is the query beeing made with 60 seconds timeout but getting error logging a timeout of 5 seconds:

2017-08-31T13:46:28.825+0000 I REPL [replication-0] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 25693 – target:10.15.187.6:7308 db:local expDate:2017-08-31T13:46:28.825+0000 cmd:{ getMore: 19638055769, collection: "oplog.rs", maxTimeMS: 5000, term: 10, lastKnownCommittedOpTime: { ts: Timestamp 0|0, t: -1 } }. Last fetched optime (with hash): { ts: Timestamp 1504129528000|245, t: 8 }[6958771872602233271]. Restarts remaining: 3

2017-08-31T13:46:28.825+0000 I REPL [replication-0] Scheduled new oplog query Fetcher source: 10.15.187.6:7308 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1504129528000|245 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 10 } query metadata: { $replData: 1, $oplogQueryData: 1, $ssm: { $secondaryOk: true } } active: 1 timeout: 65000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 25853 – target:10.15.187.6:7308 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp 1504129528000|245 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 10 } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms

I think this maybe is just a logging issue since this ExceededTimeLimit occurs 1 time per minute.

I don't know how to prevent this replication lag, but I think this isn't only a replication issue, but an issue involving the primary performance. We are 100% sure that our primary can tolerate our load (3 years to prove!!).

We can't easily downgrade our replica sets to older versions, and we need at least one kind of workaround. I tried to downgrade one of those secondaries to 3.4.3 and the replication still gets timeout. We can't downgrade the primary since it is beeing used in production.

Thanks.



 Comments   
Comment by Lucas [ 24/Oct/17 ]

Hello mark.agarunov. We were able to manage this problem by using our secondaries to read until we shard this particular replica set. Our team belive the number of connections (4 ~ 5k) could be the problem, since when we get a lower connection number (2k) the timeout to replicate stops. But I don't belive that MongoDB can't manage thousand of connections.

Unfortunately I do not know what the real cause of the replication lag and since we tried to bunch of things to fix the problem, I do not have access to these log files, sorry.

If this happens again I will provide you guys the logging.

Sorry for the delay.

Thanks.

Comment by Mark Agarunov [ 20/Oct/17 ]

Hello lucasoares,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Thanks,
Mark

Comment by Kelsey Schubert [ 29/Sep/17 ]

Hi lucasoares,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please provide the logs Ramón requested?

Thank you,
Kelsey

Comment by Ramon Fernandez Marina [ 09/Sep/17 ]

Hi lucasoares,

SERVER-19605 is for addressing timeouts in establishing the initial oplog cursor (ie timeouts of the 'find' command) whereas it looks like your issue is replication lag possibly coupled with timeouts of the 'getMore' operations on a pre-existing oplog cursor.

Can you please attach log files from both your primary and the secondary experiencing the issue so we can take a look?

Thanks,
Ramón.

Generated at Thu Feb 08 04:25:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.