-
Type: Bug
-
Resolution: Incomplete
-
Priority: Critical - P2
-
None
-
Affects Version/s: 3.4.3, 3.4.4, 3.4.7
-
Component/s: Replication
-
None
-
ALL
I don't know how to report this bug since it was discussed in another improvement issue. I will create this here because the problem we're facing will not be resolved with this kind of improvement.
I'm having the same issue as in SERVER-19605 with two different replica sets (not sharded clusters) here.
One of them is now in version 3.4.4 and the other one in 3.4.7 (we updated trying to "solve" this problem). The composition of this set is one primary, one secondary and one arbiter.
One of them holds 7k queries + 7k inserts per second (fully SSD + 10Gbps network)
The other one 5k/s commands, 700 updates/s and 41 inserts/s (fully SSD + 1Gbps network)
Firstly we were thinking the number of connections was the cause because every time our primary reached to 5k connections the replication lag began to increase (this was happening in both of those replica sets).
We changed our applications pools, and now even with 3k connections this problem still occurs. We use one of this replica set from more than 3 years, and we never had this problem before. Maybe the update to 3.4.4+ is the cause?
We also tried to change many configs (like connections per hosts, incoming connections, etc.), nothing change.
One thing we saw in our logging is the query beeing made with 60 seconds timeout but getting error logging a timeout of 5 seconds:
2017-08-31T13:46:28.825+0000 I REPL [replication-0] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 25693 – target:10.15.187.6:7308 db:local expDate:2017-08-31T13:46:28.825+0000 cmd:{ getMore: 19638055769, collection: "oplog.rs", maxTimeMS: 5000, term: 10, lastKnownCommittedOpTime: { ts: Timestamp 0|0, t: -1 } }. Last fetched optime (with hash): { ts: Timestamp 1504129528000|245, t: 8 }[6958771872602233271]. Restarts remaining: 3
2017-08-31T13:46:28.825+0000 I REPL [replication-0] Scheduled new oplog query Fetcher source: 10.15.187.6:7308 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1504129528000|245 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 10 } query metadata: { $replData: 1, $oplogQueryData: 1, $ssm: { $secondaryOk: true } } active: 1 timeout: 65000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 25853 – target:10.15.187.6:7308 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp 1504129528000|245 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 10 } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
I think this maybe is just a logging issue since this ExceededTimeLimit occurs 1 time per minute.
I don't know how to prevent this replication lag, but I think this isn't only a replication issue, but an issue involving the primary performance. We are 100% sure that our primary can tolerate our load (3 years to prove!!).
We can't easily downgrade our replica sets to older versions, and we need at least one kind of workaround. I tried to downgrade one of those secondaries to 3.4.3 and the replication still gets timeout. We can't downgrade the primary since it is beeing used in production.
Thanks.