Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Incomplete
Priority: Critical - P2
Fix Version/s: None
Affects Version/s: 3.4.3, 3.4.4, 3.4.7
Component/s: Replication
Labels:
None

Operating System:
ALL
Confidence Status:
None
Work Order:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

I don't know how to report this bug since it was discussed in another improvement issue. I will create this here because the problem we're facing will not be resolved with this kind of improvement.

I'm having the same issue as in ~~SERVER-19605~~ with two different replica sets (not sharded clusters) here.

One of them is now in version 3.4.4 and the other one in 3.4.7 (we updated trying to "solve" this problem). The composition of this set is one primary, one secondary and one arbiter.

One of them holds 7k queries + 7k inserts per second (fully SSD + 10Gbps network)
The other one 5k/s commands, 700 updates/s and 41 inserts/s (fully SSD + 1Gbps network)

Firstly we were thinking the number of connections was the cause because every time our primary reached to 5k connections the replication lag began to increase (this was happening in both of those replica sets).

We changed our applications pools, and now even with 3k connections this problem still occurs. We use one of this replica set from more than 3 years, and we never had this problem before. Maybe the update to 3.4.4+ is the cause?

We also tried to change many configs (like connections per hosts, incoming connections, etc.), nothing change.

One thing we saw in our logging is the query beeing made with 60 seconds timeout but getting error logging a timeout of 5 seconds:

2017-08-31T13:46:28.825+0000 I REPL [replication-0] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 25693 – target:10.15.187.6:7308 db:local expDate:2017-08-31T13:46:28.825+0000 cmd:{ getMore: 19638055769, collection: "oplog.rs", maxTimeMS: 5000, term: 10, lastKnownCommittedOpTime: { ts: Timestamp 0|0, t: -1 } }. Last fetched optime (with hash): { ts: Timestamp 1504129528000|245, t: 8 }[6958771872602233271]. Restarts remaining: 3

2017-08-31T13:46:28.825+0000 I REPL [replication-0] Scheduled new oplog query Fetcher source: 10.15.187.6:7308 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1504129528000|245 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 10 } query metadata: { $replData: 1, $oplogQueryData: 1, $ssm: { $secondaryOk: true } } active: 1 timeout: 65000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 25853 – target:10.15.187.6:7308 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp 1504129528000|245 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 10 } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms

I think this maybe is just a logging issue since this ExceededTimeLimit occurs 1 time per minute.

I don't know how to prevent this replication lag, but I think this isn't only a replication issue, but an issue involving the primary performance. We are 100% sure that our primary can tolerate our load (3 years to prove!!).

We can't easily downgrade our replica sets to older versions, and we need at least one kind of workaround. I tried to downgrade one of those secondaries to 3.4.3 and the replication still gets timeout. We can't downgrade the primary since it is beeing used in production.

Thanks.

Assignee:: Mark Agarunov (Inactive)
Reporter:: Lucas
Participants:: Kelsey Schubert, Lucas, Mark Agarunov, Ramon Fernandez Marina
Votes:: 0 Vote for this issue
Watchers:: 10 Start watching this issue

Created:: Aug 31 2017 02:23:56 PM UTC
Updated:: Nov 16 2017 04:20:06 PM UTC
Resolved:: Oct 20 2017 09:19:40 PM UTC

Details

Description

Attachments

Activity

People

Dates