Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Done
Priority: Major - P3
Fix Version/s: 3.4.11, 3.6.0-rc0
Affects Version/s: 2.6.7, 3.0.4
Component/s: Replication
Labels:
- neweng

Backwards Compatibility:
Fully Compatible
Backport Requested:

v3.4
Sprint:
Repl 2017-10-02
Case:
Linked BF Score:
0
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We just encountered a situation where all secondaries in two of our replica sets had ceased replication, and were 1-2 days behind the primary. This appears to have been caused in part by the fact that the initial oplog query from SECONDARY->PRIMARY times out after 30 seconds, but the oplog query takes > 5 minutes to run. Some searching led me to this JIRA ~~SERVER-6733~~, where the timeout was reduced from 10 minutes to 30 seconds.

As a workaround, we are building a custom binary with an increased oplog timeout so that the initial oplog query is allowed to complete and so our secondaries have a chance to catch up.

Ideally, this value would be configurable with a flag or configuration option to avoid the need to recompile, and to allow users to customize the timeout for their particular situation.

We have a fairly large oplog:

> db.printReplicationInfo()
configured oplog size:   143477.3826171875MB
log length start to end: 1620689secs (450.19hrs)
oplog first event time:  Wed Jul 08 2015 23:11:24 GMT+0000 (UTC)
oplog last event time:   Mon Jul 27 2015 17:22:53 GMT+0000 (UTC)
now:                     Mon Jul 27 2015 17:22:53 GMT+0000 (UTC)

Here are some sample queries issued by the secondaries that are timing out:

Mon Jul 27 16:32:44.469 [conn5987144] query local.oplog.rs query: { ts: { $gte: Timestamp 1437813467000|94 } } cursorid:1368021807027379 ntoreturn:0 ntoskip:0 nscanned:4205713 nscannedObjects:4205713 keyUpdates:0 numYields:33130 locks(micros) r:38390680 nreturned:101 reslen:25310 1361497ms
Mon Jul 27 16:32:45.037 [conn5987146] query local.oplog.rs query: { ts: { $gte: Timestamp 1437813467000|94 } } cursorid:1368020207769978 ntoreturn:0 ntoskip:0 nscanned:4205713 nscannedObjects:4205713 keyUpdates:0 numYields:33131 locks(micros) r:38186447 nreturned:101 reslen:25310 1362020ms

is duplicated by

SERVER-27952 Replication fails under heavy load - Oplog timeout should be configurable

Closed

is related to

SERVER-6733 Make oplog timeout shorter

Closed

SERVER-26106 Raise oplog socket timeout for rollback

Closed

related to

SERVER-28005 Oplog query network timeout is less than the maxTimeMs

Closed

SERVER-38973 Allow configuration of timeouts for getMores on oplog for replication

Backlog

Assignee:: Judah Schvimer
Reporter:: Travis Redman
Participants:: Asya Kamsky, Daniel Rupp, Eric Milkie, Githook User, Jonathan Kamens, Judah Schvimer, Kelsey Schubert, Lucas, Michael Brenden, Ramon Fernandez Marina, Robert Romano, Spencer Brody, Travis Redman, Ven, Vladimir
Votes:: 9 Vote for this issue
Watchers:: 36 Start watching this issue

Created:: Jul 27 2015 05:36:51 PM UTC
Updated:: Mar 06 2019 11:49:05 AM UTC
Resolved:: Sep 13 2017 04:08:17 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates