[SERVER-6733] Make oplog timeout shorter Created: 08/Aug/12  Updated: 06/Feb/17  Resolved: 11/Jun/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 2.2.5, 2.3.2

Type: Bug Priority: Major - P3
Reporter: Kristina Chodorow (Inactive) Assignee: Eric Milkie
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-4758 OplogReader has no socket timeout Closed
related to SERVER-19605 Oplog timeout should be configurable Closed
is related to SERVER-6300 Node 2 is master but attempts to repl... Closed
is related to DOCS-1002 Mention oplog timeout Closed
is related to SERVER-9707 Make oplog timeout configurable Closed
Operating System: ALL
Participants:

 Comments   
Comment by Eric Milkie [ 06/Feb/17 ]

Hi michaelbrenden
I'm sorry, but I don't think this ticket has anything to do with the problem you're experiencing (other than the words "oplog timeout" in the title). The code from 3.5 years ago has since been completely rewritten.

Comment by Michael Brenden [ 06/Feb/17 ]

Problem still on 3.4.2 (Feb 2017) without oplog timeout being waaay too short, causing failure of secondary — see also SERVER-19605

Comment by auto [ 13/Jun/13 ]

Author:

{u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-6733 lower oplog socket timeout from 10 minutes to 30 seconds
Branch: v2.2
https://github.com/mongodb/mongo/commit/5a3244c1b1424f4fbeb006f501c450074f4127f2

Comment by Juho Mäkinen [ 16/May/13 ]

I've made a new issue on this: https://jira.mongodb.org/browse/SERVER-9707

Comment by Jalmari Raippalinna [ 15/May/13 ]

This actually causes problems for us, because we use extended oplog (oplogSize=30000) with many operations.

Starting up replica set secondary first does this:

Wed May 15 14:19:39.972 [rsBackgroundSync] repl: local.oplog.rs.find({ ts: { $gte: Timestamp 1368611880000|19631 } })

30s later we see

Wed May 15 14:20:09.972 [rsBackgroundSync] Socket recv() timeout   :27017
Wed May 15 14:20:09.972 [rsBackgroundSync] SocketException: remote:  :27017 error: 9001 socket exception [3] server [ ] 
Wed May 15 14:20:09.972 [rsBackgroundSync] DBClientCursor::init call() failed

On replicate where sync was attempted, this ccomes up bit later:

Wed May 15 14:20:30.114 [conn1343346] query local.oplog.rs query: { ts: { $gte: Timestamp 1368611880000|19631 } } cursorid:16379302284139893 ntoreturn:0 ntoskip:0 nscanned:102 keyUpdates:0 numYields: 18820 locks(micros) r:3724703 nreturned:101 reslen:19184 80124ms
Wed May 15 14:20:30.114 [conn1343346] SocketException handling request, closing client connection: 9001 socket exception [2] server [ ] 

Because it takes 80 seconds for oplog query to respond, we seem to have intermittent problems on our server where backups are made.

Is there anything other to do than compile own version that has longer timeout?

Our oplog has 91M entries currently, which might be the problem here.

Comment by auto [ 08/Dec/12 ]

Author:

{u'date': u'2012-12-07T20:22:52Z', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}

Message: SERVER-6733 lower oplog socket timeout from 10 minutes to 30 seconds
Branch: master
https://github.com/mongodb/mongo/commit/c5b21478da310c15fca13f1b69e93e83418796ea

Comment by Yuri Finkelstein [ 26/Oct/12 ]

Here is a real-life example showing why this timeout should be short. We had a replica set with was syncing in a chain manner:

Primary <--- Secondary1 <--- Secondary2 <--- Secondary3, etc.

Secondary1 had a hard failure. Secondary2 took 10 minutes to detect it. Because of this, all remaining secondaries immediately started to have replication lag. As a result, the client calls to master with getLastError (w=2, timeout=4000) immediately started to fail.

This is actually a serious bug.

Comment by Kristina Chodorow (Inactive) [ 08/Aug/12 ]

This is basically a followup to SERVER-4758: now that the network thread is separate from the replication thread, a short oplog timeout might work better (and would be nice for users).

Generated at Thu Feb 08 03:12:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.