[SERVER-9707] Make oplog timeout configurable Created: 16/May/13  Updated: 06/Feb/17  Resolved: 07/Mar/14

Status: Closed
Project: Core Server
Component/s: Networking, Replication
Affects Version/s: 2.4.3
Fix Version/s: None

Type: Improvement Priority: Minor - P4
Reporter: Juho Mäkinen Assignee: Unassigned
Resolution: Incomplete Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Amazon EC2 m2.4xlarge instances with 2.6.18-308.16.1.el5.centos.plusxen kernel


Issue Links:
Related
related to SERVER-6733 Make oplog timeout shorter Closed
related to SERVER-10362 yielding during read queries waiting ... Closed
Backwards Compatibility: Fully Compatible
Participants:

 Description   

Task https://jira.mongodb.org/browse/SERVER-6733 changed oplog timeout from 10 minutes to 30 seconds. We have ran into a situation in our environment where some of the oplog queries take as long as 80 seconds, which will break replication on slaves.

Our environment uses an oplog of 30GB (oplogSize=30000), which currently contains 96 million entries, covering just about seven hours of oplog.

This issue was found by the following log messages: The slave reports this log message upon startup:

Wed May 15 14:19:39.972 [rsBackgroundSync] repl: local.oplog.rs.find({ ts: { $gte: Timestamp 1368611880000|19631 } })

30 seconds later we see:

Wed May 15 14:20:09.972 [rsBackgroundSync] Socket recv() timeout   :27017
Wed May 15 14:20:09.972 [rsBackgroundSync] SocketException: remote:  :27017 error: 9001 socket exception [3] server [ ] 
Wed May 15 14:20:09.972 [rsBackgroundSync] DBClientCursor::init call() failed

On master we see this a bit later:

Wed May 15 14:20:30.114 [conn1343346] query local.oplog.rs query: { ts: { $gte: Timestamp 1368611880000|19631 } } cursorid:16379302284139893 ntoreturn:0 ntoskip:0 nscanned:102 keyUpdates:0 numYields: 18820 locks(micros) r:3724703 nreturned:101 reslen:19184 80124ms
Wed May 15 14:20:30.114 [conn1343346] SocketException handling request, closing client connection: 9001 socket exception [2] server [ ] 

As we can see the oplog query will timeout. This also makes the slave very unresponsive, so all other instances thinks that the slave is down. This can be seen in the log with messages like "host1 thinks that we are down" rendering the slave completely useless.

I confirmed the bug by compiling my own mongodb server where I modified the timeout back to 10 minutes, which solved all these problems.

I propose that we add a configure option for increasing the oplog timeout and also a separated warning message which tells if the oplog query takes longer than is expected.



 Comments   
Comment by Eric Milkie [ 06/Feb/17 ]

Hi michaelbrenden
I don't think your issue is related to this ticket, which was filed 3.5 years ago and concerns code that is much different from today.
Additionally, this ticket was problematically prescriptive, without a full understanding of the underlying issue.

Comment by Michael Brenden [ 06/Feb/17 ]

Still experiencing this issue in 3.4.2 / Feb 2017 – see SERVER-19605

Comment by Juho Mäkinen [ 26/Jan/14 ]

it very well might be. I need to gather some data on this, but we stopped using our patched version (which increased the oplog timeout) some time ago (by mistake) and we didn't notice that this bug would have reappeared. I'll check if we have gathered the deployed mongodb version into our zabbix history and try to correlate things out. I'll come back to this issue and maybe we can just close this issue altogether.

Comment by Daniel Pasette (Inactive) [ 26/Jan/14 ]

garo, I'm wondering if you were impacted by this issue: SERVER-10362, fixed in 2.4.6.

Comment by Eric Milkie [ 11/Jun/13 ]

It's unlikely that a shorter read socket timeout would have the effect of making the slave unresponsive as you describe. The heartbeat commands are running in separate threads and use no database locks, so they should be unaffected by a slow oplog query by the replication network thread. It's possible that it has a different cause.
We will consider making the socket read timeout configurable.

Generated at Thu Feb 08 03:21:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.