Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Incomplete
Priority: Minor - P4
Fix Version/s: None
Affects Version/s: 2.4.3
Component/s: Networking, Replication
Labels:
None
Environment:
Amazon EC2 m2.4xlarge instances with 2.6.18-308.16.1.el5.centos.plusxen kernel

Backwards Compatibility:
Fully Compatible
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Task https://jira.mongodb.org/browse/SERVER-6733 changed oplog timeout from 10 minutes to 30 seconds. We have ran into a situation in our environment where some of the oplog queries take as long as 80 seconds, which will break replication on slaves.

Our environment uses an oplog of 30GB (oplogSize=30000), which currently contains 96 million entries, covering just about seven hours of oplog.

This issue was found by the following log messages: The slave reports this log message upon startup:

Wed May 15 14:19:39.972 [rsBackgroundSync] repl: local.oplog.rs.find({ ts: { $gte: Timestamp 1368611880000|19631 } })

30 seconds later we see:

Wed May 15 14:20:09.972 [rsBackgroundSync] Socket recv() timeout   :27017
Wed May 15 14:20:09.972 [rsBackgroundSync] SocketException: remote:  :27017 error: 9001 socket exception [3] server [ ] 
Wed May 15 14:20:09.972 [rsBackgroundSync] DBClientCursor::init call() failed

On master we see this a bit later:

Wed May 15 14:20:30.114 [conn1343346] query local.oplog.rs query: { ts: { $gte: Timestamp 1368611880000|19631 } } cursorid:16379302284139893 ntoreturn:0 ntoskip:0 nscanned:102 keyUpdates:0 numYields: 18820 locks(micros) r:3724703 nreturned:101 reslen:19184 80124ms
Wed May 15 14:20:30.114 [conn1343346] SocketException handling request, closing client connection: 9001 socket exception [2] server [ ]

As we can see the oplog query will timeout. This also makes the slave very unresponsive, so all other instances thinks that the slave is down. This can be seen in the log with messages like "host1 thinks that we are down" rendering the slave completely useless.

I confirmed the bug by compiling my own mongodb server where I modified the timeout back to 10 minutes, which solved all these problems.

I propose that we add a configure option for increasing the oplog timeout and also a separated warning message which tells if the oplog query takes longer than is expected.

related to

SERVER-6733 Make oplog timeout shorter

Closed

SERVER-10362 yielding during read queries waiting too long for fair locking

Closed

links to

Pull Request #615

Assignee:: Unassigned
Reporter:: Juho Mäkinen
Participants:: Daniel Pasette, Eric Milkie, Juho Mäkinen, Michael Brenden
Votes:: 1 Vote for this issue
Watchers:: 9 Start watching this issue

Created:: May 16 2013 07:58:23 AM UTC
Updated:: Feb 06 2017 02:26:39 PM UTC
Resolved:: Mar 07 2014 03:21:31 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates