Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-9707

Make oplog timeout configurable

    • Type: Icon: Improvement Improvement
    • Resolution: Incomplete
    • Priority: Icon: Minor - P4 Minor - P4
    • None
    • Affects Version/s: 2.4.3
    • Component/s: Networking, Replication
    • Labels:
    • Environment:
      Amazon EC2 m2.4xlarge instances with 2.6.18-308.16.1.el5.centos.plusxen kernel
    • Fully Compatible

      Task https://jira.mongodb.org/browse/SERVER-6733 changed oplog timeout from 10 minutes to 30 seconds. We have ran into a situation in our environment where some of the oplog queries take as long as 80 seconds, which will break replication on slaves.

      Our environment uses an oplog of 30GB (oplogSize=30000), which currently contains 96 million entries, covering just about seven hours of oplog.

      This issue was found by the following log messages: The slave reports this log message upon startup:

      Wed May 15 14:19:39.972 [rsBackgroundSync] repl: local.oplog.rs.find({ ts: { $gte: Timestamp 1368611880000|19631 } })

      30 seconds later we see:

      Wed May 15 14:20:09.972 [rsBackgroundSync] Socket recv() timeout   :27017
      Wed May 15 14:20:09.972 [rsBackgroundSync] SocketException: remote:  :27017 error: 9001 socket exception [3] server [ ] 
      Wed May 15 14:20:09.972 [rsBackgroundSync] DBClientCursor::init call() failed

      On master we see this a bit later:

      Wed May 15 14:20:30.114 [conn1343346] query local.oplog.rs query: { ts: { $gte: Timestamp 1368611880000|19631 } } cursorid:16379302284139893 ntoreturn:0 ntoskip:0 nscanned:102 keyUpdates:0 numYields: 18820 locks(micros) r:3724703 nreturned:101 reslen:19184 80124ms
      Wed May 15 14:20:30.114 [conn1343346] SocketException handling request, closing client connection: 9001 socket exception [2] server [ ] 

      As we can see the oplog query will timeout. This also makes the slave very unresponsive, so all other instances thinks that the slave is down. This can be seen in the log with messages like "host1 thinks that we are down" rendering the slave completely useless.

      I confirmed the bug by compiling my own mongodb server where I modified the timeout back to 10 minutes, which solved all these problems.

      I propose that we add a configure option for increasing the oplog timeout and also a separated warning message which tells if the oplog query takes longer than is expected.

            Unassigned Unassigned
            garo Juho Mäkinen
            1 Vote for this issue
            9 Start watching this issue