Task https://jira.mongodb.org/browse/SERVER-6733 changed oplog timeout from 10 minutes to 30 seconds. We have ran into a situation in our environment where some of the oplog queries take as long as 80 seconds, which will break replication on slaves.
Our environment uses an oplog of 30GB (oplogSize=30000), which currently contains 96 million entries, covering just about seven hours of oplog.
This issue was found by the following log messages: The slave reports this log message upon startup:
30 seconds later we see:
On master we see this a bit later:
As we can see the oplog query will timeout. This also makes the slave very unresponsive, so all other instances thinks that the slave is down. This can be seen in the log with messages like "host1 thinks that we are down" rendering the slave completely useless.
I confirmed the bug by compiling my own mongodb server where I modified the timeout back to 10 minutes, which solved all these problems.
I propose that we add a configure option for increasing the oplog timeout and also a separated warning message which tells if the oplog query takes longer than is expected.