[SERVER-9707] Make oplog timeout configurable Created: 16/May/13 Updated: 06/Feb/17 Resolved: 07/Mar/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Networking, Replication |
| Affects Version/s: | 2.4.3 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor - P4 |
| Reporter: | Juho Mäkinen | Assignee: | Unassigned |
| Resolution: | Incomplete | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Amazon EC2 m2.4xlarge instances with 2.6.18-308.16.1.el5.centos.plusxen kernel |
||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
Task https://jira.mongodb.org/browse/SERVER-6733 changed oplog timeout from 10 minutes to 30 seconds. We have ran into a situation in our environment where some of the oplog queries take as long as 80 seconds, which will break replication on slaves. Our environment uses an oplog of 30GB (oplogSize=30000), which currently contains 96 million entries, covering just about seven hours of oplog. This issue was found by the following log messages: The slave reports this log message upon startup:
30 seconds later we see:
On master we see this a bit later:
As we can see the oplog query will timeout. This also makes the slave very unresponsive, so all other instances thinks that the slave is down. This can be seen in the log with messages like "host1 thinks that we are down" rendering the slave completely useless. I confirmed the bug by compiling my own mongodb server where I modified the timeout back to 10 minutes, which solved all these problems. I propose that we add a configure option for increasing the oplog timeout and also a separated warning message which tells if the oplog query takes longer than is expected. |
| Comments |
| Comment by Eric Milkie [ 06/Feb/17 ] |
|
Hi michaelbrenden |
| Comment by Michael Brenden [ 06/Feb/17 ] |
|
Still experiencing this issue in 3.4.2 / Feb 2017 – see |
| Comment by Juho Mäkinen [ 26/Jan/14 ] |
|
it very well might be. I need to gather some data on this, but we stopped using our patched version (which increased the oplog timeout) some time ago (by mistake) and we didn't notice that this bug would have reappeared. I'll check if we have gathered the deployed mongodb version into our zabbix history and try to correlate things out. I'll come back to this issue and maybe we can just close this issue altogether. |
| Comment by Daniel Pasette (Inactive) [ 26/Jan/14 ] |
|
garo, I'm wondering if you were impacted by this issue: |
| Comment by Eric Milkie [ 11/Jun/13 ] |
|
It's unlikely that a shorter read socket timeout would have the effect of making the slave unresponsive as you describe. The heartbeat commands are running in separate threads and use no database locks, so they should be unaffected by a slow oplog query by the replication network thread. It's possible that it has a different cause. |