[SERVER-6733] Make oplog timeout shorter Created: 08/Aug/12 Updated: 06/Feb/17 Resolved: 11/Jun/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 2.2.5, 2.3.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kristina Chodorow (Inactive) | Assignee: | Eric Milkie |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Comments |
| Comment by Eric Milkie [ 06/Feb/17 ] | ||||||
|
Hi michaelbrenden | ||||||
| Comment by Michael Brenden [ 06/Feb/17 ] | ||||||
|
Problem still on 3.4.2 (Feb 2017) without oplog timeout being waaay too short, causing failure of secondary — see also | ||||||
| Comment by auto [ 13/Jun/13 ] | ||||||
|
Author: {u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: | ||||||
| Comment by Juho Mäkinen [ 16/May/13 ] | ||||||
|
I've made a new issue on this: https://jira.mongodb.org/browse/SERVER-9707 | ||||||
| Comment by Jalmari Raippalinna [ 15/May/13 ] | ||||||
|
This actually causes problems for us, because we use extended oplog (oplogSize=30000) with many operations. Starting up replica set secondary first does this:
30s later we see
On replicate where sync was attempted, this ccomes up bit later:
Because it takes 80 seconds for oplog query to respond, we seem to have intermittent problems on our server where backups are made. Is there anything other to do than compile own version that has longer timeout? Our oplog has 91M entries currently, which might be the problem here. | ||||||
| Comment by auto [ 08/Dec/12 ] | ||||||
|
Author: {u'date': u'2012-12-07T20:22:52Z', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}Message: | ||||||
| Comment by Yuri Finkelstein [ 26/Oct/12 ] | ||||||
|
Here is a real-life example showing why this timeout should be short. We had a replica set with was syncing in a chain manner:
Secondary1 had a hard failure. Secondary2 took 10 minutes to detect it. Because of this, all remaining secondaries immediately started to have replication lag. As a result, the client calls to master with getLastError (w=2, timeout=4000) immediately started to fail. This is actually a serious bug. | ||||||
| Comment by Kristina Chodorow (Inactive) [ 08/Aug/12 ] | ||||||
|
This is basically a followup to |