[SERVER-9934] Slow failovers after step down due to sleeping rsBackgroundSync Created: 13/Jun/13 Updated: 11/Jul/16 Resolved: 29/Jun/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.4.4 |
| Fix Version/s: | 2.4.6, 2.5.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Nelson Elhage | Assignee: | Eric Milkie |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Tested on Linux amd64, Ubuntu (10.04 Lucid and 13.04 Raring) |
||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Steps To Reproduce: | Run this test script with -vv: https://gist.github.com/nelhage/5777797 Observe the failovers are taking 10-12s, even though the cluster is healthy, unloaded, and running across the local interface. Look at the logs for a node that has become primary. Observe that after the election completes, the server logs "replSet waiting for replication to finish before becoming primary", and does not actually step up for up to 10 seconds later. |
||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
With a healthy replset, failovers via rs.stepDown() take 10+s, After an rs.steDown(), a healthy moderately-loaded replset on a Suppose (as is usually/often the case) that the new primary was sethbmsg(str::stream() << "db exception in producer: " << e.toString()); When the new primary goes to stop the rsBackgroundSync thread, |
| Comments |
| Comment by Aristarkh Zagorodnikov [ 26/Aug/13 ] | ||||||||
|
The | ||||||||
| Comment by auto [ 02/Aug/13 ] | ||||||||
|
Author: {u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: This reduces the potential delay in ending the bgsync thread from 10 seconds | ||||||||
| Comment by Eric Milkie [ 30/Jul/13 ] | ||||||||
|
Hi Nelson. Yes, the Backport field is what we use to indicate that. When it's complete we will change the field to "Done" (or "Cannot"). | ||||||||
| Comment by Nelson Elhage [ 30/Jul/13 ] | ||||||||
|
Is this planned for backport to 2.4? It'd be really nice to have this fix in our production environment – slow failovers are pretty painful for us. | ||||||||
| Comment by Matt Dannenberg [ 23/Jul/13 ] | ||||||||
|
I cherry-picked this commit on top of 2.4.5 and it made the server-9464 repro (test.sh) behave properly (get to a state with a primary). | ||||||||
| Comment by Eric Milkie [ 29/Jun/13 ] | ||||||||
|
Further code changes are required to reduce the delay from 5s. | ||||||||
| Comment by auto [ 18/Jun/13 ] | ||||||||
|
Author: {u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: This reduces the potential delay in ending the bgsync thread from 10 seconds | ||||||||
| Comment by Nelson Elhage [ 14/Jun/13 ] | ||||||||
|
Ah yeah, that would do it. I was testing on a hacked mongod that replaced the sleep with a timed-wait on a condvar – that version fails-over quickly in the no-ops case, but hits the double-wait if there's any work to do. The --force-sync test case is probably worthwhile anyways, since that's the more interesting case. | ||||||||
| Comment by Eric Milkie [ 14/Jun/13 ] | ||||||||
|
I hit the double wait just with your original test, in fact. I was quite intrigued when I had removed the 10 second sleep and yet observed no change in the test's behavior, so I dug deeper. Thanks very much for writing up a complete test, by the way. | ||||||||
| Comment by Nelson Elhage [ 14/Jun/13 ] | ||||||||
|
Not sure if you have your own test tools, but I added an option to https://gist.github.com/nelhage/5777797 to generate a bunch of writes immediately before the failover, which will force the new-primary to finish syncing before stepping up, thereby exhibiting the double-wait bug. | ||||||||
| Comment by Eric Milkie [ 14/Jun/13 ] | ||||||||
|
Right, I was about to update this with what I've discovered, which was exactly that. I'm going to fix the double wait. To fix the single wait, I have some ideas on how to twist it to treat a getmore wait as being already done – if _appliedBuffer is already true in stopReplicationAndFlushBuffer(), we can make it guaranteed that it will not go false without transitioning to pause mode, in which case we don't need to wait for it. | ||||||||
| Comment by Nelson Elhage [ 14/Jun/13 ] | ||||||||
|
(Let me know if this should be a separate issue) In the case where we do have ops to replicate, after replication is complete, we end up waiting for two GETMORE await timeouts. produce() does:
In the case where there are no more ops to read, each of those .more() calls will end up issuing a GETMORE, which means we'll wait 2 AWAIT timeouts (roughly 8-10s) before we return, also delaying the new primary from stepping up. The current architecture makes it very difficult to avoid waiting for at least one AWAIT timeout, but there's no need to wait for two. |