[SERVER-9934] Slow failovers after step down due to sleeping rsBackgroundSync Created: 13/Jun/13  Updated: 11/Jul/16  Resolved: 29/Jun/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.4
Fix Version/s: 2.4.6, 2.5.1

Type: Bug Priority: Major - P3
Reporter: Nelson Elhage Assignee: Eric Milkie
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Tested on Linux amd64, Ubuntu (10.04 Lucid and 13.04 Raring)


Issue Links:
Depends
Duplicate
is duplicated by SERVER-9944 assumePrimary() waits 10 seconds (for... Closed
is duplicated by SERVER-9464 On election, relax priority restricti... Closed
Related
related to SERVER-10225 Replica set failover speed improvement Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Run this test script with -vv: https://gist.github.com/nelhage/5777797

Observe the failovers are taking 10-12s, even though the cluster is healthy, unloaded, and running across the local interface.

Look at the logs for a node that has become primary. Observe that after the election completes, the server logs "replSet waiting for replication to finish before becoming primary", and does not actually step up for up to 10 seconds later.

Participants:

 Description   

With a healthy replset, failovers via rs.stepDown() take 10+s,
even when the election completes in 1-2s.

After an rs.steDown(), a healthy moderately-loaded replset on a
good network will normally complete an election for an new
primary normally within 1-2s. However, before the new primary can
step up, it has to stop the background sync thread.

Suppose (as is usually/often the case) that the new primary was
syncing from the stepped-down node. When that node stepped down,
it closed all connections, including the connection the new
primary was using to read the oplog. This triggered a DBException
inside of _producerThread, triggering this code:

sethbmsg(str::stream() << "db exception in producer: " << e.toString());
sleepsecs(10);

When the new primary goes to stop the rsBackgroundSync thread,
the thread is caught in the middle of that sleep, and so we end
up literally waiting on a sleep(10) before the cluster has a new
primary.



 Comments   
Comment by Aristarkh Zagorodnikov [ 26/Aug/13 ]

The SERVER-9464 wasn't actually fixed by this, see SERVER-10621.

Comment by auto [ 02/Aug/13 ]

Author:

{u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-9934 remove double getmore when pulling new ops

This reduces the potential delay in ending the bgsync thread from 10 seconds
to 5 seconds. This is important because assumingPrimaryness waits on this loop
before changing state to PRIMARY after an election.
To reduce the delay lower than 5 seconds requires further redesign of the code.
Branch: v2.4
https://github.com/mongodb/mongo/commit/b89e706faca937c07cc578c37c962532fc7b751b

Comment by Eric Milkie [ 30/Jul/13 ]

Hi Nelson. Yes, the Backport field is what we use to indicate that. When it's complete we will change the field to "Done" (or "Cannot").

Comment by Nelson Elhage [ 30/Jul/13 ]

Is this planned for backport to 2.4? It'd be really nice to have this fix in our production environment – slow failovers are pretty painful for us.

Comment by Matt Dannenberg [ 23/Jul/13 ]

I cherry-picked this commit on top of 2.4.5 and it made the server-9464 repro (test.sh) behave properly (get to a state with a primary).

Comment by Eric Milkie [ 29/Jun/13 ]

Further code changes are required to reduce the delay from 5s.

Comment by auto [ 18/Jun/13 ]

Author:

{u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-9934 remove double getmore when pulling new ops

This reduces the potential delay in ending the bgsync thread from 10 seconds
to 5 seconds. This is important because assumingPrimaryness waits on this loop
before changing state to PRIMARY after an election.
To reduce the delay lower than 5 seconds requires further redesign of the code.
Branch: master
https://github.com/mongodb/mongo/commit/acbae789795b2f628712ada73b2d89684b97da9d

Comment by Nelson Elhage [ 14/Jun/13 ]

Ah yeah, that would do it. I was testing on a hacked mongod that replaced the sleep with a timed-wait on a condvar – that version fails-over quickly in the no-ops case, but hits the double-wait if there's any work to do. The --force-sync test case is probably worthwhile anyways, since that's the more interesting case.

Comment by Eric Milkie [ 14/Jun/13 ]

I hit the double wait just with your original test, in fact. I was quite intrigued when I had removed the 10 second sleep and yet observed no change in the test's behavior, so I dug deeper. Thanks very much for writing up a complete test, by the way.

Comment by Nelson Elhage [ 14/Jun/13 ]

Not sure if you have your own test tools, but I added an option to https://gist.github.com/nelhage/5777797 to generate a bunch of writes immediately before the failover, which will force the new-primary to finish syncing before stepping up, thereby exhibiting the double-wait bug.

Comment by Eric Milkie [ 14/Jun/13 ]

Right, I was about to update this with what I've discovered, which was exactly that. I'm going to fix the double wait. To fix the single wait, I have some ideas on how to twist it to treat a getmore wait as being already done – if _appliedBuffer is already true in stopReplicationAndFlushBuffer(), we can make it guaranteed that it will not go false without transitioning to pause mode, in which case we don't need to wait for it.

Comment by Nelson Elhage [ 14/Jun/13 ]

(Let me know if this should be a separate issue)

In the case where we do have ops to replicate, after replication is complete, we end up waiting for two GETMORE await timeouts. produce() does:

                if (!r.moreInCurrentBatch()) {
                    ...
                        r.more();
                    ...
                }
 
                if (!r.more())
                    break;

In the case where there are no more ops to read, each of those .more() calls will end up issuing a GETMORE, which means we'll wait 2 AWAIT timeouts (roughly 8-10s) before we return, also delaying the new primary from stepping up.

The current architecture makes it very difficult to avoid waiting for at least one AWAIT timeout, but there's no need to wait for two.

Generated at Thu Feb 08 03:21:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.