[SERVER-1987] Assumption of master: arbiter should vote for most ahead replica member Created: 22/Oct/10  Updated: 12/Jul/16  Resolved: 27/Oct/10

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 1.7.1
Fix Version/s: 1.7.2

Type: Bug Priority: Major - P3
Reporter: Mark Kirkwood Assignee: Kristina Chodorow (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu lucid x86-64 mongodb compiled from src


Attachments: File dim1.dat.gz     File log-arbit27019     File log27017     File log27018     File setup1987    
Operating System: ALL
Participants:

 Description   

In the case of a 2 member + 1 arbiter replica set: e.g:

> rs.isMaster()
{
"setName" : "rep0",
"ismaster" : true,
"secondary" : false,
"hosts" : [
"192.168.2.63:27017",
"192.168.2.63:27018"
],
"arbiters" : [
"192.168.2.63:27019"
],
"ok" : 1
}

Consider the following:

1/ kill the member on port 27017, so 27018 becomes the master:
2/ connect to the member on port 27018 and do some updates to a collection
3/ fsync to be sure the changes are saved to disk
4/ startup the member on port 27017 again
5/ connect to the member on port 27017 and observe that the previous changes have been rolled back

I (possibly wrongly) was under the impression that optime was a factor in deciding who was to be primary, but clearly in this case the arbiter is voting for the most behind member to be primary when it comes up.

Log information from member on port 27018 - notes other member down and becomes primary, also see mention that changes are fsync'ed:

Fri Oct 22 15:05:53 [rs_sync] replSet syncThread: 10278 dbclient error communicating with server
Fri Oct 22 15:05:53 [ReplSetHealthPollTask] replSet info 192.168.2.63:27017 is now down (or slow to respond)
Fri Oct 22 15:05:53 [rs Manager] replSet info electSelf 1
Fri Oct 22 15:05:53 [rs Manager] replSet PRIMARY
Fri Oct 22 15:06:18 [initandlisten] connection accepted from 127.0.0.1:33896 #4
Fri Oct 22 15:07:01 [conn4] end connection 127.0.0.1:33896
Fri Oct 22 15:08:28 [initandlisten] connection accepted from 127.0.0.1:34029 #5
Fri Oct 22 15:09:18 [conn5] command denied:

{ fsync: 1.0 }

Fri Oct 22 15:09:23 [conn5] CMD fsync: sync:1 lock:0
Fri Oct 22 15:09:26 [conn5] end connection 127.0.0.1:34029

... now detects member on port 27017 is up, which is then chosen as primary, so we roll back our previous work:

Fri Oct 22 15:11:51 [ReplSetHealthPollTask] replSet info 192.168.2.63:27017 is now up
Fri Oct 22 15:11:52 [initandlisten] connection accepted from 192.168.2.63:47618 #7
Fri Oct 22 15:11:54 [rs_sync] replSet we are ahead of the primary, will try to roll back
Fri Oct 22 15:11:54 [rs_sync] replSet rollback 0
Fri Oct 22 15:11:54 [rs_sync] replSet rollback 1
Fri Oct 22 15:11:54 [rs_sync] replSet rollback 2 FindCommonPoint
Fri Oct 22 15:11:54 [rs_sync] replSet info rollback our last optime: Oct 22 15:08:34:3e7
Fri Oct 22 15:11:54 [rs_sync] replSet info rollback their last optime: Oct 22 14:18:35:218e
Fri Oct 22 15:11:54 [rs_sync] replSet info rollback diff in end of log times: 2999 seconds



 Comments   
Comment by auto [ 20/Nov/10 ]

Author:

{'login': 'kchodorow', 'name': 'Kristina Chodorow', 'email': 'kristina@10gen.com'}

Message: check for better primary before voting yea SERVER-1987
/mongodb/mongo/commit/269c53e84b5e37597b013edadb93ba028ddb3ee3

Comment by Mark Kirkwood [ 28/Oct/10 ]

Hey, my pleasure - was fun.

Comment by Kristina Chodorow (Inactive) [ 27/Oct/10 ]

Great! Thanks for all the help debugging this.

Comment by Mark Kirkwood [ 27/Oct/10 ]

Applied patch to my 1.7.1 src and rebuilt - looks fixed to me: several tries and we elect the right/best/freshest member to be primary each time. Fast fix!

Tested with latest code from git, looks fixed there too. Nice!

Comment by Kristina Chodorow (Inactive) [ 27/Oct/10 ]

Could you try the latest code (or the 1.7 nightly build tomorrow)?

Comment by auto [ 27/Oct/10 ]

Author:

{'login': 'kchodorow', 'name': 'Kristina Chodorow', 'email': 'kristina@10gen.com'}

Message: check for better primary before voting yea SERVER-1987
http://github.com/mongodb/mongo/commit/60b24f50eebe1596728e5d8e16b2ba8428032da5

Comment by Kristina Chodorow (Inactive) [ 27/Oct/10 ]

Thanks! Looks like a race condition (27017 tries to elect itself before it knows about 27018 and the arbiter goes along with it), fixing...

Comment by Mark Kirkwood [ 26/Oct/10 ]

Poking about in the code a little and enabling some debugging output for Consensus::weAreFreshest shows that when 27017 comes up it thinks it is freshest (from 27017 log):

11:13:34 [ReplSetHealthPollTask] replSet info localhost:27019 is now up
11:13:34 [ReplSetHealthPollTask] replSet localhost:27019 ARBITER
11:13:34 [rs Manager] replSet dev we are freshest of up nodes, nok:1 nTies:0
11:13:34 [rs Manager] replSet info electSelf 0
11:13:34 [rs Manager] replSet PRIMARY

Getting into this a bit more, with some extra debugging code:

14:41:22 [rs Manager] replSet freshest returns

{ opTime: new Date(0), fresher: false, ok: 1.0 }

14:41:22 [rs Manager] replSet dev we are freshest of up nodes, nok:1 nTies:0
14:41:22 [rs Manager] replSet info electSelf 0
14:41:22 [rs Manager] replSet PRIMARY

So looks to me like the optime for the ARBITER (27019) is being returned here... I'm trying to figure out why that is (note: stumbling through the code a bit so I may have misunderstood what is happening here)

Comment by Mark Kirkwood [ 25/Oct/10 ]

Hmm - weird, I tried on my home workstation (32 bit Ubuntu) and can reproduce consistently. Will attach logs.

Comment by Kristina Chodorow (Inactive) [ 25/Oct/10 ]

Hmm, still can't reproduce with 1.7.1. Can you attach or paste the arbiter's and 27017's logs?

Comment by Kristina Chodorow (Inactive) [ 25/Oct/10 ]

Thanks for the detailed instructions! I am unable to reproduce on 1.7.2-, I'm compiling 1.7.1 and will let you know. Hopefully it's been incidentally fixed.

Comment by Mark Kirkwood [ 25/Oct/10 ]

Also: If it would be helpful I'll attach a complete 'how to produce' with steps I used to setup the replicas plus the data import and update commands, Cheers! (edit) Since I have the file(s) I'll attach 'em anyway!

Comment by Mark Kirkwood [ 25/Oct/10 ]

Good point, I went and retested to check. Both 27018 and 27017 syncd (web console shows both at 4cc100be:1 ). I killed 27017 again, and updated some docs in a collection while connected to 27018 - which then advanced to 4cc4f4c5:3e7. I restarted 27017 and the web console now shows:

27017 - PRIMARY 4cc100be:1
27018 - RECOVERING rollback 2 error error not willing to roll back more than one hour of data 4cc4f4c5:3e7

Here is my rs.conf at that point:

> rs.conf()
{
"_id" : "rep0",
"version" : 1,
"members" : [

{ "_id" : 0, "host" : "192.168.2.63:27017" }

,

{ "_id" : 1, "host" : "192.168.2.63:27018" }

,

{ "_id" : 2, "host" : "192.168.2.63:27019", "arbiterOnly" : true }

]
}

Comment by Kristina Chodorow (Inactive) [ 22/Oct/10 ]

Sorry, rs.conf().

Comment by Kristina Chodorow (Inactive) [ 22/Oct/10 ]

It should use optime. Can you send rs.config()? Is 27018 fully synced when 27017 goes down? (You can check the web console, make sure the values in the optime column are the same.)

Fsyncs don't matter.

Comment by Mark Kirkwood [ 22/Oct/10 ]

Correction: I mention that the log shows when I did the fsync - not entirely true, it actually shows that I forgot to change to the 'admin' db before doing the fsync! I did that immediately afterwards... honest...

Generated at Thu Feb 08 02:58:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.