[SERVER-21240] First oplog entry after stepdown has new term but old index Created: 01/Nov/15  Updated: 30/Nov/15  Resolved: 30/Nov/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.0-rc0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Henrik Ingo (Inactive) Assignee: Benety Goh
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

<old primary>
rs.stepDown()
<new primary>
use local
db.oplog.rs.find().sort(

{ $natural : -1 }

).pretty()

Sprint: Repl C (11/20/15), Repl D (12/11/15)
Participants:

 Description   

The below oplog is from the new primary, after another node did rs.stepDown()

{
	"ts" : Timestamp(1445630479, 2),
	"t" : NumberLong(37),
	"h" : NumberLong("-8385691639698850950"),
	"v" : 2,
	"op" : "i",
	"ns" : "test.test",
	"o" : {
		"_id" : ObjectId("562a920ff4a2f080c7fb8d7d"),
		"foo" : "bar"
	}
}
{
	"ts" : Timestamp(1445630479, 1),
	"t" : NumberLong(37),
	"h" : NumberLong("1854230191406083124"),
	"v" : 2,
	"op" : "c",
	"ns" : "test.$cmd",
	"o" : {
		"create" : "test"
	}
}
{
	"ts" : Timestamp(1445629632, 1856),
	"t" : NumberLong(37),
	"h" : NumberLong("-7107662392271914394"),
	"v" : 2,
	"op" : "n",
	"ns" : "",
	"o" : {
		"msg" : "new primary"
	}
}
{
	"ts" : Timestamp(1445629632, 1854),
	"t" : NumberLong(36),
	"h" : NumberLong("5763596409837014986"),
	"v" : 2,
	"op" : "u",
	"ns" : "ycsb.usertable",
	"o2" : {
		"_id" : "user6709152380272356275"
	},
	"o" : {
		"$set" : {
			"field2" : BinData(0,"KU91Lj9wIl41KSB8M0RtJidqNk83MFQrOydmIkc1PDkqKFA3PCM4LTB4LV9rNk5tIyUwLTx0JF5rPypqOkAtLTk2I0s7JEA9NUhnMl41KlktNUVzMVdtNzRgMy8sIE1jLUc7KQ==")
		}
	}
}

Note that for the event that says "new primary", the second part of ts should be 0.

One effect of this is that when the new primary has won its election, but didn't yet commit anything, it can't step down itself:

319test:PRIMARY> rs.stepDown()
{
"ok" : 0,
"errmsg" : "No electable secondaries caught up as of 2015-10-23T19:51:01.179+0000. Please use

{force: true}

to force node to step down.",
"code" : 50
}

Once you issue any write (here an insert) the stepdown succeeds.



 Comments   
Comment by Henrik Ingo (Inactive) [ 30/Nov/15 ]

Cool. Thanks for this, at least I learned something out of this.

Comment by Eric Milkie [ 30/Nov/15 ]

Yes, indeed it is correct to say that.

Comment by Henrik Ingo (Inactive) [ 30/Nov/15 ]

Thanks milkie. A spec I read in June led me to believe for pv1 the second 32-bit part would be an integer starting from 1 for each new term. (...but it may have also just been me interpreting it that way.)

So is it correct to say, the value of the ts field is actually the same for both protocol versions? That's what the source code looks like to me.

Comment by Eric Milkie [ 30/Nov/15 ]

The index part of a Raft log event corresponds to an increasing 64-bit integer associated with MongoDB oplog entries. This integer is split up into two 32 bit values in BSON. One of the 32-bit values happens to be an approximation of the time the index was generated, most of the time.

Comment by Henrik Ingo (Inactive) [ 27/Nov/15 ]

In any case, I can't reproduce this today. I'm starting to suspect that there was some valid reason why I got the "No electable secondaries caught up..." message a month ago, such as secondaries lagging too much behind (which they did).

Because I expected the ts field to behave differently, I thought I observed a bug, but if the value of the ts field is as intended, then this can be closed as "works as designed".

Comment by Henrik Ingo (Inactive) [ 27/Nov/15 ]

Hmm... I may have misunderstood what this should look like in protocol v1. Looking at the source code now, it seems to me that the ts field is actually unchanged from previous behavior in v0? So in a v1 oplog entry, there isn't really anything resembling the index part of a Raft log event - entries are just identified by an ever increasing ts field?

Comment by Henrik Ingo (Inactive) [ 26/Nov/15 ]

milkie: If the second part of ts cannot be 0, then it should be 1. This is replication protocol v1, so I'm reading this as the index of the Raft protocol, correct?

The bug I'm (primarily) reporting in this ticket is this particular entry:

{
	"ts" : Timestamp(1445629632, 1856),
	"t" : NumberLong(37),
	"h" : NumberLong("-7107662392271914394"),
	"v" : 2,
	"op" : "n",
	"ns" : "",
	"o" : {
		"msg" : "new primary"
	}
}

It has an incremented term (37), however the index (1856) did not revert to 1 (or 0, whatever is the lowest possible). This is wrong. It is also wrong that the next entry has term=37 and index=1, since 1 < 1856.

I can file the stepdown related problem separately if you believe it is unrelated, but I believe it is related. I'll see if I can spend some time tomorrow to reproduce this more properly and then file that ticket separately.

Comment by Eric Milkie [ 25/Nov/15 ]

The "second part of ts" can never be 0, due to global_timestamp.cpp:65, so your statement in the description is incorrect.
It can, however, start over at 1 as Benety pointed out in his comment. It only does so when the clocktime seconds have changed. It does not start over at 1 after an election, or at any other time other than when the clocktime seconds have changed. Logically, the ts value is opaque and we are actively looking at making this simply a number in the future, with no connection to wallclock time.

If it's not possible to step down a newly elected primary, something else must be amiss. Can you file a separate ticket with reproduction instructions?

Comment by Henrik Ingo (Inactive) [ 25/Nov/15 ]

Hi benety.goh

Ok, I'm now done with the testing that this was related to, and not even traveling for the next few days. How can I help you?

Comment by Benety Goh [ 02/Nov/15 ]

The increment for the ts for the oplog entry is obtained from getNextOpTime() and starts at 1 (at least since 7088413).

Generated at Thu Feb 08 03:56:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.