[SERVER-21240] First oplog entry after stepdown has new term but old index Created: 01/Nov/15 Updated: 30/Nov/15 Resolved: 30/Nov/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.2.0-rc0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Henrik Ingo (Inactive) | Assignee: | Benety Goh |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Steps To Reproduce: | <old primary> ).pretty() |
| Sprint: | Repl C (11/20/15), Repl D (12/11/15) |
| Participants: |
| Description |
|
The below oplog is from the new primary, after another node did rs.stepDown()
Note that for the event that says "new primary", the second part of ts should be 0. One effect of this is that when the new primary has won its election, but didn't yet commit anything, it can't step down itself: 319test:PRIMARY> rs.stepDown() to force node to step down.", Once you issue any write (here an insert) the stepdown succeeds. |
| Comments |
| Comment by Henrik Ingo (Inactive) [ 30/Nov/15 ] | |||||||||||
|
Cool. Thanks for this, at least I learned something out of this. | |||||||||||
| Comment by Eric Milkie [ 30/Nov/15 ] | |||||||||||
|
Yes, indeed it is correct to say that. | |||||||||||
| Comment by Henrik Ingo (Inactive) [ 30/Nov/15 ] | |||||||||||
|
Thanks milkie. A spec I read in June led me to believe for pv1 the second 32-bit part would be an integer starting from 1 for each new term. (...but it may have also just been me interpreting it that way.) So is it correct to say, the value of the ts field is actually the same for both protocol versions? That's what the source code looks like to me. | |||||||||||
| Comment by Eric Milkie [ 30/Nov/15 ] | |||||||||||
|
The index part of a Raft log event corresponds to an increasing 64-bit integer associated with MongoDB oplog entries. This integer is split up into two 32 bit values in BSON. One of the 32-bit values happens to be an approximation of the time the index was generated, most of the time. | |||||||||||
| Comment by Henrik Ingo (Inactive) [ 27/Nov/15 ] | |||||||||||
|
In any case, I can't reproduce this today. I'm starting to suspect that there was some valid reason why I got the "No electable secondaries caught up..." message a month ago, such as secondaries lagging too much behind (which they did). Because I expected the ts field to behave differently, I thought I observed a bug, but if the value of the ts field is as intended, then this can be closed as "works as designed". | |||||||||||
| Comment by Henrik Ingo (Inactive) [ 27/Nov/15 ] | |||||||||||
|
Hmm... I may have misunderstood what this should look like in protocol v1. Looking at the source code now, it seems to me that the ts field is actually unchanged from previous behavior in v0? So in a v1 oplog entry, there isn't really anything resembling the index part of a Raft log event - entries are just identified by an ever increasing ts field? | |||||||||||
| Comment by Henrik Ingo (Inactive) [ 26/Nov/15 ] | |||||||||||
|
milkie: If the second part of ts cannot be 0, then it should be 1. This is replication protocol v1, so I'm reading this as the index of the Raft protocol, correct? The bug I'm (primarily) reporting in this ticket is this particular entry:
It has an incremented term (37), however the index (1856) did not revert to 1 (or 0, whatever is the lowest possible). This is wrong. It is also wrong that the next entry has term=37 and index=1, since 1 < 1856. I can file the stepdown related problem separately if you believe it is unrelated, but I believe it is related. I'll see if I can spend some time tomorrow to reproduce this more properly and then file that ticket separately. | |||||||||||
| Comment by Eric Milkie [ 25/Nov/15 ] | |||||||||||
|
The "second part of ts" can never be 0, due to global_timestamp.cpp:65, so your statement in the description is incorrect. If it's not possible to step down a newly elected primary, something else must be amiss. Can you file a separate ticket with reproduction instructions? | |||||||||||
| Comment by Henrik Ingo (Inactive) [ 25/Nov/15 ] | |||||||||||
|
Hi benety.goh Ok, I'm now done with the testing that this was related to, and not even traveling for the next few days. How can I help you? | |||||||||||
| Comment by Benety Goh [ 02/Nov/15 ] | |||||||||||
|
The increment for the ts for the oplog entry is obtained from getNextOpTime() and starts at 1 (at least since 7088413). |