[SERVER-42255] Replica set initialization writes first oplog entry with no term Created: 17/Jul/19  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bernard Gorman Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-42258 Make term field required in Oplog Ent... Backlog
Related
Assigned Teams:
Replication
Operating System: ALL
Participants:

 Description   

When a replica set is first initialized using rs.initiate(), it writes a note to that effect into the oplog as its first entry. However, this is written while the term is OpTime::kUninitializedTerm, and the entry therefore has no term field t:

{ "op" : "n", "ns" : "", "o" : { "msg" : "initiating set" }, "ts" : Timestamp(1563285225, 1), "v" : NumberLong(2), "wall" : ISODate("2019-07-16T13:53:45.914Z") }

This has been the case since 3.6 and remains so on current master.



 Comments   
Comment by Lingzhi Deng [ 22/Jul/19 ]

The cause is that we write the first oplog entry in initializeReplSetStorage which eventually calls getNextOpTimes and replCoord->getTerm() to determine the term. The problem is that initializeReplSetStorage is called before _finishReplSetInitiate which calls TopologyCoordinator::updateConfig() to reset term from OpTime::kUninitializedTerm (-1) to OpTime::kInitialTerm (0). We don't write down the t field if the term is OpTime::kUninitializedTerm and therefore the t field is missing from the "initiating set" entry. If we somehow initialize the TopologyCoordinator's term earlier to OpTime::kInitialTerm (0) before initializeReplSetStorage is called, maybe we can then log {t: 0} for it. The first election will always be started as term 1.

Comment by William Schultz (Inactive) [ 22/Jul/19 ]

Another question that came up in discussion: is it possible to run concurrent replSetInitiate commands against separate replica set nodes? In that case, would both nodes write divergent "initiation" entries and would this cause any problems?

Comment by Ratika Gandhi [ 22/Jul/19 ]

We want to understand why this has no term and what happens when the oplog entry gets rolled back. 

Comment by Judah Schvimer [ 17/Jul/19 ]

This has no term because it is written before any node is elected primary. We'll need to write it in a real term for it to be safe.

We may also want to make sure this oplog entry cannot be rolled back, which would be valuable for change streams.

Generated at Thu Feb 08 05:00:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.