[SERVER-16786] oplog can go back in time after rollback/startup, when node clocks are not in sync (mms-qa) Created: 09/Jan/15  Updated: 23/Jan/15  Resolved: 15/Jan/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.8.0-rc4
Fix Version/s: 3.0.0-rc6

Type: Task Priority: Major - P3
Reporter: Cailin Nelson Assignee: Eric Milkie
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Backwards Compatibility: Fully Compatible
Participants:

 Description   

A few days ago we downgraded the mms-qa replica set to 2.6.6 (all mmapv1). (This was not related to any issues on 2.8.0-rc4.)

This evening I used MMS Automation to upgrade the replica set back to 2.8.0-rc4 (all mmapv1). I cannot guarantee that the Automation Agents did things in the "perfect" order, but nothing looks suspicious. Each node was restarted only once - the upgrade completed quickly.

Immediately after upgrade though, one node seemed to get caught in a loop - constantly going from ROLLBACK -> RECOVERING - > SECONDARY -> ROLLBACK, and then repeating this cycle.

I turned all Automation Agents off (just in case) and the cycle continued. Note that the Automation Agents consider ROLLBACK and RECOVERING to be "ok" states, they were not attempting to interfere, but I shut them down for the sake of simplicity.

I turned on logLevel=2 logging on both the secondary and the primary for a little while and I will attach those. I also tried restarting the problematic secondary, but on restart, the cycle continued.

The replica set looks like this:

mmsqa:PRIMARY> rs.status()
{
	"set" : "mmsqa",
	"date" : ISODate("2015-01-09T03:14:21.949Z"),
	"myState" : 1,
	"members" : [
		{
			"_id" : 1,
			"name" : "ip-10-154-31-239.ec2.internal:27017",
			"health" : 1,
			"state" : 9,
			"stateStr" : "ROLLBACK",
			"uptime" : 639,
			"optime" : Timestamp(1420772106, 323),
			"optimeDate" : ISODate("2015-01-09T02:55:06Z"),
			"lastHeartbeat" : ISODate("2015-01-09T03:14:21.088Z"),
			"lastHeartbeatRecv" : ISODate("2015-01-09T03:14:20.830Z"),
			"pingMs" : 0,
			"syncingTo" : "ip-10-113-175-160.ec2.internal:27017",
			"configVersion" : 10
		},
		{
			"_id" : 3,
			"name" : "ip-10-60-97-75.ec2.internal:27018",
			"health" : 1,
			"state" : 7,
			"stateStr" : "ARBITER",
			"uptime" : 1153,
			"lastHeartbeat" : ISODate("2015-01-09T03:14:21.363Z"),
			"lastHeartbeatRecv" : ISODate("2015-01-09T03:14:21.082Z"),
			"pingMs" : 1,
			"configVersion" : 10
		},
		{
			"_id" : 4,
			"name" : "ip-10-113-175-160.ec2.internal:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 1166,
			"optime" : Timestamp(1420773260, 4),
			"optimeDate" : ISODate("2015-01-09T03:14:20Z"),
			"electionTime" : Timestamp(1420772100, 0),
			"electionDate" : ISODate("2015-01-09T02:55:00Z"),
			"configVersion" : 10,
			"self" : true
		}
	],
	"ok" : 1
}



 Comments   
Comment by Githook User [ 15/Jan/15 ]

Author:

{u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-16786 set the global optime when we rollback or startup
Branch: master
https://github.com/mongodb/mongo/commit/8820ec71f6a30caf20723524426650467bdf73e2

Comment by Eric Milkie [ 14/Jan/15 ]

I found the source of the problem and will be producing a fix soon. We changed how optimes are tracked for 2.8, and we missed a corner case.

Generated at Thu Feb 08 03:42:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.