Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Done
Priority: Major - P3
Fix Version/s: 3.0.0-rc6
Affects Version/s: 2.8.0-rc4
Component/s: Replication
Labels:
None

Backwards Compatibility:
Fully Compatible
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

A few days ago we downgraded the mms-qa replica set to 2.6.6 (all mmapv1). (This was not related to any issues on 2.8.0-rc4.)

This evening I used MMS Automation to upgrade the replica set back to 2.8.0-rc4 (all mmapv1). I cannot guarantee that the Automation Agents did things in the "perfect" order, but nothing looks suspicious. Each node was restarted only once - the upgrade completed quickly.

Immediately after upgrade though, one node seemed to get caught in a loop - constantly going from ROLLBACK -> RECOVERING - > SECONDARY -> ROLLBACK, and then repeating this cycle.

I turned all Automation Agents off (just in case) and the cycle continued. Note that the Automation Agents consider ROLLBACK and RECOVERING to be "ok" states, they were not attempting to interfere, but I shut them down for the sake of simplicity.

I turned on logLevel=2 logging on both the secondary and the primary for a little while and I will attach those. I also tried restarting the problematic secondary, but on restart, the cycle continued.

The replica set looks like this:

mmsqa:PRIMARY> rs.status()
{
	"set" : "mmsqa",
	"date" : ISODate("2015-01-09T03:14:21.949Z"),
	"myState" : 1,
	"members" : [
		{
			"_id" : 1,
			"name" : "ip-10-154-31-239.ec2.internal:27017",
			"health" : 1,
			"state" : 9,
			"stateStr" : "ROLLBACK",
			"uptime" : 639,
			"optime" : Timestamp(1420772106, 323),
			"optimeDate" : ISODate("2015-01-09T02:55:06Z"),
			"lastHeartbeat" : ISODate("2015-01-09T03:14:21.088Z"),
			"lastHeartbeatRecv" : ISODate("2015-01-09T03:14:20.830Z"),
			"pingMs" : 0,
			"syncingTo" : "ip-10-113-175-160.ec2.internal:27017",
			"configVersion" : 10
		},
		{
			"_id" : 3,
			"name" : "ip-10-60-97-75.ec2.internal:27018",
			"health" : 1,
			"state" : 7,
			"stateStr" : "ARBITER",
			"uptime" : 1153,
			"lastHeartbeat" : ISODate("2015-01-09T03:14:21.363Z"),
			"lastHeartbeatRecv" : ISODate("2015-01-09T03:14:21.082Z"),
			"pingMs" : 1,
			"configVersion" : 10
		},
		{
			"_id" : 4,
			"name" : "ip-10-113-175-160.ec2.internal:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 1166,
			"optime" : Timestamp(1420773260, 4),
			"optimeDate" : ISODate("2015-01-09T03:14:20Z"),
			"electionTime" : Timestamp(1420772100, 0),
			"electionDate" : ISODate("2015-01-09T02:55:00Z"),
			"configVersion" : 10,
			"self" : true
		}
	],
	"ok" : 1
}

Assignee:: Eric Milkie

Reporter:: Cailin Nelson (Inactive)

Participants:: Cailin Nelson, Eric Milkie, Githook User

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: Jan 09 2015 03:15:37 AM UTC

Updated:: Jan 23 2015 11:35:30 PM UTC

Resolved:: Jan 15 2015 01:34:54 PM UTC

Details

Description

Attachments

Activity

People

Dates