Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-16786

oplog can go back in time after rollback/startup, when node clocks are not in sync (mms-qa)

    • Type: Icon: Task Task
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 3.0.0-rc6
    • Affects Version/s: 2.8.0-rc4
    • Component/s: Replication
    • Labels:
      None
    • Fully Compatible

      A few days ago we downgraded the mms-qa replica set to 2.6.6 (all mmapv1). (This was not related to any issues on 2.8.0-rc4.)

      This evening I used MMS Automation to upgrade the replica set back to 2.8.0-rc4 (all mmapv1). I cannot guarantee that the Automation Agents did things in the "perfect" order, but nothing looks suspicious. Each node was restarted only once - the upgrade completed quickly.

      Immediately after upgrade though, one node seemed to get caught in a loop - constantly going from ROLLBACK -> RECOVERING - > SECONDARY -> ROLLBACK, and then repeating this cycle.

      I turned all Automation Agents off (just in case) and the cycle continued. Note that the Automation Agents consider ROLLBACK and RECOVERING to be "ok" states, they were not attempting to interfere, but I shut them down for the sake of simplicity.

      I turned on logLevel=2 logging on both the secondary and the primary for a little while and I will attach those. I also tried restarting the problematic secondary, but on restart, the cycle continued.

      The replica set looks like this:

      mmsqa:PRIMARY> rs.status()
      {
      	"set" : "mmsqa",
      	"date" : ISODate("2015-01-09T03:14:21.949Z"),
      	"myState" : 1,
      	"members" : [
      		{
      			"_id" : 1,
      			"name" : "ip-10-154-31-239.ec2.internal:27017",
      			"health" : 1,
      			"state" : 9,
      			"stateStr" : "ROLLBACK",
      			"uptime" : 639,
      			"optime" : Timestamp(1420772106, 323),
      			"optimeDate" : ISODate("2015-01-09T02:55:06Z"),
      			"lastHeartbeat" : ISODate("2015-01-09T03:14:21.088Z"),
      			"lastHeartbeatRecv" : ISODate("2015-01-09T03:14:20.830Z"),
      			"pingMs" : 0,
      			"syncingTo" : "ip-10-113-175-160.ec2.internal:27017",
      			"configVersion" : 10
      		},
      		{
      			"_id" : 3,
      			"name" : "ip-10-60-97-75.ec2.internal:27018",
      			"health" : 1,
      			"state" : 7,
      			"stateStr" : "ARBITER",
      			"uptime" : 1153,
      			"lastHeartbeat" : ISODate("2015-01-09T03:14:21.363Z"),
      			"lastHeartbeatRecv" : ISODate("2015-01-09T03:14:21.082Z"),
      			"pingMs" : 1,
      			"configVersion" : 10
      		},
      		{
      			"_id" : 4,
      			"name" : "ip-10-113-175-160.ec2.internal:27017",
      			"health" : 1,
      			"state" : 1,
      			"stateStr" : "PRIMARY",
      			"uptime" : 1166,
      			"optime" : Timestamp(1420773260, 4),
      			"optimeDate" : ISODate("2015-01-09T03:14:20Z"),
      			"electionTime" : Timestamp(1420772100, 0),
      			"electionDate" : ISODate("2015-01-09T02:55:00Z"),
      			"configVersion" : 10,
      			"self" : true
      		}
      	],
      	"ok" : 1
      }
      

            Assignee:
            milkie@mongodb.com Eric Milkie
            Reporter:
            cailin.nelson@mongodb.com Cailin Nelson
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: