Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-33571

Running stepDown may trigger a rollback

    XMLWordPrintableJSON

Details

    • Icon: Bug Bug
    • Resolution: Gone away
    • Icon: Major - P3 Major - P3
    • None
    • 3.2.11
    • Replication
    • None
    • ALL
    • Hide

      I couldn't find a deterministic set of steps to trigger the problem. However, following the steps below I was able to reproduce the problem with a good chance of success. The issue seems to happen when the server is in heavy load and we are "lucky" and the mongo with least priority ends up being primary after a step down. All scripts I used are attached to this issue.

      Start the mongo servers:

      $ mongod --config node1.cfg &
      $ mongod --config node2.cfg &
      $ mongod --config node3.cfg &
      

      Setup the cluster and constantly step down the primary:

      $ mongo rollback-bug.js
      

      That will eventually print the following and exit, meaning the issue happened:

      Thu Mar 01 2018 14:04:52 GMT+0000 (GMT) Rollback detected
      Thu Mar 01 2018 14:04:52 GMT+0000 (GMT)
      [
      	{
      		"_id" : 2,
      		"name" : "127.0.0.1:27019",
      		"health" : 1,
      		"state" : 9,
      		"stateStr" : "ROLLBACK",
      		"uptime" : 208,
      		"optime" : {
      			"ts" : Timestamp(1519913083, 423),
      			"t" : NumberLong(3)
      		},
      		"optimeDate" : ISODate("2018-03-01T14:04:43Z"),
      		"lastHeartbeat" : ISODate("2018-03-01T14:04:51.817Z"),
      		"lastHeartbeatRecv" : ISODate("2018-03-01T14:04:49.718Z"),
      		"pingMs" : NumberLong(2),
      		"syncingTo" : "127.0.0.1:27018",
      		"configVersion" : 1
      	}
      ]
      

      After a primary is elected, I ran the write_forever.py script (it assumes you have pymongo in your python path). All the script does is to start a few processes and constantly insert documents.

      $ python write_forever.py
      

      I'm afraid you may need to tune the number of processes and document size in write_forever.py depending on your settings.

      Show
      I couldn't find a deterministic set of steps to trigger the problem. However, following the steps below I was able to reproduce the problem with a good chance of success. The issue seems to happen when the server is in heavy load and we are "lucky" and the mongo with least priority ends up being primary after a step down. All scripts I used are attached to this issue. Start the mongo servers: $ mongod --config node1.cfg & $ mongod --config node2.cfg & $ mongod --config node3.cfg & Setup the cluster and constantly step down the primary: $ mongo rollback-bug.js That will eventually print the following and exit, meaning the issue happened: Thu Mar 01 2018 14:04:52 GMT+0000 (GMT) Rollback detected Thu Mar 01 2018 14:04:52 GMT+0000 (GMT) [ { "_id" : 2, "name" : "127.0.0.1:27019", "health" : 1, "state" : 9, "stateStr" : "ROLLBACK", "uptime" : 208, "optime" : { "ts" : Timestamp(1519913083, 423), "t" : NumberLong(3) }, "optimeDate" : ISODate("2018-03-01T14:04:43Z"), "lastHeartbeat" : ISODate("2018-03-01T14:04:51.817Z"), "lastHeartbeatRecv" : ISODate("2018-03-01T14:04:49.718Z"), "pingMs" : NumberLong(2), "syncingTo" : "127.0.0.1:27018", "configVersion" : 1 } ] After a primary is elected, I ran the write_forever.py script (it assumes you have pymongo in your python path). All the script does is to start a few processes and constantly insert documents. $ python write_forever.py I'm afraid you may need to tune the number of processes and document size in write_forever.py depending on your settings.

    Description

      I got a rollback after stepping down a primary in a mongo cluster with 3 nodes. From mongo's documentation:

      The method then initiates a catchup period where it waits up to secondaryCatchUpPeriodSeconds, by default 10 seconds, for a secondary to become up-to-date with the primary. The primary only steps down if a secondary is up-to-date with the primary during the catchup period to prevent rollbacks.

      No networking issues were observed during that time. So, I wouldn't expect a rollback to happen.

      It seems the rollback happened due to a primary election, which happened right after the step down (in this particular cluster the nodes have different priorities). However, I don't see anything in the documentation saying that primary elections may cause rollbacks either. In fact, I was not able to reproduce the problem only doing step downs or only swapping priorities.

      Attachments

        1. mongo-3.2.11-logs.tar.gz
          39 kB
        2. mongo-3.4.13-logs.tar.gz
          3.26 MB
        3. node1.cfg
          0.2 kB
        4. node2.cfg
          0.2 kB
        5. node3.cfg
          0.2 kB
        6. rollback-bug.js
          1 kB
        7. write_forever.py
          0.6 kB

        Activity

          People

            kelsey.schubert@mongodb.com Kelsey Schubert
            rafael Rafael Cunha de Almeida
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: