Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-9481

Server in replication set fluctuates between states SECONDARY and ROLLBACK if it can't rollback due to too much rollback data

    • ALL
    • Hide

      Follow this guide http://comerford.cc/2012/05/28/simulating-rollback-on-mongodb/ but use the following MongoDB command to create the data:

      for(var i = 0; i <= 100000000000000000 ; i+){for (var j = 0; j <= 1000; j+){ db.rollback_test.insert(

      {"a" : i * j}

      );} sleep(100);}

      Run this command on the primary until you are confident that you have at least 300 MB of data (e.g. by checking the output of df -h as a rough estimate).

      Show
      Follow this guide http://comerford.cc/2012/05/28/simulating-rollback-on-mongodb/ but use the following MongoDB command to create the data: for(var i = 0; i <= 100000000000000000 ; i+ ){for (var j = 0; j <= 1000; j +){ db.rollback_test.insert( {"a" : i * j} );} sleep(100);} Run this command on the primary until you are confident that you have at least 300 MB of data (e.g. by checking the output of df -h as a rough estimate).

      If you encounter a rollback situation in a replication set and the rollback data is more than 300 megabytes/30 minutes the rollback will fail. Unfortunately it is not clear from rs.status() that one node is in a state where it can't recover from. The node with the rollback problem will try to roll back again and again which can be seen in the mongodb.log:

      Fri Apr 26 13:28:55.957 [rsBackgroundSync] replSet syncing to: 10.128.128.102:27017
      Fri Apr 26 13:29:01.030 [rsBackgroundSync] replSet we are ahead of the sync source, will try to roll back
      Fri Apr 26 13:29:01.030 [rsBackgroundSync] replSet rollback 0
      Fri Apr 26 13:29:01.030 [rsBackgroundSync] replSet ROLLBACK
      Fri Apr 26 13:29:01.030 [rsBackgroundSync] replSet rollback 1
      Fri Apr 26 13:29:01.030 [rsBackgroundSync] replSet rollback 2 FindCommonPoint
      Fri Apr 26 13:29:01.031 [rsBackgroundSync] replSet rollback 2 error findcommonpoint waiting a while before trying again
      Fri Apr 26 13:29:01.873 [rsSync] replSet SECONDARY
      Fri Apr 26 13:29:10.640 [conn613] end connection 10.128.128.102:53295 (2 connections now open)
      Fri Apr 26 13:29:10.641 [initandlisten] connection accepted from 10.128.128.102:53297 #615 (3 connections now open)
      Fri Apr 26 13:29:11.031 [rsBackgroundSync] replSet syncing to: 10.128.128.102:27017
      Fri Apr 26 13:29:11.388 [conn614] end connection 10.128.128.98:41406 (2 connections now open)
      Fri Apr 26 13:29:11.389 [initandlisten] connection accepted from 10.128.128.98:41408 #616 (3 connections now open)
      Fri Apr 26 13:29:16.104 [rsBackgroundSync] replSet we are ahead of the sync source, will try to roll back
      Fri Apr 26 13:29:16.104 [rsBackgroundSync] replSet rollback 0
      Fri Apr 26 13:29:16.105 [rsBackgroundSync] replSet ROLLBACK
      Fri Apr 26 13:29:16.105 [rsBackgroundSync] replSet rollback 1
      Fri Apr 26 13:29:16.105 [rsBackgroundSync] replSet rollback 2 FindCommonPoint
      Fri Apr 26 13:29:16.105 [rsBackgroundSync] replSet rollback 2 error findcommonpoint waiting a while before trying again
      Fri Apr 26 13:29:16.876 [rsSync] replSet SECONDARY
      Fri Apr 26 13:29:26.105 [rsBackgroundSync] replSet syncing to: 10.128.128.102:27017
      Fri Apr 26 13:29:31.172 [rsBackgroundSync] replSet we are ahead of the sync source, will try to roll back
      Fri Apr 26 13:29:31.172 [rsBackgroundSync] replSet rollback 0
      Fri Apr 26 13:29:31.172 [rsBackgroundSync] replSet ROLLBACK
      Fri Apr 26 13:29:31.172 [rsBackgroundSync] replSet rollback 1
      Fri Apr 26 13:29:31.172 [rsBackgroundSync] replSet rollback 2 FindCommonPoint
      Fri Apr 26 13:29:31.172 [rsBackgroundSync] replSet rollback 2 error findcommonpoint waiting a while before trying again
      Fri Apr 26 13:29:31.879 [rsSync] replSet SECONDARY
      Fri Apr 26 13:29:40.774 [conn615] end connection 10.128.128.102:53297 (2 connections now open)
      Fri Apr 26 13:29:40.774 [initandlisten] connection accepted from 10.128.128.102:53299 #617 (3 connections now open)
      Fri Apr 26 13:29:41.173 [rsBackgroundSync] replSet syncing to: 10.128.128.102:27017
      Fri Apr 26 13:29:41.412 [conn616] end connection 10.128.128.98:41408 (2 connections now open)
      Fri Apr 26 13:29:41.413 [initandlisten] connection accepted from 10.128.128.98:41410 #618 (3 connections now open)
      Fri Apr 26 13:29:46.245 [rsBackgroundSync] replSet we are ahead of the sync source, will try to roll back
      Fri Apr 26 13:29:46.245 [rsBackgroundSync] replSet rollback 0
      Fri Apr 26 13:29:46.245 [rsBackgroundSync] replSet ROLLBACK
      Fri Apr 26 13:29:46.245 [rsBackgroundSync] replSet rollback 1
      Fri Apr 26 13:29:46.245 [rsBackgroundSync] replSet rollback 2 FindCommonPoint
      Fri Apr 26 13:29:46.247 [rsBackgroundSync] replSet info rollback our last optime: Apr 26 10:52:56:f23
      Fri Apr 26 13:29:46.248 [rsBackgroundSync] replSet info rollback their last optime: Apr 26 09:47:21:1000
      Fri Apr 26 13:29:46.248 [rsBackgroundSync] replSet info rollback diff in end of log times: 3935 seconds
      Fri Apr 26 13:29:46.248 [rsBackgroundSync] replSet rollback too long a time period for a rollback.
      Fri Apr 26 13:29:46.248 [rsBackgroundSync] replSet rollback 2 error error not willing to roll back more than 30 minutes of data
      Fri Apr 26 13:29:46.881 [rsSync] replSet SECONDARY

      When you start the mongo shell on a replication set node and repeatedly execute the rs.status() command you will see that the affected secondary goes into the ROLLBACK state for a few seconds max, just to flip back into SECONDARY until it enters the next iteration of the rollback loop.

      Actually I would expect the affected node to stay in ROLLBACK until the rollback succeeds. If we are in a situation like this where manual intervention is necessary, maybe it would a good idea for the affected instance to go to the FATAL state.

            Assignee:
            matt.dannenberg Matt Dannenberg
            Reporter:
            aheck Andreas Heck
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: