Details

    • Type: Sub-task Sub-task
    • Status: Closed Closed
    • Priority: Critical - P2 Critical - P2
    • Resolution: Fixed
    • Affects Version/s: 1.8.1
    • Fix Version/s: None
    • Component/s: Replication/Pairing
    • Labels:
      None
    • Environment:
      as in SERVER-3125
    • Backport:
      No
    • # Replies:
      9
    • Last comment by Customer:
      true

      Description

      after changing oplog size in in config and restarting mongodb (in order: node no 1 - secondary, node no 2 - secondary, node no 3 - primary)

      Set name: testreplset
      Majority up: yes
      Member id Up cctime Last heartbeat Votes Priority State Messages optime skew
      172.17.0.251:27017 0 1 3.3 mins 1 sec ago 1 1 RECOVERING error RS102 too stale to catch up 4dd6539b:81
      172.17.0.252:27017 1 1 3.3 mins 1 sec ago 1 1 PRIMARY 4dd6546d:7c
      172.17.0.253:27017 (me) 2 1 3.4 mins 1 1 ROLLBACK rollback 2 error findcommonpoint waiting a while before trying again 4dd65853:8

      Recent replset log activity:

      Fri May 20 14:15:09 [startReplSets] couldn't connect to localhost:27017: couldn't connect to server localhost:27017
      14:15:09 [startReplSets] replSet STARTUP2
      14:15:09 [replica set sync] replSet SECONDARY
      14:15:09 [rs Manager] replSet can't see a majority, will not try to elect self
      14:15:09 [ReplSetHealthPollTask] replSet info 172.17.0.252:27017 is up
      14:15:09 [ReplSetHealthPollTask] replSet member 172.17.0.252:27017 PRIMARY
      14:15:11 [ReplSetHealthPollTask] replSet info 172.17.0.251:27017 is up
      14:15:11 [ReplSetHealthPollTask] replSet member 172.17.0.251:27017 RECOVERING
      14:15:12 [replica set sync] replSet we are ahead of the primary, will try to roll back
      14:15:12 [replica set sync] replSet rollback 0
      14:15:12 [replica set sync] replSet ROLLBACK
      14:15:12 [replica set sync] replSet rollback 1
      14:15:12 [replica set sync] replSet rollback 2 FindCommonPoint
      14:15:12 [replica set sync] replSet info rollback our last optime: May 20 14:02:27:8
      14:15:12 [replica set sync] replSet info rollback their last optime: May 20 13:45:49:7c
      14:15:12 [replica set sync] replSet info rollback diff in end of log times: 998 seconds
      14:15:12 [replica set sync] replSet rollback error RS101 reached beginning of local oplog
      14:15:12 [replica set sync] replSet them: 172.17.0.252:27017 scanned: 58620
      14:15:12 [replica set sync] replSet theirTime: May 20 13:45:49 4dd6546d:7c
      14:15:12 [replica set sync] replSet ourTime: May 20 13:57:33 4dd6572d:1dd
      14:15:12 [replica set sync] replSet rollback 2 error RS101 reached beginning of local oplog [2]
      14:15:25 [replica set sync] replSet we are ahead of the primary, will try to roll back
      14:15:25 [replica set sync] replSet rollback 0
      14:15:25 [replica set sync] replSet rollback 1
      14:15:25 [replica set sync] replSet rollback 2 FindCommonPoint
      14:15:25 [replica set sync] replSet rollback 2 error findcommonpoint waiting a while before trying again
      14:15:39 .....
      14:15:52 .....
      14:16:05 .....
      14:16:18 [replica set sync] replSet we are ahead of the primary, will try to roll back
      14:16:18 [replica set sync] replSet rollback 0
      14:16:18 [replica set sync] replSet rollback 1
      14:16:18 [replica set sync] replSet rollback 2 FindCommonPoint
      14:16:18 [replica set sync] replSet info rollback our last optime: May 20 14:02:27:8
      14:16:18 [replica set sync] replSet info rollback their last optime: May 20 13:45:49:7c
      14:16:18 [replica set sync] replSet info rollback diff in end of log times: 998 seconds
      14:16:18 [replica set sync] replSet rollback error RS101 reached beginning of local oplog
      14:16:18 [replica set sync] replSet them: 172.17.0.252:27017 scanned: 58620
      14:16:18 [replica set sync] replSet theirTime: May 20 13:45:49 4dd6546d:7c
      14:16:18 [replica set sync] replSet ourTime: May 20 13:57:33 4dd6572d:1dd
      14:16:18 [replica set sync] replSet rollback 2 error RS101 reached beginning of local oplog [2]
      14:16:32 [replica set sync] replSet we are ahead of the primary, will try to roll back
      14:16:32 [replica set sync] replSet rollback 0
      14:16:32 [replica set sync] replSet rollback 1
      14:16:32 [replica set sync] replSet rollback 2 FindCommonPoint
      14:16:32 [replica set sync] replSet rollback 2 error findcommonpoint waiting a while before trying again
      14:16:45 .....
      14:16:58 .....
      14:17:11 .....
      14:17:24 [replica set sync] replSet we are ahead of the primary, will try to roll back
      14:17:24 [replica set sync] replSet rollback 0
      14:17:24 [replica set sync] replSet rollback 1
      14:17:24 [replica set sync] replSet rollback 2 FindCommonPoint
      14:17:24 [replica set sync] replSet info rollback our last optime: May 20 14:02:27:8
      14:17:24 [replica set sync] replSet info rollback their last optime: May 20 13:45:49:7c
      14:17:24 [replica set sync] replSet info rollback diff in end of log times: 998 seconds
      14:17:25 [replica set sync] replSet rollback error RS101 reached beginning of local oplog
      14:17:25 [replica set sync] replSet them: 172.17.0.252:27017 scanned: 58620
      14:17:25 [replica set sync] replSet theirTime: May 20 13:45:49 4dd6546d:7c
      14:17:25 [replica set sync] replSet ourTime: May 20 13:57:33 4dd6572d:1dd
      14:17:25 [replica set sync] replSet rollback 2 error RS101 reached beginning of local oplog [2]
      14:17:38 [replica set sync] replSet we are ahead of the primary, will try to roll back
      14:17:38 [replica set sync] replSet rollback 0
      14:17:38 [replica set sync] replSet rollback 1
      14:17:38 [replica set sync] replSet rollback 2 FindCommonPoint
      14:17:38 [replica set sync] replSet rollback 2 error findcommonpoint waiting a while before trying again
      14:17:51 .....
      14:18:04 .....

        Activity

        Hide
        Kristina Chodorow (Inactive)
        added a comment -

        This:

        14:15:12 [replica set sync] replSet rollback error RS101 reached beginning of local oplog

        means you're going to have to resync this server. The servers have been diverged for too long a time to reconcile automatically.

        Show
        Kristina Chodorow (Inactive)
        added a comment - This: 14:15:12 [replica set sync] replSet rollback error RS101 reached beginning of local oplog means you're going to have to resync this server. The servers have been diverged for too long a time to reconcile automatically.
        Hide
        MartinS
        added a comment -

        this time was max 15 minutes, how to force not to delete oplog until all nodes are in the same state?

        Show
        MartinS
        added a comment - this time was max 15 minutes, how to force not to delete oplog until all nodes are in the same state?
        Hide
        Kristina Chodorow (Inactive)
        added a comment -

        Hmm, how big are your oplogs? Can you send the output of db.printReplicationInfo() from the primary and recoving nodes?

        Show
        Kristina Chodorow (Inactive)
        added a comment - Hmm, how big are your oplogs? Can you send the output of db.printReplicationInfo() from the primary and recoving nodes?
        Hide
        MartinS
        added a comment - - edited

        This is no matter how big this logs are.

        or: size no matters :]

        Primary question is: why node that cannot sync became a new primary afrter restart primary host.
        And then: Why host with only actual db try to rollback after restart.

        Show
        MartinS
        added a comment - - edited This is no matter how big this logs are. or: size no matters :] Primary question is: why node that cannot sync became a new primary afrter restart primary host. And then: Why host with only actual db try to rollback after restart.
        Hide
        Kristina Chodorow (Inactive)
        added a comment -

        You cannot, at the moment, change oplog size by just changing the config option. You have to drop the oplog and recreate it with a new size. The oplog size matters a great deal!

        > Primary question is: why node that cannot sync became a new primary afrter restart primary host.

        If you have logs from all three during the time in question, I can probably interpret them for you.

        > And then: Why host with only actual db try to rollback after restart.

        Basically, if you have a set with a primary & two secondaries and the primary is ahead, when it goes down, one of the secondaries will be elected primary even if it doesn't have all of the data the primary had. Then this new primary will start taking writes, so when the old primary comes back up, it will have data that it never applied to the secondaries, but it can't apply it automatically because it might conflict with writes that have happened while it was down. So it rolls back its conflicting data. See http://www.snailinaturtleneck.com/blog/2011/01/19/how-to-use-replica-set-rollbacks/ for an example.

        Show
        Kristina Chodorow (Inactive)
        added a comment - You cannot, at the moment, change oplog size by just changing the config option. You have to drop the oplog and recreate it with a new size. The oplog size matters a great deal! > Primary question is: why node that cannot sync became a new primary afrter restart primary host. If you have logs from all three during the time in question, I can probably interpret them for you. > And then: Why host with only actual db try to rollback after restart. Basically, if you have a set with a primary & two secondaries and the primary is ahead, when it goes down, one of the secondaries will be elected primary even if it doesn't have all of the data the primary had . Then this new primary will start taking writes, so when the old primary comes back up, it will have data that it never applied to the secondaries, but it can't apply it automatically because it might conflict with writes that have happened while it was down. So it rolls back its conflicting data. See http://www.snailinaturtleneck.com/blog/2011/01/19/how-to-use-replica-set-rollbacks/ for an example.
        Hide
        MartinS
        added a comment -

        logs are in SERVER-3125

        One important information: I won't to get data back with rollback. That was only test data.

        And about primary host election with secondaries in incoplete state:
        This should be configurable. I won't use MongoDB to get rollback from time to time. I would like to may configure my replicaset stops when 1/2 of RS nodes is not up to date with rest.

        And one configuration directive could be very useful: forcing to keep oplog if changes were not commited on all nodes.

        Show
        MartinS
        added a comment - logs are in SERVER-3125 One important information: I won't to get data back with rollback. That was only test data. And about primary host election with secondaries in incoplete state: This should be configurable. I won't use MongoDB to get rollback from time to time. I would like to may configure my replicaset stops when 1/2 of RS nodes is not up to date with rest. And one configuration directive could be very useful: forcing to keep oplog if changes were not commited on all nodes.
        Hide
        Kristina Chodorow (Inactive)
        added a comment -

        I'll take a look at the logs. If you always use a write concern of w:<majority of servers>, you should never have rollbacks.

        Show
        Kristina Chodorow (Inactive)
        added a comment - I'll take a look at the logs. If you always use a write concern of w:<majority of servers>, you should never have rollbacks.
        Hide
        MartinS
        added a comment -

        I don't wont to recover my current data cause it's only test before drive mongo into production.

        But I want to not have to use rollback, I'd like mongo to stop working if 1/2 of nodes are stale (like when hosts are down).

        Show
        MartinS
        added a comment - I don't wont to recover my current data cause it's only test before drive mongo into production. But I want to not have to use rollback, I'd like mongo to stop working if 1/2 of nodes are stale (like when hosts are down).
        Hide
        Kristina Chodorow (Inactive)
        added a comment -

        You can prevent rollback with w:"majority" in v2.0.

        Show
        Kristina Chodorow (Inactive)
        added a comment - You can prevent rollback with w:"majority" in v2.0.

          People

          • Assignee:
            Kristina Chodorow (Inactive)
            Reporter:
            MartinS
            Participants:
            Last commenter:
            Ian Whalen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:
              Days since reply:
              2 years, 25 weeks, 6 days ago
              Date of 1st Reply: