Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-25026

Secondary abort immediately following losing Primary election, will not restart

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Duplicate
    • Affects Version/s: 3.2.8
    • Fix Version/s: None
    • Component/s: Replication
    • Labels:
      None
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      3 System replicaset. m1, m2, m3. m1 (priority 3), m2 (2), m3 (1). While operating normally, from m1: rs.stepDown(120); use admin; db.shutdownServer();

      wait ~2 minutes

      restart m1

      Show
      3 System replicaset. m1, m2, m3. m1 (priority 3), m2 (2), m3 (1). While operating normally, from m1: rs.stepDown(120); use admin; db.shutdownServer(); wait ~2 minutes restart m1

      Description

      After rebooting a system, the database which was currently Primary lost an election to the system which is now the primary. During that process, and in the seconds that followed, the system encountered an error, got stuck in "RECOVERING" and then crashed.

      DAY:15:38.702-0500 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 21
      DAY:15:38.703-0500 I REPL     [replExecDBWorker-1] transition to SECONDARY
      DAY:15:38.710-0500 I NETWORK  [conn244] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [5.5.5.12:58244]
      DAY:15:39.248-0500 I REPL     [ReplicationExecutor] Member mongo01.domain.us:27017 is now in state PRIMARY
      DAY:15:39.694-0500 I REPL     [ReplicationExecutor] syncing from: mongo01.domain.us:27017
      DAY:15:39.696-0500 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to mongo01.domain.us:27017
      DAY:15:39.697-0500 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongo01.domain.us:27017
      DAY:15:39.698-0500 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 20, timestamp: Jul 12 22:15:38:1ce). source's GTE: (term: 21, timestamp: Jul 12 22:15:38:1ce) hashes: (7825714044400008130/-1471538608383373916)
      DAY:15:39.698-0500 I REPL     [rsBackgroundSync] beginning rollback
      DAY:15:39.698-0500 I REPL     [rsBackgroundSync] rollback 0
      DAY:15:39.698-0500 I REPL     [ReplicationExecutor] transition to ROLLBACK
      DAY:15:39.699-0500 I REPL     [rsBackgroundSync] rollback 1
      DAY:15:39.700-0500 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
      DAY:15:39.700-0500 I REPL     [rsBackgroundSync] rollback our last optime:   Jul 12 22:15:38:1ce
      DAY:15:39.700-0500 I REPL     [rsBackgroundSync] rollback their last optime: Jul 12 22:15:39:167
      DAY:15:39.700-0500 I REPL     [rsBackgroundSync] rollback diff in end of log times: -1 seconds
      DAY:15:39.707-0500 I QUERY    [conn545] assertion 13436 cannot read from oplog collection while in rollback ns:local.oplog.rs query:{}
      DAY:15:39.707-0500 I QUERY    [conn545]  ntoskip:0 ntoreturn:1
      DAY:15:39.809-0500 I QUERY    [conn547] assertion 13436 cannot read from oplog collection while in rollback ns:local.oplog.rs query:{}
      DAY:15:39.809-0500 I QUERY    [conn547]  ntoskip:0 ntoreturn:1
      DAY:15:39.907-0500 I REPL     [rsBackgroundSync] rollback 3 fixup
      DAY:15:39.908-0500 I REPL     [rsBackgroundSync] rollback 3.5
      DAY:15:39.908-0500 I REPL     [rsBackgroundSync] rollback 4 n:2
      DAY:15:39.908-0500 I REPL     [rsBackgroundSync] minvalid=(term: 21, timestamp: Jul 12 22:15:39:1df)
      DAY:15:39.908-0500 I REPL     [rsBackgroundSync] rollback 4.6
      DAY:15:39.908-0500 I REPL     [rsBackgroundSync] rollback 4.7
      DAY:15:39.909-0500 I REPL     [rsBackgroundSync] rollback 5 d:1 u:2
      DAY:15:39.909-0500 I REPL     [rsBackgroundSync] rollback 6
      DAY:15:39.910-0500 I REPL     [rsBackgroundSync] rollback done
      DAY:15:39.910-0500 I REPL     [rsBackgroundSync] rollback finished
      DAY:15:39.910-0500 I NETWORK  [SyncSourceFeedback] Socket say send() errno:9 Bad file descriptor 5.5.6.11:27017
      DAY:15:39.910-0500 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update: socket exception [SEND_ERROR] for 5.5.6.11:27017
      DAY:15:39.910-0500 I REPL     [SyncSourceFeedback] updateUpstream failed: Location9001: socket exception [SEND_ERROR] for 5.5.6.11:27017, will retry
      DAY:15:39.910-0500 I QUERY    [conn548] assertion 13436 cannot read from oplog collection while in rollback ns:local.oplog.rs query:{}
      DAY:15:39.910-0500 I QUERY    [conn548]  ntoskip:0 ntoreturn:1
      DAY:15:39.920-0500 I REPL     [ReplicationExecutor] transition to RECOVERING
      DAY:15:39.920-0500 I REPL     [ReplicationExecutor] syncing from: mongo03.domain.us:27017
      DAY:15:39.921-0500 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to mongo03.domain.us:27017
      DAY:15:39.922-0500 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongo03.domain.us:27017
      DAY:16:09.931-0500 I REPL     [ReplicationExecutor] re-evaluating sync source because our current sync source's most recent OpTime is (term: 20, timestamp: Jul 12 22:15:38:1cc) which is more than 30s behind member mongo01.domain.us:27017 whose most recent OpTime is (term: 21, timestamp: Jul 12 22:16:09:a8)
      DAY:16:09.932-0500 I REPL     [ReplicationExecutor] syncing from: mongo01.domain.us:27017
      DAY:16:09.933-0500 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to mongo01.domain.us:27017
      DAY:16:09.934-0500 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 20, timestamp: Jul 12 22:15:38:1cc). source's GTE: (term: 21, timestamp: Jul 12 22:15:38:1cd) hashes: (1087207337565601823/-6762428419615138830)
      DAY:16:09.935-0500 I -        [rsBackgroundSync] Fatal assertion 18750 UnrecoverableRollbackError: need to rollback, but in inconsistent state. minvalid: (term: 21, timestamp: Jul 12 22:15:39:1df) > our last optime: (term: 20, timestamp: Jul 12 22:15:38:1cc)
      DAY:16:09.935-0500 I -        [rsBackgroundSync]
       
      ***aborting after fassert() failure
      
      

      Followed by this on restart:

      DAY::21:36.465-0500 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/mongodb/data/wt/diagnostic.data'
      DAY::21:36.465-0500 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
      DAY::21:36.467-0500 I NETWORK  [initandlisten] waiting for connections on port 27017
      DAY::21:36.470-0500 I REPL     [ReplicationExecutor] New replica set config in use: { _id: "OnspringCavern01", version: 8, protocolVersion: 1, members: [ { _id: 0, host: "mongo01.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "mongo02.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 2.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "mongo03.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 1000, heartbeatTimeoutSecs: 5, electionTimeoutMillis: 6000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('57857435bb2a73b423c7eb4c') } }
      DAY::21:36.470-0500 I REPL     [ReplicationExecutor] This node is mongo02.domain.us:27017 in the config
      DAY::21:36.470-0500 I REPL     [ReplicationExecutor] transition to STARTUP2
      DAY::21:36.470-0500 I REPL     [ReplicationExecutor] Starting replication applier threads
      DAY::21:36.470-0500 I REPL     [ReplicationExecutor] transition to RECOVERING
      DAY::21:36.471-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo01.domain.us:27017
      DAY::21:36.472-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo03.domain.us:27017
      DAY::21:36.476-0500 I REPL     [ReplicationExecutor] New replica set config in use: { _id: "OnspringCavern01", version: 9, protocolVersion: 1, members: [ { _id: 0, host: "mongo01.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "mongo02.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 2.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "mongo03.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: false, heartbeatIntervalMillis: 1000, heartbeatTimeoutSecs: 5, electionTimeoutMillis: 6000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('57857435bb2a73b423c7eb4c') } }
      DAY::21:36.476-0500 I REPL     [ReplicationExecutor] This node is mongo02.domain.us:27017 in the config
      DAY::21:36.477-0500 I REPL     [ReplicationExecutor] Member mongo03.domain.us:27017 is now in state SECONDARY
      DAY::21:36.477-0500 I REPL     [ReplicationExecutor] Member mongo01.domain.us:27017 is now in state PRIMARY
      DAY::21:37.002-0500 I FTDC     [ftdc] Unclean full-time diagnostic data capture shutdown detected, found interim file, some metrics may have been lost. OK
      DAY::21:37.471-0500 I REPL     [ReplicationExecutor] syncing from primary: mongo01.domain.us:27017
      DAY::21:37.475-0500 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to mongo01.domain.us:27017
      DAY::21:37.476-0500 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongo01.domain.us:27017
      DAY::21:37.477-0500 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 20, timestamp: Jul 12 22:15:38:1cc). source's GTE: (term: 21, timestamp: Jul 12 22:15:38:1cd) hashes: (1087207337565601823/-6762428419615138830)
      DAY::21:37.477-0500 I -        [rsBackgroundSync] Fatal assertion 18750 UnrecoverableRollbackError: need to rollback, but in inconsistent state. minvalid: (term: 21, timestamp: Jul 12 22:15:39:1df) > our last optime: (term: 20, timestamp: Jul 12 22:15:38:1cc)
      DAY::21:37.477-0500 I -        [rsBackgroundSync]
       
      ***aborting after fassert() failure
      

        Attachments

        1. apiBasketsCheckoutEU-cgn-mongodb01.log.gz
          16 kB
        2. apiBasketsCheckoutEU-cgn-mongodb02.log.gz
          1.42 MB
        3. apiBasketsCheckoutEU-dfw-mongodb01.log.gz
          1.34 MB
        4. apiBasketsCheckoutEU-dfw-mongodb02.log.gz
          1.34 MB
        5. apiBasketsCheckoutEU-lej-mongodb-arb01.log.gz
          648 kB
        6. cgn.mongodb01.abcEU.crash-2016-09-15.log.gz
          613 kB
        7. cgn.mongodb02.abcEU.crash-2016-09-15.log.gz
          613 kB
        8. dfw.mongodb01.abcEU.crash-2016-09-15.log.gz
          13 kB
        9. dfw.mongodb02.abcEU.crash-2016-09-15.log.gz
          582 kB
        10. flexiApiCGNArbiter-lej-mongodb-arb01.log.gz
          679 kB
        11. flexiApiCGNShardSrv-cgn-mongodb01.log.gz
          2.44 MB
        12. flexiApiCGNShardSrv-cgn-mongodb02.log.gz
          50 kB
        13. flexiApiCGNShardSrv-dfw-mongodb01.log.gz
          1.50 MB
        14. flexiApiCGNShardSrv-dfw-mongodb02.log.gz
          1.50 MB
        15. flexi-cgn.mongodb01.flexiEU.crash-2016-09-15.log.gz
          1019 kB
        16. flexi-cgn.mongodb02.flexiEU.crash-2016-09-15.log.gz
          1023 kB
        17. flexi-dfw.mongodb01.flexiEU.crash-2016-09-15.log.gz
          919 kB
        18. flexi-dfw.mongodb02.flexiEU.crash-2016-09-15.log.gz
          72 kB
        19. flexi-lej.mongodb-arb01flexiEU.crash-2016-09-15.log.gz
          391 kB
        20. lej.mongodb-arb01.abcEU.crash-2016-09-15.log.gz
          290 kB
        21. mongodb.log.2016-10-12.to.2016-10-14.arbiter01.gz
          5 kB
        22. mongodb.log.2016-10-12.to.2016-10-14.mongodb01.cgn.gz
          101 kB
        23. mongodb.log.2016-10-12.to.2016-10-14.mongodb01.dfw.gz
          152 kB
        24. mongodb.log.2016-10-12.to.2016-10-14.mongodb02.cgn.gz
          101 kB
        25. mongodb.log.2016-10-12.to.2016-10-14.mongodb02.dfw.gz
          144 kB

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                14 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: