Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-31794

REPL opslog corrupts SECONDARY members.

    • Type: Icon: Bug Bug
    • Resolution: Incomplete
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.2.3
    • Component/s: Replication
    • None
    • ALL
    • Hide

      In a simple two member replset with a single arbiter:
      As a temp collection is being written to the PRIMARY, force a shutdown of the SECONDARY.

      Show
      In a simple two member replset with a single arbiter: As a temp collection is being written to the PRIMARY, force a shutdown of the SECONDARY.
    • None
    • 0
    • None
    • None
    • None
    • None
    • None
    • None

      When a server came down with a hard stop, it recovered and did a rollback and started to replicate the PRIMARY opslog. As it was replicating the log, it a writer worker exception error that caused a Fatal Assertion. Even after wiping the secondary, the replset replication hit the same error continually. The only solution was to nuke the entire replicaset and arbiters from orbit and rebuild.

      PRIMARY:
      Metadata: { acquireCount:

      { w: 1 }

      }, oplog: { acquireCount:

      { W: 1 }

      } } protocol:op_query 155ms
      2017-03-25T16:29:36.208+0000 I NETWORK [conn64] end connection 10.196.36.193:20019 (49 connections now open)
      2017-03-25T16:29:36.209+0000 I NETWORK [initandlisten] connection accepted from 10.196.36.193:20029 #69 (50 connections now open)
      2017-03-25T16:29:36.211+0000 I NETWORK [conn69] end connection 10.196.36.193:20029 (49 connections now open)
      2017-03-25T16:29:36.407+0000 I NETWORK [conn55] end connection 10.196.36.193:20003 (48 connections now open)
      2017-03-25T16:29:36.408+0000 I NETWORK [conn62] end connection 10.196.36.193:20015 (48 connections now open)
      2017-03-25T16:29:36.917+0000 I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to server1.site:33000
      2017-03-25T16:29:36.918+0000 I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
      2017-03-25T16:29:44.935+0000 I REPL [ReplicationExecutor] Error in heartbeat request to server1.site:33000; HostUnreachable: Connection refused
      ....
      2017-03-25T16:29:45.659+0000 I COMMAND [conn13] command admin.$cmd command: isMaster

      { ismaster: 1 }

      keyUpdates:0 writeConflicts:0 numYields:0 reslen:397 locks:{} protocol:op_query 521ms
      ...
      SECONDARY
      2017-03-25T16:29:28.219+0000 I REPL [ReplicationExecutor] This node is server1.site:33000 in the config
      2017-03-25T16:29:28.219+0000 I REPL [ReplicationExecutor] transition to STARTUP2
      2017-03-25T16:29:28.219+0000 I REPL [ReplicationExecutor] Starting replication applier threads
      2017-03-25T16:29:28.221+0000 I REPL [ReplicationExecutor] transition to RECOVERING
      2017-03-25T16:29:28.224+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to arbiter.site:33000
      2017-03-25T16:29:28.224+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to server2.site:33000
      2017-03-25T16:29:28.225+0000 I REPL [ReplicationExecutor] Member arbiter.site:33000 is now in state ARBITER
      2017-03-25T16:29:28.225+0000 I REPL [ReplicationExecutor] Member server2.site:33000 is now in state PRIMARY
      2017-03-25T16:29:28.227+0000 I REPL [ReplicationExecutor] transition to SECONDARY
      ...
      2017-03-25T16:29:34.223+0000 I REPL [ReplicationExecutor] syncing from: server2.site:33000
      2017-03-25T16:29:34.227+0000 I REPL [SyncSourceFeedback] setting syncSourceFeedback to server2.site:33000
      2017-03-25T16:29:34.231+0000 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to server2.site:33000
      2017-03-25T16:29:34.238+0000 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 4, timestamp: Mar 25 16:26:13:21). source's GTE: (term: 5, timestamp: Mar 25 16:29:25:1) hashes: (-7718240682381005025/-2391617886558021771)
      2017-03-25T16:29:34.238+0000 I REPL [rsBackgroundSync] beginning rollback
      2017-03-25T16:29:34.238+0000 I REPL [rsBackgroundSync] rollback 0
      2017-03-25T16:29:34.238+0000 I REPL [ReplicationExecutor] transition to ROLLBACK
      2017-03-25T16:29:34.239+0000 I NETWORK [conn4] end connection 10.196.36.203:20550 (5 connections now open)
      2017-03-25T16:29:34.239+0000 I REPL [rsBackgroundSync] rollback 1
      ..
      2017-03-25T16:29:34.241+0000 I REPL [rsBackgroundSync] rollback 2 FindCommonPoint
      2017-03-25T16:29:34.242+0000 I REPL [rsBackgroundSync] rollback our last optime: Mar 25 16:26:13:21
      2017-03-25T16:29:34.242+0000 I REPL [rsBackgroundSync] rollback their last optime: Mar 25 16:29:34:22
      2017-03-25T16:29:34.242+0000 I REPL [rsBackgroundSync] rollback diff in end of log times: -201 seconds
      2017-03-25T16:29:34.747+0000 I REPL [rsBackgroundSync] rollback 3 fixup
      2017-03-25T16:29:34.787+0000 I REPL [rsBackgroundSync] rollback 3.5
      2017-03-25T16:29:34.891+0000 I NETWORK [initandlisten] connection accepted from 10.196.36.195:4166 #12 (1 connection now open)
      2017-03-25T16:29:34.915+0000 I REPL [rsBackgroundSync] rollback 4 n:1
      2017-03-25T16:29:34.915+0000 I REPL [rsBackgroundSync] minvalid=(term: 5, timestamp: Mar 25 16:29:34:33)
      2017-03-25T16:29:34.916+0000 I REPL [rsBackgroundSync] rollback 4.1.1 coll resync ConfigData
      2017-03-25T16:29:34.945+0000 I INDEX [rsBackgroundSync] build index on: ConfigData properties: { v: 1, key:

      { _id: 1 }

      , name: "id", ns: "ConfigData" }
      2017-03-25T16:29:34.945+0000 I INDEX [rsBackgroundSync] building index using bulk method
      2017-03-25T16:29:34.946+0000 I INDEX [rsBackgroundSync] build index done. scanned 409 total records. 0 sec
      ....
      2017-03-25T16:29:36.155+0000 I INDEX [rsBackgroundSync] build index done. scanned 769 total records. 0 secs
      2017-03-25T16:29:36.155+0000 I REPL [rsBackgroundSync] rollback 4.2
      2017-03-25T16:29:36.156+0000 I REPL [rsBackgroundSync] minvalid=(term: 5, timestamp: Mar 25 16:29:36:1)
      2017-03-25T16:29:36.156+0000 I REPL [rsBackgroundSync] rollback 4.3
      2017-03-25T16:29:36.156+0000 I REPL [rsBackgroundSync] rollback 4.6
      2017-03-25T16:29:36.156+0000 I REPL [rsBackgroundSync] rollback drop: Data1
      2017-03-25T16:29:36.157+0000 I REPL [rsBackgroundSync] rollback drop: Data2
      2017-03-25T16:29:36.165+0000 I REPL [rsBackgroundSync] rollback drop: Data3
      2017-03-25T16:29:36.179+0000 I REPL [rsBackgroundSync] rollback 4.7
      2017-03-25T16:29:36.197+0000 I REPL [rsBackgroundSync] rollback 5 d:4 u:71
      2017-03-25T16:29:36.197+0000 I REPL [rsBackgroundSync] rollback 6
      2017-03-25T16:29:36.197+0000 I REPL [rsBackgroundSync] rollback done
      2017-03-25T16:29:36.197+0000 I REPL [rsBackgroundSync] rollback finished
      2017-03-25T16:29:36.197+0000 I NETWORK [SyncSourceFeedback] Socket say send() errno:9 Bad file descriptor 10.196.36.195:33000
      2017-03-25T16:29:36.207+0000 I REPL [ReplicationExecutor] transition to RECOVERING
      2017-03-25T16:29:36.208+0000 I REPL [ReplicationExecutor] syncing from: server2.site:33000
      017-03-25T16:29:36.356+0000 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update: socket exception [SEND_ERROR] for 10.196.36.195:33000
      2017-03-25T16:29:36.356+0000 I REPL [SyncSourceFeedback] updateUpstream failed: Location9001: socket exception [SEND_ERROR] for 10.196.36.195:33000, will retry
      2017-03-25T16:29:36.365+0000 F REPL [repl writer worker 2] writer worker caught exception: :: caused by :: 26 Failed to apply insert due to missing collection:

      { ts: Timestamp 1490459371000|82, t: 5, h: 7141407996441188723, v: 2, op: "i", <collection data here> }

      2017-03-25T16:29:36.365+0000 I - [repl writer worker 2] Fatal Assertion 16360
      2017-03-25T16:29:36.365+0000 I - [repl writer worker 2]

            Assignee:
            kelsey.schubert@mongodb.com Kelsey Schubert
            Reporter:
            wallisr Andy Wallis
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: