Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-23003

Recovery problems after network partition.

    • Fully Compatible
    • ALL
    • Hide

      Setup:

      • Replica set with 5 nodes (4 data node and 1 arbiter) in 3 datacenter.
      • DC1: primary, secondary
      • DC2: secondary, secondary
      • DC3: arbiter

      Disable network between DC1 and DC2.
      The primary is switching back and forth between DC1 and DC2.
      Enable network between DC1 and DC2.
      Some of the nodes crashes with the error message shown above.

      Show
      Setup: Replica set with 5 nodes (4 data node and 1 arbiter) in 3 datacenter. DC1: primary, secondary DC2: secondary, secondary DC3: arbiter Disable network between DC1 and DC2. The primary is switching back and forth between DC1 and DC2. Enable network between DC1 and DC2. Some of the nodes crashes with the error message shown above.
    • Repl 11 (03/11/16), Repl 12 (04/01/16)

      2016-03-07T09:09:29.127-0500 I REPL     [rsBackgroundSync] rollback 6
      2016-03-07T09:09:29.127-0500 I COMMAND  [rsBackgroundSync] query admin.system.roles planSummary: EOF ntoreturn:0 ntoskip:0 keysExamined:0 docsExamined:0 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 11, w: 2, W: 1 } }, MMAPV1Journal: { acquireCount: { r: 3, w: 5 } }, Database: { acquireCount: { r: 4, w: 1, W: 1 } }, Collection: { acquireCount: { R: 3 } }, oplog: { acquireCount: { R: 1, W: 1 } } } 0ms
      2016-03-07T09:09:29.128-0500 I REPL     [rsBackgroundSync] rollback done
      2016-03-07T09:09:29.128-0500 I NETWORK  [SyncSourceFeedback] Socket say send() errno:9 Bad file descriptor 192.168.200.4:27017
      2016-03-07T09:09:29.128-0500 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update: socket exception [SEND_ERROR] for 192.168.200.4:27017
      2016-03-07T09:09:29.128-0500 I REPL     [ReplicationExecutor] transition to RECOVERING
      2016-03-07T09:09:29.128-0500 I REPL     [rsBackgroundSync] rollback finished
      2016-03-07T09:09:29.128-0500 I REPL     [SyncSourceFeedback] updateUpstream failed: Location9001: socket exception [SEND_ERROR] for 192.168.200.4:27017, will retry
      2016-03-07T09:09:29.128-0500 I REPL     [ReplicationExecutor] syncing from: 192.168.200.2:27017
      2016-03-07T09:09:29.129-0500 D COMMAND  [ConnectBG] BackgroundJob starting: ConnectBG
      2016-03-07T09:09:29.131-0500 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to 192.168.200.2:27017
      2016-03-07T09:09:29.132-0500 D COMMAND  [ConnectBG] BackgroundJob starting: ConnectBG
      2016-03-07T09:09:29.133-0500 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to 192.168.200.2:27017
      2016-03-07T09:09:29.135-0500 I -        [rsSync] Fatal assertion 34361 OplogOutOfOrder: Attempted to apply an earlier oplog entry (ts: Mar  7 09:09:00:2) when our lastWrittenOptime was (term: 3, timestamp: Mar  7 09:09:11:2)
      2016-03-07T09:09:29.143-0500 I CONTROL  [rsSync] 
       0x12f14b2 0x128f788 0x127c722 0xf2a364 0xf1ff80 0x7fd3fcdcca40 0x7fd3fc5e9182 0x7fd3fc31647d
      ----- BEGIN BACKTRACE -----
      {"backtrace":[{"b":"400000","o":"EF14B2","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"E8F788","s":"_ZN5mongo10logContextEPKc"},{"b":"400000","o":"E7C722","s":"_ZN5mongo23fassertFailedWithStatusEiRKNS_6StatusE"},{"b":"400000","o":"B2A364","s":"_ZN5mongo4repl8SyncTail16oplogApplicationEv"},{"b":"400000","o":"B1FF80","s":"_ZN5mongo4repl13runSyncThreadEv"},{"b":"7FD3FCD1B000","o":"B1A40"},{"b":"7FD3FC5E1000","o":"8182"},{"b":"7FD3FC21C000","o":"FA47D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.3", "gitVersion" : "b326ba837cf6f49d65c2f85e1b70f6f31ece7937", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.19.0-25-generic", "version" : "#26~14.04.1-Ubuntu SMP Fri Jul 24 21:16:20 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "C1CD0F405485844DA016C6B5275C8BEF3D68DB7A" }, { "b" : "7FFE8D1C5000", "elfType" : 3, "buildId" : "01005E92EA3C0C1527396B65C7412468298D0813" }, { "b" : "7FD3FD806000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "D08DD65F97859C71BB2CBBF1043BD968EFE18AAD" }, { "b" : "7FD3FD42B000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "F86FA9FB4ECEB4E06B40DBDF761A4172B70A4229" }, { "b" : "7FD3FD223000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "92FCF41EFE012D6186E31A59AD05BDBB487769AB" }, { "b" : "7FD3FD01F000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C1AE4CB7195D337A77A3C689051DABAA3980CA0C" }, { "b" : "7FD3FCD1B000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3, "buildId" : "4BF6F7ADD8244AD86008E6BF40D90F8873892197" }, { "b" : "7FD3FCA15000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1D76B71E905CB867B27CEF230FCB20F01A3178F5" }, { "b" : "7FD3FC7FF000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "8D0AA71411580EE6C08809695C3984769F25725B" }, { "b" : "7FD3FC5E1000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "9318E8AF0BFBE444731BB0461202EF57F7C39542" }, { "b" : "7FD3FC21C000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "30C94DC66A1FE95180C3D68D2B89E576D5AE213C" }, { "b" : "7FD3FDA65000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9F00581AB3C73E3AEA35995A0C50D24D59A01D47" } ] }}
       mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x12f14b2]
       mongod(_ZN5mongo10logContextEPKc+0x138) [0x128f788]
       mongod(_ZN5mongo23fassertFailedWithStatusEiRKNS_6StatusE+0x62) [0x127c722]
       mongod(_ZN5mongo4repl8SyncTail16oplogApplicationEv+0x1564) [0xf2a364]
       mongod(_ZN5mongo4repl13runSyncThreadEv+0x230) [0xf1ff80]
       libstdc++.so.6(+0xB1A40) [0x7fd3fcdcca40]
       libpthread.so.0(+0x8182) [0x7fd3fc5e9182]
       libc.so.6(clone+0x6D) [0x7fd3fc31647d]
      -----  END BACKTRACE  -----
      2016-03-07T09:09:29.143-0500 I -        [rsSync] 
      
      ***aborting after fassert() failure
      

        1. rs.conf
          2 kB
        2. dc3-node1.log
          185 kB
        3. dc2-node2.log
          342 kB
        4. dc2-node1.log
          351 kB
        5. dc1-node2.log
          434 kB
        6. dc1-node1.log
          728 kB

            Assignee:
            scotthernandez Scott Hernandez (Inactive)
            Reporter:
            robert.hegedus Róbert Heged?s
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: