Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-9682

Fatal assertion crash on replica during initial sync

    • Type: Icon: Bug Bug
    • Resolution: Incomplete
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.4.3
    • Component/s: Replication
    • Environment:
      Primary ubuntu 10.04
      Secondary ubuntu 12.04
      AWS
    • Linux

      Have been trying to upgrade our QA from 2.2.2 to 2.4.3. Historically, our upgrades have gone very smoothly. This one is really giving us some trouble. We first disabled replication on QA and just upgraded master to 2.4.3 to test it out. When satisfied, we upgraded the two replicas and let them catch up but they kept failing very similarly to SERVER-6975. So we simplified and attempted a full resync instead. What we observe now is the replicas losing connectivity with master several times during the resync process and eventually crashing in the same manner as reported in SERVER-9057 and SERVER-9199:

       
      Mon May 13 19:07:45.614 [rsHealthPoll] replSet member ip-10-71-26-90.ec2.internal:27017 is now in state SECONDARY
      Mon May 13 19:07:45.700 [initandlisten] connection accepted from 10.71.26.90:50732 #2175 (28 connections now open)
      Mon May 13 19:07:45.703 [conn2175]  authenticate db: local { authenticate: 1, nonce: "779ae9aba7a706e4", user: "__system", key: "e7f4453059498c244523f2794a04c216" }
      Mon May 13 19:07:45.705 [conn2175] replSet info voting yea for ip-10-71-26-90.ec2.internal:27017 (0)
      Mon May 13 19:07:47.617 [rsHealthPoll] replSet member ip-10-71-26-90.ec2.internal:27017 is now in state PRIMARY
      Mon May 13 19:07:59.673 [initandlisten] connection accepted from 10.71.26.90:50738 #2176 (29 connections now open)
      Mon May 13 19:07:59.677 [conn2176]  authenticate db: admin { authenticate: 1, user: "mongo", nonce: "8f94b9003dc3c980", key: "ad59ed669b28d961e6db34033c785980" }
      Mon May 13 19:07:59.691 [conn2176] end connection 10.71.26.90:50738 (28 connections now open)
      Mon May 13 19:08:13.751 [conn2175] end connection 10.71.26.90:50732 (27 connections now open)
      Mon May 13 19:08:13.753 [initandlisten] connection accepted from 10.71.26.90:50749 #2177 (28 connections now open)
      Mon May 13 19:08:13.755 [conn2177]  authenticate db: local { authenticate: 1, nonce: "f4fb648338aebbc9", user: "__system", key: "77d24c9ffe7f0e3fe2749d1c042562d0" }
      Mon May 13 19:08:15.606 [rsSync]   Fatal Assertion 16233
      0xdcf361 0xd8f0d3 0xc03b0f 0xc21811 0xc218ad 0xc21b7c 0xe17cb9 0x7f313696ce9a 0x7f3135c7fcbd
       /home/mongo/mongodb/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcf361]
       /home/mongo/mongodb/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xd8f0d3]
       /home/mongo/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x6f) [0xc03b0f]
       /home/mongo/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0x71) [0xc21811]
       /home/mongo/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x2d) [0xc218ad]
       /home/mongo/mongodb/bin/mongod(_ZN5mongo15startSyncThreadEv+0x6c) [0xc21b7c]
       /home/mongo/mongodb/bin/mongod() [0xe17cb9]
       /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f313696ce9a]
       /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f3135c7fcbd]
      Mon May 13 19:08:15.667 [rsSync]
      
      ***aborting after fassert() failure
      
      
      Mon May 13 19:08:15.667 Got signal: 6 (Aborted).
      
      Mon May 13 19:08:15.671 Backtrace:
      0xdcf361 0x6cf729 0x7f3135bc24a0 0x7f3135bc2425 0x7f3135bc5b8b 0xd8f10e 0xc03b0f 0xc21811 0xc218ad 0xc21b7c 0xe17cb9 0x7f313696ce9a 0x7f3135c7fcbd
       /home/mongo/mongodb/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcf361]
       /home/mongo/mongodb/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6cf729]
       /lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7f3135bc24a0]
       /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f3135bc2425]
       /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7f3135bc5b8b]
       /home/mongo/mongodb/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0xd8f10e]
       /home/mongo/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x6f) [0xc03b0f]
       /home/mongo/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0x71) [0xc21811]
       /home/mongo/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x2d) [0xc218ad]
       /home/mongo/mongodb/bin/mongod(_ZN5mongo15startSyncThreadEv+0x6c) [0xc21b7c]
       /home/mongo/mongodb/bin/mongod() [0xe17cb9]
       /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f313696ce9a]
       /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f3135c7fcbd]
      

      We have even simplified the setup to one replica trying to resync from master so that we reduce the load on the network. Same issue. I'll be attaching the full logs from both master and the crashing secondary in a bit. Following the suggestions in the other tickets, I also ran tcpdump during the tests to capture any possible network issues. Those logs are huge so I'm trying to truncate them to the time of the crash so that I can upload them.

        1. mongodb.logs.tar.gz
          1.37 MB
        2. tcpdump.secondary.tar.gz
          24.89 MB

            Assignee:
            dan@mongodb.com Daniel Pasette (Inactive)
            Reporter:
            gstathis George P. Stathis
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: