Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-7841

Segmentation faults and lost replicaset seed on rs.remove()

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.0.3
    • Component/s: Replication, Sharding, Stability
    • Environment:
      Ubuntu 12.04.1 LTS on a hi1.4xlarge AWS instance
    • Linux
    • Hide

      The description field is in narrative format with stack traces. The cluster is 2 shards, each with a 3-machine replica set with one node as a 1 hour delay slave, and a total of about 300Gb of actual data.

      Show
      The description field is in narrative format with stack traces. The cluster is 2 shards, each with a 3-machine replica set with one node as a 1 hour delay slave, and a total of about 300Gb of actual data.

      We were taking an old MongoDB server out of a replicaset (one of two shards), to be replaced with a newer build of the system. The machine to be taken out was called mongo2-1, and it was the primary.

      I first stepped it down from Primary:
      rs.stepDown();

      mongo2-4 was elected the new primary. That went very smoothly, and mongo2-4 was serving requests with little or no effect on the site from the transition.

      I let that run for a minute, and since everything looked fine, I did
      rs.remove("mongo2-1.prod.trello.local:27017") on mongo2-4.

      The new primary (mongo2-4) logged some very strange replica set behavior:
      Mon Dec 3 21:56:59 [conn219311] starting new replica set monitor for replica set rs2 with seed of mongo2-1.prod.trello.local:27017
      Mon Dec 3 21:56:59 [conn219311] successfully connected to seed mongo2-1.prod.trello.local:27017 for replica set rs2

      Note that the seed there is the replica set member I had just removed.

      Then MongoDB on mongo2-4 seg faulted:

      Mon Dec 3 21:57:00 Invalid access at address: 0xffffffffffffffe0

      Mon Dec 3 21:57:00 Got signal: 11 (Segmentation fault).

      Mon Dec 3 21:57:00 Backtrace:
      0xa90d79 0xa91350 0x7f588eb5acb0 0x5d3f2d 0x5d9d01 0x5da4b0 0xa20db3 0xa2163b 0xa27c45 0xa2837e 0xa23b42 0xa6b06e 0x97cc14 0x97e20f 0x940e25 0x9441b1 0x8869d7 0x88df49 0xaa37d6 0x637497
      /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0xa90d79]
      /usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x220) [0xa91350]
      /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f588eb5acb0]
      /usr/bin/mongod(_ZN5mongo17ReplicaSetMonitor16_checkConnectionEPNS_18DBClientConnectionERSsbi+0x11ad) [0x5d3f2d]
      /usr/bin/mongod(_ZN5mongo17ReplicaSetMonitorC1ERKSsRKSt6vectorINS_11HostAndPortESaIS4_EE+0x441) [0x5d9d01]
      /usr/bin/mongod(_ZN5mongo17ReplicaSetMonitor3getERKSsRKSt6vectorINS_11HostAndPortESaIS4_EE+0x200) [0x5da4b0]
      /usr/bin/mongod(_ZN5mongo5Shard7_rsInitEv+0x133) [0xa20db3]
      /usr/bin/mongod(_ZN5mongo5Shard8_setAddrERKSs+0x13b) [0xa2163b]
      /usr/bin/mongod(_ZN5mongo15StaticShardInfo6reloadEv+0xa65) [0xa27c45]
      /usr/bin/mongod(_ZN5mongo15StaticShardInfo4findERKSs+0x14e) [0xa2837e]
      /usr/bin/mongod(_ZN5mongo5Shard5resetERKSs+0x42) [0xa23b42]
      /usr/bin/mongod(_ZN5mongo17SplitChunkCommand3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0xbee) [0xa6b06e]
      /usr/bin/mongod(_ZN5mongo11execCommandEPNS_7CommandERNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0x6a4) [0x97cc14]
      /usr/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x6ff) [0x97e20f]
      /usr/bin/mongod(_ZN5mongo11runCommandsEPKcRNS_7BSONObjERNS_5CurOpERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x35) [0x940e25]
      /usr/bin/mongod(ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1+0x11e1) [0x9441b1]
      /usr/bin/mongod() [0x8869d7]
      /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x559) [0x88df49]
      /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x76) [0xaa37d6]
      /usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x637497]

      After it restarted, we were consistenly getting socket exceptions in Mongo clients connecting through mongos, and the mongos logs were showing:

      Mon Dec 3 21:57:36 [conn18303] SyncClusterConnection connecting to [mongo-config-1.prod.trello.local:27019]
      Mon Dec 3 21:57:36 [conn18303] SyncClusterConnection connecting to [mongo-config-2.prod.trello.local:27019]
      Mon Dec 3 21:57:36 [conn18303] SyncClusterConnection connecting to [mongo-config-3.prod.trello.local:27019]
      Mon Dec 3 21:57:37 [conn18339] Assertion: 13129:can't find shard for: mongo2-5.prod.trello.local:27017
      0x5350c2 0x75caba 0x757fd2 0x7f5297 0x5c3526 0x5c1707 0x5ef3ee 0x76fbd9 0x7b60e7 0x7c8691 0x5e8127 0x7f270a3dbe9a 0x7f27098f6cbd
      /usr/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x112) [0x5350c2]
      /usr/bin/mongos(_ZN5mongo15StaticShardInfo4findERKSs+0x3aa) [0x75caba]
      /usr/bin/mongos(_ZN5mongo5Shard5resetERKSs+0x42) [0x757fd2]
      /usr/bin/mongos() [0x7f5297]
      /usr/bin/mongos(_ZN5boost6detail8function17function_invoker4IPFbRN5mongo12DBClientBaseERKSsbiEbS5_S7_biE6invokeERNS1_15function_bufferES5
      _S7_bi+0x16) [0x5c3526]
      /usr/bin/mongos(_ZN5mongo15ShardConnection11_finishInitEv+0x137) [0x5c1707]
      /usr/bin/mongos(_ZN5mongo27ParallelSortClusteredCursor5_initEv+0x5be) [0x5ef3ee]
      /usr/bin/mongos(_ZN5mongo13ShardStrategy7queryOpERNS_7RequestE+0xc59) [0x76fbd9]
      /usr/bin/mongos(_ZN5mongo7Request7processEi+0x187) [0x7b60e7]
      /usr/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x71) [0x7c8691]
      /usr/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x5e8127]
      /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f270a3dbe9a]
      /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f27098f6cbd]

      Finally, we added mongo2-1 back into the replicaset, restarted all mongos and client processes, and things went back to normal. An hour later, we were able to remove mongo2-1 from the replica set and nothing exploded. All told, about a 15 minute site outage.

            Assignee:
            randolph@mongodb.com Randolph Tan
            Reporter:
            brettkiefer Brett Kiefer
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: