Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-7093

Mongos crashed because of "got not master" with signal 11

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker - P1
    • Resolution: Duplicate
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: Stability
    • Labels:
    • Environment:
      CentOS6.3, 12 replica sets (3 replication with each shard) in sharding, 3 config server, 5 mongos
    • Operating System:
      Linux

      Description

      Mongos always crashed!!! And 5 mongos crashed almost at the same time. The reason is that it "got not master for: 192.168.99.1", then "DBClientCursor::init call() failed" and it received signal 11.

      The version is 2.2.0.

      This bug is similar to SERVER-6539: https://jira.mongodb.org/browse/SERVER-6539

      Backtraces below:

      — 1 —
      Thu Sep 13 13:57:24 [ReplicaSetMonitorWatcher] Primary for replica set shard01 changed to SNode_S023:2012
      Thu Sep 13 13:57:34 [ReplicaSetMonitorWatcher] Primary for replica set shard01 changed to SNode_S021:2012
      Thu Sep 13 13:57:34 [ReplicaSetMonitorWatcher] Primary for replica set shard01 changed to SNode_S023:2012
      Thu Sep 13 13:57:44 [ReplicaSetMonitorWatcher] Primary for replica set shard01 changed to SNode_S021:2012
      Thu Sep 13 13:57:44 [ReplicaSetMonitorWatcher] Primary for replica set shard01 changed to SNode_S023:2012
      Thu Sep 13 13:57:54 [ReplicaSetMonitorWatcher] Primary for replica set shard01 changed to SNode_S021:2012
      Thu Sep 13 13:57:54 [ReplicaSetMonitorWatcher] Primary for replica set shard01 changed to SNode_S023:2012
      Thu Sep 13 13:57:54 [WriteBackListener-SNode_S023:2012] DBClientCursor::init call() failed
      Thu Sep 13 13:57:54 [WriteBackListener-SNode_S023:2012] WriteBackListener exception : DBClientBase::findN: transport error: SNode_S023:2012 ns: admin.$cmd query:

      { writebacklisten: ObjectId('50516d98ac39cc0cc08f7ad3') }

      Thu Sep 13 13:57:55 [conn584] ChunkManager: time to load chunks for infodb.docinfo: 132ms sequenceNumber: 3330 version: 3420|1||504836f4ed66ab254ec61a1e based on: 3419|5||504836f4ed66ab254ec61a1e
      Thu Sep 13 13:57:55 [conn589] ChunkManager: time to load chunks for textdb.doctext: 212ms sequenceNumber: 3331 version: 2856|3||504836f4ed66ab254ec61a1f based on: 2856|1||504836f4ed66ab254ec61a1f
      Thu Sep 13 13:57:55 [conn589] got not master for: SNode_S023:2012
      Thu Sep 13 13:57:55 [conn458] ChunkManager: time to load chunks for infodb.docinfo: 109ms sequenceNumber: 3332 version: 3420|1||504836f4ed66ab254ec61a1e based on: 3419|5||504836f4ed66ab254ec61a1e
      Received signal 11
      Backtrace: 0x8386d5 0x361f632920
      ./mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x8386d5]
      /lib64/libc.so.6[0x361f632920]
      ===

      — /1 —

      and
      — 2 —
      Thu Sep 13 14:27:01 [conn5] ChunkManager: time to load chunks for textdb.doctext: 181ms sequenceNumber: 529 version: 2856|185||504836f4ed66ab254ec61a1f based on: 2856|47||504836f4ed66ab254ec61a1f
      Thu Sep 13 14:27:02 [conn8] Socket recv() errno:104 Connection reset by peer 10.9.0.23:2012
      Thu Sep 13 14:27:02 [WriteBackListener-SNode_S023:2012] DBClientCursor::init call() failed
      Thu Sep 13 14:27:02 [conn8] SocketException: remote: 10.9.0.23:2012 error: 9001 socket exception [1] server [10.9.0.23:2012]
      Thu Sep 13 14:27:02 [conn8] DBClientCursor::init call() failed
      Thu Sep 13 14:27:02 [WriteBackListener-SNode_S023:2012] WriteBackListener exception : DBClientBase::findN: transport error: SNode_S023:2012 ns: admin.$cmd query:

      { writebacklisten: ObjectId('5051795ec69e943c6fb769f9') }

      Thu Sep 13 14:27:02 [conn8] warning: db exception when initializing on shard01:shard01/SNode_S021:2012,SNode_S022:2012,SNode_S023:2012, current connection state is { state:

      { conn: "shard01/SNode_S021:2012,SNode_S022:2012,SNode_S023:2012", vinfo: "textdb.doctext @ 2856|185||504836f4ed66ab254ec61a1f", cursor: "(none)", count: 0, done: false }

      , retryNext: false, init: false, finish: false, errored: false } :: caused by :: 10276 DBClientBase::findN: transport error: SNode_S023:2012 ns: admin.$cmd query: { setShardVersion: "textdb.doctext", configdb: "SNode_S038:2020,SNode_S039:2020,SNode_S040:2020", version: Timestamp 2856000|173, versionEpoch: ObjectId('504836f4ed66ab254ec61a1f'), serverID: ObjectId('5051795ec69e943c6fb769f9'), shard: "shard01", shardHost: "shard01/SNode_S021:2012,SNode_S022:2012,SNode_S023:2012", $auth: {} }
      Thu Sep 13 14:27:02 [conn29] got not master for: SNode_S023:2012
      Thu Sep 13 14:27:02 [conn3] Primary for replica set shard01 changed to SNode_S021:2012
      Received signal 11
      Backtrace: 0x8386d5 0x361f632920 0xc61e30
      ./mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x8386d5]
      /lib64/libc.so.6[0x361f632920]
      ./mongos(_ZTVN5mongo18DBClientConnectionE+0x10)[0xc61e30]
      ===
      — /2 —

      and
      — 3 —
      Tue Sep 18 14:36:38 [WriteBackListener-SNode_S029:2012] Socket recv() errno:104 Connection reset by peer 10.9.0.29:2012
      Tue Sep 18 14:36:38 [WriteBackListener-SNode_S029:2012] SocketException: remote: 10.9.0.29:2012 error: 9001 socket exception [1] server [10.9.0.29:2012]
      Tue Sep 18 14:36:38 [WriteBackListener-SNode_S029:2012] DBClientCursor::init call() failed
      Tue Sep 18 14:36:38 [WriteBackListener-SNode_S029:2012] WriteBackListener exception : DBClientBase::findN: transport error: SNode_S029:2012 ns: admin.$cmd query:

      { writebacklisten: ObjectId('50570a6870b81f72dd22c467') }

      Tue Sep 18 14:36:38 [mongosMain] connection accepted from 10.9.0.1:38044 #2861 (1062 connections now open)
      Tue Sep 18 14:36:38 [conn2861] got not master for: SNode_S029:2012
      Received signal 11
      Backtrace: 0x8386d5 0x361f632920 0x7f665e428d80
      ./mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x8386d5]
      /lib64/libc.so.6[0x361f632920]
      [0x7f665e428d80]
      ===
      — /3 —

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              Unassigned
              Reporter:
              matrixyy Tieying Zhang
              Participants:
              Votes:
              10 Vote for this issue
              Watchers:
              7 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: