Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-17441

mongos crash right after "not master" error

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Critical - P2 Critical - P2
    • 2.6.9
    • Affects Version/s: 2.6.8
    • Component/s: Sharding
    • Labels:
      None
    • Fully Compatible
    • ALL

      Issue Status as of Mar 17, 2015

      ISSUE SUMMARY / USER IMPACT
      mongos may crash when encountering a "not master" error, which can occur as a result of cluster topology changes.

      WORKAROUNDS
      Downgrade mongos to 2.6.7. There's no need to downgrade mongod.

      AFFECTED VERSIONS
      mongos in MongoDB 2.6.8 is affected by this issue

      FIX VERSION
      The fix is included in the 2.6.9 production release.

      RESOLUTION DETAILS
      Implementation fix was to ensure clearing of the host string to primary when connection to primary is cleaned up.

      Original description

      Hello,
      Ever since I upgraded this week-end to 2.6.8, my mongos keep on crashing...
      Here is what I have in my logs before:

      2015-03-03T05:50:30.501+0100 [conn32137] ChunkManager: time to load chunks for monitoring.LinkMDB: 98ms sequenceNumber: 815 version: 2657|1||53b07a37bba6714cea30e13d based on: (empty)
      2015-03-03T05:50:30.509+0100 [conn32137] ChunkManager: time to load chunks for monitoring.LinkStats: 3ms sequenceNumber: 817 version: 70|0||54db124dd4ba96ba92f1b6a8 based on: (empty)
      2015-03-03T05:50:30.514+0100 [conn32137] ChunkManager: time to load chunks for monitoring.Twittos: 4ms sequenceNumber: 818 version: 37|0||54db12f2d4ba96ba92f1b6ab based on: (empty)
      2015-03-03T05:50:30.516+0100 [conn32156] warning: chunk manager reload forced for collection 'monitoring.LinkMDB', config version is 2657|1||53b07a37bba6714cea30e13d
      2015-03-03T05:50:30.637+0100 [conn32138] ChunkManager: time to load chunks for monitoring.LinkMDB: 112ms sequenceNumber: 819 version: 2657|1||53b07a37bba6714cea30e13d based on: (empty)
      2015-03-03T05:50:30.650+0100 [conn32138] ChunkManager: time to load chunks for monitoring.LinkStats: 7ms sequenceNumber: 820 version: 70|0||54db124dd4ba96ba92f1b6a8 based on: (empty)
      2015-03-03T05:50:30.654+0100 [conn32138] ChunkManager: time to load chunks for monitoring.Twittos: 3ms sequenceNumber: 821 version: 37|0||54db12f2d4ba96ba92f1b6ab based on: (empty)
      2015-03-03T05:50:30.660+0100 [conn32145] warning: chunk manager reload forced for collection 'monitoring.LinkMDB', config version is 2657|1||53b07a37bba6714cea30e13d
      2015-03-03T05:50:30.660+0100 [conn32147] warning: chunk manager reload forced for collection 'monitoring.LinkMDB', config version is 2657|1||53b07a37bba6714cea30e13d
      2015-03-03T05:50:30.667+0100 [conn32139] ChunkManager: time to load chunks for monitoring.LinkMDB: 13ms sequenceNumber: 822 version: 2657|1||53b07a37bba6714cea30e13d based on: 2657|1||53b07a37bba6714cea30e13d
      2015-03-03T05:50:30.686+0100 [conn32139] warning: chunk manager reload forced for collection 'monitoring.LinkMDB', config version is 2657|1||53b07a37bba6714cea30e13d
      2015-03-03T05:50:30.687+0100 [conn32150] ChunkManager: time to load chunks for monitoring.LinkMDB: 7ms sequenceNumber: 823 version: 2657|1||53b07a37bba6714cea30e13d based on: 2657|1||53b07a37bba6714cea30e13d
      2015-03-03T05:50:30.703+0100 [conn32150] warning: chunk manager reload forced for collection 'monitoring.LinkMDB', config version is 2657|1||53b07a37bba6714cea30e13d
      2015-03-03T05:50:30.716+0100 [conn32143] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
      2015-03-03T05:50:30.717+0100 [conn32137] ChunkManager: time to load chunks for monitoring.LinkMDB: 21ms sequenceNumber: 824 version: 2657|1||53b07a37bba6714cea30e13d based on: 2657|1||53b07a37bba6714cea30e13d
      2015-03-03T05:50:30.760+0100 [conn32138] ChunkManager: time to load chunks for monitoring.LinkMDB: 36ms sequenceNumber: 826 version: 2657|1||53b07a37bba6714cea30e13d based on: 2657|1||53b07a37bba6714cea30e13d
      2015-03-03T05:50:30.811+0100 [conn32136] ChunkManager: time to load chunks for monitoring.LinkMDB: 92ms sequenceNumber: 825 version: 2657|1||53b07a37bba6714cea30e13d based on: (empty)
      2015-03-03T05:50:30.821+0100 [conn32136] ChunkManager: time to load chunks for monitoring.LinkStats: 5ms sequenceNumber: 827 version: 70|0||54db124dd4ba96ba92f1b6a8 based on: (empty)
      2015-03-03T05:50:30.824+0100 [conn32136] ChunkManager: time to load chunks for monitoring.Twittos: 2ms sequenceNumber: 828 version: 37|0||54db12f2d4ba96ba92f1b6ab based on: (empty)
      2015-03-03T05:50:30.827+0100 [conn32137] warning: chunk manager reload forced for collection 'monitoring.LinkMDB', config version is 2657|1||53b07a37bba6714cea30e13d
      2015-03-03T05:50:30.829+0100 [conn32138] warning: chunk manager reload forced for collection 'monitoring.LinkMDB', config version is 2657|1||53b07a37bba6714cea30e13d
      2015-03-03T05:50:30.852+0100 [conn32136] ChunkManager: time to load chunks for monitoring.LinkStats: 23ms sequenceNumber: 829 version: 70|0||54db124dd4ba96ba92f1b6a8 based on: 70|0||54db124dd4ba96ba92f1b6a8
      2015-03-03T05:50:30.853+0100 [conn32136] warning: chunk manager reload forced for collection 'monitoring.LinkStats', config version is 70|0||54db124dd4ba96ba92f1b6a8
      2015-03-03T05:50:30.924+0100 [conn32128] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
      2015-03-03T05:50:31.244+0100 [conn32127] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
      2015-03-03T05:50:31.602+0100 [conn32145] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
      2015-03-03T05:50:31.821+0100 [conn32154] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
      2015-03-03T05:50:32.501+0100 [conn32145] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
      2015-03-03T05:50:32.737+0100 [conn32127] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
      2015-03-03T05:50:32.884+0100 [conn32155] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
      2015-03-03T05:50:33.474+0100 [conn32145] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
      2015-03-03T05:50:33.697+0100 [conn32128] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
      2015-03-03T05:50:33.841+0100 [conn32143] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
      2015-03-03T05:50:34.091+0100 [conn32154] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
      2015-03-03T05:50:34.302+0100 [conn32162] Socket recv() timeout  37.187.81.132:27017
      2015-03-03T05:50:34.310+0100 [conn32162] SocketException: remote: 37.187.81.132:27017 error: 9001 socket exception [RECV_TIMEOUT] server [37.187.81.132:27017]
      2015-03-03T05:50:34.312+0100 [conn32162] DBClientCursor::init call() failed
      2015-03-03T05:50:34.312+0100 [conn32162] Detected bad connection created at 1425358229302575 microSec, clearing pool for node8_p:27017 of 0 connections
      2015-03-03T05:50:34.765+0100 [conn32885] end connection 5.135.143.229:58055 (81 connections now open)
      2015-03-03T05:50:35.235+0100 [conn32154] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
      2015-03-03T05:50:36.027+0100 [WriteBackListener-node8_p:27017] DBClientCursor::init call() failed
      2015-03-03T05:50:36.033+0100 [WriteBackListener-node8_p:27017] Detected bad connection created at 1425259483926799 microSec, clearing pool for node8_p:27017 of 2 connections
      2015-03-03T05:50:36.034+0100 [WriteBackListener-node8_p:27017] WriteBackListener exception : DBClientBase::findN: transport error: node8_p:27017 ns: admin.$cmd query: { writebacklisten: ObjectId('54f3bbc0b50bae19882a41c5') }
      2015-03-03T05:50:37.500+0100 [conn32132] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
      2015-03-03T05:50:39.624+0100 [LockPinger] cluster mdb_config1:27019,mdb_config2:27019,mdb_config3:27019 pinged successfully at Tue Mar  3 05:50:39 2015 by distributed lock pinger 'mdb_config1:27019,mdb_config2:27019,mdb_config3:27019/mdb_proxy1:27017:1425259457:1804289383', sleeping for 30000ms
      2015-03-03T05:50:40.573+0100 [Balancer] Socket closed remotely, no longer connected (idle 37 secs, remote host 37.187.81.132:27017)
      2015-03-03T05:50:45.078+0100 [WriteBackListener-node8_s:27017] DBClientCursor::init call() failed
      2015-03-03T05:50:45.078+0100 [WriteBackListener-node8_s:27017] Detected bad connection created at 1425262772295994 microSec, clearing pool for node8_s:27017 of 0 connections
      2015-03-03T05:50:45.078+0100 [WriteBackListener-node8_s:27017] WriteBackListener exception : DBClientBase::findN: transport error: node8_s:27017 ns: admin.$cmd query: { writebacklisten: ObjectId('54f3bbc0b50bae19882a41c5') }
      2015-03-03T05:50:46.034+0100 [conn32868] Socket closed remotely, no longer connected (idle 5 secs, remote host 37.187.81.161:27017)
      2015-03-03T05:50:46.049+0100 [conn32868] got not master for: node8_s:27017
      2015-03-03T05:50:46.069+0100 [conn32868] Socket closed remotely, no longer connected (idle 17 secs, remote host 37.187.81.161:27017)
      2015-03-03T05:50:46.071+0100 [conn32867] DBClientCursor::init lazy say() failed
      2015-03-03T05:50:46.071+0100 [conn32867] DBClientCursor::init message from say() was empty
      2015-03-03T05:50:46.071+0100 [conn32867] slave no longer has secondary status: node8_s:27017
      2015-03-03T05:50:46.072+0100 [conn32867] Socket closed remotely, no longer connected (idle 10 secs, remote host 37.187.81.132:27017)
      2015-03-03T05:50:46.072+0100 [conn32867] Socket closed remotely, no longer connected (idle 10 secs, remote host 37.187.81.161:27017)
      2015-03-03T05:50:46.075+0100 [conn32867] warning: No primary detected for set node8
      2015-03-03T05:50:46.076+0100 [conn32868] warning: problem while initially checking shard versions on node8 :: caused by :: 10009 ReplicaSetMonitor no master found for set: node8
      2015-03-03T05:50:46.245+0100 [conn32868] SEVERE: Got signal: 6 (Aborted).
      Backtrace:0xdc39d1 0xdc2dae 0x7fb52e57a6a0 0x7fb52e57a625 0x7fb52e57be05 0x7fb52e57374e 0x7fb52e573810 0x74a385 0x74a3bc 0x753272 0x75b4ef 0x76f664 0x775f89 0xc8787f 0xc5a789 0x6e77b9 0xd78cae 0x7fb52f2979d1 0x7fb52e6308fd
       /usr/bin/mongos(_ZN5mongo15printStackTraceERSo+0x21) [0xdc39d1]
       /usr/bin/mongos() [0xdc2dae]
       /lib64/libc.so.6(+0x326a0) [0x7fb52e57a6a0]
       /lib64/libc.so.6(gsignal+0x35) [0x7fb52e57a625]
       /lib64/libc.so.6(abort+0x175) [0x7fb52e57be05]
       /lib64/libc.so.6(+0x2b74e) [0x7fb52e57374e]
       /lib64/libc.so.6(__assert_perror_fail+0) [0x7fb52e573810]
       /usr/bin/mongos() [0x74a385]
       /usr/bin/mongos(_ZN5mongo18DBClientReplicaSet19selectNodeUsingTagsEN5boost10shared_ptrINS_21ReadPreferenceSettingEEE+0x2c) [0x74a3bc]
       /usr/bin/mongos(_ZN5mongo18DBClientReplicaSet3sayERNS_7MessageEbPSs+0x5d2) [0x753272]
       /usr/bin/mongos(_ZN5mongo14DBClientCursor8initLazyEb+0x5bf) [0x75b4ef]
       /usr/bin/mongos(_ZN5mongo27ParallelSortClusteredCursor9startInitEv+0xd54) [0x76f664]
       /usr/bin/mongos(_ZN5mongo27ParallelSortClusteredCursor8fullInitEv+0x9) [0x775f89]
       /usr/bin/mongos(_ZN5mongo8Strategy7queryOpERNS_7RequestE+0xaef) [0xc8787f]
       /usr/bin/mongos(_ZN5mongo7Request7processEi+0x669) [0xc5a789]
       /usr/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x69) [0x6e77b9]
       /usr/bin/mongos(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x4ee) [0xd78cae]
       /lib64/libpthread.so.0(+0x79d1) [0x7fb52f2979d1]
       /lib64/libc.so.6(clone+0x6d) [0x7fb52e6308fd]
      

            Assignee:
            randolph@mongodb.com Randolph Tan
            Reporter:
            giudicelli Frédéric Giudicelli
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

              Created:
              Updated:
              Resolved: