[SERVER-17441] mongos crash right after "not master" error Created: 03/Mar/15  Updated: 16/Apr/15  Resolved: 06/Mar/15

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.6.8
Fix Version/s: 2.6.9

Type: Bug Priority: Critical - P2
Reporter: Frédéric Giudicelli Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-10904 Possible for _master and _slaveConn t... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   
Issue Status as of Mar 17, 2015

ISSUE SUMMARY / USER IMPACT
mongos may crash when encountering a "not master" error, which can occur as a result of cluster topology changes.

WORKAROUNDS
Downgrade mongos to 2.6.7. There's no need to downgrade mongod.

AFFECTED VERSIONS
mongos in MongoDB 2.6.8 is affected by this issue

FIX VERSION
The fix is included in the 2.6.9 production release.

RESOLUTION DETAILS
Implementation fix was to ensure clearing of the host string to primary when connection to primary is cleaned up.

Original description

Hello,
Ever since I upgraded this week-end to 2.6.8, my mongos keep on crashing...
Here is what I have in my logs before:

2015-03-03T05:50:30.501+0100 [conn32137] ChunkManager: time to load chunks for monitoring.LinkMDB: 98ms sequenceNumber: 815 version: 2657|1||53b07a37bba6714cea30e13d based on: (empty)
2015-03-03T05:50:30.509+0100 [conn32137] ChunkManager: time to load chunks for monitoring.LinkStats: 3ms sequenceNumber: 817 version: 70|0||54db124dd4ba96ba92f1b6a8 based on: (empty)
2015-03-03T05:50:30.514+0100 [conn32137] ChunkManager: time to load chunks for monitoring.Twittos: 4ms sequenceNumber: 818 version: 37|0||54db12f2d4ba96ba92f1b6ab based on: (empty)
2015-03-03T05:50:30.516+0100 [conn32156] warning: chunk manager reload forced for collection 'monitoring.LinkMDB', config version is 2657|1||53b07a37bba6714cea30e13d
2015-03-03T05:50:30.637+0100 [conn32138] ChunkManager: time to load chunks for monitoring.LinkMDB: 112ms sequenceNumber: 819 version: 2657|1||53b07a37bba6714cea30e13d based on: (empty)
2015-03-03T05:50:30.650+0100 [conn32138] ChunkManager: time to load chunks for monitoring.LinkStats: 7ms sequenceNumber: 820 version: 70|0||54db124dd4ba96ba92f1b6a8 based on: (empty)
2015-03-03T05:50:30.654+0100 [conn32138] ChunkManager: time to load chunks for monitoring.Twittos: 3ms sequenceNumber: 821 version: 37|0||54db12f2d4ba96ba92f1b6ab based on: (empty)
2015-03-03T05:50:30.660+0100 [conn32145] warning: chunk manager reload forced for collection 'monitoring.LinkMDB', config version is 2657|1||53b07a37bba6714cea30e13d
2015-03-03T05:50:30.660+0100 [conn32147] warning: chunk manager reload forced for collection 'monitoring.LinkMDB', config version is 2657|1||53b07a37bba6714cea30e13d
2015-03-03T05:50:30.667+0100 [conn32139] ChunkManager: time to load chunks for monitoring.LinkMDB: 13ms sequenceNumber: 822 version: 2657|1||53b07a37bba6714cea30e13d based on: 2657|1||53b07a37bba6714cea30e13d
2015-03-03T05:50:30.686+0100 [conn32139] warning: chunk manager reload forced for collection 'monitoring.LinkMDB', config version is 2657|1||53b07a37bba6714cea30e13d
2015-03-03T05:50:30.687+0100 [conn32150] ChunkManager: time to load chunks for monitoring.LinkMDB: 7ms sequenceNumber: 823 version: 2657|1||53b07a37bba6714cea30e13d based on: 2657|1||53b07a37bba6714cea30e13d
2015-03-03T05:50:30.703+0100 [conn32150] warning: chunk manager reload forced for collection 'monitoring.LinkMDB', config version is 2657|1||53b07a37bba6714cea30e13d
2015-03-03T05:50:30.716+0100 [conn32143] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-03-03T05:50:30.717+0100 [conn32137] ChunkManager: time to load chunks for monitoring.LinkMDB: 21ms sequenceNumber: 824 version: 2657|1||53b07a37bba6714cea30e13d based on: 2657|1||53b07a37bba6714cea30e13d
2015-03-03T05:50:30.760+0100 [conn32138] ChunkManager: time to load chunks for monitoring.LinkMDB: 36ms sequenceNumber: 826 version: 2657|1||53b07a37bba6714cea30e13d based on: 2657|1||53b07a37bba6714cea30e13d
2015-03-03T05:50:30.811+0100 [conn32136] ChunkManager: time to load chunks for monitoring.LinkMDB: 92ms sequenceNumber: 825 version: 2657|1||53b07a37bba6714cea30e13d based on: (empty)
2015-03-03T05:50:30.821+0100 [conn32136] ChunkManager: time to load chunks for monitoring.LinkStats: 5ms sequenceNumber: 827 version: 70|0||54db124dd4ba96ba92f1b6a8 based on: (empty)
2015-03-03T05:50:30.824+0100 [conn32136] ChunkManager: time to load chunks for monitoring.Twittos: 2ms sequenceNumber: 828 version: 37|0||54db12f2d4ba96ba92f1b6ab based on: (empty)
2015-03-03T05:50:30.827+0100 [conn32137] warning: chunk manager reload forced for collection 'monitoring.LinkMDB', config version is 2657|1||53b07a37bba6714cea30e13d
2015-03-03T05:50:30.829+0100 [conn32138] warning: chunk manager reload forced for collection 'monitoring.LinkMDB', config version is 2657|1||53b07a37bba6714cea30e13d
2015-03-03T05:50:30.852+0100 [conn32136] ChunkManager: time to load chunks for monitoring.LinkStats: 23ms sequenceNumber: 829 version: 70|0||54db124dd4ba96ba92f1b6a8 based on: 70|0||54db124dd4ba96ba92f1b6a8
2015-03-03T05:50:30.853+0100 [conn32136] warning: chunk manager reload forced for collection 'monitoring.LinkStats', config version is 70|0||54db124dd4ba96ba92f1b6a8
2015-03-03T05:50:30.924+0100 [conn32128] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-03-03T05:50:31.244+0100 [conn32127] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-03-03T05:50:31.602+0100 [conn32145] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-03-03T05:50:31.821+0100 [conn32154] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-03-03T05:50:32.501+0100 [conn32145] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-03-03T05:50:32.737+0100 [conn32127] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-03-03T05:50:32.884+0100 [conn32155] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-03-03T05:50:33.474+0100 [conn32145] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-03-03T05:50:33.697+0100 [conn32128] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-03-03T05:50:33.841+0100 [conn32143] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-03-03T05:50:34.091+0100 [conn32154] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-03-03T05:50:34.302+0100 [conn32162] Socket recv() timeout  37.187.81.132:27017
2015-03-03T05:50:34.310+0100 [conn32162] SocketException: remote: 37.187.81.132:27017 error: 9001 socket exception [RECV_TIMEOUT] server [37.187.81.132:27017]
2015-03-03T05:50:34.312+0100 [conn32162] DBClientCursor::init call() failed
2015-03-03T05:50:34.312+0100 [conn32162] Detected bad connection created at 1425358229302575 microSec, clearing pool for node8_p:27017 of 0 connections
2015-03-03T05:50:34.765+0100 [conn32885] end connection 5.135.143.229:58055 (81 connections now open)
2015-03-03T05:50:35.235+0100 [conn32154] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-03-03T05:50:36.027+0100 [WriteBackListener-node8_p:27017] DBClientCursor::init call() failed
2015-03-03T05:50:36.033+0100 [WriteBackListener-node8_p:27017] Detected bad connection created at 1425259483926799 microSec, clearing pool for node8_p:27017 of 2 connections
2015-03-03T05:50:36.034+0100 [WriteBackListener-node8_p:27017] WriteBackListener exception : DBClientBase::findN: transport error: node8_p:27017 ns: admin.$cmd query: { writebacklisten: ObjectId('54f3bbc0b50bae19882a41c5') }
2015-03-03T05:50:37.500+0100 [conn32132] warning: Primary for node8/node8_p:27017,node8_s:27017 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-03-03T05:50:39.624+0100 [LockPinger] cluster mdb_config1:27019,mdb_config2:27019,mdb_config3:27019 pinged successfully at Tue Mar  3 05:50:39 2015 by distributed lock pinger 'mdb_config1:27019,mdb_config2:27019,mdb_config3:27019/mdb_proxy1:27017:1425259457:1804289383', sleeping for 30000ms
2015-03-03T05:50:40.573+0100 [Balancer] Socket closed remotely, no longer connected (idle 37 secs, remote host 37.187.81.132:27017)
2015-03-03T05:50:45.078+0100 [WriteBackListener-node8_s:27017] DBClientCursor::init call() failed
2015-03-03T05:50:45.078+0100 [WriteBackListener-node8_s:27017] Detected bad connection created at 1425262772295994 microSec, clearing pool for node8_s:27017 of 0 connections
2015-03-03T05:50:45.078+0100 [WriteBackListener-node8_s:27017] WriteBackListener exception : DBClientBase::findN: transport error: node8_s:27017 ns: admin.$cmd query: { writebacklisten: ObjectId('54f3bbc0b50bae19882a41c5') }
2015-03-03T05:50:46.034+0100 [conn32868] Socket closed remotely, no longer connected (idle 5 secs, remote host 37.187.81.161:27017)
2015-03-03T05:50:46.049+0100 [conn32868] got not master for: node8_s:27017
2015-03-03T05:50:46.069+0100 [conn32868] Socket closed remotely, no longer connected (idle 17 secs, remote host 37.187.81.161:27017)
2015-03-03T05:50:46.071+0100 [conn32867] DBClientCursor::init lazy say() failed
2015-03-03T05:50:46.071+0100 [conn32867] DBClientCursor::init message from say() was empty
2015-03-03T05:50:46.071+0100 [conn32867] slave no longer has secondary status: node8_s:27017
2015-03-03T05:50:46.072+0100 [conn32867] Socket closed remotely, no longer connected (idle 10 secs, remote host 37.187.81.132:27017)
2015-03-03T05:50:46.072+0100 [conn32867] Socket closed remotely, no longer connected (idle 10 secs, remote host 37.187.81.161:27017)
2015-03-03T05:50:46.075+0100 [conn32867] warning: No primary detected for set node8
2015-03-03T05:50:46.076+0100 [conn32868] warning: problem while initially checking shard versions on node8 :: caused by :: 10009 ReplicaSetMonitor no master found for set: node8
2015-03-03T05:50:46.245+0100 [conn32868] SEVERE: Got signal: 6 (Aborted).
Backtrace:0xdc39d1 0xdc2dae 0x7fb52e57a6a0 0x7fb52e57a625 0x7fb52e57be05 0x7fb52e57374e 0x7fb52e573810 0x74a385 0x74a3bc 0x753272 0x75b4ef 0x76f664 0x775f89 0xc8787f 0xc5a789 0x6e77b9 0xd78cae 0x7fb52f2979d1 0x7fb52e6308fd
 /usr/bin/mongos(_ZN5mongo15printStackTraceERSo+0x21) [0xdc39d1]
 /usr/bin/mongos() [0xdc2dae]
 /lib64/libc.so.6(+0x326a0) [0x7fb52e57a6a0]
 /lib64/libc.so.6(gsignal+0x35) [0x7fb52e57a625]
 /lib64/libc.so.6(abort+0x175) [0x7fb52e57be05]
 /lib64/libc.so.6(+0x2b74e) [0x7fb52e57374e]
 /lib64/libc.so.6(__assert_perror_fail+0) [0x7fb52e573810]
 /usr/bin/mongos() [0x74a385]
 /usr/bin/mongos(_ZN5mongo18DBClientReplicaSet19selectNodeUsingTagsEN5boost10shared_ptrINS_21ReadPreferenceSettingEEE+0x2c) [0x74a3bc]
 /usr/bin/mongos(_ZN5mongo18DBClientReplicaSet3sayERNS_7MessageEbPSs+0x5d2) [0x753272]
 /usr/bin/mongos(_ZN5mongo14DBClientCursor8initLazyEb+0x5bf) [0x75b4ef]
 /usr/bin/mongos(_ZN5mongo27ParallelSortClusteredCursor9startInitEv+0xd54) [0x76f664]
 /usr/bin/mongos(_ZN5mongo27ParallelSortClusteredCursor8fullInitEv+0x9) [0x775f89]
 /usr/bin/mongos(_ZN5mongo8Strategy7queryOpERNS_7RequestE+0xaef) [0xc8787f]
 /usr/bin/mongos(_ZN5mongo7Request7processEi+0x669) [0xc5a789]
 /usr/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x69) [0x6e77b9]
 /usr/bin/mongos(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x4ee) [0xd78cae]
 /lib64/libpthread.so.0(+0x79d1) [0x7fb52f2979d1]
 /lib64/libc.so.6(clone+0x6d) [0x7fb52e6308fd]



 Comments   
Comment by Githook User [ 06/Mar/15 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-17441 mongos crash right after "not master" error
Branch: v2.6
https://github.com/mongodb/mongo/commit/9af0d09d28ec052740e80706e4b8844a564e93a0

Comment by Andy Schwerin [ 03/Mar/15 ]

It should be fine to keep the mongod running 2.6.8.

Comment by Frédéric Giudicelli [ 03/Mar/15 ]

Yes, it seems to be the case.
Can I downgrade the mongos instances and keep the mongo in 2.6.8 ?

Comment by Daniel Pasette (Inactive) [ 03/Mar/15 ]

Thanks for the report. We're investigating now. In the meantime, please downgrade your mongos binaries to 2.6.7.

Just to confirm what we're seeing from the logs, this is only happening while a primary is down in one of your shards?

Comment by Frédéric Giudicelli [ 03/Mar/15 ]

From 2.6.7

Comment by Daniel Pasette (Inactive) [ 03/Mar/15 ]

What version did you upgrade from?

Generated at Thu Feb 08 03:44:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.