[SERVER-7093] Mongos crashed because of "got not master" with signal 11 Created: 21/Sep/12  Updated: 15/Feb/13  Resolved: 21/Sep/12

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 2.2.0
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: Tieying Zhang Assignee: Unassigned
Resolution: Duplicate Votes: 10
Labels: crash, mongos
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS6.3, 12 replica sets (3 replication with each shard) in sharding, 3 config server, 5 mongos


Issue Links:
Duplicate
duplicates SERVER-7061 mongos can use invalid ptr to master ... Closed
Operating System: Linux
Participants:

 Description   

Mongos always crashed!!! And 5 mongos crashed almost at the same time. The reason is that it "got not master for: 192.168.99.1", then "DBClientCursor::init call() failed" and it received signal 11.

The version is 2.2.0.

This bug is similar to SERVER-6539: https://jira.mongodb.org/browse/SERVER-6539

Backtraces below:

— 1 —
Thu Sep 13 13:57:24 [ReplicaSetMonitorWatcher] Primary for replica set shard01 changed to SNode_S023:2012
Thu Sep 13 13:57:34 [ReplicaSetMonitorWatcher] Primary for replica set shard01 changed to SNode_S021:2012
Thu Sep 13 13:57:34 [ReplicaSetMonitorWatcher] Primary for replica set shard01 changed to SNode_S023:2012
Thu Sep 13 13:57:44 [ReplicaSetMonitorWatcher] Primary for replica set shard01 changed to SNode_S021:2012
Thu Sep 13 13:57:44 [ReplicaSetMonitorWatcher] Primary for replica set shard01 changed to SNode_S023:2012
Thu Sep 13 13:57:54 [ReplicaSetMonitorWatcher] Primary for replica set shard01 changed to SNode_S021:2012
Thu Sep 13 13:57:54 [ReplicaSetMonitorWatcher] Primary for replica set shard01 changed to SNode_S023:2012
Thu Sep 13 13:57:54 [WriteBackListener-SNode_S023:2012] DBClientCursor::init call() failed
Thu Sep 13 13:57:54 [WriteBackListener-SNode_S023:2012] WriteBackListener exception : DBClientBase::findN: transport error: SNode_S023:2012 ns: admin.$cmd query:

{ writebacklisten: ObjectId('50516d98ac39cc0cc08f7ad3') }

Thu Sep 13 13:57:55 [conn584] ChunkManager: time to load chunks for infodb.docinfo: 132ms sequenceNumber: 3330 version: 3420|1||504836f4ed66ab254ec61a1e based on: 3419|5||504836f4ed66ab254ec61a1e
Thu Sep 13 13:57:55 [conn589] ChunkManager: time to load chunks for textdb.doctext: 212ms sequenceNumber: 3331 version: 2856|3||504836f4ed66ab254ec61a1f based on: 2856|1||504836f4ed66ab254ec61a1f
Thu Sep 13 13:57:55 [conn589] got not master for: SNode_S023:2012
Thu Sep 13 13:57:55 [conn458] ChunkManager: time to load chunks for infodb.docinfo: 109ms sequenceNumber: 3332 version: 3420|1||504836f4ed66ab254ec61a1e based on: 3419|5||504836f4ed66ab254ec61a1e
Received signal 11
Backtrace: 0x8386d5 0x361f632920
./mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x8386d5]
/lib64/libc.so.6[0x361f632920]
===

— /1 —

and
— 2 —
Thu Sep 13 14:27:01 [conn5] ChunkManager: time to load chunks for textdb.doctext: 181ms sequenceNumber: 529 version: 2856|185||504836f4ed66ab254ec61a1f based on: 2856|47||504836f4ed66ab254ec61a1f
Thu Sep 13 14:27:02 [conn8] Socket recv() errno:104 Connection reset by peer 10.9.0.23:2012
Thu Sep 13 14:27:02 [WriteBackListener-SNode_S023:2012] DBClientCursor::init call() failed
Thu Sep 13 14:27:02 [conn8] SocketException: remote: 10.9.0.23:2012 error: 9001 socket exception [1] server [10.9.0.23:2012]
Thu Sep 13 14:27:02 [conn8] DBClientCursor::init call() failed
Thu Sep 13 14:27:02 [WriteBackListener-SNode_S023:2012] WriteBackListener exception : DBClientBase::findN: transport error: SNode_S023:2012 ns: admin.$cmd query:

{ writebacklisten: ObjectId('5051795ec69e943c6fb769f9') }

Thu Sep 13 14:27:02 [conn8] warning: db exception when initializing on shard01:shard01/SNode_S021:2012,SNode_S022:2012,SNode_S023:2012, current connection state is { state:

{ conn: "shard01/SNode_S021:2012,SNode_S022:2012,SNode_S023:2012", vinfo: "textdb.doctext @ 2856|185||504836f4ed66ab254ec61a1f", cursor: "(none)", count: 0, done: false }

, retryNext: false, init: false, finish: false, errored: false } :: caused by :: 10276 DBClientBase::findN: transport error: SNode_S023:2012 ns: admin.$cmd query: { setShardVersion: "textdb.doctext", configdb: "SNode_S038:2020,SNode_S039:2020,SNode_S040:2020", version: Timestamp 2856000|173, versionEpoch: ObjectId('504836f4ed66ab254ec61a1f'), serverID: ObjectId('5051795ec69e943c6fb769f9'), shard: "shard01", shardHost: "shard01/SNode_S021:2012,SNode_S022:2012,SNode_S023:2012", $auth: {} }
Thu Sep 13 14:27:02 [conn29] got not master for: SNode_S023:2012
Thu Sep 13 14:27:02 [conn3] Primary for replica set shard01 changed to SNode_S021:2012
Received signal 11
Backtrace: 0x8386d5 0x361f632920 0xc61e30
./mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x8386d5]
/lib64/libc.so.6[0x361f632920]
./mongos(_ZTVN5mongo18DBClientConnectionE+0x10)[0xc61e30]
===
— /2 —

and
— 3 —
Tue Sep 18 14:36:38 [WriteBackListener-SNode_S029:2012] Socket recv() errno:104 Connection reset by peer 10.9.0.29:2012
Tue Sep 18 14:36:38 [WriteBackListener-SNode_S029:2012] SocketException: remote: 10.9.0.29:2012 error: 9001 socket exception [1] server [10.9.0.29:2012]
Tue Sep 18 14:36:38 [WriteBackListener-SNode_S029:2012] DBClientCursor::init call() failed
Tue Sep 18 14:36:38 [WriteBackListener-SNode_S029:2012] WriteBackListener exception : DBClientBase::findN: transport error: SNode_S029:2012 ns: admin.$cmd query:

{ writebacklisten: ObjectId('50570a6870b81f72dd22c467') }

Tue Sep 18 14:36:38 [mongosMain] connection accepted from 10.9.0.1:38044 #2861 (1062 connections now open)
Tue Sep 18 14:36:38 [conn2861] got not master for: SNode_S029:2012
Received signal 11
Backtrace: 0x8386d5 0x361f632920 0x7f665e428d80
./mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x8386d5]
/lib64/libc.so.6[0x361f632920]
[0x7f665e428d80]
===
— /3 —



 Comments   
Comment by Tieying Zhang [ 21/Sep/12 ]

A similar problem "mongos consistently crashing after Primary replica change" : https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/NeeB86n9-JU

Generated at Thu Feb 08 03:13:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.