[SERVER-4857] Mongos crashes with a signal 11 Created: 03/Feb/12  Updated: 03/Feb/12  Resolved: 03/Feb/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.0.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tal Kedar Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: mongos
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS 64 bit (Linux <hostname> 2.6.32-71.29.1.el6.x86_64 #1 SMP Mon Jun 27 19:49:27 BST 2011 x86_64 x86_64 x86_64 GNU/Linux)


Operating System: ALL
Participants:

 Description   

Mongos crashes with a signal 11 (according to the log).

Context:
One of the shard primaries became unstable - networking problems causes high latencies, server appears to flicker between UP and DOWN - this is shard "s6" in the log below, one primary one secondary and one arbiter each on a different server all binding port 5006. There are three config mongod instances on yet other three servers, all binding port 5000. The mongos is running on a host called "MongosHost" binding port 7000.

Mongos log (the log indeed ends without anything after the "Backtrace" line):
Thu Feb 2 21:13:21 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' acquired, ts : 4f2afc7114d54bcb687ad1e5
Thu Feb 2 21:13:21 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' unlocked.
Thu Feb 2 21:13:32 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' acquired, ts : 4f2afc7b14d54bcb687ad1e6
Thu Feb 2 21:13:32 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' unlocked.
Thu Feb 2 21:13:42 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' acquired, ts : 4f2afc8614d54bcb687ad1e7
Thu Feb 2 21:13:42 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' unlocked.
Thu Feb 2 21:13:48 [LockPinger] cluster x.x.x.223:5000,x.x.x.224:5000,x.x.x.225:5000 pinged successfully at Thu Feb 2 21:13:48 2012 by distributed lock pinger 'x.x.x.223:5000,x.x.x.224:5000,x.x.x.225:5000/MongosHost:7000:1328043036:1804289383', sleeping for 30000ms
Thu Feb 2 21:13:52 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' acquired, ts : 4f2afc9014d54bcb687ad1e8
Thu Feb 2 21:13:52 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' unlocked.
Thu Feb 2 21:14:03 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' acquired, ts : 4f2afc9b14d54bcb687ad1e9
Thu Feb 2 21:14:03 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' unlocked.
Thu Feb 2 21:14:13 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' acquired, ts : 4f2afca514d54bcb687ad1ea
Thu Feb 2 21:14:13 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' unlocked.
Thu Feb 2 21:14:23 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' acquired, ts : 4f2afcaf14d54bcb687ad1eb
Thu Feb 2 21:14:23 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' unlocked.
Thu Feb 2 21:14:34 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' acquired, ts : 4f2afcb914d54bcb687ad1ec
Thu Feb 2 21:14:34 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' unlocked.
Thu Feb 2 21:14:44 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' acquired, ts : 4f2afcc414d54bcb687ad1ed
Thu Feb 2 21:14:44 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' unlocked.
Thu Feb 2 21:14:54 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' acquired, ts : 4f2afcce14d54bcb687ad1ee
Thu Feb 2 21:14:54 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' unlocked.
Thu Feb 2 21:15:05 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' acquired, ts : 4f2afcd914d54bcb687ad1ef
Thu Feb 2 21:15:05 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' unlocked.
Thu Feb 2 21:15:15 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' acquired, ts : 4f2afce314d54bcb687ad1f0
Thu Feb 2 21:15:15 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' unlocked.
Thu Feb 2 21:15:26 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' acquired, ts : 4f2afced14d54bcb687ad1f1
Thu Feb 2 21:15:26 [Balancer] distributed lock 'balancer/MongosHost:7000:1328043036:1804289383' unlocked.
Thu Feb 2 21:15:51 [conn128] DBException in process: could not initialize cursor across all shards because : not master and slaveok=false @ s6/x.x.x.130:5006,x.x.x.128:5006
Thu Feb 2 21:15:52 [conn128] DBException in process: could not initialize cursor across all shards because : not master and slaveok=false @ s6/x.x.x.130:5006,x.x.x.128:5006
Thu Feb 2 21:15:52 [conn128] DBException in process: could not initialize cursor across all shards because : not master and slaveok=false @ s6/x.x.x.130:5006,x.x.x.128:5006
Thu Feb 2 21:15:54 [conn128] DBException in process: could not initialize cursor across all shards because : not master and slaveok=false @ s6/x.x.x.130:5006,x.x.x.128:5006
Thu Feb 2 21:15:54 [WriteBackListener-x.x.x.128:5006] DBClientCursor::init call() failed
Thu Feb 2 21:15:54 [WriteBackListener-x.x.x.128:5006] WriteBackListener exception : DBClientBase::findN: transport error: x.x.x.128:5006 query:

{ writebacklisten: ObjectId('4f28541c14d54bcb687a9398') }

Thu Feb 2 21:15:55 [conn128] DBException in process: could not initialize cursor across all shards because : not master and slaveok=false @ s6/x.x.x.130:5006,x.x.x.128:5006
Thu Feb 2 21:15:56 [conn128] created new distributed lock for database.collection on x.x.x.223:5000,x.x.x.224:5000,x.x.x.225:5000 ( lock timeout : 900000, ping interval : 30000, process : 0 )
Thu Feb 2 21:15:56 [conn128] ChunkManager: time to load chunks for database.collection: 4ms sequenceNumber: 67 version: 109|11
Thu Feb 2 21:15:56 [conn128] autosplitted database.collection shard: ns:database.collection at: s3:s3/x.x.x.225:5003,x.x.x.223:5003 lastmod: 84|0 min:

{ key: "xxx382" }

max:

{ key: "xxx808" }

on:

{ key: "xxx231" }

(splitThreshold 67108864)
Thu Feb 2 21:15:56 [conn128] update will be retried b/c sharding config info is stale, left:4 ns: database.collection query:

{ key: "xxx920", ts: 1324097710000 }

Thu Feb 2 21:15:56 [ReplicaSetMonitorWatcher] Socket recv() errno:104 Connection reset by peer x.x.x.128:5006
Thu Feb 2 21:15:56 [ReplicaSetMonitorWatcher] SocketException: remote: x.x.x.128:5006 error: 9001 socket exception [1] server [x.x.x.128:5006]
Thu Feb 2 21:15:56 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
Thu Feb 2 21:15:56 [ReplicaSetMonitorWatcher] trying reconnect to x.x.x.128:5006
Thu Feb 2 21:15:56 [conn128] update will be retried b/c sharding config info is stale, left:4 ns: database.collection query:

{ key: "xxx051", ts: 1324078158000 }

Thu Feb 2 21:15:56 [ReplicaSetMonitorWatcher] reconnect x.x.x.128:5006 ok
Thu Feb 2 21:15:56 [conn128] update will be retried b/c sharding config info is stale, left:4 ns: database.collection query:

{ key: "xxx562", ts: 1324054409000 }

Thu Feb 2 21:15:57 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: x.x.x.130:5006

{ setName: "s6", ismaster: false, secondary: true, hosts: [ "x.x.x.130:5006", "x.x.x.128:5006" ], arbiters: [ "x.x.x.129:5006" ], primary: "x.x.x.128:5006", me: "x.x.x.130:5006", maxBsonObjectSize: 16777216, ok: 1.0 }

Thu Feb 2 21:15:57 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: x.x.x.128:5006

{ setName: "s6", ismaster: false, secondary: true, hosts: [ "x.x.x.128:5006", "x.x.x.130:5006" ], arbiters: [ "x.x.x.129:5006" ], me: "x.x.x.128:5006", maxBsonObjectSize: 16777216, ok: 1.0 }

Thu Feb 2 21:15:57 [conn128] ReplicaSetMonitor::_checkConnection: x.x.x.130:5006

{ setName: "s6", ismaster: false, secondary: true, hosts: [ "x.x.x.130:5006", "x.x.x.128:5006" ], arbiters: [ "x.x.x.129:5006" ], primary: "x.x.x.128:5006", me: "x.x.x.130:5006", maxBsonObjectSize: 16777216, ok: 1.0 }

Thu Feb 2 21:15:57 [conn128] ReplicaSetMonitor::_checkConnection: x.x.x.128:5006

{ setName: "s6", ismaster: false, secondary: true, hosts: [ "x.x.x.128:5006", "x.x.x.130:5006" ], arbiters: [ "x.x.x.129:5006" ], me: "x.x.x.128:5006", maxBsonObjectSize: 16777216, ok: 1.0 }

Thu Feb 2 21:15:59 [conn128] got not master for: x.x.x.130:5006
Received signal 11
Backtrace:



 Comments   
Comment by Eliot Horowitz (Inactive) [ 03/Feb/12 ]

This is caused by SERVER-4699

Comment by Tal Kedar [ 03/Feb/12 ]

Not sure if related to https://jira.mongodb.org/browse/SERVER-3044

Generated at Thu Feb 08 03:07:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.