[SERVER-7270] mongos crash with signal 6 Created: 05/Oct/12  Updated: 11/Jul/16  Resolved: 12/Nov/12

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Anton V. Volokhov Assignee: Kristina Chodorow (Inactive)
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to DOCS-740 Mentions that mongod/s can only creat... Closed
Operating System: Linux
Participants:

 Description   

I'm using mongodb 2.0.7 with two shards, three replicas each.
All crashes occurred during draining one of the shards.
several times during last week I experienced following mongos crashes:

Sun Sep 30 15:23:34 [conn271] ns: musicgroup.entity could not initialize cursor across all shards because : stale config detected for ns: musicgroup.entity ParallelCursor::_
init @ mongodb-sh1/<...> attempt: 0
Sun Sep 30 15:23:34 [mongosMain] connection accepted from 192.168.18.6:54797 #447
Sun Sep 30 15:23:34 [conn447]  authenticate: { authenticate: 1, user: "rating", nonce: "7642e5d689939a9c", key: "2ee0b88e7572c5fac957830fa727075d" }
Sun Sep 30 15:23:36 [conn230] ns: rating.entity could not initialize cursor across all shards because : stale config detected for ns: rating.entity ParallelCursor::_init @ mongodb-sh1/<...> attempt: 0
Sun Sep 30 15:23:40 [conn380] ns: rating.entity could not initialize cursor across all shards because : stale config detected for ns: rating.entity ParallelCursor::_init @ mongodb-sh1/<...< attempt: 0
Sun Sep 30 15:23:41 [mongosMain] connection accepted from 192.168.18.6:54904 #448
Sun Sep 30 15:23:41 [conn448]  authenticate: { authenticate: 1, user: "rating", nonce: "83c1b541f74be683", key: "510ff1afd1d2cf9d41128135fdeccd43" }
Sun Sep 30 15:23:41 [conn448] getaddrinfo("host1f.load.net") failed: Name or service not known
Sun Sep 30 15:23:41 [conn448] DBException in process: could not initialize cursor across all shards because : socket exception @ mongodb-sh1/<...>
Sun Sep 30 15:23:41 [conn448] getaddrinfo("host1f.load.net") failed: Name or service not known
Sun Sep 30 15:23:41 [conn448] DBException in process: could not initialize cursor across all shards because : socket exception @ mongodb-sh1/<...>
Sun Sep 30 15:23:41 [conn448] getaddrinfo("host1f.load.net") failed: Name or service not known
Sun Sep 30 15:23:41 [conn448] DBException in process: could not initialize cursor across all shards because : socket exception @ mongodb-sh1/<...>
Sun Sep 30 15:23:41 [conn448] getaddrinfo("host1f.load.net") failed: Name or service not known
Sun Sep 30 15:23:41 [conn448] DBException in process: could not initialize cursor across all shards because : socket exception @ mongodb-sh1/<...>
Sun Sep 30 15:23:41 [conn327] getaddrinfo("host1f.load.net") failed: Name or service not known
Sun Sep 30 15:23:41 [conn327] warning: could not get last error from a shard host1g.load.net :: caused by :: socket exception
Received signal 6
Backtrace: 0x528c54 0x7ff4a8b77af0 0x7ff4a8b77a75 0x7ff4a8b7b5c0 0x7ff4a8b70941 0x730908 0x712fa5 0x700860 0x72ad97 0x73c8d1 0x5aad20 0x7ff4a9ef89ca 0x7ff4a8c2a70d
mongos(_ZN5mongo17printStackAndExitEi+0x64)[0x528c54]
/lib/libc.so.6(+0x33af0)[0x7ff4a8b77af0]
/lib/libc.so.6(gsignal+0x35)[0x7ff4a8b77a75]
/lib/libc.so.6(abort+0x180)[0x7ff4a8b7b5c0]
/lib/libc.so.6(__assert_fail+0xf1)[0x7ff4a8b70941]
mongos(_ZN5mongo10ClientInfo12getLastErrorERKNS_7BSONObjERNS_14BSONObjBuilderEb+0x3898)[0x730908]
mongos(_ZN5mongo7Command20runAgainstRegisteredEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi+0x805)[0x712fa5]
mongos(_ZN5mongo14SingleStrategy7queryOpERNS_7RequestE+0x4d0)[0x700860]
mongos(_ZN5mongo7Request7processEi+0x157)[0x72ad97]
mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x71)[0x73c8d1]
mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x260)[0x5aad20]
/lib/libpthread.so.0(+0x69ca)[0x7ff4a9ef89ca]
/lib/libc.so.6(clone+0x6d)[0x7ff4a8c2a70d]
 
 
*********************
 
 
Wed Oct  3 18:28:32 [conn273] creating new connection to:host1d.load.net:27017
Wed Oct  3 18:28:32 [conn273] getaddrinfo("host1d.load.net") failed: Name or service not known
Wed Oct  3 18:28:32 [conn273] warning: could not get last error from a shard host1d.load.net:27017 :: caused by :: socket exception
Wed Oct  3 18:28:32 [conn393] Request::process ns: chords.entity msg id:169529 attempt: 0
Wed Oct  3 18:28:32 [conn393] CursorCache::get id: 4966267460045049647
Wed Oct  3 18:28:32 [conn446] Request::process ns: admin.$cmd msg id:183399 attempt: 0
Wed Oct  3 18:28:32 [conn446] single query: admin.$cmd  { ismaster: 1 }  ntoreturn: -1 options : 0
Received signal 6
Backtrace: 0x528c54 0x7fed6cc0faf0 0x7fed6cc0fa75 0x7fed6cc135c0 0x7fed6cc08941 0x730908 0x712fa5 0x700860 0x72ad97 0x73c8d1 0x5aad20 0x7fed6df909ca 0x7fed6ccc270d
Wed Oct  3 18:28:32 [conn446] Request::process ns: admin.$cmd msg id:183400 attempt: 0
Wed Oct  3 18:28:32 [conn446] single query: admin.$cmd  { ismaster: 1 }  ntoreturn: -1 options : 0
mongos(_ZN5mongo17printStackAndExitEi+0x64)[0x528c54]
/lib/libc.so.6(+0x33af0)[0x7fed6cc0faf0]
/lib/libc.so.6(gsignal+0x35)[0x7fed6cc0fa75]
/lib/libc.so.6(abort+0x180)[0x7fed6cc135c0]
/lib/libc.so.6(__assert_fail+0xf1)[0x7fed6cc08941]
 
mongos(_ZN5mongo10ClientInfo12getLastErrorERKNS_7BSONObjERNS_14BSONObjBuilderEb+0x3898)[0x730908]
mongos(_ZN5mongo7Command20runAgainstRegisteredEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi+0x805)[0x712fa5]
mongos(_ZN5mongo14SingleStrategy7queryOpERNS_7RequestE+0x4d0)[0x700860]
mongos(_ZN5mongo7Request7processEi+0x157)[0x72ad97]
mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x71)[0x73c8d1]
mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x260)[0x5aad20]
/lib/libpthread.so.0(+0x69ca)[0x7fed6df909ca]
/lib/libc.so.6(clone+0x6d)[0x7fed6ccc270d]
===
 CursorCache at shutdown -  sharded: 2 passthrough: 1900
 
 
**********************
 
 
Thu Oct  4 21:31:09 [conn131] reconnect host2f.load.net:27017 ok
Thu Oct  4 21:31:09 [conn249] getaddrinfo("host2f.load.net") failed: Name or service not known
Thu Oct  4 21:31:09 [conn249] warning: could not get last error from a shard host2f.load.net:27017 :: caused by :: socket exception
Received signal 6
Backtrace: 0x528c54 0x7f652e6d9af0 0x7f652e6d9a75 0x7f652e6dd5c0 0x7f652e6d2941 0x730908 0x712fa5 0x700860 0x72ad97 0x73c8d1 0x5aad20 0x7f652fa5a9ca 0x7f652e78c70d 
mongos(_ZN5mongo17printStackAndExitEi+0x64)[0x528c54]
/lib/libc.so.6(+0x33af0)[0x7f652e6d9af0]
/lib/libc.so.6(gsignal+0x35)[0x7f652e6d9a75]
/lib/libc.so.6(abort+0x180)[0x7f652e6dd5c0]
/lib/libc.so.6(__assert_fail+0xf1)[0x7f652e6d2941]
mongos(_ZN5mongo10ClientInfo12getLastErrorERKNS_7BSONObjERNS_14BSONObjBuilderEb+0x3898)[0x730908]
mongos(_ZN5mongo7Command20runAgainstRegisteredEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi+0x805)[0x712fa5]
mongos(_ZN5mongo14SingleStrategy7queryOpERNS_7RequestE+0x4d0)[0x700860]
mongos(_ZN5mongo7Request7processEi+0x157)[0x72ad97]
mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x71)[0x73c8d1]
mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x260)[0x5aad20]
/lib/libpthread.so.0(+0x69ca)[0x7f652fa5a9ca]
/lib/libc.so.6(clone+0x6d)[0x7f652e78c70d]
===
Logstream::get called in uninitialized state
Thu Oct  4 21:31:09 [conn188] Assertion: 13548:BufBuilder grow() > 64MB
0x512279 0x4b4bfc 0x4b64bb 0x55942a 0x559ec9 0x55a247 0x55a41e 0x75cc98 0x75e408 0x5878dc 0x6f2d38 0x6f7b18 0x6f86c3 0x72ad0d 0x73c8d1 0x5aad20 0x7f652fa5a9ca 0x7f652e78c70d 
 CursorCache at shutdown -  sharded: 2 passthrough: 967
Received signal 11
Backtrace: 0x528c54 0x7f652e6d9af0 0x4fa867 0x4eb598 0x4eeafc 0x4ef270 0x4f3be9 0x4f4638 0x6df524 0x6f8d29 0x72ad97 0x73c8d1 0x5aad20 0x7f652fa5a9ca 0x7f652e78c70d 
mongos(_ZN5mongo17printStackAndExitEi+0x64)[0x528c54]
/lib/libc.so.6(+0x33af0)[0x7f652e6d9af0]
Received signal 11
Backtrace: 0x528c54 0x7f652e6d9af0 0x54b4e6 0x5610c6 0x562e90 0x54048a 0x55102a 0x54abc0 0x54b295 0x550b4e 0x550d01 0x5495b0 0x54a97a 0x6e6877 0x5380cd 0x539984 0x53a45b 0x560f59 0x563474 0x5acb89 
mongos(_ZN5mongo17printStackAndExitEi+0x64)[0x528c54]
/lib/libc.so.6(+0x33af0)[0x7f652e6d9af0]
mongos(_ZNK5mongo10FieldRange10nontrivialEv+0x87)[0x4fa867]
mongos(_ZN5mongo10FieldRangeaNERKS0_+0x1b8)[0x4eb598]
mongos(_ZN5mongo13FieldRangeSet16processOpElementEPKcRKNS_11BSONElementEbb+0x27c)[0x4eeafc]
mongos(_ZN5mongo13FieldRangeSet17processQueryFieldERKNS_11BSONElementEb+0x220)[0x4ef270]
mongos(_ZN5mongo13FieldRangeSetC1EPKcRKNS_7BSONObjEbb+0x179)[0x4f3be9]
mongos(_ZN5mongo16OrRangeGeneratorC1EPKcRKNS_7BSONObjEb+0x48)[0x4f4638]
mongos(_ZNK5mongo12ChunkManager17getShardsForQueryERSt3setINS_5ShardESt4lessIS2_ESaIS2_EERKNS_7BSONObjE+0x34)[0x6df524]
mongos(_ZN5mongo13ShardStrategy7queryOpERNS_7RequestE+0x559)[0x6f8d29]
mongos(_ZN5mongo7Request7processEi+0x157)[0x72ad97]
mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x71)[0x73c8d1]
mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x260)[0x5aad20]
/lib/libpthread.so.0(+0x69ca)[0x7f652fa5a9ca]
/lib/libc.so.6(clone+0x6d)[0x7f652e78c70d]

All mongod's are up and have nothing strange in their logs.



 Comments   
Comment by Kristina Chodorow (Inactive) [ 12/Nov/12 ]

Good point, talked to the docs guy about it and filed DOCS-470.

Comment by Anton V. Volokhov [ 12/Nov/12 ]

I've found a bug in docs then:
http://docs.mongodb.org/manual/administration/ulimit/

Comment by Kristina Chodorow (Inactive) [ 12/Nov/12 ]

Yes, I'm fine, thanks!

MongoDB only allows 20,000 connections open, but glad to hear.

Comment by Anton V. Volokhov [ 12/Nov/12 ]

Thank you for the response. Hope, the hurricane didn't cause a significant damage to you.

I raised ulimits from 10 to 60 thousands, everything looks ok now.

Comment by Kristina Chodorow (Inactive) [ 07/Nov/12 ]

Sorry about the delay, last week was kind of a loss from the hurricane.

I'm guessing that all existing connections were fine, but new connections could not be created due to bumping into file descriptor limits. DNS lookup uses a UDP socket, so running out of file descriptors prevented lookup for all new connections. The solution to that is "don't run out of file descriptors," which is easier said than done. You might want to look into raising your ulimits and changing your app server configuration to spawn fewer children. In the future, /var/log/syslog or /var/log/messages may have messages about DNS.

Comment by Anton V. Volokhov [ 25/Oct/12 ]

The first occurrence was just after too many open files. Logs before file descriptors ended looks safe, the only occurrence of Name or service not known was a day before, and there is no suspicious logs before, just stale config, split chunks and start/end connections:

Mon Oct 22 10:47:44 [conn89] warning: splitChunk failed - cmd: { splitChunk: "movie.entity_attr", keyPattern:

{ _id: 1.0 }

, min:

{ _id: "http://vidstream.ru/movie/524d5335928992.html&domain-id=286172" }

, max:

{ _id: "http://vidvidoo.com/movie/1433822-the-apparition-2012-online-streaming-in-hd-aug-24-2012/&domain-id=295708" }

, from: "mongodb-sh1/host01d.load.net:27017,host01f.load.net:27017,host01g.load.net:27017", splitKeys: [

{ _id: "http://vidstream.ru/movie/a0ce51289072c.html&domain-id=286172" }

], shardId: "movie.entity_attr-id"http://vidstream.ru/movie/524d5335928992.html&domain-id=286172"", configdb: "conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018" } result: { who: { _id: "movie.entity_attr", process: "host01d:27017:1350602021:1390778741", state: 2, ts: ObjectId('5084ec0ce23b10d5d497b8a4'), when: new Date(1350888460538), who: "host01d:27017:1350602021:
1390778741:conn21057:1425296225", why: "split-

{ _id: "http://vidstream.ru/movie/524d5335928992.html&domain-id=286172" }

" }, errmsg: "the collection's metadata lock is taken", ok: 0.0 }
Mon Oct 22 10:47:44 [conn89] ChunkManager: time to load chunks for movie.entity_attr: 2ms sequenceNumber: 85 version: 464|7
Mon Oct 22 10:47:44 [conn412] autosplitted movie.entity_attr shard: ns:movie.entity_attr at: mongodb-sh1:mongodb-sh1/host01d.load.net:27017,host01f.load.net:27017,host01g.load.net:27017 lastmod: 395|66 min:

{ _id: "http://vidstream.ru/movie/524d5335928992.html&domain-id=286172" }

max:

{ _id: "http://vidvidoo.com/movie/1433822-the-apparit ion-2012-online-streaming-in-hd-aug-24-2012/&domain-id=295708" }

on:

{ _id: "http://vidstream.ru/movie/a0cf23433ae62.html&domain-id=286172" }

(splitThreshold 67108864)
Mon Oct 22 10:47:49 [conn412] SyncClusterConnection connecting to [conf01g.load.net:27018]
Mon Oct 22 10:47:49 [conn412] SyncClusterConnection connecting to [conf01d.load.net:27018]
Mon Oct 22 10:47:49 [conn412] SyncClusterConnection connecting to [conf01f.load.net:27018]
Mon Oct 22 10:47:49 [conn412] getaddrinfo("conf01f.load.net") failed: Name or service not known
Mon Oct 22 10:47:49 [conn412] SyncClusterConnection connect fail to: conf01f.load.net:27018 errmsg:
Mon Oct 22 10:47:49 [conn412] trying reconnect to conf01f.load.net:27018
>>>Mon Oct 22 10:47:49 [conn412] getaddrinfo("conf01f.load.net") failed: Name or service not known
Mon Oct 22 10:47:49 [conn412] reconnect conf01f.load.net:27018 failed
Mon Oct 22 10:47:49 [conn412] DBException in process: socket exception
Mon Oct 22 10:47:49 [conn412] end connection 111.222.111.140:41915
Mon Oct 22 10:47:49 [conn115] end connection 111.222.111.140:40533
Mon Oct 22 10:47:49 [conn66] end connection 111.222.111.140:54677
Mon Oct 22 10:47:49 [conn112] end connection 111.222.111.140:58591
Mon Oct 22 10:47:49 [conn144] end connection 111.222.111.140:54549
Mon Oct 22 10:47:49 [conn130] end connection 111.222.111.140:40788
Mon Oct 22 10:47:49 [conn425] end connection 111.222.111.140:43930
Mon Oct 22 10:47:49 [conn405] end connection 111.222.111.140:40721
Mon Oct 22 10:47:49 [conn72] end connection 111.222.111.140:55364
.
. nothing suspicious
.
Tue Oct 23 03:11:39 [mongosMain] ERROR: Out of file descriptors. Waiting one second before trying to accept more connections.
.
. Out of file descriptors
.
Tue Oct 23 03:14:43 [mongosMain] ERROR: Out of file descriptors. Waiting one second before trying to accept more connections.
Tue Oct 23 03:14:43 [conn1385] end connection 111.222.111.191:59069
Tue Oct 23 03:14:44 [mongosMain] connection accepted from 111.222.111.191:59070 #1386
Tue Oct 23 03:14:44 [mongosMain] Listener: accept() returns -1 errno:24 Too many open files
Tue Oct 23 03:14:44 [mongosMain] ERROR: Out of file descriptors. Waiting one second before trying to accept more connections.
Tue Oct 23 03:14:44 [conn1386] end connection 111.222.111.191:59070
Tue Oct 23 03:14:45 [mongosMain] connection accepted from 111.222.111.65:53230 #1387
Tue Oct 23 03:16:25 [LockPinger] cluster conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018 pinged successfully at Tue Oct 23 03:16:25 2012 by distributed lock pinger 'conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018/host01g:27000:1350671783:1804289383', sleeping for 30000ms
>>>Tue Oct 23 03:18:20 [Balancer] getaddrinfo("host01d.load.net") failed: Name or service not known
Tue Oct 23 03:18:20 [Balancer] scoped connection to conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018 not being returned to the pool
Tue Oct 23 03:18:20 [Balancer] caught exception while doing balance: socket exception
Tue Oct 23 03:21:28 [conn1182] ns: movie.entity_attr could not initialize cursor across all shards because : stale config detected for ns: movie.entity_attr ParallelCursor::_init @ mongodb-sh1/host01d.load.net:27017,host01f.load.net:27017,host01g.load.net:27017 attempt: 0
Tue Oct 23 03:21:34 [LockPinger] cluster conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018 pinged successfully at Tue Oct 23 03:21:33 2012 by distributed lock pinger 'conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018/host01g:27000:1350671783:1804289383', sleeping for 30000ms
Tue Oct 23 03:26:38 [LockPinger] cluster conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018 pinged successfully at Tue Oct 23 03:26:38 2012 by distributed lock pinger 'conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018/host01g:27000:1350671783:1804289383', sleeping for 30000ms
Tue Oct 23 03:29:02 [conn153] ChunkManager: time to load chunks for movie.entity_attr: 3ms sequenceNumber: 101 version: 464|19
Tue Oct 23 03:29:02 [conn153] autosplitted movie.entity_attr shard: ns:movie.entity_attr at: mongodb-sh1:mongodb-sh1/host01d.load.net:27017,host01f.load.net:27017,host01g.load.net:27017 lastmod: 395|28 min:

{ _id: "http://tvfru.ru/load/fantasticheskie_filmi_smotret_online_skachat_torrent/smertelnaja_bitva_nasledie_smotret_online_v_kachestve_mortal_kombat_legacy/6..." }

max:

{ _id: "http://twomovies.name/watch_movie/Four_Sheets_to_the_Wind&domain-id=1973122" }

on:

{ _id: "http://tvlistings.zap2it.com/tv/me-quiero-casar/MV001245240000&domain-id=3945654" }

(splitThreshold 67108864)
.
. stale config
.
Tue Oct 23 03:31:07 [conn406] ns: movie.entity_attr could not initialize cursor across all shards because : stale config detected for ns: movie.entity_attr ParallelCursor::_init @ mongodb-sh1/host01d.load.net:27017,host01f.load.net:27017,host01g.load.net:27017 attempt: 0
Tue Oct 23 03:31:36 [conn124] warning: splitChunk failed - cmd: { splitChunk: "rating.entity_attr", keyPattern:

{ _id: 1.0 }

, min:

{ _id: "http://www.goodreads.ru/books/2414174/default.aspx?partner=qibet&domain-id=1969694" }

, max:

{ _id: "http://www.grando.hu/erositos-ontapados-belso-antenna-szelvedore-ksa-02-1524125370&domain-id=1973550" }

, from: "mongodb-sh1/host01d.load.net:27017,host01f.load.net:27017,host01g.load.net:27017", splitKeys: [

{ _id: "http://www.goodreads.ru/podarki/items/1626309/default.aspx&domain-id=1969694" }

], shardId: "rating.entity_attr-id"http://www.goodreads.ru/books/2414174/default.aspx?partner=qibet&domain-id=1969694"", configdb: "conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018" } result: { who: { _id: "rating.entity_attr", process: "host01d:27017:1350602021:1390778741", state: 2, ts: ObjectId('508546a0e23b10d5d497b8b2'), when: new Date(1350911648534), who: "host01d:27017:1350602021:1390778741:conn8736:375682389", why: "migrate-

{ _id: "http://catalogfirm.com.ua/1437/ZAPORIZKIY-PROFETSIYNIY-LITSEY-MODI-I-STILYU.html&domain-id=2005685" }

" }, errmsg: "the collection's metadata lock is taken", ok: 0.0 }
Tue Oct 23 03:31:46 [LockPinger] cluster conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018 pinged successfully at Tue Oct 23 03:31:46 2012 by distributed lock pinger 'conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018/host01g:27000:1350671783:1804289383', sleeping for 30000ms
Tue Oct 23 03:32:21 [conn434] ns: movie.entity_attr could not initialize cursor across all shards because : stale config detected for ns: movie.entity_attr ParallelCursor::_init @ mongodb-sh1/host01d.load.net:27017,host01f.load.net:27017,host01g.load.net:27017 attempt: 0
.
. stale config, nothing about getaddrinfo
.
Tue Oct 23 03:40:01 [conn875] SyncClusterConnection connecting to [conf01f.load.net:27018]
.
. stale config and successful ping, nothing about getaddrinfo
.
Tue Oct 23 03:48:04 [conn423] ns: rating.entity_attr could not initialize cursor across all shards because : stale config detected for ns: rating.entity_attr ParallelCursor::_init @ mongodb-sh1/host01d.load.net:27017,host01f.load.net:27017,host01g.load.net:27017 attempt: 0
Tue Oct 23 03:48:25 [conn1126] ns: movie.entity_attr could not initialize cursor across all shards because : stale config detected for ns: movie.entity_attr ParallelCursor::_init @ mongodb-sh1/host01d.load.net:27017,host01f.load.net:27017,host01g.load.net:27017 attempt: 0
Tue Oct 23 03:52:11 [LockPinger] cluster conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018 pinged successfully at Tue Oct 23 03:52:11 2012 by distributed lock pinger 'conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018/host01g:27000:1350671783:1804289383', sleeping for 30000ms
Tue Oct 23 03:54:13 [mongosMain] connection accepted from 111.222.111.48:42417 #1388
Tue Oct 23 03:54:13 [conn1388] authenticate:

{ authenticate: 1, user: "mobile-app", nonce: "bf2f4b223b0f8aa1", key: "b2b50a4bf78c6a12d8a888dbb5b14981" }

>>>Tue Oct 23 03:54:13 [conn1388] getaddrinfo("host01d.load.net") failed: Name or service not known
Tue Oct 23 03:54:13 [conn1388] DBException in process: socket exception
Tue Oct 23 03:54:13 [conn1388] end connection 111.222.111.48:42417
Tue Oct 23 03:54:13 [conn1237] end connection 111.222.111.48:50804
Tue Oct 23 03:54:13 [conn1238] end connection 111.222.111.48:50631
Tue Oct 23 03:54:13 [conn1304] end connection 111.222.111.48:52611
Tue Oct 23 03:54:13 [conn1252] end connection 111.222.111.48:51285
Tue Oct 23 03:54:13 [conn1235] end connection 111.222.111.48:45518
.
. stale config and successful ping, not a single word about getaddrinfo. And then it just started
.
Oct 23 07:32:06 [LockPinger] cluster conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018 pinged successfully at Tue Oct 23 07:32:06 2012 by distributed lock pinger 'conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018/host01g:27000:1350671783:1804289383', sleeping for 30000ms
Tue Oct 23 07:33:04 [conn828] ns: musicgroup.entity_attr could not initialize cursor across all shards because : stale config detected for ns: musicgroup.entity_attr ParallelCursor::_init @ mongodb-sh1/host01d.load.net:27017,host01f.load.net:27017,host01g.load.net:27017 attempt: 0
Tue Oct 23 07:33:10 [conn127] ns: musicgroup.entity_attr could not initialize cursor across all shards because : stale config detected for ns: musicgroup.entity_attr ParallelCursor::_init @ mongodb-sh1/host01d.load.net:27017,host01f.load.net:27017,host01g.load.net:27017 attempt: 0
Tue Oct 23 07:33:36 [Balancer] getaddrinfo("host01d.load.net") failed: Name or service not known
Tue Oct 23 07:33:36 [Balancer] scoped connection to conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018 not being returned to the pool
Tue Oct 23 07:33:36 [Balancer] caught exception while doing balance: socket exception
Tue Oct 23 07:34:06 [Balancer] SyncClusterConnection connecting to [conf01g.load.net:27018]
Tue Oct 23 07:34:06 [Balancer] SyncClusterConnection connecting to [conf01d.load.net:27018]
Tue Oct 23 07:34:06 [Balancer] SyncClusterConnection connecting to [conf01f.load.net:27018]
Tue Oct 23 07:34:06 [Balancer] getaddrinfo("conf01f.load.net") failed: Name or service not known
Tue Oct 23 07:34:06 [Balancer] SyncClusterConnection connect fail to: conf01f.load.net:27018 errmsg:
Tue Oct 23 07:34:06 [Balancer] trying reconnect to conf01f.load.net:27018
Tue Oct 23 07:34:06 [Balancer] getaddrinfo("conf01f.load.net") failed: Name or service not known
Tue Oct 23 07:34:06 [Balancer] reconnect conf01f.load.net:27018 failed
Tue Oct 23 07:34:06 [Balancer] scoped connection to conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018 not being returned to the pool
Tue Oct 23 07:34:06 [Balancer] caught exception while doing balance: socket exception
Tue Oct 23 07:34:36 [Balancer] SyncClusterConnection connecting to [conf01g.load.net:27018]
Tue Oct 23 07:34:36 [Balancer] SyncClusterConnection connecting to [conf01d.load.net:27018]
Tue Oct 23 07:34:36 [Balancer] SyncClusterConnection connecting to [conf01f.load.net:27018]
Tue Oct 23 07:34:36 [Balancer] SyncClusterConnection connecting to [conf01g.load.net:27018]
Tue Oct 23 07:34:36 [Balancer] SyncClusterConnection connecting to [conf01d.load.net:27018]
Tue Oct 23 07:34:36 [Balancer] SyncClusterConnection connecting to [conf01f.load.net:27018]
Tue Oct 23 07:34:36 [Balancer] getaddrinfo("conf01f.load.net") failed: Name or service not known
Tue Oct 23 07:34:36 [Balancer] SyncClusterConnection connect fail to: conf01f.load.net:27018 errmsg:
Tue Oct 23 07:34:36 [Balancer] trying reconnect to conf01f.load.net:27018
Tue Oct 23 07:34:36 [Balancer] getaddrinfo("conf01f.load.net") failed: Name or service not known
Tue Oct 23 07:34:36 [Balancer] reconnect conf01f.load.net:27018 failed
Tue Oct 23 07:34:36 [Balancer] scoped connection to conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018 not being returned to the pool
Tue Oct 23 07:34:36 [Balancer] caught exception while doing balance: socket exception
Tue Oct 23 07:35:06 [Balancer] SyncClusterConnection connecting to [conf01g.load.net:27018]
Tue Oct 23 07:35:06 [Balancer] SyncClusterConnection connecting to [conf01d.load.net:27018]
Tue Oct 23 07:35:06 [Balancer] SyncClusterConnection connecting to [conf01f.load.net:27018]
Tue Oct 23 07:35:06 [Balancer] SyncClusterConnection connecting to [conf01g.load.net:27018]
Tue Oct 23 07:35:06 [Balancer] SyncClusterConnection connecting to [conf01d.load.net:27018]
Tue Oct 23 07:35:06 [Balancer] SyncClusterConnection connecting to [conf01f.load.net:27018]
Tue Oct 23 07:35:06 [Balancer] getaddrinfo("conf01f.load.net") failed: Name or service not known
Tue Oct 23 07:35:06 [Balancer] SyncClusterConnection connect fail to: conf01f.load.net:27018 errmsg:
Tue Oct 23 07:35:06 [Balancer] trying reconnect to conf01f.load.net:27018
Tue Oct 23 07:35:06 [Balancer] getaddrinfo("conf01f.load.net") failed: Name or service not known
Tue Oct 23 07:35:06 [Balancer] reconnect conf01f.load.net:27018 failed
Tue Oct 23 07:35:06 [Balancer] scoped connection to conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018 not being returned to the pool
Tue Oct 23 07:35:06 [Balancer] caught exception while doing balance: socket exception

and from this point, getaddrinfo were every 30 seconds until mongos crash

Comment by Kristina Chodorow (Inactive) [ 25/Oct/12 ]

It might have to do with hitting ulimit, but it's just weird it couldn't even resolve. Can you find the point in the log between 03:18:20 and 07:47:37 where it stopped being able to resolve them (you can probably look for where "Name or service not known" starts occurring)? Maybe there are more clues when it starts happening.

Comment by Anton V. Volokhov [ 24/Oct/12 ]

No, I have no idea. I'm definitely sure, that all config servers are resolvable and reachable. I can telnet them from the host and I can connect them over mongo shell. And If I restart mongos after crash, it work fine for a couple of days, so it seems like the network itself is not a problem.
Is there a chance, that something went wrong, when I've reached ulimits?

Comment by Kristina Chodorow (Inactive) [ 24/Oct/12 ]

Actually, with more context this looks like an on-purpose abort. If a shard's at a certain point in a migrate and cannot reach the config servers, it'll abort because it cannot do anything (including cancel the migrate) without them. Do you know why it couldn't reach (or even resolve the hostnames of) any of the config servers?

Tue Oct 23 07:52:30 [conn427] SyncClusterConnection connecting to [conf01d.load.net:27018]
Tue Oct 23 07:52:30 [conn427] getaddrinfo("conf01d.load.net") failed: Name or service not known
Tue Oct 23 07:52:30 [conn427] SyncClusterConnection connect fail to: conf01d.load.net:27018 errmsg: 
Tue Oct 23 07:52:30 [conn427] SyncClusterConnection connecting to [conf01f.load.net:27018]
Tue Oct 23 07:52:30 [conn427] getaddrinfo("conf01f.load.net") failed: Name or service not known
Tue Oct 23 07:52:30 [conn427] SyncClusterConnection connect fail to: conf01f.load.net:27018 errmsg: 
etc.

Comment by Anton V. Volokhov [ 24/Oct/12 ]

seems, like I've found the problem.
Exceptions started with "Too many open files". It was pretty far away from the moment mongos crashes, but still, can it be a reason?
Skipped chunks of logs looks like"
Tue Oct 23 *::* [Balancer] trying reconnect to conf01*:27018
Tue Oct 23 *::* [Balancer] getaddrinfo("conf01*") failed: Name or service not known
"

Tue Oct 23 03:14:42 [conn1384] end connection 111.222.111.191:59065
Tue Oct 23 03:14:43 [mongosMain] connection accepted from 178.154.238.191:59069 #1385
Tue Oct 23 03:14:43 [mongosMain] Listener: accept() returns -1 errno:24 Too many open files
Tue Oct 23 03:14:43 [mongosMain] ERROR: Out of file descriptors. Waiting one second before trying to accept more connections.
Tue Oct 23 03:14:43 [conn1385] end connection 111.222.111.191:59069
Tue Oct 23 03:14:44 [mongosMain] connection accepted from 178.154.238.191:59070 #1386
Tue Oct 23 03:14:44 [mongosMain] Listener: accept() returns -1 errno:24 Too many open files
Tue Oct 23 03:14:44 [mongosMain] ERROR: Out of file descriptors. Waiting one second before trying to accept more connections.
Tue Oct 23 03:14:44 [conn1386] end connection 111.222.111.191:59070
Tue Oct 23 03:14:45 [mongosMain] connection accepted from 111.222.111.65:53230 #1387
Tue Oct 23 03:16:25 [LockPinger] cluster conf01g:27018,conf01d:27018,conf01f:27018 pinged successfully at Tue Oct 23 03:16:25 2012 by distributed lock pinger 'conf01g:27018,conf01d:27018,conf01f:27018/host01g:27000:1350671783:1804289383', sleeping for 30000ms
Tue Oct 23 03:18:20 [Balancer] getaddrinfo("host01d") failed: Name or service not known
Tue Oct 23 03:18:20 [Balancer] scoped connection to conf01g:27018,conf01d:27018,conf01f:27018 not being returned to the pool
Tue Oct 23 03:18:20 [Balancer] caught exception while doing balance: socket exception
.
.
.
Tue Oct 23 07:47:37 [Balancer] SyncClusterConnection connecting to [conf01f:27018]
Tue Oct 23 07:47:37 [Balancer] getaddrinfo("conf01f") failed: Name or service not known
Tue Oct 23 07:47:37 [Balancer] SyncClusterConnection connect fail to: conf01f:27018 errmsg:
Tue Oct 23 07:47:37 [Balancer] trying reconnect to conf01f:27018
Tue Oct 23 07:47:37 [Balancer] getaddrinfo("conf01f") failed: Name or service not known
Tue Oct 23 07:47:37 [Balancer] reconnect conf01f:27018 failed
Tue Oct 23 07:47:37 [Balancer] scoped connection to conf01g:27018,conf01d:27018,conf01f:27018 not being returned to the pool
Tue Oct 23 07:47:37 [Balancer] caught exception while doing balance: socket exception
Tue Oct 23 07:47:37 [LockPinger] cluster conf01g:27018,conf01d:27018,conf01f:27018 pinged successfully at Tue Oct 23 07:47:36 2012 by distributed lock pinger 'conf01g:27018,conf01d:27018,conf01f:27018/host01g:27000:1350671783:1804289383', sleeping for 30000ms
Tue Oct 23 07:48:07 [Balancer] SyncClusterConnection connecting to [conf01g:27018]
Tue Oct 23 07:48:07 [Balancer] SyncClusterConnection connecting to [conf01d:27018]
Tue Oct 23 07:48:07 [Balancer] SyncClusterConnection connecting to [conf01f:27018]
Tue Oct 23 07:48:07 [Balancer] SyncClusterConnection connecting to [conf01g:27018]
Tue Oct 23 07:48:07 [Balancer] SyncClusterConnection connecting to [conf01d:27018]
Tue Oct 23 07:48:07 [Balancer] SyncClusterConnection connecting to [conf01f:27018]
Tue Oct 23 07:48:07 [Balancer] getaddrinfo("conf01f") failed: Name or service not known
Tue Oct 23 07:48:07 [Balancer] SyncClusterConnection connect fail to: conf01f:27018 errmsg:
Tue Oct 23 07:48:07 [Balancer] trying reconnect to conf01f:27018
Tue Oct 23 07:48:07 [Balancer] getaddrinfo("conf01f") failed: Name or service not known
Tue Oct 23 07:48:07 [Balancer] reconnect conf01f:27018 failed
Tue Oct 23 07:48:07 [Balancer] scoped connection to conf01g:27018,conf01d:27018,conf01f:27018 not being returned to the pool
Tue Oct 23 07:48:07 [Balancer] caught exception while doing balance: socket exception
Tue Oct 23 07:48:37 [Balancer] SyncClusterConnection connecting to [conf01g:27018]
.
.
.
Tue Oct 23 07:49:07 [Balancer] getaddrinfo("conf01f.load.net") failed: Name or service not known
Tue Oct 23 07:49:07 [Balancer] SyncClusterConnection connect fail to: conf01f.load.net:27018 errmsg:
Tue Oct 23 07:49:07 [Balancer] trying reconnect to conf01f.load.net:27018
Tue Oct 23 07:49:07 [Balancer] getaddrinfo("conf01f.load.net") failed: Name or service not known
Tue Oct 23 07:49:07 [Balancer] reconnect conf01f.load.net:27018 failed
Tue Oct 23 07:49:07 [Balancer] scoped connection to conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018 not being returned to the pool
Tue Oct 23 07:49:07 [Balancer] caught exception while doing balance: socket exception
Tue Oct 23 07:49:37 [Balancer] SyncClusterConnection connecting to [conf01g.load.net:27018]
Tue Oct 23 07:49:37 [Balancer] SyncClusterConnection connecting to [conf01d.load.net:27018]
Tue Oct 23 07:49:37 [Balancer] SyncClusterConnection connecting to [conf01f.load.net:27018]
Tue Oct 23 07:49:37 [Balancer] SyncClusterConnection connecting to [conf01g.load.net:27018]
Tue Oct 23 07:49:37 [Balancer] SyncClusterConnection connecting to [conf01d.load.net:27018]
Tue Oct 23 07:49:37 [Balancer] SyncClusterConnection connecting to [conf01f.load.net:27018]
Tue Oct 23 07:49:37 [Balancer] getaddrinfo("conf01f.load.net") failed: Name or service not known
Tue Oct 23 07:49:37 [Balancer] SyncClusterConnection connect fail to: conf01f.load.net:27018 errmsg:
Tue Oct 23 07:49:37 [Balancer] trying reconnect to conf01f.load.net:27018
Tue Oct 23 07:49:37 [Balancer] getaddrinfo("conf01f.load.net") failed: Name or service not known
Tue Oct 23 07:49:37 [Balancer] reconnect conf01f.load.net:27018 failed
Tue Oct 23 07:49:37 [Balancer] scoped connection to conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018 not being returned to the pool
Tue Oct 23 07:49:37 [Balancer] caught exception while doing balance: socket exception
Tue Oct 23 07:50:07 [Balancer] SyncClusterConnection connecting to [conf01g.load.net:27018]
Tue Oct 23 07:50:07 [Balancer] SyncClusterConnection connecting to [conf01d.load.net:27018]
Tue Oct 23 07:50:07 [Balancer] SyncClusterConnection connecting to [conf01f.load.net:27018]
Tue Oct 23 07:50:07 [Balancer] SyncClusterConnection connecting to [conf01g.load.net:27018]
Tue Oct 23 07:50:07 [Balancer] SyncClusterConnection connecting to [conf01d.load.net:27018]
Tue Oct 23 07:50:07 [Balancer] SyncClusterConnection connecting to [conf01f.load.net:27018]
Tue Oct 23 07:50:07 [Balancer] getaddrinfo("conf01f.load.net") failed: Name or service not known
Tue Oct 23 07:50:07 [Balancer] SyncClusterConnection connect fail to: conf01f.load.net:27018 errmsg:
Tue Oct 23 07:50:07 [Balancer] trying reconnect to conf01f.load.net:27018
Tue Oct 23 07:50:07 [Balancer] getaddrinfo("conf01f.load.net") failed: Name or service not known
Tue Oct 23 07:50:07 [Balancer] reconnect conf01f.load.net:27018 failed
Tue Oct 23 07:50:07 [Balancer] scoped connection to conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018 not being returned to the pool
Tue Oct 23 07:50:07 [Balancer] caught exception while doing balance: socket exception
Tue Oct 23 07:50:37 [Balancer] SyncClusterConnection connecting to [conf01g.load.net:27018]
Tue Oct 23 07:50:37 [Balancer] SyncClusterConnection connecting to [conf01d.load.net:27018]
Tue Oct 23 07:50:37 [Balancer] SyncClusterConnection connecting to [conf01f.load.net:27018]
Tue Oct 23 07:50:37 [Balancer] SyncClusterConnection connecting to [conf01g.load.net:27018]
Tue Oct 23 07:50:37 [Balancer] SyncClusterConnection connecting to [conf01d.load.net:27018]
Tue Oct 23 07:50:37 [Balancer] SyncClusterConnection connecting to [conf01f.load.net:27018]
Tue Oct 23 07:50:37 [Balancer] getaddrinfo("conf01f.load.net") failed: Name or service not known
Tue Oct 23 07:50:37 [Balancer] SyncClusterConnection connect fail to: conf01f.load.net:27018 errmsg:
Tue Oct 23 07:50:37 [Balancer] trying reconnect to conf01f.load.net:27018
Tue Oct 23 07:50:37 [Balancer] getaddrinfo("conf01f.load.net") failed: Name or service not known
Tue Oct 23 07:50:37 [Balancer] reconnect conf01f.load.net:27018 failed
Tue Oct 23 07:50:37 [Balancer] scoped connection to conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018 not being returned to the pool
Tue Oct 23 07:50:37 [Balancer] caught exception while doing balance: socket exception
Tue Oct 23 07:51:07 [Balancer] SyncClusterConnection connecting to [conf01g.load.net:27018]
Tue Oct 23 07:51:07 [Balancer] SyncClusterConnection connecting to [conf01d.load.net:27018]
Tue Oct 23 07:51:07 [Balancer] SyncClusterConnection connecting to [conf01f.load.net:27018]
Tue Oct 23 07:51:07 [Balancer] SyncClusterConnection connecting to [conf01g.load.net:27018]
Tue Oct 23 07:51:07 [Balancer] SyncClusterConnection connecting to [conf01d.load.net:27018]
Tue Oct 23 07:51:07 [Balancer] SyncClusterConnection connecting to [conf01f.load.net:27018]
Tue Oct 23 07:51:07 [Balancer] getaddrinfo("conf01f.load.net") failed: Name or service not known
Tue Oct 23 07:51:07 [Balancer] SyncClusterConnection connect fail to: conf01f.load.net:27018 errmsg:
Tue Oct 23 07:51:07 [Balancer] trying reconnect to conf01f.load.net:27018
Tue Oct 23 07:51:07 [Balancer] getaddrinfo("conf01f.load.net") failed: Name or service not known
Tue Oct 23 07:51:07 [Balancer] reconnect conf01f.load.net:27018 failed
Tue Oct 23 07:51:07 [Balancer] scoped connection to conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018 not being returned to the pool
Tue Oct 23 07:51:07 [Balancer] caught exception while doing balance: socket exception
Tue Oct 23 07:51:37 [Balancer] SyncClusterConnection connecting to [conf01g.load.net:27018]
Tue Oct 23 07:51:37 [Balancer] SyncClusterConnection connecting to [conf01d.load.net:27018]
Tue Oct 23 07:51:37 [Balancer] SyncClusterConnection connecting to [conf01f.load.net:27018]
Tue Oct 23 07:51:37 [Balancer] SyncClusterConnection connecting to [conf01g.load.net:27018]
Tue Oct 23 07:51:37 [Balancer] SyncClusterConnection connecting to [conf01d.load.net:27018]
Tue Oct 23 07:51:37 [Balancer] SyncClusterConnection connecting to [conf01f.load.net:27018]
Tue Oct 23 07:52:30 [conn427] warning: splitChunk failed - cmd: { splitChunk: "musicgroup.entity", keyPattern:

{ _id: 1.0 }

, min:

{ _id: "http://www.myspace.com/rotten_mind?_escaped_fragment_=&domain-id=286078" }

, max:

{ _id: "http://www.myspace.com/santeriaandthepornhorns?_escaped_fragment_=&domain-id=286078" }

, from: "mongodb-sh1/host01d.load.net:27017,host01f.load.net:27017,host01g.load.net:27017", splitKeys: [

{ _id: "http://www.myspace.com/s_e_b_y&domain-id=286078" }

], shardId: "vertis-musicgroup.entity_attr-id"http://www.myspace.com/rotten_mind?escaped_fragment=&domain-id=286078"", configdb: "conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018" } result: { who: { _id: "vertis-musicgroup.entity_attr", process: "host01d:27017:1350602021:1390778741", state: 2, ts: ObjectId('50860ba9e23b10d5d497b96d'), when: new Date(1350962089258), who: "host01d:27017:1350602021:1390778741:conn8736:375682389", why: "migrate-

{ _id: "http://cf.myspace.com/ollerton&domain-id=286649" }

" }, errmsg: "the collection's metadata lock is taken", ok: 0.0 }
Tue Oct 23 07:52:30 [conn427] SyncClusterConnection connecting to [conf01g.load.net:27018]
Tue Oct 23 07:52:30 [conn427] SyncClusterConnection connecting to [conf01d.load.net:27018]
Tue Oct 23 07:52:30 [conn427] getaddrinfo("conf01d.load.net") failed: Name or service not known
Tue Oct 23 07:52:30 [conn427] SyncClusterConnection connect fail to: conf01d.load.net:27018 errmsg:
Tue Oct 23 07:52:30 [conn427] SyncClusterConnection connecting to [conf01f.load.net:27018]
Tue Oct 23 07:52:30 [conn427] getaddrinfo("conf01f.load.net") failed: Name or service not known
Tue Oct 23 07:52:30 [conn427] SyncClusterConnection connect fail to: conf01f.load.net:27018 errmsg:
Tue Oct 23 07:52:30 [conn427] trying reconnect to conf01d.load.net:27018
Tue Oct 23 07:52:30 [conn427] getaddrinfo("conf01d.load.net") failed: Name or service not known
Tue Oct 23 07:52:30 [conn427] reconnect conf01d.load.net:27018 failed
Tue Oct 23 07:52:30 [conn427] warning: could not get last error from a shard conf01g.load.net:27018,conf01d.load.net:27018,conf01f.load.net:27018 :: caused by :: socket exception
Received signal 6
Backtrace: 0x528c54 0x7fe56caa0af0 0x7fe56caa0a75 0x7fe56caa45c0 0x7fe56ca99941 0x730908 0x712fa5 0x700860 0x72ad97 0x73c8d1 0x5aad20 0x7fe56de219ca 0x7fe56cb5370d
mongos(_ZN5mongo17printStackAndExitEi+0x64)[0x528c54]
/lib/libc.so.6(+0x33af0)[0x7fe56caa0af0]
/lib/libc.so.6(gsignal+0x35)[0x7fe56caa0a75]
/lib/libc.so.6(abort+0x180)[0x7fe56caa45c0]
/lib/libc.so.6(__assert_fail+0xf1)[0x7fe56ca99941]
mongos(_ZN5mongo10ClientInfo12getLastErrorERKNS_7BSONObjERNS_14BSONObjBuilderEb+0x3898)[0x730908]
mongos(_ZN5mongo7Command20runAgainstRegisteredEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi+0x805)[0x712fa5]
mongos(_ZN5mongo14SingleStrategy7queryOpERNS_7RequestE+0x4d0)[0x700860]
mongos(_ZN5mongo7Request7processEi+0x157)[0x72ad97]
mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x71)[0x73c8d1]
mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x260)[0x5aad20]
/lib/libpthread.so.0(+0x69ca)[0x7fe56de219ca]
/lib/libc.so.6(clone+0x6d)[0x7fe56cb5370d]
===
CursorCache at shutdown - sharded: 1 passthrough: 2936
Received signal 11
Backtrace: 0x528c54 0x7fe56caa0af0 0x2784c40
mongos(_ZN5mongo17printStackAndExitEi+0x64)[0x528c54]
/lib/libc.so.6(+0x33af0)[0x7fe56caa0af0]
[0x2784c40]
===

Comment by Kristina Chodorow (Inactive) [ 23/Oct/12 ]

Can you attach a larger chunk of logs leading up to the crash?

Comment by Anton V. Volokhov [ 23/Oct/12 ]

It is just one draining. I restart the mongos, it continues draining for some days, and then crashes again.
Now the draining is over, and I hope that I won't experience this exception any more. But it's made a lot of troubles.

Comment by Kristina Chodorow (Inactive) [ 18/Oct/12 ]

Are these crashes from multiple drainings, or just one (and you restart the server, it keeps draining, crashes, etc.)?

Comment by Anton V. Volokhov [ 17/Oct/12 ]

Any ideas?

Comment by Anton V. Volokhov [ 11/Oct/12 ]

still crashing every three days or something like that.

Generated at Thu Feb 08 03:14:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.