[SERVER-14910] mongos crash after temporary connect error to a config server Created: 15/Aug/14  Updated: 15/Jan/15  Resolved: 15/Jan/15

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.4.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Taha Jahangir Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongos_logs.gz    
Operating System: ALL
Participants:

 Description   

After a temporary connct-error to a config server, mongos crashed with signal 11:

Fri Aug 15 07:15:13.345 warning:  couldn't check dbhash on config server servername:27001 :: caused by :: 11002 socket exception [CONNECT_ERROR] server [servername:27001] mongos connectionpool error: couldn't connect to server servername:27001
Received signal 11
Received signal 11
Backtrace: Backtrace: 0xa8c225 0x7fc3fdabb000 0x8d9209 0x8d9209 ^@x7fc3fdabb000 0x8dab9f 0x8d9209 0x999dbb ^@x8dab9f 0x666817 ^@x999dbb 0xa79d4e ^@x666817 0x7fc3fe87a182 0xa79d4e 0x7fc3fdb7f38d 0x7fc3fe87a182
0x7fc3fdb7f38d
/usr/bin/mongos(_ZN5mongo17printStackAndExitEi+0x75)[0xa8c225]
/usr/bin/mongos(_ZN5mongo17printStackAndExitEi+0x75)[0xa8c225]
/lib/x86_64-linux-gnu/libc.so.6(+0x37000)[0x7fc3fdabb000]
/lib/x86_64-linux-gnu/libc.so.6(+0x37000)[0x7fc3fdabb000]
/usr/bin/mongos(_ZN5mongo10ClientInfo10newRequestEv+0x19)[0x8d9209]
/usr/bin/mongos(_ZN5mongo10ClientInfo10newRequestEv+0x19)[0x8d9209]
/usr/bin/mongos(_ZN5mongo10ClientInfo14newPeerRequestERKNS_11HostAndPortE+0x4f)[0x8dab9f]
/usr/bin/mongos(_ZN5mongo10ClientInfo14newPeerRequestERKNS_11HostAndPortE+0x4f)[0x8dab9f]
/usr/bin/mongos(_ZN5mongo7RequestC1ERNS_7MessageEPNS_21AbstractMessagingPortE+0xfb)[0x999dbb]
/usr/bin/mongos(_ZN5mongo7RequestC1ERNS_7MessageEPNS_21AbstractMessagingPortE+0xfb)[0x999dbb]
/usr/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x47)[0x666817]
/usr/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x47)[0x666817]
/usr/bin/mongos(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e)[0xa79d4e]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc3fe87a182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc3fdb7f38d]

Balancing is off (and not running) on mongos severs.



 Comments   
Comment by Ramon Fernandez Marina [ 15/Jan/15 ]

Thanks for uploading the logs, and pologies for the late reply taha_jahangir. Unfortunately we have been unable to reproduce the issue and we haven't seen any more cases of it, so I'm resolving this ticket. Feel free to reopen it if this happens again.

Regards,
Ramón.

Comment by Taha Jahangir [ 27/Sep/14 ]

The crash happened only once. Log file of mongos server is attached.

Comment by Ramon Fernandez Marina [ 25/Sep/14 ]

taha_jahangir, is this still an issue for you? We'd like to examine the full logs from the time you start a mongos until it crashes with the errors above. Hopefully this will give us more information to help us understand what's going on. If any of your mongos crashes again, can you please upload the full logs to this ticket?

Thanks,
Ramón.

Comment by Taha Jahangir [ 04/Sep/14 ]

This error occurred in a production server (with relatively high load), and we are not tried to reproduce it (on a production server!). There is part of full log file: (the first 5 lines are not related to this problem). There is no line like `reason: errno`.

Thu Aug 14 20:29:44.428 [conn187] autosplitted storage.fs.chunks shard: ns:storage.fs.chunksshard: rs2:rs2/servername2:27102,servername3:27102lastmod: 3875|7203||000000000000000000000000min: { files_id: ObjectId('e7d7e95321db72615f348aaa'), n: 15 }max: { files_id: ObjectId('e7de75520f4faf3748e4800c'), n: 10 } on: { files_id: ObjectId('e7dcec5321db724f23348c03'), n: 25 } (splitThreshold 134217728)
Thu Aug 14 23:16:45.717 [conn62] Socket say send() errno:32 Broken pipe servername3:27101
Thu Aug 14 23:16:45.717 [conn62] DBException in process: socket exception [SEND_ERROR] for servername3:27101
Thu Aug 14 23:44:23.940 [conn50] Socket say send() errno:32 Broken pipe servername3:27101
Thu Aug 14 23:44:23.940 [conn50] DBException in process: socket exception [SEND_ERROR] for servername3:27101
Fri Aug 15 07:08:37.291 Socket recv() timeout  192.168.143.101:27001
Fri Aug 15 07:08:37.304 SocketException: remote: 192.168.143.101:27001 error: 9001 socket exception [RECV_TIMEOUT] server [192.168.143.101:27001]
Fri Aug 15 07:08:37.304 DBClientCursor::init call() failed
Fri Aug 15 07:08:37.325 warning:  couldn't check dbhash on config server servername:27001 :: caused by :: 10276 DBClientBase::findN: transport error: servername:27001 ns: config.$cmd query: { dbhash: 1, collections: [ "chunks", "databases" ] }
Fri Aug 15 07:09:43.328 warning:  couldn't check dbhash on config server servername:27001 :: caused by :: 11002 socket exception [CONNECT_ERROR] server [servername:27001] mongos connectionpool error: couldn't connect to server servername:27001
Fri Aug 15 07:10:49.332 warning:  couldn't check dbhash on config server servername:27001 :: caused by :: 11002 socket exception [CONNECT_ERROR] server [servername:27001] mongos connectionpool error: couldn't connect to server servername:27001
Fri Aug 15 07:11:55.335 warning:  couldn't check dbhash on config server servername:27001 :: caused by :: 11002 socket exception [CONNECT_ERROR] server [servername:27001] mongos connectionpool error: couldn't connect to server servername:27001
Fri Aug 15 07:13:01.339 warning:  couldn't check dbhash on config server servername:27001 :: caused by :: 11002 socket exception [CONNECT_ERROR] server [servername:27001] mongos connectionpool error: couldn't connect to server servername:27001
Fri Aug 15 07:14:07.342 warning:  couldn't check dbhash on config server servername:27001 :: caused by :: 11002 socket exception [CONNECT_ERROR] server [servername:27001] mongos connectionpool error: couldn't connect to server servername:27001
Fri Aug 15 07:15:13.345 warning:  couldn't check dbhash on config server servername:27001 :: caused by :: 11002 socket exception [CONNECT_ERROR] server [servername:27001] mongos connectionpool error: couldn't connect to server servername:27001
Received signal 11
Received signal 11
Backtrace: Backtrace: 0xa8c225 0x7fc3fdabb000 0x8d9209 0x8d9209 ^@x7fc3fdabb000 0x8dab9f 0x8d9209 0x999dbb ^@x8dab9f 0x666817 ^@x999dbb 0xa79d4e ^@x666817 0x7fc3fe87a182 0xa79d4e 0x7fc3fdb7f38d 0x7fc3fe87a182
0x7fc3fdb7f38d
/usr/bin/mongos(_ZN5mongo17printStackAndExitEi+0x75)[0xa8c225]
/usr/bin/mongos(_ZN5mongo17printStackAndExitEi+0x75)[0xa8c225]
/lib/x86_64-linux-gnu/libc.so.6(+0x37000)[0x7fc3fdabb000]
/lib/x86_64-linux-gnu/libc.so.6(+0x37000)[0x7fc3fdabb000]
/usr/bin/mongos(_ZN5mongo10ClientInfo10newRequestEv+0x19)[0x8d9209]
/usr/bin/mongos(_ZN5mongo10ClientInfo10newRequestEv+0x19)[0x8d9209]
/usr/bin/mongos(_ZN5mongo10ClientInfo14newPeerRequestERKNS_11HostAndPortE+0x4f)[0x8dab9f]
/usr/bin/mongos(_ZN5mongo10ClientInfo14newPeerRequestERKNS_11HostAndPortE+0x4f)[0x8dab9f]
/usr/bin/mongos(_ZN5mongo7RequestC1ERNS_7MessageEPNS_21AbstractMessagingPortE+0xfb)[0x999dbb]
/usr/bin/mongos(_ZN5mongo7RequestC1ERNS_7MessageEPNS_21AbstractMessagingPortE+0xfb)[0x999dbb]
/usr/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x47)[0x666817]
/usr/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x47)[0x666817]
/usr/bin/mongos(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e)[0xa79d4e]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fc3fe87a182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fc3fdb7f38d]

The mongodb setup contains 4 shards (each with 2 data node and 1 arbiter), 3 config servers and 2 mongos. I have wrote details about config servers in previous note.

I think the only information to debug is the stack trace provided in logs.

Comment by Ramon Fernandez Marina [ 02/Sep/14 ]

I tried to reproduce this behavior by killing one of my config servers in a test setup, but I was not able to:

2014-09-02T16:09:29.319-0400 [mongosMain] warning: Failed to connect to 127.0.1.1:27020, reason: errno:111 Connection refused
2014-09-02T16:09:29.319-0400 [mongosMain] warning:  couldn't check dbhash on config server tab:27020 :: caused by :: 11002 socket exception [CONNECT_ERROR] server [tab:27020] connection pool error: couldn't connect to server tab:27020 (127.0.1.1), connection attempt failed

I agree that mongos should not go belly up in this circumstances, but in order to track down the problem we'll need more information. Are you able to reliably reproduce this behavior? Can you send us more detailed logs? Perhaps the line with reason: errno like the one above can provide a useful hint. Are you able to provide more details about your setup?

Thanks,
Ramón.

Comment by Taha Jahangir [ 27/Aug/14 ]

There is 3 config servers, two in the same DC, and one in another DC (this is the `servername`)

The config server at servername:27001 is always up, but network errors are not unusual (that is on another DC).

I think the problem is not whether a config server is or is not listening on that socket. `mongos` should not die when a config server is not accessible.

Comment by Ramon Fernandez Marina [ 25/Aug/14 ]

How many config servers do you have? Also, can you check whether there's a mongod process listening on servername:27001?

Comment by Taha Jahangir [ 15/Aug/14 ]

mongos version: 2.4.10, on (updated) Ubuntu 14.04

Generated at Thu Feb 08 03:36:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.