-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
Affects Version/s: 2.4.5
-
Component/s: Replication
-
Environment:Centos 6.3
libc 2.12
OpenSSL 1.0.0-fips 29 Mar 2010
{{Linux mdb2 2.6.32-279.19.1.el6.x86_64 #1 SMP Wed Dec 19 07:05:20 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux}}
MongoDB 2.4.5 built from source with SSL
-
ALL
-
- Disrupt connection between two replica set members
- Wait for slave to notice it is disconnected
- Restore connectivity
The ReplSetImpl::relinquish method closes all sockets from the RsMgr thread. If the rsHealthPoll thread does a poll during the closing of the socket the file descriptor fd might be -1 causing the fassert to trigger.
Either mutex the Socket or fd or handle the error scenario more gracefully.
Original description:
fatal assertion 16505 during replica set reconnection
After renegotiating a connection after a failure, it appears that the server can attempt to secure the socket before the descriptor has been reassigned to reflect the new socket. This also appears to affect the CXX driver.
https://github.com/mongodb/mongo/blob/r2.4.5/src/mongo/util/net/sock.cpp#L401,L437
Mon Aug 19 05:12:28.642 [rsHealthPoll] couldn't connect to mdb1:27017: couldn't connect to server mdb1:27017 Mon Aug 19 05:12:29.393 [rsMgr] replSet PRIMARY Mon Aug 19 05:12:32.664 [initandlisten] connection accepted from 10.3.246.152:43959 #61171 (3 connections now open) Mon Aug 19 05:12:33.079 [conn61171] authenticate db: local { authenticate: 1, nonce: "7e4b22f8bf015279", user: "__system", key: "06560f09af 6221a471a971baf1ea616f" } Mon Aug 19 05:12:35.667 [rsHealthPoll] couldn't connect to mdb1:27017: couldn't connect to server mdb1:27017 Mon Aug 19 05:12:35.776 [rsHealthPoll] replset info mdb1:27017 heartbeat failed, retrying Mon Aug 19 05:12:35.777 [conn61169] end connection 10.3.246.152:43953 (2 connections now open) Mon Aug 19 05:12:35.914 [initandlisten] connection accepted from 10.3.246.152:43960 #61172 (3 connections now open) Mon Aug 19 05:12:35.971 [rsHealthPoll] replset info mdb1:27017 thinks that we are down Mon Aug 19 05:12:35.971 [rsHealthPoll] replSet member mdb1:27017 is up Mon Aug 19 05:12:35.971 [rsHealthPoll] replSet member mdb1:27017 is now in state SECONDARY Mon Aug 19 05:12:35.971 [rsMgr] stepping down mdb2.firehost.net:27017 (priority 75), mdb1:27017 is priority 100 and 0 seconds be hind Mon Aug 19 05:12:35.971 [rsMgr] replSet relinquishing primary state Mon Aug 19 05:12:35.971 [rsMgr] replSet SECONDARY Mon Aug 19 05:12:35.971 [rsMgr] replSet closing client sockets after relinquishing primary Mon Aug 19 05:12:36.052 [conn61172] ERROR: could not negotiate SSL connection: EOF detected Mon Aug 19 05:12:36.052 [conn61172] SocketException handling request, closing client connection: 9001 socket exception [6] Mon Aug 19 05:12:36.103 [rsHealthPoll] ERROR: disableNagle failed: errno:9 Bad file descriptor Mon Aug 19 05:12:36.103 [rsHealthPoll] ERROR: SO_KEEPALIVE failed: errno:9 Bad file descriptor Mon Aug 19 05:12:36.103 [rsHealthPoll] ERROR: can't get TCP_KEEPIDLE: errno:9 Bad file descriptor Mon Aug 19 05:12:36.103 [rsHealthPoll] ERROR: can't get TCP_KEEPINTVL: errno:9 Bad file descriptor Mon Aug 19 05:12:36.103 [rsHealthPoll] Fatal Assertion 16505 0xc13a66 0xbda6b3 0xc03435 0x74dda4 0x74e0fb 0xa8fa6f 0xaa0f32 0xaa23a6 0xbe21f5 0xbddd4f 0xbdf669 0xc4cf6c 0x3879a07851 0x38792e811d /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x26) [0xc13a66] /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0x63) [0xbda6b3] /usr/bin/mongod(_ZN5mongo6Socket6secureEPNS_10SSLManagerE+0x65) [0xc03435] /usr/bin/mongod(_ZN5mongo18DBClientConnection8_connectERSs+0x334) [0x74dda4] /usr/bin/mongod(_ZN5mongo18DBClientConnection7connectERKNS_11HostAndPortERSs+0x6b) [0x74e0fb] /usr/bin/mongod(_ZN5mongo10ScopedConn7connectEv+0xef) [0xa8fa6f] /usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask17_requestHeartbeatERNS_13HeartbeatInfoERNS_7BSONObjERi+0x672) [0xaa0f32] /usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask6doWorkEv+0x96) [0xaa23a6] /usr/bin/mongod(_ZN5mongo4task4Task3runEv+0x25) [0xbe21f5] /usr/bin/mongod(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0xcf) [0xbddd4f] /usr/bin/mongod(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS7_9JobStatusEEEEENS2_5list2INS2_5valueIPS7_EENSD_ISA_EEEEEEE3runEv+0x79) [0xbdf669] /usr/bin/mongod() [0xc4cf6c] /lib64/libpthread.so.0() [0x3879a07851] /lib64/libc.so.6(clone+0x6d) [0x38792e811d] Mon Aug 19 05:12:36.109 [rsHealthPoll] ***aborting after fassert() failure Mon Aug 19 05:12:36.109 Got signal: 6 (Aborted). Mon Aug 19 05:12:36.114 Backtrace: 0xc13a66 0x708f4d 0x3879232920 0x38792328a5 0x3879234085 0xbda6ee 0xc03435 0x74dda4 0x74e0fb 0xa8fa6f 0xaa0f32 0xaa23a6 0xbe21f5 0xbddd4f 0xbdf669 0xc4cf6c 0x3879a07851 0x38792e811d /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x26) [0xc13a66] /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x39d) [0x708f4d] /lib64/libc.so.6() [0x3879232920] /lib64/libc.so.6(gsignal+0x35) [0x38792328a5] /lib64/libc.so.6(abort+0x175) [0x3879234085] /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0x9e) [0xbda6ee] /usr/bin/mongod(_ZN5mongo6Socket6secureEPNS_10SSLManagerE+0x65) [0xc03435] /usr/bin/mongod(_ZN5mongo18DBClientConnection8_connectERSs+0x334) [0x74dda4] /usr/bin/mongod(_ZN5mongo18DBClientConnection7connectERKNS_11HostAndPortERSs+0x6b) [0x74e0fb] /usr/bin/mongod(_ZN5mongo10ScopedConn7connectEv+0xef) [0xa8fa6f] /usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask17_requestHeartbeatERNS_13HeartbeatInfoERNS_7BSONObjERi+0x672) [0xaa0f32] /usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask6doWorkEv+0x96) [0xaa23a6] /usr/bin/mongod(_ZN5mongo4task4Task3runEv+0x25) [0xbe21f5] /usr/bin/mongod(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0xcf) [0xbddd4f] /usr/bin/mongod(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS7_9JobStatusEEEEENS2_5list2INS2_5valueIPS7_EENSD_ISA_EEEEEEE3runEv+0x79) [0xbdf669] /usr/bin/mongod() [0xc4cf6c] /lib64/libpthread.so.0() [0x3879a07851] /lib64/libc.so.6(clone+0x6d) [0x38792e811d]