[SERVER-10567] Socket close race when relinquishing a replica node Created: 19/Aug/13  Updated: 11/Jul/16  Resolved: 29/Sep/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.5
Fix Version/s: 2.5.3

Type: Bug Priority: Major - P3
Reporter: Travis Mehlinger Assignee: Andreas Nilsson
Resolution: Done Votes: 0
Labels: replicaset
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Centos 6.3
libc 2.12
OpenSSL 1.0.0-fips 29 Mar 2010
Linux mdb2 2.6.32-279.19.1.el6.x86_64 #1 SMP Wed Dec 19 07:05:20 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
MongoDB 2.4.5 built from source with SSL


Issue Links:
Related
Operating System: ALL
Steps To Reproduce:
  1. Disrupt connection between two replica set members
  2. Wait for slave to notice it is disconnected
  3. Restore connectivity
Participants:

 Description   

The ReplSetImpl::relinquish method closes all sockets from the RsMgr thread. If the rsHealthPoll thread does a poll during the closing of the socket the file descriptor fd might be -1 causing the fassert to trigger.

Either mutex the Socket or fd or handle the error scenario more gracefully.

Original description:
fatal assertion 16505 during replica set reconnection
After renegotiating a connection after a failure, it appears that the server can attempt to secure the socket before the descriptor has been reassigned to reflect the new socket. This also appears to affect the CXX driver.

https://github.com/mongodb/mongo/blob/r2.4.5/src/mongo/util/net/sock.cpp#L401,L437

Mon Aug 19 05:12:28.642 [rsHealthPoll] couldn't connect to mdb1:27017: couldn't connect to server mdb1:27017
Mon Aug 19 05:12:29.393 [rsMgr] replSet PRIMARY
Mon Aug 19 05:12:32.664 [initandlisten] connection accepted from 10.3.246.152:43959 #61171 (3 connections now open)
Mon Aug 19 05:12:33.079 [conn61171]  authenticate db: local { authenticate: 1, nonce: "7e4b22f8bf015279", user: "__system", key: "06560f09af
6221a471a971baf1ea616f" }
Mon Aug 19 05:12:35.667 [rsHealthPoll] couldn't connect to mdb1:27017: couldn't connect to server mdb1:27017
Mon Aug 19 05:12:35.776 [rsHealthPoll] replset info mdb1:27017 heartbeat failed, retrying
Mon Aug 19 05:12:35.777 [conn61169] end connection 10.3.246.152:43953 (2 connections now open)
Mon Aug 19 05:12:35.914 [initandlisten] connection accepted from 10.3.246.152:43960 #61172 (3 connections now open)
Mon Aug 19 05:12:35.971 [rsHealthPoll] replset info mdb1:27017 thinks that we are down
Mon Aug 19 05:12:35.971 [rsHealthPoll] replSet member mdb1:27017 is up
Mon Aug 19 05:12:35.971 [rsHealthPoll] replSet member mdb1:27017 is now in state SECONDARY
Mon Aug 19 05:12:35.971 [rsMgr] stepping down mdb2.firehost.net:27017 (priority 75), mdb1:27017 is priority 100 and 0 seconds be
hind
Mon Aug 19 05:12:35.971 [rsMgr] replSet relinquishing primary state
Mon Aug 19 05:12:35.971 [rsMgr] replSet SECONDARY
Mon Aug 19 05:12:35.971 [rsMgr] replSet closing client sockets after relinquishing primary
Mon Aug 19 05:12:36.052 [conn61172] ERROR: could not negotiate SSL connection: EOF detected
Mon Aug 19 05:12:36.052 [conn61172] SocketException handling request, closing client connection: 9001 socket exception [6]
Mon Aug 19 05:12:36.103 [rsHealthPoll] ERROR: disableNagle failed: errno:9 Bad file descriptor
Mon Aug 19 05:12:36.103 [rsHealthPoll] ERROR: SO_KEEPALIVE failed: errno:9 Bad file descriptor
Mon Aug 19 05:12:36.103 [rsHealthPoll] ERROR: can't get TCP_KEEPIDLE: errno:9 Bad file descriptor
Mon Aug 19 05:12:36.103 [rsHealthPoll] ERROR: can't get TCP_KEEPINTVL: errno:9 Bad file descriptor
Mon Aug 19 05:12:36.103 [rsHealthPoll]   Fatal Assertion 16505
0xc13a66 0xbda6b3 0xc03435 0x74dda4 0x74e0fb 0xa8fa6f 0xaa0f32 0xaa23a6 0xbe21f5 0xbddd4f 0xbdf669 0xc4cf6c 0x3879a07851 0x38792e811d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x26) [0xc13a66]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0x63) [0xbda6b3]
 /usr/bin/mongod(_ZN5mongo6Socket6secureEPNS_10SSLManagerE+0x65) [0xc03435]
 /usr/bin/mongod(_ZN5mongo18DBClientConnection8_connectERSs+0x334) [0x74dda4]
 /usr/bin/mongod(_ZN5mongo18DBClientConnection7connectERKNS_11HostAndPortERSs+0x6b) [0x74e0fb]
 /usr/bin/mongod(_ZN5mongo10ScopedConn7connectEv+0xef) [0xa8fa6f]
 /usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask17_requestHeartbeatERNS_13HeartbeatInfoERNS_7BSONObjERi+0x672) [0xaa0f32]
 /usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask6doWorkEv+0x96) [0xaa23a6]
 /usr/bin/mongod(_ZN5mongo4task4Task3runEv+0x25) [0xbe21f5]
 /usr/bin/mongod(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0xcf) [0xbddd4f]
 /usr/bin/mongod(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS7_9JobStatusEEEEENS2_5list2INS2_5valueIPS7_EENSD_ISA_EEEEEEE3runEv+0x79) [0xbdf669]
 /usr/bin/mongod() [0xc4cf6c]
 /lib64/libpthread.so.0() [0x3879a07851]
 /lib64/libc.so.6(clone+0x6d) [0x38792e811d]
Mon Aug 19 05:12:36.109 [rsHealthPoll]
 
***aborting after fassert() failure
 
 
Mon Aug 19 05:12:36.109 Got signal: 6 (Aborted).
 
Mon Aug 19 05:12:36.114 Backtrace:
0xc13a66 0x708f4d 0x3879232920 0x38792328a5 0x3879234085 0xbda6ee 0xc03435 0x74dda4 0x74e0fb 0xa8fa6f 0xaa0f32 0xaa23a6 0xbe21f5 0xbddd4f 0xbdf669 0xc4cf6c 0x3879a07851 0x38792e811d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x26) [0xc13a66]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x39d) [0x708f4d]
 /lib64/libc.so.6() [0x3879232920]
 /lib64/libc.so.6(gsignal+0x35) [0x38792328a5]
 /lib64/libc.so.6(abort+0x175) [0x3879234085]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0x9e) [0xbda6ee]
 /usr/bin/mongod(_ZN5mongo6Socket6secureEPNS_10SSLManagerE+0x65) [0xc03435]
 /usr/bin/mongod(_ZN5mongo18DBClientConnection8_connectERSs+0x334) [0x74dda4]
 /usr/bin/mongod(_ZN5mongo18DBClientConnection7connectERKNS_11HostAndPortERSs+0x6b) [0x74e0fb]
 /usr/bin/mongod(_ZN5mongo10ScopedConn7connectEv+0xef) [0xa8fa6f]
 /usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask17_requestHeartbeatERNS_13HeartbeatInfoERNS_7BSONObjERi+0x672) [0xaa0f32]
 /usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask6doWorkEv+0x96) [0xaa23a6]
 /usr/bin/mongod(_ZN5mongo4task4Task3runEv+0x25) [0xbe21f5]
 /usr/bin/mongod(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0xcf) [0xbddd4f]
 /usr/bin/mongod(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS7_9JobStatusEEEEENS2_5list2INS2_5valueIPS7_EENSD_ISA_EEEEEEE3runEv+0x79) [0xbdf669]
 /usr/bin/mongod() [0xc4cf6c]
 /lib64/libpthread.so.0() [0x3879a07851]
 /lib64/libc.so.6(clone+0x6d) [0x38792e811d]



 Comments   
Comment by auto [ 29/Sep/13 ]

Author:

{u'username': u'agralius', u'name': u'Andreas Nilsson', u'email': u'andreas.nilsson@10gen.com'}

Message: SERVER-10567 Handling rsHealthPoll/PRIMARY stepdown race
Branch: master
https://github.com/mongodb/mongo/commit/d72362a3ecb69f8f504b2b99a0adf64ae5cf7a12

Generated at Thu Feb 08 03:23:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.