Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-10567

Socket close race when relinquishing a replica node

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 2.5.3
    • Affects Version/s: 2.4.5
    • Component/s: Replication
    • Environment:
      Centos 6.3
      libc 2.12
      OpenSSL 1.0.0-fips 29 Mar 2010
      {{Linux mdb2 2.6.32-279.19.1.el6.x86_64 #1 SMP Wed Dec 19 07:05:20 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux}}
      MongoDB 2.4.5 built from source with SSL
    • ALL
      1. Disrupt connection between two replica set members
      2. Wait for slave to notice it is disconnected
      3. Restore connectivity

      The ReplSetImpl::relinquish method closes all sockets from the RsMgr thread. If the rsHealthPoll thread does a poll during the closing of the socket the file descriptor fd might be -1 causing the fassert to trigger.

      Either mutex the Socket or fd or handle the error scenario more gracefully.

      Original description:
      fatal assertion 16505 during replica set reconnection
      After renegotiating a connection after a failure, it appears that the server can attempt to secure the socket before the descriptor has been reassigned to reflect the new socket. This also appears to affect the CXX driver.

      https://github.com/mongodb/mongo/blob/r2.4.5/src/mongo/util/net/sock.cpp#L401,L437

      Mon Aug 19 05:12:28.642 [rsHealthPoll] couldn't connect to mdb1:27017: couldn't connect to server mdb1:27017
      Mon Aug 19 05:12:29.393 [rsMgr] replSet PRIMARY
      Mon Aug 19 05:12:32.664 [initandlisten] connection accepted from 10.3.246.152:43959 #61171 (3 connections now open)
      Mon Aug 19 05:12:33.079 [conn61171]  authenticate db: local { authenticate: 1, nonce: "7e4b22f8bf015279", user: "__system", key: "06560f09af
      6221a471a971baf1ea616f" }
      Mon Aug 19 05:12:35.667 [rsHealthPoll] couldn't connect to mdb1:27017: couldn't connect to server mdb1:27017
      Mon Aug 19 05:12:35.776 [rsHealthPoll] replset info mdb1:27017 heartbeat failed, retrying
      Mon Aug 19 05:12:35.777 [conn61169] end connection 10.3.246.152:43953 (2 connections now open)
      Mon Aug 19 05:12:35.914 [initandlisten] connection accepted from 10.3.246.152:43960 #61172 (3 connections now open)
      Mon Aug 19 05:12:35.971 [rsHealthPoll] replset info mdb1:27017 thinks that we are down
      Mon Aug 19 05:12:35.971 [rsHealthPoll] replSet member mdb1:27017 is up
      Mon Aug 19 05:12:35.971 [rsHealthPoll] replSet member mdb1:27017 is now in state SECONDARY
      Mon Aug 19 05:12:35.971 [rsMgr] stepping down mdb2.firehost.net:27017 (priority 75), mdb1:27017 is priority 100 and 0 seconds be
      hind
      Mon Aug 19 05:12:35.971 [rsMgr] replSet relinquishing primary state
      Mon Aug 19 05:12:35.971 [rsMgr] replSet SECONDARY
      Mon Aug 19 05:12:35.971 [rsMgr] replSet closing client sockets after relinquishing primary
      Mon Aug 19 05:12:36.052 [conn61172] ERROR: could not negotiate SSL connection: EOF detected
      Mon Aug 19 05:12:36.052 [conn61172] SocketException handling request, closing client connection: 9001 socket exception [6]
      Mon Aug 19 05:12:36.103 [rsHealthPoll] ERROR: disableNagle failed: errno:9 Bad file descriptor
      Mon Aug 19 05:12:36.103 [rsHealthPoll] ERROR: SO_KEEPALIVE failed: errno:9 Bad file descriptor
      Mon Aug 19 05:12:36.103 [rsHealthPoll] ERROR: can't get TCP_KEEPIDLE: errno:9 Bad file descriptor
      Mon Aug 19 05:12:36.103 [rsHealthPoll] ERROR: can't get TCP_KEEPINTVL: errno:9 Bad file descriptor
      Mon Aug 19 05:12:36.103 [rsHealthPoll]   Fatal Assertion 16505
      0xc13a66 0xbda6b3 0xc03435 0x74dda4 0x74e0fb 0xa8fa6f 0xaa0f32 0xaa23a6 0xbe21f5 0xbddd4f 0xbdf669 0xc4cf6c 0x3879a07851 0x38792e811d
       /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x26) [0xc13a66]
       /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0x63) [0xbda6b3]
       /usr/bin/mongod(_ZN5mongo6Socket6secureEPNS_10SSLManagerE+0x65) [0xc03435]
       /usr/bin/mongod(_ZN5mongo18DBClientConnection8_connectERSs+0x334) [0x74dda4]
       /usr/bin/mongod(_ZN5mongo18DBClientConnection7connectERKNS_11HostAndPortERSs+0x6b) [0x74e0fb]
       /usr/bin/mongod(_ZN5mongo10ScopedConn7connectEv+0xef) [0xa8fa6f]
       /usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask17_requestHeartbeatERNS_13HeartbeatInfoERNS_7BSONObjERi+0x672) [0xaa0f32]
       /usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask6doWorkEv+0x96) [0xaa23a6]
       /usr/bin/mongod(_ZN5mongo4task4Task3runEv+0x25) [0xbe21f5]
       /usr/bin/mongod(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0xcf) [0xbddd4f]
       /usr/bin/mongod(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS7_9JobStatusEEEEENS2_5list2INS2_5valueIPS7_EENSD_ISA_EEEEEEE3runEv+0x79) [0xbdf669]
       /usr/bin/mongod() [0xc4cf6c]
       /lib64/libpthread.so.0() [0x3879a07851]
       /lib64/libc.so.6(clone+0x6d) [0x38792e811d]
      Mon Aug 19 05:12:36.109 [rsHealthPoll]
      
      ***aborting after fassert() failure
      
      
      Mon Aug 19 05:12:36.109 Got signal: 6 (Aborted).
      
      Mon Aug 19 05:12:36.114 Backtrace:
      0xc13a66 0x708f4d 0x3879232920 0x38792328a5 0x3879234085 0xbda6ee 0xc03435 0x74dda4 0x74e0fb 0xa8fa6f 0xaa0f32 0xaa23a6 0xbe21f5 0xbddd4f 0xbdf669 0xc4cf6c 0x3879a07851 0x38792e811d
       /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x26) [0xc13a66]
       /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x39d) [0x708f4d]
       /lib64/libc.so.6() [0x3879232920]
       /lib64/libc.so.6(gsignal+0x35) [0x38792328a5]
       /lib64/libc.so.6(abort+0x175) [0x3879234085]
       /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0x9e) [0xbda6ee]
       /usr/bin/mongod(_ZN5mongo6Socket6secureEPNS_10SSLManagerE+0x65) [0xc03435]
       /usr/bin/mongod(_ZN5mongo18DBClientConnection8_connectERSs+0x334) [0x74dda4]
       /usr/bin/mongod(_ZN5mongo18DBClientConnection7connectERKNS_11HostAndPortERSs+0x6b) [0x74e0fb]
       /usr/bin/mongod(_ZN5mongo10ScopedConn7connectEv+0xef) [0xa8fa6f]
       /usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask17_requestHeartbeatERNS_13HeartbeatInfoERNS_7BSONObjERi+0x672) [0xaa0f32]
       /usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask6doWorkEv+0x96) [0xaa23a6]
       /usr/bin/mongod(_ZN5mongo4task4Task3runEv+0x25) [0xbe21f5]
       /usr/bin/mongod(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0xcf) [0xbddd4f]
       /usr/bin/mongod(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS7_9JobStatusEEEEENS2_5list2INS2_5valueIPS7_EENSD_ISA_EEEEEEE3runEv+0x79) [0xbdf669]
       /usr/bin/mongod() [0xc4cf6c]
       /lib64/libpthread.so.0() [0x3879a07851]
       /lib64/libc.so.6(clone+0x6d) [0x38792e811d]
      

            Assignee:
            andreas.nilsson Andreas Nilsson
            Reporter:
            tmehlinger Travis Mehlinger
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: