[SERVER-21108] connection pool should check if a connection has been closed remotely before using it Created: 22/Oct/15  Updated: 17/Nov/15  Resolved: 12/Nov/15

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: None
Fix Version/s: 3.2.0-rc3

Type: Bug Priority: Major - P3
Reporter: Scott Hernandez (Inactive) Assignee: Adam Midvidy
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-21406 rollback2.js Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Platform C (11/20/15)
Participants:
Linked BF Score: 0

 Description   

failures

[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.598+0000 d20510| 2015-10-22T21:06:15.597+0000 I REPL     [conn1] replSetReconfig admin command received from client
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.598+0000 d20511| 2015-10-22T21:06:15.598+0000 I NETWORK  [initandlisten] connection accepted from 10.229.60.131:39471 #21 (5 connections now open)
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.599+0000 d20510| 2015-10-22T21:06:15.599+0000 I REPL     [conn1] replSetReconfig config object with 2 members parses ok
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.599+0000 d20511| 2015-10-22T21:06:15.599+0000 I NETWORK  [conn21] end connection 10.229.60.131:39471 (4 connections now open)
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.599+0000 d20510| 2015-10-22T21:06:15.599+0000 W REPL     [ReplicationExecutor] Failed to complete heartbeat request to ip-10-229-60-131:20511; HostUnreachable End of file
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.600+0000 d20510| 2015-10-22T21:06:15.599+0000 E REPL     [conn1] replSetReconfig failed; NodeNotFound Quorum check failed because not enough voting nodes responded; required 2 but only the following 1 voting nodes responded: ip-10-229-60-131:20510; the following nodes did not respond affirmatively: ip-10-229-60-131:20511 failed with End of file
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.600+0000 assert: command failed: {
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.600+0000 	"ok" : 0,
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.601+0000 	"errmsg" : "Quorum check failed because not enough voting nodes responded; required 2 but only the following 1 voting nodes responded: ip-10-229-60-131:20510; the following nodes did not respond affirmatively: ip-10-229-60-131:20511 failed with End of file",
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.601+0000 	"code" : 74
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.601+0000 } : undefined
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.601+0000 _getErrorWithCode@src/mongo/shell/utils.js:23:13
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.601+0000 doassert@src/mongo/shell/assert.js:13:14
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.601+0000 assert.commandWorked@src/mongo/shell/assert.js:259:5
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.602+0000 reconfig@jstests/replsets/rslib.js:109:9
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.602+0000 @jstests/sharding/mongos_no_replica_set_refresh.js:34:1
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.602+0000 @jstests/sharding/mongos_no_replica_set_refresh.js:3:2
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.602+0000 



 Comments   
Comment by Githook User [ 12/Nov/15 ]

Author:

{u'username': u'amidvidy', u'name': u'Adam Midvidy', u'email': u'amidvidy@gmail.com'}

Message: SERVER-21108 don't call cancel on destroyed timers
Branch: master
https://github.com/mongodb/mongo/commit/0892687880f0500a5cd07e71d7c667590607ea17

Comment by Githook User [ 12/Nov/15 ]

Author:

{u'username': u'amidvidy', u'name': u'Adam Midvidy', u'email': u'amidvidy@gmail.com'}

Message: SERVER-21108 check if connection is healthy before returning it from the pool
Branch: master
https://github.com/mongodb/mongo/commit/8dee1e3b48531f843e055d445dd545e75db30096

Comment by Githook User [ 10/Nov/15 ]

Author:

{u'username': u'amidvidy', u'name': u'Adam Midvidy', u'email': u'amidvidy@gmail.com'}

Message: SERVER-21108 use stdx to fix lint
Branch: master
https://github.com/mongodb/mongo/commit/8d9337834608600eaff01b14f490ff5b9e3edcd6

Comment by Githook User [ 10/Nov/15 ]

Author:

{u'username': u'amidvidy', u'name': u'Adam Midvidy', u'email': u'amidvidy@gmail.com'}

Message: SERVER-21108 add method to check if a stream is still open
Branch: master
https://github.com/mongodb/mongo/commit/d4f62a3fa396d0e952ec64b35a8533e69939698c

Comment by Eric Milkie [ 02/Nov/15 ]

There should be no retrying anywhere, as part of the implementation of this ticket. The old pooled connection code did not retry and neither should the new code.

Comment by Adam Midvidy [ 02/Nov/15 ]

We don't believe that this issue should be handled in the network layer, as not all commands we execute are idempotent (and thus safe to retry).

Comment by Benety Goh [ 26/Oct/15 ]

The commit b82f1a2 fixes a symptom of the underlying issue (bad connections) in a sharding test. The underlying issue still needs to be looked at.

Comment by Githook User [ 26/Oct/15 ]

Author:

{u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}

Message: SERVER-21108 eliminate unnecessary reconfig from mongos_no_replica_set_refresh.js
Branch: master
https://github.com/mongodb/mongo/commit/b82f1a20b8c3b44f84e5744fcf9ed062341e7d58

Comment by Scott Hernandez (Inactive) [ 23/Oct/15 ]

We also need to add tests to ensure dead connections do not fail up-level operation run thought the executor/network-interface, like sharding and replication use.

Comment by Benety Goh [ 23/Oct/15 ]

The legacy socket code in Socket::isStillConnected() used to perform a number of checks to determine if the connection in the pool is still usable:

https://github.com/mongodb/mongo/blob/23982c4d0b11b445c54f0ec4ad58edbd837fe7e4/src/mongo/util/net/sock.cpp#L859

Comment by Benety Goh [ 23/Oct/15 ]

The primary seems to be using a bad connection in its pool to send a heartbeat to the node that recently stepped down (and closed incoming connections):

[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.535+0000 d20511| 2015-10-22T21:06:15.528+0000 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 2
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.535+0000 d20511| 2015-10-22T21:06:15.528+0000 I REPL     [ReplicationExecutor] Member ip-10-229-60-131:20510 is now in state PRIMARY
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.536+0000 d20511| 2015-10-22T21:06:15.528+0000 I REPL     [replExecDBWorker-2] transition to SECONDARY
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.536+0000 d20511| 2015-10-22T21:06:15.528+0000 I NETWORK  [conn6] end connection 10.229.60.131:39145 (10 connections now open)
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.536+0000 d20511| 2015-10-22T21:06:15.528+0000 I NETWORK  [conn11] end connection 10.229.60.131:39187 (9 connections now open)
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.536+0000 d20511| 2015-10-22T21:06:15.529+0000 I NETWORK  [conn17] end connection 10.229.60.131:39461 (8 connections now open)
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.536+0000 d20511| 2015-10-22T21:06:15.529+0000 I NETWORK  [conn14] end connection 10.229.60.131:39363 (8 connections now open)
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.537+0000 d20511| 2015-10-22T21:06:15.529+0000 I NETWORK  [conn8] end connection 10.229.60.131:39184 (6 connections now open)
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.537+0000 d20511| 2015-10-22T21:06:15.529+0000 I NETWORK  [conn5] end connection 10.229.60.131:39047 (5 connections now open)
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.537+0000 d20511| 2015-10-22T21:06:15.529+0000 I NETWORK  [conn18] end connection 10.229.60.131:39462 (4 connections now open)
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.537+0000 d20511| 2015-10-22T21:06:15.529+0000 I NETWORK  [conn1] end connection 127.0.0.1:37530 (3 connections now open)
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.538+0000 d20511| 2015-10-22T21:06:15.529+0000 I NETWORK  [conn13] end connection 10.229.60.131:39362 (2 connections now open)
...
[js_test:mongos_no_replica_set_refresh] 2015-10-22T21:06:15.599+0000 d20510| 2015-10-22T21:06:15.599+0000 W REPL     [ReplicationExecutor] Failed to complete heartbeat request to ip-10-229-60-131:20511; HostUnreachable End of file

Comment by Spencer Brody (Inactive) [ 23/Oct/15 ]

This looks to be strictly replication related, I don't think sharding is a factor. The failure is due to one node not responding during the reconfig quorum check.

Generated at Thu Feb 08 03:56:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.