[SERVER-14076] remove test replset_remove_node.js Created: 06/Mar/14  Updated: 11/Jul/16  Resolved: 28/May/14

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 2.6.5, 2.7.2

Type: Task Priority: Minor - P4
Reporter: Michael O'Brien Assignee: Eric Milkie
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File replset_remove_node.js_6a10de0_windows-64-debug.log     Text File replset_remove_node.js_7700d67_windows-64-2k8-debug.log     Text File replset_remove_node.js_8d43b5c_windows-64-debug.log    
Issue Links:
Related
Backport Completed:
Participants:
Linked BF Score: 0

 Description   

https://mci.10gen.com/ui/task_history/mongodb-mongo-master/replicasets#replset_remove_node.js=fail&buildVariants=

has failed only a handful of times recently, seems to be isolated to the win64 and win64 2k8 DEBUG variants.

there are a bunch of error messages in the logs, some of which may be intended as part of the test, but i also this in the logs (just prior to the endings) which looks suspicious:

 m31000| 2014-03-03T23:29:43.329+0000 [conn1] replSet replSetReconfig new config saved locally
 m31000| 2014-03-03T23:29:43.331+0000 [rsHealthPoll] warning: Failed to connect to 10.41.143.218:31001, reason: errno:10038 An operation was attempted on something that is not a socket.
 m31000| 2014-03-03T23:29:43.331+0000 [initandlisten] connection accepted from 127.0.0.1:51629 #15 (5 connections now open)
 m31000| 2014-03-03T23:29:43.331+0000 [conn1] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [127.0.0.1:51596] 
 m31000| 2014-03-03T23:29:43.331+0000 [initandlisten] connection accepted from 10.41.143.218:51630 #16 (5 connections now open)
 m31002| 2014-03-03T23:29:43.331+0000 [conn2] end connection 10.41.143.218:51602 (1 connection now open)
 m31000| 2014-03-03T23:29:43.331+0000 [conn13] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.41.143.218:51626] 
 m31000| 2014-03-03T23:29:43.335+0000 [rsHealthPoll] replSet info WIN-THU1H72T4UU:31001 is down (or slow to respond): 
 m31002| 2014-03-03T23:29:43.337+0000 [initandlisten] connection accepted from 10.41.143.218:51631 #5 (2 connections now open)
 m31000| 2014-03-03T23:29:43.336+0000 [conn11] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.41.143.218:51623] 
 m31002| 2014-03-03T23:29:43.339+0000 [rsBackgroundSync] replSet remoteOldestOp:    Mar 03 23:29:19 5315104f:1
 m31002| 2014-03-03T23:29:43.339+0000 [rsBackgroundSync] replSet lastOpTimeFetched: Mar 03 23:29:19 5315104f:1
 m31000| 2014-03-03T23:29:43.338+0000 [rsHealthPoll] replSet member WIN-THU1H72T4UU:31001 is now in state DOWN
2014-03-03T23:29:43.342+0000 Socket recv() errno:10053 An established connection was aborted by the software in your host machine. 127.0.0.1:31000
2014-03-03T23:29:43.342+0000 SocketException: remote: 127.0.0.1:31000 error: 9001 socket exception [RECV_ERROR] server [127.0.0.1:31000] 
2014-03-03T23:29:43.342+0000 DBClientCursor::init call() failed



 Comments   
Comment by Githook User [ 11/Aug/14 ]

Author:

{u'username': u'dannenberg', u'name': u'Matt Dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: SERVER-14076 remove replset_remove_node.js as it is redundant and flakey

(cherry picked from commit 4acaf8a26e37986da49d51d14d26b7639699dde6)
Branch: v2.6
https://github.com/mongodb/mongo/commit/f22b3d5470d3734937bf069a3ef884f4b6064497

Comment by Githook User [ 28/May/14 ]

Author:

{u'username': u'dannenberg', u'name': u'Matt Dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: SERVER-14076 remove replset_remove_node.js as it is redundant and flakey
Branch: master
https://github.com/mongodb/mongo/commit/4acaf8a26e37986da49d51d14d26b7639699dde6

Comment by Amalia Hawkins [ 22/May/14 ]

Failed again here.

Comment by Benety Goh [ 08/May/14 ]

ab47b0b217 Windows 64-bit DEBUG replicasets
https://mci.10gen.com/ui/task/mongodb_mongo_master_windows_64_debug_ab47b0b217ab40971a928bbe3d98bd315bbba716_14_05_08_02_24_06_replicasets_windows_64_debug
http://buildlogs.mongodb.org/mci_0.9_windows-64-debug/builds/50664/test/replicasets_0/replset_remove_node.js

 m31000| 2014-05-08T05:49:07.168+0000 [rsMgr] replSet total number of votes is even - add arbiter or give one member an extra vote
 m31000| 2014-05-08T05:49:07.168+0000 [rsMgr] can't see a majority of the set, relinquishing primary
 m31000| 2014-05-08T05:49:07.168+0000 [rsMgr] replSet relinquishing primary state
 m31000| 2014-05-08T05:49:07.168+0000 [rsMgr] replSet SECONDARY

Comment by Matt Dannenberg [ 06/May/14 ]

For some reason, the primary loses track of the remaining secondary and as a result steps down.

I think that may be a bug we should fix. If not, I can have the test wait until the nodes re-obtain a steady/happy state.

Comment by David Storch [ 06/May/14 ]

https://mci.10gen.com/ui/task/mongodb_mongo_master_windows_64_2k8_debug_688b0585cb2cc0cdeffb019f88643b2340c55d17_14_05_06_17_07_26_replicasets_windows_64_2k8_debug

Comment by David Storch [ 02/May/14 ]

https://mci.10gen.com/ui/task/mongodb_mongo_master_windows_64_2k8_debug_1d317f1794044fb640e7591a8b611b4374207772_14_05_01_17_42_24_replicasets_windows_64_2k8_debug

Comment by Matt Kangas [ 30/Apr/14 ]

ed1c2d2db4 Windows 64-bit 2008R2+ DEBUG replicasets
https://mci.10gen.com/ui/task/mongodb_mongo_master_windows_64_2k8_debug_ed1c2d2db431766492892e68702564f5722f15b0_14_04_30_04_17_06_replicasets_windows_64_2k8_debug
http://buildlogs.mongodb.org/mci_0.9_windows-64-2k8-debug/builds/51519/test/replicasets_0/replset_remove_node.js

Comment by Randolph Tan [ 09/Apr/14 ]

failed again:
http://buildlogs.mongodb.org/mci_0.9_windows_64_debug/builds/42634/test/replicasets_0/replset_remove_node.js
https://mci.10gen.com/ui/task/mongodb_mongo_master_windows_64_debug_aa5a4870cc8ab47482ce04fd8491b4c99dc267db_14_04_08_19_29_24_replicasets_windows_64_test

Comment by Matt Dannenberg [ 04/Apr/14 ]

The above change was determined to be too large for 2.6, but will be part of the overhauling of replica sets done for 2.7.

Comment by Matt Dannenberg [ 17/Mar/14 ]

https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/rs.cpp#L762
This and the line below it seem to cause the problem.
I've been trying to patch test removing those two lines. It is failing some tests, but they seem to be existing failures... I'm going to hold off until I can get a green build.

Comment by Eric Milkie [ 14/Mar/14 ]

Looks like the crux of the problem is:

 m31000| 2014-03-06T22:25:08.304+0000 [conn1] replSet SECONDARY
 m31000| 2014-03-06T22:25:08.304+0000 [conn1] replSet closing client sockets after relinquishing primary
 m31000| 2014-03-06T22:25:08.304+0000 [conn1] replSet PRIMARY

Note that we bounced from SECONDARY back to PRIMARY in <1ms without an election. This is due to the reconfig code trying to guess when we should remain primary. The logic is probably wrong.

The rsHealthPoll for 31001 is a red herring. It lasts longer than the reconfig because each heartbeat task will retry once if it has trouble connecting, and the connect timeout is rather long. The ending of the health poll background jobs works by setting a flag so that no further tasks are scheduled, but it cannot interrupt the currently running task.

The "operation was attempted on something that is not a socket" is because there is apparently a SocketConn pooling problem; you see that error with the first heartbeat to 31002; the second one detects that the connection still has a socket of "-1" and thus reconnects.

Comment by Matt Kangas [ 13/Mar/14 ]

I finally got a bead on this one.

In each of the failure examples above, the test gets the replSetReconfig and reports "replSetReconfig new config saved locally". Successful test runs also get this far.

Then, unlike successful test runs, an "rsHealthPoll" thread is still trying to contact the now-stopped m31001 secondary and reports

 m31000| 2014-03-06T22:25:08.317+0000 [rsHealthPoll] warning: Failed to connect to 10.183.163.165:31001, reason: errno:10038 An operation was attempted on something that is not a socket.
 m31000| 2014-03-06T22:25:08.318+0000 [rsHealthPoll] replSet info WIN-THU1H72T4UU:31001 is down (or slow to respond): 
 m31000| 2014-03-06T22:25:08.318+0000 [rsHealthPoll] replSet member WIN-THU1H72T4UU:31001 is now in state DOWN
 m31000| 2014-03-06T22:25:08.318+0000 [rsMgr] replSet total number of votes is even - add arbiter or give one member an extra vote
 m31000| 2014-03-06T22:25:08.318+0000 [rsMgr] can't see a majority of the set, relinquishing primary
 m31000| 2014-03-06T22:25:08.319+0000 [rsMgr] replSet relinquishing primary state
 m31000| 2014-03-06T22:25:08.319+0000 [rsMgr] replSet SECONDARY
 m31000| 2014-03-06T22:25:08.321+0000 [rsHealthPoll] ERROR: disableNagle failed: errno:10038 An operation was attempted on something that is not a socket.
 m31000| 2014-03-06T22:25:08.321+0000 [rsHealthPoll] ERROR: SO_KEEPALIVE failed: errno:10038 An operation was attempted on something that is not a socket.
 m31000| 2014-03-06T22:25:08.322+0000 [rsHealthPoll] warning: Could not resolve local address for socket with fd -1: errno:10038 An operation was attempted on something that is not a socket.

No primary is re-established and the test fails.

  • Should rsHealthPoll thread for 31001 even still be active after "replSet replSetReconfig new config saved locally"?
  • Why do we see "errno:10038 An operation was attempted on something that is not a socket."?

It appears this is only occurring on Windows. Latent replsets bug on this platform?

milkie - please take a look and choose a priority + assignee.

Comment by Matt Kangas [ 13/Mar/14 ]

Attaching logs from failures on 2014-03-03, 2014-03-06

Comment by Matt Kangas [ 12/Mar/14 ]

pickaxe('jstests/replsets/replset_remove_node.js', {max_days:30}).toJiraTable();

2014-03-06T20:09Z 7700d67 fail windows-64-2k8-debug jstests\replsets\replset_remove_node.js MCI task logfile
2014-03-06T00:10Z 8d43b5c fail windows-64-debug jstests\replsets\replset_remove_node.js MCI task logfile
2014-03-03T21:26Z 6a10de0 fail windows-64-debug jstests\replsets\replset_remove_node.js MCI task logfile
2014-02-17T16:44Z 306dfab fail enterprise-windows-64 jstests\replsets\replset_remove_node.js MCI task logfile
Generated at Thu Feb 08 03:33:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.