[SERVER-14076] remove test replset_remove_node.js Created: 06/Mar/14 Updated: 11/Jul/16 Resolved: 28/May/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 2.6.5, 2.7.2 |
| Type: | Task | Priority: | Minor - P4 |
| Reporter: | Michael O'Brien | Assignee: | Eric Milkie |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Backport Completed: | |||||
| Participants: | |||||
| Linked BF Score: | 0 | ||||
| Description |
|
has failed only a handful of times recently, seems to be isolated to the win64 and win64 2k8 DEBUG variants. there are a bunch of error messages in the logs, some of which may be intended as part of the test, but i also this in the logs (just prior to the endings) which looks suspicious:
|
| Comments |
| Comment by Githook User [ 11/Aug/14 ] | |||||||||||||||||||||||||||||
|
Author: {u'username': u'dannenberg', u'name': u'Matt Dannenberg', u'email': u'matt.dannenberg@10gen.com'}Message: (cherry picked from commit 4acaf8a26e37986da49d51d14d26b7639699dde6) | |||||||||||||||||||||||||||||
| Comment by Githook User [ 28/May/14 ] | |||||||||||||||||||||||||||||
|
Author: {u'username': u'dannenberg', u'name': u'Matt Dannenberg', u'email': u'matt.dannenberg@10gen.com'}Message: | |||||||||||||||||||||||||||||
| Comment by Amalia Hawkins [ 22/May/14 ] | |||||||||||||||||||||||||||||
|
Failed again here. | |||||||||||||||||||||||||||||
| Comment by Benety Goh [ 08/May/14 ] | |||||||||||||||||||||||||||||
|
ab47b0b217 Windows 64-bit DEBUG replicasets
| |||||||||||||||||||||||||||||
| Comment by Matt Dannenberg [ 06/May/14 ] | |||||||||||||||||||||||||||||
|
For some reason, the primary loses track of the remaining secondary and as a result steps down. I think that may be a bug we should fix. If not, I can have the test wait until the nodes re-obtain a steady/happy state. | |||||||||||||||||||||||||||||
| Comment by David Storch [ 06/May/14 ] | |||||||||||||||||||||||||||||
| Comment by David Storch [ 02/May/14 ] | |||||||||||||||||||||||||||||
| Comment by Matt Kangas [ 30/Apr/14 ] | |||||||||||||||||||||||||||||
|
ed1c2d2db4 Windows 64-bit 2008R2+ DEBUG replicasets | |||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 09/Apr/14 ] | |||||||||||||||||||||||||||||
|
failed again: | |||||||||||||||||||||||||||||
| Comment by Matt Dannenberg [ 04/Apr/14 ] | |||||||||||||||||||||||||||||
|
The above change was determined to be too large for 2.6, but will be part of the overhauling of replica sets done for 2.7. | |||||||||||||||||||||||||||||
| Comment by Matt Dannenberg [ 17/Mar/14 ] | |||||||||||||||||||||||||||||
|
https://github.com/mongodb/mongo/blob/master/src/mongo/db/repl/rs.cpp#L762 | |||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 14/Mar/14 ] | |||||||||||||||||||||||||||||
|
Looks like the crux of the problem is:
Note that we bounced from SECONDARY back to PRIMARY in <1ms without an election. This is due to the reconfig code trying to guess when we should remain primary. The logic is probably wrong. The rsHealthPoll for 31001 is a red herring. It lasts longer than the reconfig because each heartbeat task will retry once if it has trouble connecting, and the connect timeout is rather long. The ending of the health poll background jobs works by setting a flag so that no further tasks are scheduled, but it cannot interrupt the currently running task. The "operation was attempted on something that is not a socket" is because there is apparently a SocketConn pooling problem; you see that error with the first heartbeat to 31002; the second one detects that the connection still has a socket of "-1" and thus reconnects. | |||||||||||||||||||||||||||||
| Comment by Matt Kangas [ 13/Mar/14 ] | |||||||||||||||||||||||||||||
|
I finally got a bead on this one. In each of the failure examples above, the test gets the replSetReconfig and reports "replSetReconfig new config saved locally". Successful test runs also get this far. Then, unlike successful test runs, an "rsHealthPoll" thread is still trying to contact the now-stopped m31001 secondary and reports
No primary is re-established and the test fails.
It appears this is only occurring on Windows. Latent replsets bug on this platform? milkie - please take a look and choose a priority + assignee. | |||||||||||||||||||||||||||||
| Comment by Matt Kangas [ 13/Mar/14 ] | |||||||||||||||||||||||||||||
|
Attaching logs from failures on 2014-03-03, 2014-03-06 | |||||||||||||||||||||||||||||
| Comment by Matt Kangas [ 12/Mar/14 ] | |||||||||||||||||||||||||||||
|