[SERVER-21249] Replica set reports down member is healthy on Windows Created: 02/Nov/15 Updated: 01/Dec/15 Resolved: 24/Nov/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.2.0-rc1 |
| Fix Version/s: | 3.2.0-rc4 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | David Golub | Assignee: | Matt Dannenberg |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Repl C (11/20/15), Repl D (12/11/15) | ||||||||
| Participants: | |||||||||
| Description |
|
We have an Automation Agent test that fires up a four-node replica set, takes down one secondary and the primary, and waits for the two remaining nodes to report accurate information via rs.status(). This test is failing on Windows, timing out because the two remaining nodes continue to report that the primary is healthy, which can by seen by connecting to them via the Mongo shell and running rs.status(). I have not been able to reproduce this outside the test, and the test passes on Linux and OS X, leading me to believe that there may be some sort of Windows-specific race condition. Memory dumps from the two remaining secondaries are attached. |
| Comments |
| Comment by Githook User [ 24/Nov/15 ] |
|
Author: {u'username': u'dannenberg', u'name': u'matt dannenberg', u'email': u'matt.dannenberg@10gen.com'}Message: |
| Comment by Scott Hernandez (Inactive) [ 19/Nov/15 ] |
|
It would be good to see if this still happens with RC3, or RC4 (next week). |
| Comment by David Golub [ 02/Nov/15 ] |
|
It gets into the state whenever I run the test. Tomorrow morning is fine. Just stop by my desk when it's good for you. I generally get in by 10:00 AM. |
| Comment by David Golub [ 02/Nov/15 ] |
|
We're talking about minutes, not seconds. As far as I can tell, once it's in that state, it stays there indefinitely. |
| Comment by Scott Hernandez (Inactive) [ 02/Nov/15 ] |
|
How long are you waiting before considering it incorrect to report the node as primary? It will take some time for the heartbeats to timeout before the member is marked down, and this is expected behavior if we are taking about seconds of time, but not minutes. |
| Comment by David Golub [ 02/Nov/15 ] |
|
Unfortunately, no, the only way I've been able to reproduce it is by running the Automation Agent test in question. I can guide you through getting that set up, but it's a little bit of a hassle. It can be reproduced reliably but running the test, and like all the Automation Agent tests, it runs on a single server. I'll attach the logs. |
| Comment by Daniel Pasette (Inactive) [ 02/Nov/15 ] |
|
A few follow up questions:
|