[SERVER-21249] Replica set reports down member is healthy on Windows Created: 02/Nov/15  Updated: 01/Dec/15  Resolved: 24/Nov/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.0-rc1
Fix Version/s: 3.2.0-rc4

Type: Bug Priority: Major - P3
Reporter: David Golub Assignee: Matt Dannenberg
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongod1.dmp     File mongod2.dmp     HTML File run9001     HTML File run9002     HTML File run9003     HTML File run9004    
Issue Links:
Related
is related to SERVER-21501 Restart of secondary results in addit... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl C (11/20/15), Repl D (12/11/15)
Participants:

 Description   

We have an Automation Agent test that fires up a four-node replica set, takes down one secondary and the primary, and waits for the two remaining nodes to report accurate information via rs.status(). This test is failing on Windows, timing out because the two remaining nodes continue to report that the primary is healthy, which can by seen by connecting to them via the Mongo shell and running rs.status(). I have not been able to reproduce this outside the test, and the test passes on Linux and OS X, leading me to believe that there may be some sort of Windows-specific race condition. Memory dumps from the two remaining secondaries are attached.

CC mark.benvenuto



 Comments   
Comment by Githook User [ 24/Nov/15 ]

Author:

{u'username': u'dannenberg', u'name': u'matt dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: SERVER-21249 only restart heartbeats once when a node cannot find a syncsource
Branch: master
https://github.com/mongodb/mongo/commit/e8f4a2ea35060e97281221f3b1457ab7106e631e

Comment by Scott Hernandez (Inactive) [ 19/Nov/15 ]

It would be good to see if this still happens with RC3, or RC4 (next week).

Comment by David Golub [ 02/Nov/15 ]

It gets into the state whenever I run the test. Tomorrow morning is fine. Just stop by my desk when it's good for you. I generally get in by 10:00 AM.

Comment by David Golub [ 02/Nov/15 ]

We're talking about minutes, not seconds. As far as I can tell, once it's in that state, it stays there indefinitely.

Comment by Scott Hernandez (Inactive) [ 02/Nov/15 ]

How long are you waiting before considering it incorrect to report the node as primary? It will take some time for the heartbeats to timeout before the member is marked down, and this is expected behavior if we are taking about seconds of time, but not minutes.

Comment by David Golub [ 02/Nov/15 ]

Unfortunately, no, the only way I've been able to reproduce it is by running the Automation Agent test in question. I can guide you through getting that set up, but it's a little bit of a hassle. It can be reproduced reliably but running the test, and like all the Automation Agent tests, it runs on a single server. I'll attach the logs.

Comment by Daniel Pasette (Inactive) [ 02/Nov/15 ]

A few follow up questions:

  • Do you have a repro script that can be used?
  • Do you have logs from all nodes?
  • Can this be reproduced reliably?
  • Is this test run on a single server?
Generated at Thu Feb 08 03:56:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.