[SERVER-66654] Failed election leaves recipient in unusable state Created: 23/May/22  Updated: 18/Aug/22  Resolved: 18/Aug/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matt Broadstone Assignee: Didier Nadeau
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Sprint: Server Serverless 2022-06-13, Server Serverless 2022-06-27
Participants:

 Description   

See build failure logs here

Look for these specific lines:

[j5:rs10:n2] | 2022-05-22T01:21:24.036+00:00 I  COMMAND  21581   [conn44] "Received replSetStepUp request"
[j5:rs10:n2] | 2022-05-22T01:21:24.036+00:00 D4 ELECTION 4615650 [conn44] "Rescheduled election timeout callback","attr":{"when":{"$date":"2022-05-23T03:20:58.441Z"},"requestedWhen":{"$date":"2022-05-23T01:21:24.036Z"},"error":{"code":0,"codeName":"OK"}}
[j5:rs10:n2] | 2022-05-22T01:21:24.036+00:00 I  ELECTION 4615657 [conn44] "Not starting an election for a replSetStepUp request, since we are not electable","attr":{"reason":"Not standing for election because I cannot see a majority (mask 0x1)"}
[j5:rs10:n2] | 2022-05-22T01:21:24.036+00:00 I  COMMAND  21582   [conn44] "replSetStepUp request failed","attr":{"error":" :: caused by :: CommandFailed: Election failed."}

When the logs reach that line, all recipient nodes are marked as SECONDARY, and subsequent attempts to clear the sync source, and find a new one seem to fail. Eventually the test times out waiting for a recipient primary



 Comments   
Comment by Steven Vannelli [ 18/Aug/22 ]

Team decided to close this out since it is low impact and hasn't happened in a while. Will address via a new BF if it arises again. 

Comment by Didier Nadeau [ 03/Jun/22 ]

Another occurence happened while running a patch for the shard split passthrough : task. We can see in the annoted log that it failed due to a timeout to get the recipient primary, and that election had failed before that because two nodes see the all other nodes as DOWN.

When we look at log for `j0:rs11:n0`, we see the same issue where an out of sync hearbeat set a node as down even though it has applied the new config (and a more recent hearbeat received before informed us of that). See L4499-4500 and L4533-4534 that confirm a response to an older hearbeat request (67) is received after a newer (request 69), and that mistakenly sets the node as down.

Comment by Didier Nadeau [ 31/May/22 ]

It turns out the nodes do not get stuck in state DOWN : I was looking for string "is now in state" to check for MemberData state change, but the actual string is "Member is in new state" (I confused the two logs formats in LOGV2 here). The nodes were seeing each other as DOWN during the replSetStepUp, which explains why it failed. Then the states were updated and an election would have succeeded. However the election never happened as the default timeout for tests run on evergreen is 86400000s. The test timed out as an election never happened.

Comment by Didier Nadeau [ 30/May/22 ]

The problem appears to come from recipient nodes marking each other as DOWN during the application of the recipient config. The nodes that are DOWN are never reset (we don't see the log accompanying the reset) and they are not reset during the recipient config application either. The nodes keep exchanging heartbeat until the end, seemingly without issue, but they don't seem to clear the DOWN attribute (it seems like it should be reset here but it never happens), which means none of them can ever be elected.

The next step would, ideally, be to understand why DOWN is never cleared and confirm the cause by reproducing the issue.

We should be able to fix this by modifying the condition to mark a node as DOWN. It might also require to reset the nodes states when applying a recipient config.

Generated at Thu Feb 08 06:06:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.