[SERVER-66654] Failed election leaves recipient in unusable state Created: 23/May/22 Updated: 18/Aug/22 Resolved: 18/Aug/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matt Broadstone | Assignee: | Didier Nadeau |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Sprint: | Server Serverless 2022-06-13, Server Serverless 2022-06-27 |
| Participants: |
| Description |
|
See build failure logs here Look for these specific lines:
When the logs reach that line, all recipient nodes are marked as SECONDARY, and subsequent attempts to clear the sync source, and find a new one seem to fail. Eventually the test times out waiting for a recipient primary |
| Comments |
| Comment by Steven Vannelli [ 18/Aug/22 ] |
|
Team decided to close this out since it is low impact and hasn't happened in a while. Will address via a new BF if it arises again. |
| Comment by Didier Nadeau [ 03/Jun/22 ] |
|
Another occurence happened while running a patch for the shard split passthrough : task. We can see in the annoted log that it failed due to a timeout to get the recipient primary, and that election had failed before that because two nodes see the all other nodes as DOWN. When we look at log for `j0:rs11:n0`, we see the same issue where an out of sync hearbeat set a node as down even though it has applied the new config (and a more recent hearbeat received before informed us of that). See L4499-4500 and L4533-4534 that confirm a response to an older hearbeat request (67) is received after a newer (request 69), and that mistakenly sets the node as down. |
| Comment by Didier Nadeau [ 31/May/22 ] |
|
It turns out the nodes do not get stuck in state DOWN : I was looking for string "is now in state" to check for MemberData state change, but the actual string is "Member is in new state" (I confused the two logs formats in LOGV2 here). The nodes were seeing each other as DOWN during the replSetStepUp, which explains why it failed. Then the states were updated and an election would have succeeded. However the election never happened as the default timeout for tests run on evergreen is 86400000s. The test timed out as an election never happened. |
| Comment by Didier Nadeau [ 30/May/22 ] |
|
The problem appears to come from recipient nodes marking each other as DOWN during the application of the recipient config. The nodes that are DOWN are never reset (we don't see the log accompanying the reset) and they are not reset during the recipient config application either. The nodes keep exchanging heartbeat until the end, seemingly without issue, but they don't seem to clear the DOWN attribute (it seems like it should be reset here but it never happens), which means none of them can ever be elected. The next step would, ideally, be to understand why DOWN is never cleared and confirm the cause by reproducing the issue. We should be able to fix this by modifying the condition to mark a node as DOWN. It might also require to reset the nodes states when applying a recipient config. |