[SERVER-49297] RSM may not learn about new nodes if failure happens while majority of nodes crash and are repaired. Created: 02/Jul/20  Updated: 29/Oct/23  Resolved: 16/Sep/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.4.1

Type: New Feature Priority: Minor - P4
Reporter: Lamont Nelson Assignee: Lamont Nelson
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.4
Participants:

 Description   

In HELP-16677, a failure of 2/3 nodes caused those nodes to crash, and the third node was in an unknown failed state. The oplogs of the two replica set members were repaired. During the repair process, the nodes were stood up in standalone mode. The sdam spec requires that nodes self-reporting as standalone should be removed from the TopologyDescription. Once the nodes were restored as replica set members, mongos did not route traffic to these nodes, and a core dump showed that the list of hosts that were being monitored was the third node (with server description having type=Unknown) that could not be contacted. Restarting mongos fixed this problem. This ticket is a placeholder to investigate ways to mitigate this situation without manual intervention.

While there are no guarantees that this would positively impact liveness, the host lists stored in the config server and/or the initial connection string can be used to instruct the RSM to monitor nodes that may have been in the replica set in the past in the case that all current members of the replica set are down for a configured period of time. Adding them to the TopologyDescription as type=Unknown, would cause the RSM to contact those nodes at least once without negative effects on the rest of the protocol.

Update: In HELP-16677, we decided to do the following:
1. If all nodes are down for some configurable time period, add in the initial replica set members as type=Unknown.
2. Do not remove type=Standalone servers from the topology description.



 Comments   
Comment by Githook User [ 20/Aug/20 ]

Author:

{'name': 'LaMont Nelson', 'email': 'lamont.nelson@mongodb.com', 'username': 'lamontnelson'}

Message: SERVER-49297: do not stop monitoring standalone node

(cherry picked from commit 0d073ced6c2d652a6543f580390ce3637a280f3c)
Branch: v4.4
https://github.com/mongodb/mongo/commit/0adaca354fe602e8682da3675d7c8ca535c3602d

Comment by Githook User [ 20/Aug/20 ]

Author:

{'name': 'LaMont Nelson', 'email': 'lamont.nelson@mongodb.com', 'username': 'lamontnelson'}

Message: SERVER-49297: do not stop monitoring standalone node
Branch: master
https://github.com/mongodb/mongo/commit/0d073ced6c2d652a6543f580390ce3637a280f3c

Comment by Lamont Nelson [ 20/Aug/20 ]

https://mongodbcr.appspot.com/669830011/

Generated at Thu Feb 08 05:19:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.