-
Type: New Feature
-
Resolution: Fixed
-
Priority: Minor - P4
-
Affects Version/s: None
-
Component/s: None
-
None
-
Fully Compatible
-
v4.4
In HELP-16677, a failure of 2/3 nodes caused those nodes to crash, and the third node was in an unknown failed state. The oplogs of the two replica set members were repaired. During the repair process, the nodes were stood up in standalone mode. The sdam spec requires that nodes self-reporting as standalone should be removed from the TopologyDescription. Once the nodes were restored as replica set members, mongos did not route traffic to these nodes, and a core dump showed that the list of hosts that were being monitored was the third node (with server description having type=Unknown) that could not be contacted. Restarting mongos fixed this problem. This ticket is a placeholder to investigate ways to mitigate this situation without manual intervention.
While there are no guarantees that this would positively impact liveness, the host lists stored in the config server and/or the initial connection string can be used to instruct the RSM to monitor nodes that may have been in the replica set in the past in the case that all current members of the replica set are down for a configured period of time. Adding them to the TopologyDescription as type=Unknown, would cause the RSM to contact those nodes at least once without negative effects on the rest of the protocol.
Update: In HELP-16677, we decided to do the following:
1. If all nodes are down for some configurable time period, add in the initial replica set members as type=Unknown.
2. Do not remove type=Standalone servers from the topology description.