Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-57893

Make rsm_horizon_change.js resilient to network failures

    • Type: Icon: Improvement Improvement
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 5.0.0-rc4, 4.4.9, 5.1.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • Labels:
    • Fully Compatible
    • v5.0, v4.4
    • Service Arch 2021-06-28

      The rsm_horizon_change.js expects to see a specific log line here: https://github.com/mongodb/mongo/blob/e706abcecab992d5b2bf7f1806a90bd92e860c2d/jstests/noPassthrough/rsm_horizon_change.js#L30 after a split-horizon reconfig. Specifically, it expects to see this log line with topologyType ReplicaSetNoPrimary and type Unknown to indicate that the split-horizon reconfig initially results in an unknown server description.

      Currently, this log line is emitted because StreamableReplicaSetMonitor::onTopologyDescriptionChangedEvent is called after TopologyManager::onServerDescription is called here after the RSM receives an error response from the remote node after the reconfig. But this code path is only reached if, after receiving the error response, the helloOutcome for the response is set here.

      But, this hello response is only set if the received error is not a network error. Before SERVER-49435, we erroneously translate network errors into CommnandResultSchemaViolation in the NetworkInterfaceTL here by calling getStatusFromCommandResult on the response without checking the response's internal status. This results in the RSM correctly emitting the log line via the code path described above. But after fixing this and correctly propogating network errors like HostUnreachable from the networkInterface, the RSM will no longer set the helloResponse after recieving the error (because it is a network error) and instead of emitting the expected log line, will drop connections to the remote it recieved and error from and monitor the RS in expedited mode until it detects a primary.

      Note that this behavior is still correct – the host that recieves the network error will simply monitor the RS until it has a new primary, and then continue as usual. We just need to allow the test to accept the alternate log line in this case.

            george.wangensteen@mongodb.com George Wangensteen
            george.wangensteen@mongodb.com George Wangensteen
            0 Vote for this issue
            1 Start watching this issue