Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.0.0-rc4, 4.4.9, 5.1.0-rc0
Affects Version/s: None
Component/s: None
Labels:
None

Backwards Compatibility:
Fully Compatible
Backport Requested:

v5.0, v4.4
Sprint:
Service Arch 2021-06-28
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The rsm_horizon_change.js expects to see a specific log line here: https://github.com/mongodb/mongo/blob/e706abcecab992d5b2bf7f1806a90bd92e860c2d/jstests/noPassthrough/rsm_horizon_change.js#L30 after a split-horizon reconfig. Specifically, it expects to see this log line with topologyType ReplicaSetNoPrimary and type Unknown to indicate that the split-horizon reconfig initially results in an unknown server description.

Currently, this log line is emitted because StreamableReplicaSetMonitor::onTopologyDescriptionChangedEvent is called after TopologyManager::onServerDescription is called here after the RSM receives an error response from the remote node after the reconfig. But this code path is only reached if, after receiving the error response, the helloOutcome for the response is set here.

But, this hello response is only set if the received error is not a network error. Before ~~SERVER-49435~~, we erroneously translate network errors into CommnandResultSchemaViolation in the NetworkInterfaceTL here by calling getStatusFromCommandResult on the response without checking the response's internal status. This results in the RSM correctly emitting the log line via the code path described above. But after fixing this and correctly propogating network errors like HostUnreachable from the networkInterface, the RSM will no longer set the helloResponse after recieving the error (because it is a network error) and instead of emitting the expected log line, will drop connections to the remote it recieved and error from and monitor the RS in expedited mode until it detects a primary.

Note that this behavior is still correct – the host that recieves the network error will simply monitor the RS until it has a new primary, and then continue as usual. We just need to allow the test to accept the alternate log line in this case.

is depended on by

SERVER-49435 uassert in NetworkInterfaceTL::setTimer can cause server to crash if connection future not immediately ready

Closed

is related to

SERVER-62881 Make rsm_horizon_change.js unknown server description test resilient to RSM state transition

Closed

Assignee:: George Wangensteen (Inactive)
Reporter:: George Wangensteen (Inactive)
Participants:: George Wangensteen, Githook User, Vivian Ge
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Jun 21 2021 07:15:33 PM UTC
Updated:: Oct 29 2023 09:51:49 PM UTC
Resolved:: Jun 22 2021 06:13:04 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates