-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Networking & Observability
-
None
-
3
-
TBD
-
None
-
None
-
None
-
None
-
None
-
None
-
None
In HELP-77706, a three member replica set (instances 0, 1, and 2) got into a state during maintenance where instances 0 and 2 thought that instance 1 was primary, while instance 1 responded to queries as if it were secondary. Instance 1's FTDC replication data indicated that it thought that it was primary, and so the issue lay in a disagreement between instance 1's replication coordinator (which was correct) and instance 1's replica set monitor (which was incorrect).
During the investigation, we noticed that the log lines in all instances, but especially in instance 1, were not in chronological order by timestamp (.t.$date). The log would regularly jump backwards in time. In the logs for instance 0 and 2, the jumps were infrequent and almost always exactly one millisecond. On instance 1, the jumps were more frequent and often as much as four minutes.
I (david.goffredo@mongodb.com) read through the replica set monitor (SDAM) server code to try to find a way that these large backward jumps in time could get the server into a state where the replica set monitor thinks that the instance is secondary and ignores replication coordinator evidence to the contrary. I was unable to find a possible cause.
This ticket is to further the investigation into that hypothetical, and if possible create a test that reproduces the pathological state seen in HELP-77706.
- related to
-
SERVER-107659 Test RSM when host clock experience major jump back and forth in time
-
- Closed
-