Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.2.0-rc3, 4.0.13, 4.3.1
Affects Version/s: None
Component/s: None
Labels:
None

Backwards Compatibility:
Fully Compatible
Backport Requested:

v4.2, v4.0
Sprint:
Repl 2019-07-01, Repl 2019-07-15
Case:
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

In Atlas you can pause a cluster, effectively shutting the nodes down for a period of time.

Let's assume we pause for more than 24 hours and that all the nodes are current having committed all the writes. When they are restarted at the same time, we are seeing two nodes run and two branches of history forming. Eventually, one goes into rollback and gets a fassert because the common point is more than 24 hours behind even though we are only rolling back 1 or 2 very recent oplog entries. The common point, in this case, is from over 24 hours ago where the oplog entry immediately after the common point is from less than 5 mins ago.

While we believe we are fixing the two nodes running at the same time problem via ~~SERVER-40336~~, it still makes sense to change this calculation if true network partitions occur after unpausing. Resolving this manually is a headache.

Assignee:: Jason Chan
Reporter:: Alyson Cabral (Inactive)
Participants:: Alyson Cabral, Githook User, Jason Chan
Votes:: 0 Vote for this issue
Watchers:: 13 Start watching this issue

Created:: May 21 2019 07:21:22 PM UTC
Updated:: Oct 29 2023 10:20:48 PM UTC
Resolved:: Jul 12 2019 05:16:51 PM UTC
Confidence Status Last Update:: 02/Jul/19 8:10 PM

Details

Description

Attachments

Forms

Activity

People

Dates