-
Type: Task
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Labels:
Description
SERVER ticket description:
In Atlas you can pause a cluster, effectively shutting the nodes down for a period of time.
Let's assume we pause for more than 24 hours and that all the nodes are current having committed all the writes. When they are restarted at the same time, we are seeing two nodes run and two branches of history forming. Eventually, one goes into rollback and gets a fassert because the common point is more than 24 hours behind even though we are only rolling back 1 or 2 very recent oplog entries. The common point, in this case, is from over 24 hours ago where the oplog entry immediately after the common point is from less than 5 mins ago.
While we believe we are fixing the two nodes running at the same time problem via SERVER-40336, it still makes sense to change this calculation if true network partitions occur after unpausing. Resolving this manually is a headache.
Change Description:
The rollback time limit is no longer calculated between the top of the oplog and the common point but rather it is now between the top of the oplog and the first operation after the common point. The time limit is still 24 hours.
Scope of changes
Impact to Other Docs
MVP (Work and Date)
Resources (Scope or Design Docs, Invision, etc.)
- documents
-
SERVER-41261 Use the oplog entry after the common point to calculate rollbackTimeLimitSecs
- Closed