Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Won't Do
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Replication
Labels:
None

Sprint:
Repl 2020-03-09
Case:
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Sorry if the title is confusing. Let me provide some background.

If a majority of a replica set is down for extended periods of time it's possible for writes to be acknowledged but not make it to any other members and for RCM to cause cache overflow saturation. For example, in a PSA this is one of the problems.

This also happens if you have a PSS where SS are both up but in initial sync or OpLog recovery or lagging, and P has been up for an extended period of time. In this case P will push all writes into cacheoverflow. This has the consequence of possibly eventually stalling / OOMing.

As a note: this applies primarily to systems where

w:1 writes are done
secondaries are "up" but not available (i.e. lagging significantly or in recovery/startup2)

There are some consequences of this

Recovery for days

If, for some reason, the Primary is restarted, it will need to apply OpLog from the common point in time (where SS both originally became unavailable for majority commit) and it will have a lot of data from cacheoverflow to apply.

This means that it could take DAYS for the recovery to complete and the Primary to become available. Further, due to ~~SERVER-36495~~, if the Primary fails while recovering, it will have to go through the entire process again because we don't move the commit point forward. We can also encounter this as a consequence of ~~SERVER-34938~~.

Data Loss

If a Secondary manages to come online (that's days behind at this point) and assumes Primary role it may begin accepting writes. When the former Primary comes up again it will be forced into ROLLBACK and the customer is left with 3 very undesirable options.
1. Initial sync and lose all the data that didn't make it to the Secondary
2. Let the former Primary complete recovery (days of waiting) and force an election and effectively lose the "new" data or
3. force the Secondary-that-became-the-Primary into rollback and hopefully get some semblance of data continuity

At the moment, the most effective "workaround" to this is to set votes:0 for the "down" Secondaries. However, in a scenario where both Secondaries are down and the Primary is in Oplog Apply mode, this isn't possible.

The above break some basic expectations that customers have re: Replica Sets. Namely, if it's accepting writes then I won't lose data. In reality, there could be significant data loss.

We should have a mechanism where after some pre-determined amount of time, if P is still writing to cacheoverflow (because the S's are unavailable) - either automatically set votes:0 and allow the majority commit point to move forward or error out and stop all writes or add logic that alerts the user to the compromised state of their deployment and actions they can take.

Assignee:: Evin Roesle
Reporter:: Shakir Sadikali
Participants:: Evin Roesle, Judah Schvimer, Shakir Sadikali
Votes:: 0 Vote for this issue
Watchers:: 19 Start watching this issue

Created:: Feb 18 2020 09:16:50 PM UTC
Updated:: Mar 30 2020 04:47:50 PM UTC
Resolved:: Mar 30 2020 04:03:38 PM UTC

Details

Description

Attachments

Activity

People

Dates