[SERVER-46237] Extended majority downtime can cause consequences with majority read concern Created: 18/Feb/20  Updated: 30/Mar/20  Resolved: 30/Mar/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Shakir Sadikali Assignee: Evin Roesle
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Sprint: Repl 2020-03-09
Participants:
Case:

 Description   

Sorry if the title is confusing.  Let me provide some background.

If a majority of a replica set is down for extended periods of time it's possible for writes to be acknowledged but not make it to any other members and for RCM to cause cache overflow saturation.  For example, in a PSA this is one of the problems.  

This also happens if you have a PSS where SS are both up but in initial sync or OpLog recovery or lagging, and P has been up for an extended period of time.  In this case P will push all writes into cacheoverflow.  This has the consequence of possibly eventually stalling / OOMing.  

As a note: this applies primarily to systems where

  • w:1 writes are done
  • secondaries are "up" but not available (i.e. lagging significantly or in recovery/startup2)

There are some consequences of this

Recovery for days

If, for some reason, the Primary is restarted, it will need to apply OpLog from the common point in time (where SS both originally became unavailable for majority commit) and it will have a lot of data from cacheoverflow to apply.

This means that it could take DAYS for the recovery to complete and the Primary to become available.  Further, due to SERVER-36495, if the Primary fails while recovering, it will have to go through the entire process again because we don't move the commit point forward.  We can also encounter this as a consequence of SERVER-34938.

Data Loss

If a Secondary manages to come online (that's days behind at this point) and assumes Primary role it may begin accepting writes.  When the former Primary comes up again it will be forced into ROLLBACK and the customer is left with 3 very undesirable options.
1. Initial sync and lose all the data that didn't make it to the Secondary
2. Let the former Primary complete recovery (days of waiting) and force an election and effectively lose the "new" data or
3. force the Secondary-that-became-the-Primary into rollback and hopefully get some semblance of data continuity

At the moment, the most effective "workaround" to this is to set votes:0 for the "down" Secondaries.  However, in a scenario where both Secondaries are down and the Primary is in Oplog Apply mode, this isn't possible.

The above break some basic expectations that customers have re: Replica Sets.  Namely, if it's accepting writes then I won't lose data.  In reality, there could be significant data loss.

We should have a mechanism where after some pre-determined amount of time, if P is still writing to cacheoverflow (because the S's are unavailable) - either automatically set votes:0 and allow the majority commit point to move forward or error out and stop all writes or add logic that alerts the user to the compromised state of their deployment and actions they can take.



 Comments   
Comment by Evin Roesle [ 30/Mar/20 ]

Closing this ticket as we don't want to automatically change the majority number of a cluster on a user. We recommend users set wc:majority in their clusters and in Atlas for their connection strings. If you would like Cloud to consider a possible alert mechanism, please submit a Cloud ticket.

Comment by Judah Schvimer [ 24/Feb/20 ]

Hi shakir.sadikali,

Thanks for the suggestion! You mention two different ideas:

automatically set votes:0 and allow the majority commit point to move forward

Up until now, we have not done any automatic reconfigs in the database. We are currently in the middle of doing the first automatic reconfigs with the "Initial Sync Semantics (ISS)" project. This is only possible now that reconfigs are safe with the "Safe Reconfig" project. That said, we generally do not want to do automatic reconfigs in the server itself, especially when they could change the majority number. This could prevent the user from getting the guarantees and safety they expect, and we believe the database should not make decisions around availability/consistency tradeoffs for the user. In ISS, the automatic reconfig is just moving when a user requested reconfig comes into effect, not doing a reconfig the user did not request at all. Durable history in WT should improve the cache pressure in this circumstance, but unfortunately won't help with some of the other problems you identify.

error out and stop all writes

This is effectively what "Replica Set Flow Control" does. If the commit point starts lagging, the primary throttles writes. It never stops them entirely, but it slows them down significantly to prevent lag from accumulating.

I think that in cases like these, "Replica Set Flow Control" or user-initiated reconfigs to votes:0 are the best path forward. If flow-control is not sufficiently throttling writes, we could make improvements to that feature to address your concerns. Are there improvements there you're interested in? Additionally, it is best practice to always reconfig nodes to have votes:0 before initial syncing them, so doing that will help.

Did I understand and address your concerns/suggestions correctly?

Generated at Thu Feb 08 05:10:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.