[SERVER-46237] Extended majority downtime can cause consequences with majority read concern Created: 18/Feb/20 Updated: 30/Mar/20 Resolved: 30/Mar/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Shakir Sadikali | Assignee: | Evin Roesle |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Sprint: | Repl 2020-03-09 | ||||
| Participants: | |||||
| Case: | (copied to CRM) | ||||
| Description |
|
Sorry if the title is confusing. Let me provide some background. If a majority of a replica set is down for extended periods of time it's possible for writes to be acknowledged but not make it to any other members and for RCM to cause cache overflow saturation. For example, in a PSA this is one of the problems. This also happens if you have a PSS where SS are both up but in initial sync or OpLog recovery or lagging, and P has been up for an extended period of time. In this case P will push all writes into cacheoverflow. This has the consequence of possibly eventually stalling / OOMing. As a note: this applies primarily to systems where
There are some consequences of this Recovery for days If, for some reason, the Primary is restarted, it will need to apply OpLog from the common point in time (where SS both originally became unavailable for majority commit) and it will have a lot of data from cacheoverflow to apply. This means that it could take DAYS for the recovery to complete and the Primary to become available. Further, due to Data Loss If a Secondary manages to come online (that's days behind at this point) and assumes Primary role it may begin accepting writes. When the former Primary comes up again it will be forced into ROLLBACK and the customer is left with 3 very undesirable options. At the moment, the most effective "workaround" to this is to set votes:0 for the "down" Secondaries. However, in a scenario where both Secondaries are down and the Primary is in Oplog Apply mode, this isn't possible. The above break some basic expectations that customers have re: Replica Sets. Namely, if it's accepting writes then I won't lose data. In reality, there could be significant data loss. We should have a mechanism where after some pre-determined amount of time, if P is still writing to cacheoverflow (because the S's are unavailable) - either automatically set votes:0 and allow the majority commit point to move forward or error out and stop all writes or add logic that alerts the user to the compromised state of their deployment and actions they can take. |
| Comments |
| Comment by Evin Roesle [ 30/Mar/20 ] |
|
Closing this ticket as we don't want to automatically change the majority number of a cluster on a user. We recommend users set wc:majority in their clusters and in Atlas for their connection strings. If you would like Cloud to consider a possible alert mechanism, please submit a Cloud ticket. |
| Comment by Judah Schvimer [ 24/Feb/20 ] |
|
Hi shakir.sadikali, Thanks for the suggestion! You mention two different ideas:
Up until now, we have not done any automatic reconfigs in the database. We are currently in the middle of doing the first automatic reconfigs with the "Initial Sync Semantics (ISS)" project. This is only possible now that reconfigs are safe with the "Safe Reconfig" project. That said, we generally do not want to do automatic reconfigs in the server itself, especially when they could change the majority number. This could prevent the user from getting the guarantees and safety they expect, and we believe the database should not make decisions around availability/consistency tradeoffs for the user. In ISS, the automatic reconfig is just moving when a user requested reconfig comes into effect, not doing a reconfig the user did not request at all. Durable history in WT should improve the cache pressure in this circumstance, but unfortunately won't help with some of the other problems you identify.
This is effectively what "Replica Set Flow Control" does. If the commit point starts lagging, the primary throttles writes. It never stops them entirely, but it slows them down significantly to prevent lag from accumulating. I think that in cases like these, "Replica Set Flow Control" or user-initiated reconfigs to votes:0 are the best path forward. If flow-control is not sufficiently throttling writes, we could make improvements to that feature to address your concerns. Are there improvements there you're interested in? Additionally, it is best practice to always reconfig nodes to have votes:0 before initial syncing them, so doing that will help. Did I understand and address your concerns/suggestions correctly? |