[SERVER-17975] Stale reads with WriteConcern Majority and ReadPreference Primary Created: 10/Apr/15 Updated: 02/Apr/21 Resolved: 07/Nov/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.6.7 |
| Fix Version/s: | 3.4.0-rc3 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kyle Kingsbury | Assignee: | Andy Schwerin |
| Resolution: | Done | Votes: | 15 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Minor Change | ||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: | Clone jepsen, check out commit 72697c09eff26fdb1afb7491256c873f03404307, cd mongodb, and run `lein test`. Might need to run `lein install` in the jepsen/jepsen directory first. |
||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
Hello, everyone! Hope you're having a terrific week. I think I may have found a thing! In Jepsen tests involving a mix of reads, writes, and compare-and-set against a single document, MongoDB appears to allow stale reads, even when writes use WriteConcern.MAJORITY, when network partitions cause a leader election. This holds for both plain find-by-id lookups and for queries explicitly passing ReadPreference.primary(). Here's how we execute read, write, and compare-and-set operations against a register: And this is the schedule for failures: a 60-second on, 60-second off pattern of network partitions cutting the network cleanly into a randomly selected 3-node majority component and a 2-node minority component. This particular test is a bit finicky--it's easy to get knossos locked into a really slow verification cycle, or to have trouble triggering the bug. Wish I had a more reliable test for you! Attached, linearizability.txt shows the linearizability analysis from Knossos for a test run with a relatively simple failure mode. In this test, MongoDB returns the value "0" for the document, even though the only possible values for the document at that time were 1, 2, 3, or 4. The value 0 was the proper state at some time close to the partition's beginning, but successful reads just after the partition was fully established indicated that at least one of the indeterminate (:info) CaS operations changing the value away from 0 had to have executed. You can see this visually in the attached image, where I've drawn the acknowledged (:ok) operations as green and indeterminate (:info) operations as yellow bars; omitting :fail ops which are known to have not taken place. Time moves from left to right; each process is a numbered horizontal track. The value must be zero just prior to the partition, but in order to read 4 and 3 we must execute process 1's CAS from 0->4; all possible paths from that point on cannot result in a value of 0 in time for process 5's final read. Since the MongoDB docs for Read Preferences (http://docs.mongodb.org/manual/core/read-preference/) say "reading from the primary guarantees that read operations reflect the latest version of a document", I suspect this behavior conflicts with Mongo's intended behavior. There is good news! If you remove all read operations from the mix, performing only CaS and writes, single-register ops with WriteConcern MAJORITY do appear to be linearizable! Or, at least, I haven't devised an aggressive enough test to expose any faults yet. This suggests to me that MongoDB might make the same mistake that Etcd and Consul did with respect to consistent reads: assuming that a node which believes it is currently a primary can safely service a read request without confirming with a quorum of secondaries that it is still the primary. If this is so, you might refer to https://github.com/coreos/etcd/issues/741 and https://gist.github.com/armon/11059431 for more context on why this behavior is not consistent. If this is the case, I think you can recover linearizable reads by computing the return value for the query, then verifying with a majority of nodes that no leadership transitions have happened since the start of the query, and then sending the result back to the client--preventing a logically "old" primary from servicing reads. Let me know if there's anything else I can help with! |
| Comments |
| Comment by Andy Schwerin [ 07/Nov/16 ] | ||||||||||
|
We have completed implementation of a new "linearizable" read concern under Thanks for your report and follow-up assistance, aphyr. | ||||||||||
| Comment by Andy Schwerin [ 02/Jun/16 ] | ||||||||||
|
Marqin, in the meantime, for single-document reads, if you have write privileges on the collection containing the document, you can use a findAndModify that performs a no-op update to avoid stale reads in cases where that is an operational requirement. This documentation suggests one approach, though it's not necessary to do a write that actually changes the document. | ||||||||||
| Comment by Ramon Fernandez Marina [ 02/Jun/16 ] | ||||||||||
|
Marqin, the "3.3 Desired" fixVersion indicates that we're aiming to address this ticket in the current development cycle. Feel free to watch the ticket for updates. Regards, | ||||||||||
| Comment by Hubert Jarosz [X] [ 02/Jun/16 ] | ||||||||||
|
What's the current state of this bug? | ||||||||||
| Comment by Carsten Klein [ 22/Jun/15 ] | ||||||||||
|
Andy Schwering, here, you definetely lost me What I meant was, that prior to reading or writing from the primary, there should be a third instance that would validate that primary before it is being used, even if it needed do multiple rpcs to the list of provable primaries and also wait for a specific amount of time before the data got replicated across all machines or at least to the one the client is being connected to. Ultimately causing the reading or writing client to fail if the primary could not be validated in a timely fashion. Which, I guess, is basically what the option is all about... lest for the failing part, of course. | ||||||||||
| Comment by Andy Schwerin [ 08/Jun/15 ] | ||||||||||
|
carstenklein@yahoo.de, if I understand Galera's model correctly, There are some differences, as individual replica sets in MongoDB only elect a single primary (write master) at a time, but I believe the effect is similar. | ||||||||||
| Comment by Carsten Klein [ 05/Jun/15 ] | ||||||||||
|
Hm, looking at MariaDB Galera, it uses both a proxy and an additional arbitrator for handling both fail over and for making sure that updates and presumably also reads are valid. As I see it, each mongo db replicate acts as an arbitrator. The same goes for MariaDB Galera, however, here, they also integrated an additional independent arbitrator that does not hold a replication set, just the transaction log. | ||||||||||
| Comment by Andy Schwerin [ 22/Apr/15 ] | ||||||||||
|
You cannot implement this feature with timing tricks. Even if everything else is going great, the OS scheduler can screw you pretty easily on a heavily loaded system, and just fail to schedule the step-down work on the old primary. We see this in our test harnesses sometimes, in tests that wait for failover to complete. | ||||||||||
| Comment by Kyle Kingsbury [ 22/Apr/15 ] | ||||||||||
|
> I'm not sure that can ever happen. Even if it could happen, it would be easy to tweak the step-down and election sequences so that step-down is guaranteed to happen faster than election of a new primary. The network is not synchronous, clocks drift, nodes pause, etc. Fixing a race condition via a timeout is an easy workaround, but I think you'll find (like Consul) that it's a probabilistic hack at best. | ||||||||||
| Comment by Henrik Ingo (Inactive) [ 22/Apr/15 ] | ||||||||||
|
I was thinking about this today, and I'm still wondering whether stale reads are at all possible in MongoDB? Even today with 2.6/3.0? The kind of stale read that Kyle describes can happen if there are 2 primaries existing at the same time: the old primary about to step down, and the newly elected primary. Even if it's unlikely, in theory a client process could flip flop between the primaries so that it reads: new primary, old primary, new primary. However, this can only happen if the old primary steps down later than the new primary is elected. (Using ReadPreference = PRIMARY is of course assumed here.) I'm not sure that can ever happen. Even if it could happen, it would be easy to tweak the step-down and election sequences so that step-down is guaranteed to happen faster than election of a new primary. This would be a more performant and easier solution than using findAndModify+getLastError or any other solution depending on doing roundtrips via the oplog. (Note that there have of course been a couple bugs reported where a replica set had 2 primaries even for long times, but those were bugs, not part of the intended failover protocol.) | ||||||||||
| Comment by Mark Callaghan [ 22/Apr/15 ] | ||||||||||
|
https://jira.mongodb.org/browse/DOCS-5185 is related to this. Commits can be visible on the master before a slave receives the oplog entry. Therefore visible commits can be rolled back regardless of the write concern. I think the manual and online training should be updated to explain that. I have also been hoping that eventually we will get something like lossless semisync replication in MongoDB as the Majority write concern is similar to semisync. Maybe this will serve as motivation. | ||||||||||
| Comment by Kyle Kingsbury [ 22/Apr/15 ] | ||||||||||
|
> in short I propose to break off the documentation request into a separate ticket and to use this ticket as the handle for scheduling the feature request. Sounds good to me! Aligning the docs to the current behavior can be done right away. I tried to make a reasonable survey of the Mongo consistency docs in the Jepsen post here: https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads. The post also suggests some example anomalies that might be helpful for users trying to reason about whether they can tolerate dirty/stale reads. > "coupling reads to oplog acknowledgement" pretty much degrades to converting reads to read-modify-writes in periods of low write volume. The workaround I describe in the post is to just do a findAndModify from the current state to itself. Experiments suggest this will do the trick, but if Mongo's smart enough to optimize that CaS away this won't help, haha. I say "couple", though, because you don't actually need to write anything to the oplog. You can actually piggyback the read state onto the existing oplog without inserting any new ops by simply blocking long enough for some other operation to be replicated-thereby verifying the primary is still current. Or you can inject a heartbeat event every so often. Oh, and you can also batch reads ops which should improve performance as well. The Consul and Raft discussions I linked to talk about both tactics. > If Ah, yeah, you're assuming all of these operations take place against the minority primary. That may be the case for this particular history, but in general, writes can occur on either side of the partition, leading to stale reads--the reads could see 0, then 1, then 0, then 1, or any other pattern, depending on which primary clients are talking and when they make their request. | ||||||||||
| Comment by Mark Callaghan [ 22/Apr/15 ] | ||||||||||
|
Still catching up on things so perhaps my comments are not relevant but... In 2.6 (mmapv1 only) changes on the master are visible: In 3.0 with WiredTiger and RocksDB changes on the master are not visible until after their redo log sync has been done. I assume that #1 continues to be true for mmapv1. In 3.0 I assume that #2 is still a problem. I wrote about this in: We also experienced this in MySQL land with semi-sync replication and solved the problem with lossless semisync replication. See the post by Yoshi for more details but the property we provide is that commits are not visible on the master until the commit log has been archived on at least one other replica or log-only replica. It will take me a while to get through all of the details in this case but in the end I hope we can describe the MongoDB behavior in a few sentences. | ||||||||||
| Comment by Andy Schwerin [ 22/Apr/15 ] | ||||||||||
|
Assigning to me for scheduling. | ||||||||||
| Comment by Andy Schwerin [ 21/Apr/15 ] | ||||||||||
|
In my reading of this ticket, there are two actions requested. One is to schedule the ticket's suggestion for a feature to support treating a single document as a linearizable concurrent object without forcing the client to convert reads to read-modify-write operations. The other is to correct the MongoDB documentation about read consistency, emphasizing the conditions in which stale reads may occur with read preference "primary" in current and prior versions of MongoDB. Please read below for details, but in short I propose to break off the documentation request into a separate ticket and to use this ticket as the handle for scheduling the feature request. Regarding the possible linearizable schedule for the reads, let me try to clarify. Your diagram indicates that by the end of the period in question, none of the writes have finished being confirmed by the replication system. If Anyhow, the behavior you did observe certainly doesn't have a linearizable schedule. As you point out, even with You suggested (approximately) transforming reads into atomic read-modify-writes in order to achieve linearizable reads. You didn't propose it exactly that way, and your description leaves more room for optimization, but "coupling reads to oplog acknowledgement" pretty much degrades to converting reads to read-modify-writes in periods of low write volume. The behavior can be achieved today, albeit somewhat clumsily and only with some client drivers, by using the "findAndModify" command to issue your reads and then issuing a getLastError command to wait for write concern satisfaction. Your findAndModify command will need to make some change to the document being read, such as incrementing an otherwise ignored field, in order to force an entry into the oplog, and you cannot observe the value until the getLastError command returns successfully, indicating that your read-modify-write replicated successfully. Finally, as you indicated above, there is a clear documentation issue. The documentation you reference needs to be updated. As mentioned, there's an active DOCS ticket for part of that, | ||||||||||
| Comment by Kyle Kingsbury [ 20/Apr/15 ] | ||||||||||
|
In what possible sense is this "working as designed"? The MongoDB documentation repeats the terms "immediate consistency" and "latest version" over and over again. Here's the MongoDB chief architect claiming Mongo provides "Immediate Consistency" in a 2012 talk: http://www.slideshare.net/mongodb/mongodb-basic-concepts-15674838 Here's the Read preference documentation claiming Mongo ReadPreference=primary "guarantees that read operations reflect the latest version of a document": http://docs.mongodb.org/manual/core/read-preference/ The MongoDB FAQ says "MongoDB is consistent by default: reads and writes are issued to the primary member of a replica set": http://www.mongodb.com/faq#consistency And the Architecture Guide repeats the theme that only non-primary ReadPreferences can see stale data: http://s3.amazonaws.com/info-mongodb-com/MongoDB_Architecture_Guide.pdf. What Mongo actually does is allow stale reads: it is possible to execute a WriteConcern=MAJORITY write of a new value, wait for it to return successfully, perform a read with ReadPreference=PRIMARY, and not see the value you just wrote. | ||||||||||
| Comment by Kyle Kingsbury [ 20/Apr/15 ] | ||||||||||
|
To elaborate... > Further, there's a linearizable schedule in that case, I believe. It's been a while since I read Herlihy's paper, but if I have this right, with I don't understand what you mean--it doesn't make sense for a register to read 0, 4, 3, and 0 again without any writes taking place. > On the other hand, even if an application does that the response might be delayed during transport, during which time a more-current value might appear. Sticking to the single-document case, for the moment, if a thread communicates with other threads only through MongoDB, so long as it never sees an older value of a document after seeing a newer value of a document, and so long as it does only committed reads, what would staleness even mean? The property you're describing is sequential consistency: all processes see operations in the same order, but do not agree on when they happen. Sequentially consistent systems allow arbitrarily stale reads: it is legal, for instance, for a new process to see no documents, which leads to confusing anomalies like, say, submitting a comment, refreshing the page, and seeing nothing there. I think you would be hard-pressed to find users who have no side-channels between processes, and I also think most of your user base would interpret "latest version" to mean "a state between the invocation and completion times of my read operation", not "some state logically subsequent to my previous operation and temporally prior to the completion of my read operation." > Now, if the threads communicate directly with each other They can and they will communicate--I have never talked to a Mongo user which did not send data from MongoDB to a human being. This is why linearizability is a useful invariant: you know that if you post a status update, receive an HTTP 200 response, call up your friend, and ask them to look, they'll see your post. You can ask people to embed causality tokens in all their operations, but a.) you have to train users how to propagate and merge causality tokens correctly, b.) this does nothing for fresh processes, and c.) this is not what most people mean when they say "immediate" or "latest version", haha. | ||||||||||
| Comment by Eric Milkie [ 20/Apr/15 ] | ||||||||||
|
EDIT This ticket was re-opened on April 21. | ||||||||||
| Comment by Kyle Kingsbury [ 20/Apr/15 ] | ||||||||||
|
Maybe I should have been more explicit: this is not a duplicate of | ||||||||||
| Comment by Eric Milkie [ 20/Apr/15 ] | ||||||||||
|
I'm closing this as a duplicate of the read-committed ticket, but please feel free to reopen for further discussion. | ||||||||||
| Comment by Andy Schwerin [ 15/Apr/15 ] | ||||||||||
|
From my interpretation of Kyle's diagram, if As for Kyle's point about needing reads to be coupled to oplog acknowledgement to prevent stale reads, I'm of two minds. On the one hand, an application can convert reads into atomic read-modify-write operations today using the findAndModify and getLastError commands in MongoDB in order to tie the reads into the oplog acknowledgement system (NB: I don't think most drivers support this today). On the other hand, even if an application does that the response might be delayed during transport, during which time a more-current value might appear. Sticking to the single-document case, for the moment, if a thread communicates with other threads only through MongoDB, so long as it never sees an older value of a document after seeing a newer value of a document, and so long as it does only committed reads, what would staleness even mean? Now, if the threads communicate directly with each other, the story gets more complicated and That solution depends on resolution of In the meantime, we will work to improve the documentation around this behavior in current versions of MongoDB. As always, please respond if you have questions or comments. | ||||||||||
| Comment by Kyle Kingsbury [ 15/Apr/15 ] | ||||||||||
|
(perhaps I should also mention, in case anyone comes along and thinks this is subsumed by | ||||||||||
| Comment by Kyle Kingsbury [ 14/Apr/15 ] | ||||||||||
|
The existence of this behavior actually implies both anomalies are present in MongoDB, but I'm phrasing it conservatively. Why? A dirty read from an isolated primary can be trivially converted to a stale read if the write to the isolated primary doesn't affect the outcome of the read (or if the write doesn't take place at all). I think there are two problems to fix here--supporting read-committed isolation will prevent dirty reads, but still allows stale reads. You also have to couple reads to oplog acknowledgement in some way to prevent stale read transactions. I've attached a sketch (journal-84.png) to illustrate--all you have to do is execute the write on the new primary instead of the old to convert a dirty read to a stale one. Either way, you're not reading "the most recent state." Note that you don't have to go full read-committed to fix this anomaly: you can prevent stale and dirty reads for single documents without supporting RC for multi-doc operations (just a difference in lock granularity), so if you want to support reading the latest version, you can have it in both read-uncommitted and read-committed modes. The read isolation docs (http://docs.mongodb.org/manual/core/write-concern/#read-isolation) are technically correct, I think, but sorta misleading: "For all inserts and updates, MongoDB modifies each document in isolation: clients never see documents in intermediate states" kinda suggests that the read uncommitted problem refers to multiple-document updates—which is also true—but it doesn't mention that even read operations on a single document may see invalid states that are not causally connected to the final history. The read preference docs (http://docs.mongodb.org/manual/core/read-preference/) make some pretty explicit claims that Mongo supports linearizable reads, saying "Reading from the primary guarantees that read operations reflect the latest version of a document", and "All read preference modes except primary may return stale data". With this in mind, it might be a good idea to let users know all read modes may return stale data, and that the difference in ReadPreference just changes the probabilities. For instance, "Ensure that your application can tolerate stale data if you choose to use a non-primary mode," could read "Always ensure that your application can tolerate stale data." | ||||||||||
| Comment by Kyle Kingsbury [ 14/Apr/15 ] | ||||||||||
|
Sketch illustrating that stale reads are a degenerate case of dirty reads. | ||||||||||
| Comment by Asya Kamsky [ 14/Apr/15 ] | ||||||||||
|
Hello Kyle, As you and Knossos have discovered, it is not possible to do fully linearized single-document reads with the current version of MongoDB. I believe that what your test framework is not taking into account is that reading from a primary does not guarantee that the read data will survive a network partition. This is because MongoDB read isolation semantics are similar to "read uncommitted" in a traditional database system when you take into account the full replica set. As the docs mention, data written with majority writeConcern that has been acknowledged will survive any replica set event that allows a new primary to be elected. However, after the write is made on the primary, but before it has successfully replicated to majority of the cluster, it is visible to any other connection that's reading from the primary. This allows the following sequence of events:
When write A has not propagated to the majority of the replica set, it may not be present on the newly elected primary (in fact, if write A has replicated to none of the secondaries, it is guaranteed to be absent from the newly elected primary). I believe such a sequence of events was observed in your case, where the majority write concern is not yet satisfied, the unacknowledged data have been written on the primary and were visible to other connections (process 1 in your case), but the value was not present on the newly elected primary (which is the node that process 5 finally successfully read from). The phenomenon your tests are observing are not stale reads (of value 0) but rather uncommitted reads, and those are the reads "1 read 4" and "1 read 3" as this happens on the "old" primary. Those writes were not acknowledged, nor replicated to the majority partition, and they will be (correctly) rolled back when the partition is removed. Currently, there is a Documentation task, Thanks for detailed report and let me know if you have any questions. |