[SERVER-17975] Stale reads with WriteConcern Majority and ReadPreference Primary Created: 10/Apr/15  Updated: 02/Apr/21  Resolved: 07/Nov/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.6.7
Fix Version/s: 3.4.0-rc3

Type: Bug Priority: Major - P3
Reporter: Kyle Kingsbury Assignee: Andy Schwerin
Resolution: Done Votes: 15
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: JPEG File CCNSOQ6UwAEAvsO.jpg     PNG File Journal - 84.png     File history.edn     Text File linearizability.txt    
Issue Links:
Depends
depends on SERVER-18285 Support linearizable reads on replica... Closed
Related
related to DOCS-5141 can you document more behavior for re... Closed
related to DOCS-5259 Audit read-preference sections Closed
related to DOCS-5324 Document how to use findAndModify wit... Closed
related to DOCS-5325 Clearly document meaning of "you" in ... Closed
related to CXX-597 Use electionId to detect stale primar... Closed
related to DOCS-5326 consistency/isolation/recency trackin... Closed
related to DRIVERS-228 Use electionId to detect stale primar... Closed
Backwards Compatibility: Minor Change
Operating System: ALL
Steps To Reproduce:

Clone jepsen, check out commit 72697c09eff26fdb1afb7491256c873f03404307, cd mongodb, and run `lein test`. Might need to run `lein install` in the jepsen/jepsen directory first.

Participants:
Case:

 Description   

Hello, everyone! Hope you're having a terrific week.

I think I may have found a thing!

In Jepsen tests involving a mix of reads, writes, and compare-and-set against a single document, MongoDB appears to allow stale reads, even when writes use WriteConcern.MAJORITY, when network partitions cause a leader election. This holds for both plain find-by-id lookups and for queries explicitly passing ReadPreference.primary().

Here's how we execute read, write, and compare-and-set operations against a register:

https://github.com/aphyr/jepsen/blob/72697c09eff26fdb1afb7491256c873f03404307/mongodb/src/mongodb/document_cas.clj#L55-L81

And this is the schedule for failures: a 60-second on, 60-second off pattern of network partitions cutting the network cleanly into a randomly selected 3-node majority component and a 2-node minority component.

https://github.com/aphyr/jepsen/blob/72697c09eff26fdb1afb7491256c873f03404307/mongodb/src/mongodb/core.clj#L377-L391

This particular test is a bit finicky--it's easy to get knossos locked into a really slow verification cycle, or to have trouble triggering the bug. Wish I had a more reliable test for you!

Attached, linearizability.txt shows the linearizability analysis from Knossos for a test run with a relatively simple failure mode. In this test, MongoDB returns the value "0" for the document, even though the only possible values for the document at that time were 1, 2, 3, or 4. The value 0 was the proper state at some time close to the partition's beginning, but successful reads just after the partition was fully established indicated that at least one of the indeterminate (:info) CaS operations changing the value away from 0 had to have executed.

You can see this visually in the attached image, where I've drawn the acknowledged (:ok) operations as green and indeterminate (:info) operations as yellow bars; omitting :fail ops which are known to have not taken place. Time moves from left to right; each process is a numbered horizontal track. The value must be zero just prior to the partition, but in order to read 4 and 3 we must execute process 1's CAS from 0->4; all possible paths from that point on cannot result in a value of 0 in time for process 5's final read.

Since the MongoDB docs for Read Preferences (http://docs.mongodb.org/manual/core/read-preference/) say "reading from the primary guarantees that read operations reflect the latest version of a document", I suspect this behavior conflicts with Mongo's intended behavior.

There is good news! If you remove all read operations from the mix, performing only CaS and writes, single-register ops with WriteConcern MAJORITY do appear to be linearizable! Or, at least, I haven't devised an aggressive enough test to expose any faults yet.

This suggests to me that MongoDB might make the same mistake that Etcd and Consul did with respect to consistent reads: assuming that a node which believes it is currently a primary can safely service a read request without confirming with a quorum of secondaries that it is still the primary. If this is so, you might refer to https://github.com/coreos/etcd/issues/741 and https://gist.github.com/armon/11059431 for more context on why this behavior is not consistent.

If this is the case, I think you can recover linearizable reads by computing the return value for the query, then verifying with a majority of nodes that no leadership transitions have happened since the start of the query, and then sending the result back to the client--preventing a logically "old" primary from servicing reads.

Let me know if there's anything else I can help with!



 Comments   
Comment by Andy Schwerin [ 07/Nov/16 ]

We have completed implementation of a new "linearizable" read concern under SERVER-18285, and have undertaken some documentation updates under DOCS-8298. As such, I'm resolving this ticket as "fixed" for MongoDB 3.4.0-rc3. The code is actually present and enabled in 3.4.0-rc2, for those interested in further test. Our own testing included, among other things, integrating jepsen tests into our continous integration system. That work was done in SERVER-24509.

Thanks for your report and follow-up assistance, aphyr.

Comment by Andy Schwerin [ 02/Jun/16 ]

Marqin, in the meantime, for single-document reads, if you have write privileges on the collection containing the document, you can use a findAndModify that performs a no-op update to avoid stale reads in cases where that is an operational requirement.

This documentation suggests one approach, though it's not necessary to do a write that actually changes the document.

Comment by Ramon Fernandez Marina [ 02/Jun/16 ]

Marqin, the "3.3 Desired" fixVersion indicates that we're aiming to address this ticket in the current development cycle. Feel free to watch the ticket for updates.

Regards,
Ramón.

Comment by Hubert Jarosz [X] [ 02/Jun/16 ]

What's the current state of this bug?

Comment by Carsten Klein [ 22/Jun/15 ]

Andy Schwering, here, you definetely lost me

What I meant was, that prior to reading or writing from the primary, there should be a third instance that would validate that primary before it is being used, even if it needed do multiple rpcs to the list of provable primaries and also wait for a specific amount of time before the data got replicated across all machines or at least to the one the client is being connected to.

Ultimately causing the reading or writing client to fail if the primary could not be validated in a timely fashion.

Which, I guess, is basically what the option is all about... lest for the failing part, of course.

Comment by Andy Schwerin [ 08/Jun/15 ]

carstenklein@yahoo.de, if I understand Galera's model correctly, SERVER-18285 should provide the equivalent behavior to setting wsrep_sync_wait = 1, while using it in conjunction with w:majority writes starting in MongoDB 3.2 ought to provide wsrep_sync_wait=3 or possibly 7.

There are some differences, as individual replica sets in MongoDB only elect a single primary (write master) at a time, but I believe the effect is similar.

Comment by Carsten Klein [ 05/Jun/15 ]

Hm, looking at MariaDB Galera, it uses both a proxy and an additional arbitrator for handling both fail over and for making sure that updates and presumably also reads are valid.
Would it not be possible to implement a similar scheme, lest the proxy of course, in MongoDB to get rid of this once and for all?

As I see it, each mongo db replicate acts as an arbitrator. The same goes for MariaDB Galera, however, here, they also integrated an additional independent arbitrator that does not hold a replication set, just the transaction log.

Comment by Andy Schwerin [ 22/Apr/15 ]

You cannot implement this feature with timing tricks. Even if everything else is going great, the OS scheduler can screw you pretty easily on a heavily loaded system, and just fail to schedule the step-down work on the old primary. We see this in our test harnesses sometimes, in tests that wait for failover to complete.

Comment by Kyle Kingsbury [ 22/Apr/15 ]

> I'm not sure that can ever happen. Even if it could happen, it would be easy to tweak the step-down and election sequences so that step-down is guaranteed to happen faster than election of a new primary.

The network is not synchronous, clocks drift, nodes pause, etc. Fixing a race condition via a timeout is an easy workaround, but I think you'll find (like Consul) that it's a probabilistic hack at best.

Comment by Henrik Ingo (Inactive) [ 22/Apr/15 ]

I was thinking about this today, and I'm still wondering whether stale reads are at all possible in MongoDB? Even today with 2.6/3.0?

The kind of stale read that Kyle describes can happen if there are 2 primaries existing at the same time: the old primary about to step down, and the newly elected primary. Even if it's unlikely, in theory a client process could flip flop between the primaries so that it reads: new primary, old primary, new primary. However, this can only happen if the old primary steps down later than the new primary is elected. (Using ReadPreference = PRIMARY is of course assumed here.)

I'm not sure that can ever happen. Even if it could happen, it would be easy to tweak the step-down and election sequences so that step-down is guaranteed to happen faster than election of a new primary.

This would be a more performant and easier solution than using findAndModify+getLastError or any other solution depending on doing roundtrips via the oplog.

(Note that there have of course been a couple bugs reported where a replica set had 2 primaries even for long times, but those were bugs, not part of the intended failover protocol.)

Comment by Mark Callaghan [ 22/Apr/15 ]

https://jira.mongodb.org/browse/DOCS-5185 is related to this. Commits can be visible on the master before a slave receives the oplog entry. Therefore visible commits can be rolled back regardless of the write concern. I think the manual and online training should be updated to explain that.

I have also been hoping that eventually we will get something like lossless semisync replication in MongoDB as the Majority write concern is similar to semisync. Maybe this will serve as motivation.
'

Comment by Kyle Kingsbury [ 22/Apr/15 ]

> in short I propose to break off the documentation request into a separate ticket and to use this ticket as the handle for scheduling the feature request.

Sounds good to me! Aligning the docs to the current behavior can be done right away. I tried to make a reasonable survey of the Mongo consistency docs in the Jepsen post here: https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads. The post also suggests some example anomalies that might be helpful for users trying to reason about whether they can tolerate dirty/stale reads.

> "coupling reads to oplog acknowledgement" pretty much degrades to converting reads to read-modify-writes in periods of low write volume.

The workaround I describe in the post is to just do a findAndModify from the current state to itself. Experiments suggest this will do the trick, but if Mongo's smart enough to optimize that CaS away this won't help, haha. I say "couple", though, because you don't actually need to write anything to the oplog. You can actually piggyback the read state onto the existing oplog without inserting any new ops by simply blocking long enough for some other operation to be replicated-thereby verifying the primary is still current. Or you can inject a heartbeat event every so often. Oh, and you can also batch reads ops which should improve performance as well. The Consul and Raft discussions I linked to talk about both tactics.

> If SERVER-18022 were resolved all the reads in the diagram would have returned 0, because none of the writes had committed.

Ah, yeah, you're assuming all of these operations take place against the minority primary. That may be the case for this particular history, but in general, writes can occur on either side of the partition, leading to stale reads--the reads could see 0, then 1, then 0, then 1, or any other pattern, depending on which primary clients are talking and when they make their request.

Comment by Mark Callaghan [ 22/Apr/15 ]

Still catching up on things so perhaps my comments are not relevant but...

In 2.6 (mmapv1 only) changes on the master are visible:
1) before journal sync is done
2) before replicas might have received or ack'd the change

In 3.0 with WiredTiger and RocksDB changes on the master are not visible until after their redo log sync has been done. I assume that #1 continues to be true for mmapv1.

In 3.0 I assume that #2 is still a problem.

I wrote about this in:
https://jira.mongodb.org/browse/DOCS-2908
http://smalldatum.blogspot.com/2014/03/when-does-mongodb-make-transaction.html

We also experienced this in MySQL land with semi-sync replication and solved the problem with lossless semisync replication. See the post by Yoshi for more details but the property we provide is that commits are not visible on the master until the commit log has been archived on at least one other replica or log-only replica.
http://yoshinorimatsunobu.blogspot.com/2014/04/semi-synchronous-replication-at-facebook.html

It will take me a while to get through all of the details in this case but in the end I hope we can describe the MongoDB behavior in a few sentences.

Comment by Andy Schwerin [ 22/Apr/15 ]

Assigning to me for scheduling.

Comment by Andy Schwerin [ 21/Apr/15 ]

In my reading of this ticket, there are two actions requested. One is to schedule the ticket's suggestion for a feature to support treating a single document as a linearizable concurrent object without forcing the client to convert reads to read-modify-write operations. The other is to correct the MongoDB documentation about read consistency, emphasizing the conditions in which stale reads may occur with read preference "primary" in current and prior versions of MongoDB. Please read below for details, but in short I propose to break off the documentation request into a separate ticket and to use this ticket as the handle for scheduling the feature request.

Regarding the possible linearizable schedule for the reads, let me try to clarify. Your diagram indicates that by the end of the period in question, none of the writes have finished being confirmed by the replication system. If SERVER-18022 were resolved all the reads in the diagram would have returned 0, because none of the writes had committed. As such, there would be a legal linearizable schedule, whose prefix includes the completion of all the read operations, and whose suffix includes the completion of all the write operations. I think in this case that the fact that the writes had in fact started is not relevant. In this case, write operation completion means replication to a majority of voting nodes and confirmation of that fact to the primary that accepted the write from the client.

Anyhow, the behavior you did observe certainly doesn't have a linearizable schedule. As you point out, even with SERVER-18022 you don't get reads into a linear schedule for free. The problem is that there is a period of time during a network partition when two nodes may believe themselves to be primary. As soon as those two nodes communicate, one will step down. If the partition lasts long enough, the node in the minority partition will step down, but there is an inevitable window for stale reads when a client reads from a primary that will inevitably step down. As an aside, improvements to the consensus protocol can be used to bring that period down to a few network roundtrip periods (hundreds of milliseconds), and that is the subject of the somewhat ill-described SERVER-12385.

You suggested (approximately) transforming reads into atomic read-modify-writes in order to achieve linearizable reads. You didn't propose it exactly that way, and your description leaves more room for optimization, but "coupling reads to oplog acknowledgement" pretty much degrades to converting reads to read-modify-writes in periods of low write volume. The behavior can be achieved today, albeit somewhat clumsily and only with some client drivers, by using the "findAndModify" command to issue your reads and then issuing a getLastError command to wait for write concern satisfaction. Your findAndModify command will need to make some change to the document being read, such as incrementing an otherwise ignored field, in order to force an entry into the oplog, and you cannot observe the value until the getLastError command returns successfully, indicating that your read-modify-write replicated successfully.

Finally, as you indicated above, there is a clear documentation issue. The documentation you reference needs to be updated. As mentioned, there's an active DOCS ticket for part of that, DOCS-5141, and I'm now convinced that we'll need a separate one to review all of our read-preference documentation. Improving the documentation will be tricky because, while linearizable distributed objects are often convenient, they come at a comparatively high cost in terms of communication overhead. Since users' needs can be frequently satisfied with more relaxed consistency models, the updated documentation will need to help developers weigh the probability of a stale read with its impact on their application.

Comment by Kyle Kingsbury [ 20/Apr/15 ]

In what possible sense is this "working as designed"? The MongoDB documentation repeats the terms "immediate consistency" and "latest version" over and over again.

Here's the MongoDB chief architect claiming Mongo provides "Immediate Consistency" in a 2012 talk: http://www.slideshare.net/mongodb/mongodb-basic-concepts-15674838

Here's the Read preference documentation claiming Mongo ReadPreference=primary "guarantees that read operations reflect the latest version of a document": http://docs.mongodb.org/manual/core/read-preference/

The MongoDB FAQ says "MongoDB is consistent by default: reads and writes are issued to the primary member of a replica set": http://www.mongodb.com/faq#consistency

And the Architecture Guide repeats the theme that only non-primary ReadPreferences can see stale data: http://s3.amazonaws.com/info-mongodb-com/MongoDB_Architecture_Guide.pdf.

What Mongo actually does is allow stale reads: it is possible to execute a WriteConcern=MAJORITY write of a new value, wait for it to return successfully, perform a read with ReadPreference=PRIMARY, and not see the value you just wrote.

Comment by Kyle Kingsbury [ 20/Apr/15 ]

To elaborate...

> Further, there's a linearizable schedule in that case, I believe. It's been a while since I read Herlihy's paper, but if I have this right, with SERVER-18022 and committed single-document reads, a legal schedule would have been to process all of the reads in some sequence, and then the writes of threads 0, 3, 4, 1 and 2

I don't understand what you mean--it doesn't make sense for a register to read 0, 4, 3, and 0 again without any writes taking place.

> On the other hand, even if an application does that the response might be delayed during transport, during which time a more-current value might appear. Sticking to the single-document case, for the moment, if a thread communicates with other threads only through MongoDB, so long as it never sees an older value of a document after seeing a newer value of a document, and so long as it does only committed reads, what would staleness even mean?

The property you're describing is sequential consistency: all processes see operations in the same order, but do not agree on when they happen. Sequentially consistent systems allow arbitrarily stale reads: it is legal, for instance, for a new process to see no documents, which leads to confusing anomalies like, say, submitting a comment, refreshing the page, and seeing nothing there. I think you would be hard-pressed to find users who have no side-channels between processes, and I also think most of your user base would interpret "latest version" to mean "a state between the invocation and completion times of my read operation", not "some state logically subsequent to my previous operation and temporally prior to the completion of my read operation."

> Now, if the threads communicate directly with each other

They can and they will communicate--I have never talked to a Mongo user which did not send data from MongoDB to a human being. This is why linearizability is a useful invariant: you know that if you post a status update, receive an HTTP 200 response, call up your friend, and ask them to look, they'll see your post.

You can ask people to embed causality tokens in all their operations, but a.) you have to train users how to propagate and merge causality tokens correctly, b.) this does nothing for fresh processes, and c.) this is not what most people mean when they say "immediate" or "latest version", haha.

Comment by Eric Milkie [ 20/Apr/15 ]

EDIT This ticket was re-opened on April 21.
~~~~~
Kyle, I'm going to switch this to "Works as Designed", as you're correct that there are many more facets to this topic than just a simple duplication of one work ticket.
I'm still uncertain what you mean by "staleness" in this context, as highlighted by Andy in his response above.

Comment by Kyle Kingsbury [ 20/Apr/15 ]

Maybe I should have been more explicit: this is not a duplicate of SERVER-18022. Read-committed does not prevent stale reads.

Comment by Eric Milkie [ 20/Apr/15 ]

I'm closing this as a duplicate of the read-committed ticket, but please feel free to reopen for further discussion.

Comment by Andy Schwerin [ 15/Apr/15 ]

From my interpretation of Kyle's diagram, if SERVER-18022 were resolved and the test threads were doing single-document committed reads, thread 1 would not have observed 4 then 3, but 0 and then 0 again, since neither thread 0 nor thread 2 have completed their writes. Similarly, thread 5 would continue to read 0. Those values aren't stale – they would represent the most recent committed value. Further, there's a linearizable schedule in that case, I believe. It's been a while since I read Herlihy's paper, but if I have this right, with SERVER-18022 and committed single-document reads, a legal schedule would have been to process all of the reads in some sequence, and then the writes of threads 0, 3, 4, 1 and 2 in that order. Note, unlike asya,I primarily consulted the diagram, ticket description and prior comments.

As for Kyle's point about needing reads to be coupled to oplog acknowledgement to prevent stale reads, I'm of two minds. On the one hand, an application can convert reads into atomic read-modify-write operations today using the findAndModify and getLastError commands in MongoDB in order to tie the reads into the oplog acknowledgement system (NB: I don't think most drivers support this today). On the other hand, even if an application does that the response might be delayed during transport, during which time a more-current value might appear. Sticking to the single-document case, for the moment, if a thread communicates with other threads only through MongoDB, so long as it never sees an older value of a document after seeing a newer value of a document, and so long as it does only committed reads, what would staleness even mean?

Now, if the threads communicate directly with each other, the story gets more complicated and SERVER-18022 may not be sufficient by itself. In that case, allowing the client threads to pass some kind of logical clock token when they communicate with each other would suffice to prevent a causal ordering violation during periods when one node erroneously believes itself to still be primary. That token could be a combination of the monotonically increasing election term id and the highest committed oplog timestamp on the node when the read completed. If the causally second observer saw an earlier (term, optime) pair than the causally first observer, it would know to reject the read.

That solution depends on resolution of SERVER-12385 (adding term ids to the oplog is part of that work) and SERVER-16570 (involving write concern satisfaction after rollback), which we're planning to do during the development of version 3.2. It is worth noting that even that solution is insufficient for managing some causal relationships when communication is through multiple documents. I don't believe we make promises about those causal relationships, today.

In the meantime, we will work to improve the documentation around this behavior in current versions of MongoDB. As always, please respond if you have questions or comments.

Comment by Kyle Kingsbury [ 15/Apr/15 ]

(perhaps I should also mention, in case anyone comes along and thinks this is subsumed by SERVER-18022, that fixing SERVER-18022 does not necessarily resolve the problem of stale reads)

Comment by Kyle Kingsbury [ 14/Apr/15 ]

The existence of this behavior actually implies both anomalies are present in MongoDB, but I'm phrasing it conservatively.

Why? A dirty read from an isolated primary can be trivially converted to a stale read if the write to the isolated primary doesn't affect the outcome of the read (or if the write doesn't take place at all). I think there are two problems to fix here--supporting read-committed isolation will prevent dirty reads, but still allows stale reads. You also have to couple reads to oplog acknowledgement in some way to prevent stale read transactions.

I've attached a sketch (journal-84.png) to illustrate--all you have to do is execute the write on the new primary instead of the old to convert a dirty read to a stale one. Either way, you're not reading "the most recent state."

Note that you don't have to go full read-committed to fix this anomaly: you can prevent stale and dirty reads for single documents without supporting RC for multi-doc operations (just a difference in lock granularity), so if you want to support reading the latest version, you can have it in both read-uncommitted and read-committed modes.

The read isolation docs (http://docs.mongodb.org/manual/core/write-concern/#read-isolation) are technically correct, I think, but sorta misleading: "For all inserts and updates, MongoDB modifies each document in isolation: clients never see documents in intermediate states" kinda suggests that the read uncommitted problem refers to multiple-document updates—which is also true—but it doesn't mention that even read operations on a single document may see invalid states that are not causally connected to the final history.

The read preference docs (http://docs.mongodb.org/manual/core/read-preference/) make some pretty explicit claims that Mongo supports linearizable reads, saying "Reading from the primary guarantees that read operations reflect the latest version of a document", and "All read preference modes except primary may return stale data".

With this in mind, it might be a good idea to let users know all read modes may return stale data, and that the difference in ReadPreference just changes the probabilities. For instance, "Ensure that your application can tolerate stale data if you choose to use a non-primary mode," could read "Always ensure that your application can tolerate stale data."

Comment by Kyle Kingsbury [ 14/Apr/15 ]

Sketch illustrating that stale reads are a degenerate case of dirty reads.

Comment by Asya Kamsky [ 14/Apr/15 ]

Hello Kyle,

As you and Knossos have discovered, it is not possible to do fully linearized single-document reads with the current version of MongoDB.

I believe that what your test framework is not taking into account is that reading from a primary does not guarantee that the read data will survive a network partition. This is because MongoDB read isolation semantics are similar to "read uncommitted" in a traditional database system when you take into account the full replica set.

As the docs mention, data written with majority writeConcern that has been acknowledged will survive any replica set event that allows a new primary to be elected. However, after the write is made on the primary, but before it has successfully replicated to majority of the cluster, it is visible to any other connection that's reading from the primary.

This allows the following sequence of events:

T1 network partition happens
T2 write A happens, waits for write concern
T3 read of A happens on the primary
T4 primary steps down due to not seeing majority
T5 new primary is elected

When write A has not propagated to the majority of the replica set, it may not be present on the newly elected primary (in fact, if write A has replicated to none of the secondaries, it is guaranteed to be absent from the newly elected primary).

I believe such a sequence of events was observed in your case, where the majority write concern is not yet satisfied, the unacknowledged data have been written on the primary and were visible to other connections (process 1 in your case), but the value was not present on the newly elected primary (which is the node that process 5 finally successfully read from). The phenomenon your tests are observing are not stale reads (of value 0) but rather uncommitted reads, and those are the reads "1 read 4" and "1 read 3" as this happens on the "old" primary. Those writes were not acknowledged, nor replicated to the majority partition, and they will be (correctly) rolled back when the partition is removed.

Currently, there is a Documentation task, DOCS-5141, to clarify read isolation semantics further. In addition, in SERVER-18022 we are working on support for read-committed isolation, which will enable your test to perform linearizable reads correctly – I'll mark this ticket as a duplicate of that work so they will be linked together in JIRA.

Thanks for detailed report and let me know if you have any questions.
Asya

Generated at Thu Feb 08 03:46:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.