Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-17975

Stale reads with WriteConcern Majority and ReadPreference Primary

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 2.6.7
    • Fix Version/s: 3.4.0-rc3
    • Component/s: Replication
    • Labels:
      None
    • Backwards Compatibility:
      Minor Change
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      Clone jepsen, check out commit 72697c09eff26fdb1afb7491256c873f03404307, cd mongodb, and run `lein test`. Might need to run `lein install` in the jepsen/jepsen directory first.

      Show
      Clone jepsen, check out commit 72697c09eff26fdb1afb7491256c873f03404307, cd mongodb, and run `lein test`. Might need to run `lein install` in the jepsen/jepsen directory first.

      Description

      Hello, everyone! Hope you're having a terrific week.

      I think I may have found a thing!

      In Jepsen tests involving a mix of reads, writes, and compare-and-set against a single document, MongoDB appears to allow stale reads, even when writes use WriteConcern.MAJORITY, when network partitions cause a leader election. This holds for both plain find-by-id lookups and for queries explicitly passing ReadPreference.primary().

      Here's how we execute read, write, and compare-and-set operations against a register:

      https://github.com/aphyr/jepsen/blob/72697c09eff26fdb1afb7491256c873f03404307/mongodb/src/mongodb/document_cas.clj#L55-L81

      And this is the schedule for failures: a 60-second on, 60-second off pattern of network partitions cutting the network cleanly into a randomly selected 3-node majority component and a 2-node minority component.

      https://github.com/aphyr/jepsen/blob/72697c09eff26fdb1afb7491256c873f03404307/mongodb/src/mongodb/core.clj#L377-L391

      This particular test is a bit finicky--it's easy to get knossos locked into a really slow verification cycle, or to have trouble triggering the bug. Wish I had a more reliable test for you!

      Attached, linearizability.txt shows the linearizability analysis from Knossos for a test run with a relatively simple failure mode. In this test, MongoDB returns the value "0" for the document, even though the only possible values for the document at that time were 1, 2, 3, or 4. The value 0 was the proper state at some time close to the partition's beginning, but successful reads just after the partition was fully established indicated that at least one of the indeterminate (:info) CaS operations changing the value away from 0 had to have executed.

      You can see this visually in the attached image, where I've drawn the acknowledged (:ok) operations as green and indeterminate (:info) operations as yellow bars; omitting :fail ops which are known to have not taken place. Time moves from left to right; each process is a numbered horizontal track. The value must be zero just prior to the partition, but in order to read 4 and 3 we must execute process 1's CAS from 0->4; all possible paths from that point on cannot result in a value of 0 in time for process 5's final read.

      Since the MongoDB docs for Read Preferences (http://docs.mongodb.org/manual/core/read-preference/) say "reading from the primary guarantees that read operations reflect the latest version of a document", I suspect this behavior conflicts with Mongo's intended behavior.

      There is good news! If you remove all read operations from the mix, performing only CaS and writes, single-register ops with WriteConcern MAJORITY do appear to be linearizable! Or, at least, I haven't devised an aggressive enough test to expose any faults yet.

      This suggests to me that MongoDB might make the same mistake that Etcd and Consul did with respect to consistent reads: assuming that a node which believes it is currently a primary can safely service a read request without confirming with a quorum of secondaries that it is still the primary. If this is so, you might refer to https://github.com/coreos/etcd/issues/741 and https://gist.github.com/armon/11059431 for more context on why this behavior is not consistent.

      If this is the case, I think you can recover linearizable reads by computing the return value for the query, then verifying with a majority of nodes that no leadership transitions have happened since the start of the query, and then sending the result back to the client--preventing a logically "old" primary from servicing reads.

      Let me know if there's anything else I can help with!

      1. history.edn
        307 kB
        Kyle Kingsbury
      2. linearizability.txt
        48 kB
        Kyle Kingsbury
      1. CCNSOQ6UwAEAvsO.jpg
        25 kB
      2. Journal - 84.png
        764 kB

        Issue Links

          Activity

          Hide
          asya Asya Kamsky added a comment - - edited

          Hello Kyle,

          As you and Knossos have discovered, it is not possible to do fully linearized single-document reads with the current version of MongoDB.

          I believe that what your test framework is not taking into account is that reading from a primary does not guarantee that the read data will survive a network partition. This is because MongoDB read isolation semantics are similar to "read uncommitted" in a traditional database system when you take into account the full replica set.

          As the docs mention, data written with majority writeConcern that has been acknowledged will survive any replica set event that allows a new primary to be elected. However, after the write is made on the primary, but before it has successfully replicated to majority of the cluster, it is visible to any other connection that's reading from the primary.

          This allows the following sequence of events:

          T1 network partition happens
          T2 write A happens, waits for write concern
          T3 read of A happens on the primary
          T4 primary steps down due to not seeing majority
          T5 new primary is elected

          When write A has not propagated to the majority of the replica set, it may not be present on the newly elected primary (in fact, if write A has replicated to none of the secondaries, it is guaranteed to be absent from the newly elected primary).

          I believe such a sequence of events was observed in your case, where the majority write concern is not yet satisfied, the unacknowledged data have been written on the primary and were visible to other connections (process 1 in your case), but the value was not present on the newly elected primary (which is the node that process 5 finally successfully read from). The phenomenon your tests are observing are not stale reads (of value 0) but rather uncommitted reads, and those are the reads "1 read 4" and "1 read 3" as this happens on the "old" primary. Those writes were not acknowledged, nor replicated to the majority partition, and they will be (correctly) rolled back when the partition is removed.

          Currently, there is a Documentation task, DOCS-5141, to clarify read isolation semantics further. In addition, in SERVER-18022 we are working on support for read-committed isolation, which will enable your test to perform linearizable reads correctly – I'll mark this ticket as a duplicate of that work so they will be linked together in JIRA.

          Thanks for detailed report and let me know if you have any questions.
          Asya

          Show
          asya Asya Kamsky added a comment - - edited Hello Kyle, As you and Knossos have discovered, it is not possible to do fully linearized single-document reads with the current version of MongoDB. I believe that what your test framework is not taking into account is that reading from a primary does not guarantee that the read data will survive a network partition. This is because MongoDB read isolation semantics are similar to "read uncommitted" in a traditional database system when you take into account the full replica set. As the docs mention, data written with majority writeConcern that has been acknowledged will survive any replica set event that allows a new primary to be elected. However, after the write is made on the primary, but before it has successfully replicated to majority of the cluster, it is visible to any other connection that's reading from the primary. This allows the following sequence of events: T1 network partition happens T2 write A happens, waits for write concern T3 read of A happens on the primary T4 primary steps down due to not seeing majority T5 new primary is elected When write A has not propagated to the majority of the replica set, it may not be present on the newly elected primary (in fact, if write A has replicated to none of the secondaries, it is guaranteed to be absent from the newly elected primary). I believe such a sequence of events was observed in your case, where the majority write concern is not yet satisfied, the unacknowledged data have been written on the primary and were visible to other connections (process 1 in your case), but the value was not present on the newly elected primary (which is the node that process 5 finally successfully read from). The phenomenon your tests are observing are not stale reads (of value 0) but rather uncommitted reads, and those are the reads "1 read 4" and "1 read 3" as this happens on the "old" primary. Those writes were not acknowledged, nor replicated to the majority partition, and they will be (correctly) rolled back when the partition is removed. Currently, there is a Documentation task, DOCS-5141 , to clarify read isolation semantics further. In addition, in SERVER-18022 we are working on support for read-committed isolation, which will enable your test to perform linearizable reads correctly – I'll mark this ticket as a duplicate of that work so they will be linked together in JIRA. Thanks for detailed report and let me know if you have any questions. Asya
          Hide
          aphyr Kyle Kingsbury added a comment -

          Sketch illustrating that stale reads are a degenerate case of dirty reads.

          Show
          aphyr Kyle Kingsbury added a comment - Sketch illustrating that stale reads are a degenerate case of dirty reads.
          Hide
          aphyr Kyle Kingsbury added a comment - - edited

          The existence of this behavior actually implies both anomalies are present in MongoDB, but I'm phrasing it conservatively.

          Why? A dirty read from an isolated primary can be trivially converted to a stale read if the write to the isolated primary doesn't affect the outcome of the read (or if the write doesn't take place at all). I think there are two problems to fix here--supporting read-committed isolation will prevent dirty reads, but still allows stale reads. You also have to couple reads to oplog acknowledgement in some way to prevent stale read transactions.

          I've attached a sketch (journal-84.png) to illustrate--all you have to do is execute the write on the new primary instead of the old to convert a dirty read to a stale one. Either way, you're not reading "the most recent state."

          Note that you don't have to go full read-committed to fix this anomaly: you can prevent stale and dirty reads for single documents without supporting RC for multi-doc operations (just a difference in lock granularity), so if you want to support reading the latest version, you can have it in both read-uncommitted and read-committed modes.

          The read isolation docs (http://docs.mongodb.org/manual/core/write-concern/#read-isolation) are technically correct, I think, but sorta misleading: "For all inserts and updates, MongoDB modifies each document in isolation: clients never see documents in intermediate states" kinda suggests that the read uncommitted problem refers to multiple-document updates—which is also true—but it doesn't mention that even read operations on a single document may see invalid states that are not causally connected to the final history.

          The read preference docs (http://docs.mongodb.org/manual/core/read-preference/) make some pretty explicit claims that Mongo supports linearizable reads, saying "Reading from the primary guarantees that read operations reflect the latest version of a document", and "All read preference modes except primary may return stale data".

          With this in mind, it might be a good idea to let users know all read modes may return stale data, and that the difference in ReadPreference just changes the probabilities. For instance, "Ensure that your application can tolerate stale data if you choose to use a non-primary mode," could read "Always ensure that your application can tolerate stale data."

          Show
          aphyr Kyle Kingsbury added a comment - - edited The existence of this behavior actually implies both anomalies are present in MongoDB, but I'm phrasing it conservatively. Why? A dirty read from an isolated primary can be trivially converted to a stale read if the write to the isolated primary doesn't affect the outcome of the read (or if the write doesn't take place at all). I think there are two problems to fix here--supporting read-committed isolation will prevent dirty reads, but still allows stale reads. You also have to couple reads to oplog acknowledgement in some way to prevent stale read transactions. I've attached a sketch (journal-84.png) to illustrate--all you have to do is execute the write on the new primary instead of the old to convert a dirty read to a stale one. Either way, you're not reading "the most recent state." Note that you don't have to go full read-committed to fix this anomaly: you can prevent stale and dirty reads for single documents without supporting RC for multi-doc operations (just a difference in lock granularity), so if you want to support reading the latest version, you can have it in both read-uncommitted and read-committed modes. The read isolation docs ( http://docs.mongodb.org/manual/core/write-concern/#read-isolation ) are technically correct, I think, but sorta misleading: "For all inserts and updates, MongoDB modifies each document in isolation: clients never see documents in intermediate states" kinda suggests that the read uncommitted problem refers to multiple-document updates—which is also true—but it doesn't mention that even read operations on a single document may see invalid states that are not causally connected to the final history. The read preference docs ( http://docs.mongodb.org/manual/core/read-preference/ ) make some pretty explicit claims that Mongo supports linearizable reads, saying "Reading from the primary guarantees that read operations reflect the latest version of a document", and "All read preference modes except primary may return stale data". With this in mind, it might be a good idea to let users know all read modes may return stale data, and that the difference in ReadPreference just changes the probabilities. For instance, "Ensure that your application can tolerate stale data if you choose to use a non-primary mode," could read "Always ensure that your application can tolerate stale data."
          Hide
          aphyr Kyle Kingsbury added a comment -

          (perhaps I should also mention, in case anyone comes along and thinks this is subsumed by SERVER-18022, that fixing SERVER-18022 does not necessarily resolve the problem of stale reads)

          Show
          aphyr Kyle Kingsbury added a comment - (perhaps I should also mention, in case anyone comes along and thinks this is subsumed by SERVER-18022 , that fixing SERVER-18022 does not necessarily resolve the problem of stale reads)
          Hide
          schwerin Andy Schwerin added a comment - - edited

          From my interpretation of Kyle's diagram, if SERVER-18022 were resolved and the test threads were doing single-document committed reads, thread 1 would not have observed 4 then 3, but 0 and then 0 again, since neither thread 0 nor thread 2 have completed their writes. Similarly, thread 5 would continue to read 0. Those values aren't stale – they would represent the most recent committed value. Further, there's a linearizable schedule in that case, I believe. It's been a while since I read Herlihy's paper, but if I have this right, with SERVER-18022 and committed single-document reads, a legal schedule would have been to process all of the reads in some sequence, and then the writes of threads 0, 3, 4, 1 and 2 in that order. Note, unlike Asya Kamsky,I primarily consulted the diagram, ticket description and prior comments.

          As for Kyle's point about needing reads to be coupled to oplog acknowledgement to prevent stale reads, I'm of two minds. On the one hand, an application can convert reads into atomic read-modify-write operations today using the findAndModify and getLastError commands in MongoDB in order to tie the reads into the oplog acknowledgement system (NB: I don't think most drivers support this today). On the other hand, even if an application does that the response might be delayed during transport, during which time a more-current value might appear. Sticking to the single-document case, for the moment, if a thread communicates with other threads only through MongoDB, so long as it never sees an older value of a document after seeing a newer value of a document, and so long as it does only committed reads, what would staleness even mean?

          Now, if the threads communicate directly with each other, the story gets more complicated and SERVER-18022 may not be sufficient by itself. In that case, allowing the client threads to pass some kind of logical clock token when they communicate with each other would suffice to prevent a causal ordering violation during periods when one node erroneously believes itself to still be primary. That token could be a combination of the monotonically increasing election term id and the highest committed oplog timestamp on the node when the read completed. If the causally second observer saw an earlier (term, optime) pair than the causally first observer, it would know to reject the read.

          That solution depends on resolution of SERVER-12385 (adding term ids to the oplog is part of that work) and SERVER-16570 (involving write concern satisfaction after rollback), which we're planning to do during the development of version 3.2. It is worth noting that even that solution is insufficient for managing some causal relationships when communication is through multiple documents. I don't believe we make promises about those causal relationships, today.

          In the meantime, we will work to improve the documentation around this behavior in current versions of MongoDB. As always, please respond if you have questions or comments.

          Show
          schwerin Andy Schwerin added a comment - - edited From my interpretation of Kyle's diagram, if SERVER-18022 were resolved and the test threads were doing single-document committed reads, thread 1 would not have observed 4 then 3, but 0 and then 0 again, since neither thread 0 nor thread 2 have completed their writes. Similarly, thread 5 would continue to read 0. Those values aren't stale – they would represent the most recent committed value. Further, there's a linearizable schedule in that case, I believe. It's been a while since I read Herlihy's paper, but if I have this right, with SERVER-18022 and committed single-document reads, a legal schedule would have been to process all of the reads in some sequence, and then the writes of threads 0, 3, 4, 1 and 2 in that order. Note, unlike Asya Kamsky ,I primarily consulted the diagram, ticket description and prior comments. As for Kyle's point about needing reads to be coupled to oplog acknowledgement to prevent stale reads, I'm of two minds. On the one hand, an application can convert reads into atomic read-modify-write operations today using the findAndModify and getLastError commands in MongoDB in order to tie the reads into the oplog acknowledgement system (NB: I don't think most drivers support this today). On the other hand, even if an application does that the response might be delayed during transport, during which time a more-current value might appear. Sticking to the single-document case, for the moment, if a thread communicates with other threads only through MongoDB, so long as it never sees an older value of a document after seeing a newer value of a document, and so long as it does only committed reads, what would staleness even mean? Now, if the threads communicate directly with each other, the story gets more complicated and SERVER-18022 may not be sufficient by itself. In that case, allowing the client threads to pass some kind of logical clock token when they communicate with each other would suffice to prevent a causal ordering violation during periods when one node erroneously believes itself to still be primary. That token could be a combination of the monotonically increasing election term id and the highest committed oplog timestamp on the node when the read completed. If the causally second observer saw an earlier (term, optime) pair than the causally first observer, it would know to reject the read. That solution depends on resolution of SERVER-12385 (adding term ids to the oplog is part of that work) and SERVER-16570 (involving write concern satisfaction after rollback), which we're planning to do during the development of version 3.2. It is worth noting that even that solution is insufficient for managing some causal relationships when communication is through multiple documents. I don't believe we make promises about those causal relationships, today. In the meantime, we will work to improve the documentation around this behavior in current versions of MongoDB. As always, please respond if you have questions or comments.
          Hide
          milkie Eric Milkie added a comment -

          I'm closing this as a duplicate of the read-committed ticket, but please feel free to reopen for further discussion.

          Show
          milkie Eric Milkie added a comment - I'm closing this as a duplicate of the read-committed ticket, but please feel free to reopen for further discussion.
          Hide
          aphyr Kyle Kingsbury added a comment -

          Maybe I should have been more explicit: this is not a duplicate of SERVER-18022. Read-committed does not prevent stale reads.

          Show
          aphyr Kyle Kingsbury added a comment - Maybe I should have been more explicit: this is not a duplicate of SERVER-18022 . Read-committed does not prevent stale reads.
          Hide
          milkie Eric Milkie added a comment - - edited

          EDIT This ticket was re-opened on April 21.
          ~~~~~
          Kyle, I'm going to switch this to "Works as Designed", as you're correct that there are many more facets to this topic than just a simple duplication of one work ticket.
          I'm still uncertain what you mean by "staleness" in this context, as highlighted by Andy in his response above.

          Show
          milkie Eric Milkie added a comment - - edited EDIT This ticket was re-opened on April 21. ~~~~~ Kyle, I'm going to switch this to "Works as Designed", as you're correct that there are many more facets to this topic than just a simple duplication of one work ticket. I'm still uncertain what you mean by "staleness" in this context, as highlighted by Andy in his response above.
          Hide
          aphyr Kyle Kingsbury added a comment - - edited

          To elaborate...

          > Further, there's a linearizable schedule in that case, I believe. It's been a while since I read Herlihy's paper, but if I have this right, with SERVER-18022 and committed single-document reads, a legal schedule would have been to process all of the reads in some sequence, and then the writes of threads 0, 3, 4, 1 and 2

          I don't understand what you mean--it doesn't make sense for a register to read 0, 4, 3, and 0 again without any writes taking place.

          > On the other hand, even if an application does that the response might be delayed during transport, during which time a more-current value might appear. Sticking to the single-document case, for the moment, if a thread communicates with other threads only through MongoDB, so long as it never sees an older value of a document after seeing a newer value of a document, and so long as it does only committed reads, what would staleness even mean?

          The property you're describing is sequential consistency: all processes see operations in the same order, but do not agree on when they happen. Sequentially consistent systems allow arbitrarily stale reads: it is legal, for instance, for a new process to see no documents, which leads to confusing anomalies like, say, submitting a comment, refreshing the page, and seeing nothing there. I think you would be hard-pressed to find users who have no side-channels between processes, and I also think most of your user base would interpret "latest version" to mean "a state between the invocation and completion times of my read operation", not "some state logically subsequent to my previous operation and temporally prior to the completion of my read operation."

          > Now, if the threads communicate directly with each other

          They can and they will communicate--I have never talked to a Mongo user which did not send data from MongoDB to a human being. This is why linearizability is a useful invariant: you know that if you post a status update, receive an HTTP 200 response, call up your friend, and ask them to look, they'll see your post.

          You can ask people to embed causality tokens in all their operations, but a.) you have to train users how to propagate and merge causality tokens correctly, b.) this does nothing for fresh processes, and c.) this is not what most people mean when they say "immediate" or "latest version", haha.

          Show
          aphyr Kyle Kingsbury added a comment - - edited To elaborate... > Further, there's a linearizable schedule in that case, I believe. It's been a while since I read Herlihy's paper, but if I have this right, with SERVER-18022 and committed single-document reads, a legal schedule would have been to process all of the reads in some sequence, and then the writes of threads 0, 3, 4, 1 and 2 I don't understand what you mean--it doesn't make sense for a register to read 0, 4, 3, and 0 again without any writes taking place. > On the other hand, even if an application does that the response might be delayed during transport, during which time a more-current value might appear. Sticking to the single-document case, for the moment, if a thread communicates with other threads only through MongoDB, so long as it never sees an older value of a document after seeing a newer value of a document, and so long as it does only committed reads, what would staleness even mean? The property you're describing is sequential consistency: all processes see operations in the same order, but do not agree on when they happen. Sequentially consistent systems allow arbitrarily stale reads: it is legal, for instance, for a new process to see no documents, which leads to confusing anomalies like, say, submitting a comment, refreshing the page, and seeing nothing there. I think you would be hard-pressed to find users who have no side-channels between processes, and I also think most of your user base would interpret "latest version" to mean "a state between the invocation and completion times of my read operation", not "some state logically subsequent to my previous operation and temporally prior to the completion of my read operation." > Now, if the threads communicate directly with each other They can and they will communicate--I have never talked to a Mongo user which did not send data from MongoDB to a human being. This is why linearizability is a useful invariant: you know that if you post a status update, receive an HTTP 200 response, call up your friend, and ask them to look, they'll see your post. You can ask people to embed causality tokens in all their operations, but a.) you have to train users how to propagate and merge causality tokens correctly, b.) this does nothing for fresh processes, and c.) this is not what most people mean when they say "immediate" or "latest version", haha.
          Hide
          aphyr Kyle Kingsbury added a comment - - edited

          In what possible sense is this "working as designed"? The MongoDB documentation repeats the terms "immediate consistency" and "latest version" over and over again.

          Here's the MongoDB chief architect claiming Mongo provides "Immediate Consistency" in a 2012 talk: http://www.slideshare.net/mongodb/mongodb-basic-concepts-15674838

          Here's the Read preference documentation claiming Mongo ReadPreference=primary "guarantees that read operations reflect the latest version of a document": http://docs.mongodb.org/manual/core/read-preference/

          The MongoDB FAQ says "MongoDB is consistent by default: reads and writes are issued to the primary member of a replica set": http://www.mongodb.com/faq#consistency

          And the Architecture Guide repeats the theme that only non-primary ReadPreferences can see stale data: http://s3.amazonaws.com/info-mongodb-com/MongoDB_Architecture_Guide.pdf.

          What Mongo actually does is allow stale reads: it is possible to execute a WriteConcern=MAJORITY write of a new value, wait for it to return successfully, perform a read with ReadPreference=PRIMARY, and not see the value you just wrote.

          Show
          aphyr Kyle Kingsbury added a comment - - edited In what possible sense is this "working as designed"? The MongoDB documentation repeats the terms "immediate consistency" and "latest version" over and over again. Here's the MongoDB chief architect claiming Mongo provides "Immediate Consistency" in a 2012 talk: http://www.slideshare.net/mongodb/mongodb-basic-concepts-15674838 Here's the Read preference documentation claiming Mongo ReadPreference=primary "guarantees that read operations reflect the latest version of a document": http://docs.mongodb.org/manual/core/read-preference/ The MongoDB FAQ says "MongoDB is consistent by default: reads and writes are issued to the primary member of a replica set": http://www.mongodb.com/faq#consistency And the Architecture Guide repeats the theme that only non-primary ReadPreferences can see stale data: http://s3.amazonaws.com/info-mongodb-com/MongoDB_Architecture_Guide.pdf . What Mongo actually does is allow stale reads: it is possible to execute a WriteConcern=MAJORITY write of a new value, wait for it to return successfully, perform a read with ReadPreference=PRIMARY, and not see the value you just wrote.
          Hide
          schwerin Andy Schwerin added a comment - - edited

          In my reading of this ticket, there are two actions requested. One is to schedule the ticket's suggestion for a feature to support treating a single document as a linearizable concurrent object without forcing the client to convert reads to read-modify-write operations. The other is to correct the MongoDB documentation about read consistency, emphasizing the conditions in which stale reads may occur with read preference "primary" in current and prior versions of MongoDB. Please read below for details, but in short I propose to break off the documentation request into a separate ticket and to use this ticket as the handle for scheduling the feature request.

          Regarding the possible linearizable schedule for the reads, let me try to clarify. Your diagram indicates that by the end of the period in question, none of the writes have finished being confirmed by the replication system. If SERVER-18022 were resolved all the reads in the diagram would have returned 0, because none of the writes had committed. As such, there would be a legal linearizable schedule, whose prefix includes the completion of all the read operations, and whose suffix includes the completion of all the write operations. I think in this case that the fact that the writes had in fact started is not relevant. In this case, write operation completion means replication to a majority of voting nodes and confirmation of that fact to the primary that accepted the write from the client.

          Anyhow, the behavior you did observe certainly doesn't have a linearizable schedule. As you point out, even with SERVER-18022 you don't get reads into a linear schedule for free. The problem is that there is a period of time during a network partition when two nodes may believe themselves to be primary. As soon as those two nodes communicate, one will step down. If the partition lasts long enough, the node in the minority partition will step down, but there is an inevitable window for stale reads when a client reads from a primary that will inevitably step down. As an aside, improvements to the consensus protocol can be used to bring that period down to a few network roundtrip periods (hundreds of milliseconds), and that is the subject of the somewhat ill-described SERVER-12385.

          You suggested (approximately) transforming reads into atomic read-modify-writes in order to achieve linearizable reads. You didn't propose it exactly that way, and your description leaves more room for optimization, but "coupling reads to oplog acknowledgement" pretty much degrades to converting reads to read-modify-writes in periods of low write volume. The behavior can be achieved today, albeit somewhat clumsily and only with some client drivers, by using the "findAndModify" command to issue your reads and then issuing a getLastError command to wait for write concern satisfaction. Your findAndModify command will need to make some change to the document being read, such as incrementing an otherwise ignored field, in order to force an entry into the oplog, and you cannot observe the value until the getLastError command returns successfully, indicating that your read-modify-write replicated successfully.

          Finally, as you indicated above, there is a clear documentation issue. The documentation you reference needs to be updated. As mentioned, there's an active DOCS ticket for part of that, DOCS-5141, and I'm now convinced that we'll need a separate one to review all of our read-preference documentation. Improving the documentation will be tricky because, while linearizable distributed objects are often convenient, they come at a comparatively high cost in terms of communication overhead. Since users' needs can be frequently satisfied with more relaxed consistency models, the updated documentation will need to help developers weigh the probability of a stale read with its impact on their application.

          Show
          schwerin Andy Schwerin added a comment - - edited In my reading of this ticket, there are two actions requested. One is to schedule the ticket's suggestion for a feature to support treating a single document as a linearizable concurrent object without forcing the client to convert reads to read-modify-write operations. The other is to correct the MongoDB documentation about read consistency, emphasizing the conditions in which stale reads may occur with read preference "primary" in current and prior versions of MongoDB. Please read below for details, but in short I propose to break off the documentation request into a separate ticket and to use this ticket as the handle for scheduling the feature request. Regarding the possible linearizable schedule for the reads, let me try to clarify. Your diagram indicates that by the end of the period in question, none of the writes have finished being confirmed by the replication system. If SERVER-18022 were resolved all the reads in the diagram would have returned 0, because none of the writes had committed. As such, there would be a legal linearizable schedule, whose prefix includes the completion of all the read operations, and whose suffix includes the completion of all the write operations. I think in this case that the fact that the writes had in fact started is not relevant. In this case, write operation completion means replication to a majority of voting nodes and confirmation of that fact to the primary that accepted the write from the client. Anyhow, the behavior you did observe certainly doesn't have a linearizable schedule. As you point out, even with SERVER-18022 you don't get reads into a linear schedule for free. The problem is that there is a period of time during a network partition when two nodes may believe themselves to be primary. As soon as those two nodes communicate, one will step down. If the partition lasts long enough, the node in the minority partition will step down, but there is an inevitable window for stale reads when a client reads from a primary that will inevitably step down. As an aside, improvements to the consensus protocol can be used to bring that period down to a few network roundtrip periods (hundreds of milliseconds), and that is the subject of the somewhat ill-described SERVER-12385 . You suggested (approximately) transforming reads into atomic read-modify-writes in order to achieve linearizable reads. You didn't propose it exactly that way, and your description leaves more room for optimization, but "coupling reads to oplog acknowledgement" pretty much degrades to converting reads to read-modify-writes in periods of low write volume. The behavior can be achieved today, albeit somewhat clumsily and only with some client drivers, by using the "findAndModify" command to issue your reads and then issuing a getLastError command to wait for write concern satisfaction. Your findAndModify command will need to make some change to the document being read, such as incrementing an otherwise ignored field, in order to force an entry into the oplog, and you cannot observe the value until the getLastError command returns successfully, indicating that your read-modify-write replicated successfully. Finally, as you indicated above, there is a clear documentation issue. The documentation you reference needs to be updated. As mentioned, there's an active DOCS ticket for part of that, DOCS-5141 , and I'm now convinced that we'll need a separate one to review all of our read-preference documentation. Improving the documentation will be tricky because, while linearizable distributed objects are often convenient, they come at a comparatively high cost in terms of communication overhead. Since users' needs can be frequently satisfied with more relaxed consistency models, the updated documentation will need to help developers weigh the probability of a stale read with its impact on their application.
          Hide
          schwerin Andy Schwerin added a comment -

          Assigning to me for scheduling.

          Show
          schwerin Andy Schwerin added a comment - Assigning to me for scheduling.
          Hide
          mdcallag Mark Callaghan added a comment -

          Still catching up on things so perhaps my comments are not relevant but...

          In 2.6 (mmapv1 only) changes on the master are visible:
          1) before journal sync is done
          2) before replicas might have received or ack'd the change

          In 3.0 with WiredTiger and RocksDB changes on the master are not visible until after their redo log sync has been done. I assume that #1 continues to be true for mmapv1.

          In 3.0 I assume that #2 is still a problem.

          I wrote about this in:
          https://jira.mongodb.org/browse/DOCS-2908
          http://smalldatum.blogspot.com/2014/03/when-does-mongodb-make-transaction.html

          We also experienced this in MySQL land with semi-sync replication and solved the problem with lossless semisync replication. See the post by Yoshi for more details but the property we provide is that commits are not visible on the master until the commit log has been archived on at least one other replica or log-only replica.
          http://yoshinorimatsunobu.blogspot.com/2014/04/semi-synchronous-replication-at-facebook.html

          It will take me a while to get through all of the details in this case but in the end I hope we can describe the MongoDB behavior in a few sentences.

          Show
          mdcallag Mark Callaghan added a comment - Still catching up on things so perhaps my comments are not relevant but... In 2.6 (mmapv1 only) changes on the master are visible: 1) before journal sync is done 2) before replicas might have received or ack'd the change In 3.0 with WiredTiger and RocksDB changes on the master are not visible until after their redo log sync has been done. I assume that #1 continues to be true for mmapv1. In 3.0 I assume that #2 is still a problem. I wrote about this in: https://jira.mongodb.org/browse/DOCS-2908 http://smalldatum.blogspot.com/2014/03/when-does-mongodb-make-transaction.html We also experienced this in MySQL land with semi-sync replication and solved the problem with lossless semisync replication. See the post by Yoshi for more details but the property we provide is that commits are not visible on the master until the commit log has been archived on at least one other replica or log-only replica. http://yoshinorimatsunobu.blogspot.com/2014/04/semi-synchronous-replication-at-facebook.html It will take me a while to get through all of the details in this case but in the end I hope we can describe the MongoDB behavior in a few sentences.
          Hide
          aphyr Kyle Kingsbury added a comment - - edited

          > in short I propose to break off the documentation request into a separate ticket and to use this ticket as the handle for scheduling the feature request.

          Sounds good to me! Aligning the docs to the current behavior can be done right away. I tried to make a reasonable survey of the Mongo consistency docs in the Jepsen post here: https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads. The post also suggests some example anomalies that might be helpful for users trying to reason about whether they can tolerate dirty/stale reads.

          > "coupling reads to oplog acknowledgement" pretty much degrades to converting reads to read-modify-writes in periods of low write volume.

          The workaround I describe in the post is to just do a findAndModify from the current state to itself. Experiments suggest this will do the trick, but if Mongo's smart enough to optimize that CaS away this won't help, haha. I say "couple", though, because you don't actually need to write anything to the oplog. You can actually piggyback the read state onto the existing oplog without inserting any new ops by simply blocking long enough for some other operation to be replicated-thereby verifying the primary is still current. Or you can inject a heartbeat event every so often. Oh, and you can also batch reads ops which should improve performance as well. The Consul and Raft discussions I linked to talk about both tactics.

          > If SERVER-18022 were resolved all the reads in the diagram would have returned 0, because none of the writes had committed.

          Ah, yeah, you're assuming all of these operations take place against the minority primary. That may be the case for this particular history, but in general, writes can occur on either side of the partition, leading to stale reads--the reads could see 0, then 1, then 0, then 1, or any other pattern, depending on which primary clients are talking and when they make their request.

          Show
          aphyr Kyle Kingsbury added a comment - - edited > in short I propose to break off the documentation request into a separate ticket and to use this ticket as the handle for scheduling the feature request. Sounds good to me! Aligning the docs to the current behavior can be done right away. I tried to make a reasonable survey of the Mongo consistency docs in the Jepsen post here: https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads . The post also suggests some example anomalies that might be helpful for users trying to reason about whether they can tolerate dirty/stale reads. > "coupling reads to oplog acknowledgement" pretty much degrades to converting reads to read-modify-writes in periods of low write volume. The workaround I describe in the post is to just do a findAndModify from the current state to itself. Experiments suggest this will do the trick, but if Mongo's smart enough to optimize that CaS away this won't help, haha. I say "couple", though, because you don't actually need to write anything to the oplog. You can actually piggyback the read state onto the existing oplog without inserting any new ops by simply blocking long enough for some other operation to be replicated-thereby verifying the primary is still current. Or you can inject a heartbeat event every so often. Oh, and you can also batch reads ops which should improve performance as well. The Consul and Raft discussions I linked to talk about both tactics. > If SERVER-18022 were resolved all the reads in the diagram would have returned 0, because none of the writes had committed. Ah, yeah, you're assuming all of these operations take place against the minority primary. That may be the case for this particular history, but in general, writes can occur on either side of the partition, leading to stale reads--the reads could see 0, then 1, then 0, then 1, or any other pattern, depending on which primary clients are talking and when they make their request.
          Hide
          mdcallag Mark Callaghan added a comment -

          https://jira.mongodb.org/browse/DOCS-5185 is related to this. Commits can be visible on the master before a slave receives the oplog entry. Therefore visible commits can be rolled back regardless of the write concern. I think the manual and online training should be updated to explain that.

          I have also been hoping that eventually we will get something like lossless semisync replication in MongoDB as the Majority write concern is similar to semisync. Maybe this will serve as motivation.
          '

          Show
          mdcallag Mark Callaghan added a comment - https://jira.mongodb.org/browse/DOCS-5185 is related to this. Commits can be visible on the master before a slave receives the oplog entry. Therefore visible commits can be rolled back regardless of the write concern. I think the manual and online training should be updated to explain that. I have also been hoping that eventually we will get something like lossless semisync replication in MongoDB as the Majority write concern is similar to semisync. Maybe this will serve as motivation. '
          Hide
          henrik.ingo Henrik Ingo added a comment -

          I was thinking about this today, and I'm still wondering whether stale reads are at all possible in MongoDB? Even today with 2.6/3.0?

          The kind of stale read that Kyle describes can happen if there are 2 primaries existing at the same time: the old primary about to step down, and the newly elected primary. Even if it's unlikely, in theory a client process could flip flop between the primaries so that it reads: new primary, old primary, new primary. However, this can only happen if the old primary steps down later than the new primary is elected. (Using ReadPreference = PRIMARY is of course assumed here.)

          I'm not sure that can ever happen. Even if it could happen, it would be easy to tweak the step-down and election sequences so that step-down is guaranteed to happen faster than election of a new primary.

          This would be a more performant and easier solution than using findAndModify+getLastError or any other solution depending on doing roundtrips via the oplog.

          (Note that there have of course been a couple bugs reported where a replica set had 2 primaries even for long times, but those were bugs, not part of the intended failover protocol.)

          Show
          henrik.ingo Henrik Ingo added a comment - I was thinking about this today, and I'm still wondering whether stale reads are at all possible in MongoDB? Even today with 2.6/3.0? The kind of stale read that Kyle describes can happen if there are 2 primaries existing at the same time: the old primary about to step down, and the newly elected primary. Even if it's unlikely, in theory a client process could flip flop between the primaries so that it reads: new primary, old primary, new primary. However, this can only happen if the old primary steps down later than the new primary is elected. (Using ReadPreference = PRIMARY is of course assumed here.) I'm not sure that can ever happen. Even if it could happen, it would be easy to tweak the step-down and election sequences so that step-down is guaranteed to happen faster than election of a new primary. This would be a more performant and easier solution than using findAndModify+getLastError or any other solution depending on doing roundtrips via the oplog. (Note that there have of course been a couple bugs reported where a replica set had 2 primaries even for long times, but those were bugs, not part of the intended failover protocol.)
          Hide
          aphyr Kyle Kingsbury added a comment - - edited

          > I'm not sure that can ever happen. Even if it could happen, it would be easy to tweak the step-down and election sequences so that step-down is guaranteed to happen faster than election of a new primary.

          The network is not synchronous, clocks drift, nodes pause, etc. Fixing a race condition via a timeout is an easy workaround, but I think you'll find (like Consul) that it's a probabilistic hack at best.

          Show
          aphyr Kyle Kingsbury added a comment - - edited > I'm not sure that can ever happen. Even if it could happen, it would be easy to tweak the step-down and election sequences so that step-down is guaranteed to happen faster than election of a new primary. The network is not synchronous, clocks drift, nodes pause, etc. Fixing a race condition via a timeout is an easy workaround, but I think you'll find (like Consul) that it's a probabilistic hack at best.
          Hide
          schwerin Andy Schwerin added a comment -

          You cannot implement this feature with timing tricks. Even if everything else is going great, the OS scheduler can screw you pretty easily on a heavily loaded system, and just fail to schedule the step-down work on the old primary. We see this in our test harnesses sometimes, in tests that wait for failover to complete.

          Show
          schwerin Andy Schwerin added a comment - You cannot implement this feature with timing tricks. Even if everything else is going great, the OS scheduler can screw you pretty easily on a heavily loaded system, and just fail to schedule the step-down work on the old primary. We see this in our test harnesses sometimes, in tests that wait for failover to complete.
          Hide
          carstenklein@yahoo.de Carsten Klein added a comment - - edited

          Hm, looking at MariaDB Galera, it uses both a proxy and an additional arbitrator for handling both fail over and for making sure that updates and presumably also reads are valid.
          Would it not be possible to implement a similar scheme, lest the proxy of course, in MongoDB to get rid of this once and for all?

          As I see it, each mongo db replicate acts as an arbitrator. The same goes for MariaDB Galera, however, here, they also integrated an additional independent arbitrator that does not hold a replication set, just the transaction log.

          Show
          carstenklein@yahoo.de Carsten Klein added a comment - - edited Hm, looking at MariaDB Galera, it uses both a proxy and an additional arbitrator for handling both fail over and for making sure that updates and presumably also reads are valid. Would it not be possible to implement a similar scheme, lest the proxy of course, in MongoDB to get rid of this once and for all? As I see it, each mongo db replicate acts as an arbitrator. The same goes for MariaDB Galera, however, here, they also integrated an additional independent arbitrator that does not hold a replication set, just the transaction log.
          Hide
          schwerin Andy Schwerin added a comment - - edited

          Carsten Klein, if I understand Galera's model correctly, SERVER-18285 should provide the equivalent behavior to setting wsrep_sync_wait = 1, while using it in conjunction with w:majority writes starting in MongoDB 3.2 ought to provide wsrep_sync_wait=3 or possibly 7.

          There are some differences, as individual replica sets in MongoDB only elect a single primary (write master) at a time, but I believe the effect is similar.

          Show
          schwerin Andy Schwerin added a comment - - edited Carsten Klein , if I understand Galera's model correctly, SERVER-18285 should provide the equivalent behavior to setting wsrep_sync_wait = 1, while using it in conjunction with w:majority writes starting in MongoDB 3.2 ought to provide wsrep_sync_wait=3 or possibly 7. There are some differences, as individual replica sets in MongoDB only elect a single primary (write master) at a time, but I believe the effect is similar.
          Hide
          carstenklein@yahoo.de Carsten Klein added a comment - - edited

          Andy Schwering, here, you definetely lost me

          What I meant was, that prior to reading or writing from the primary, there should be a third instance that would validate that primary before it is being used, even if it needed do multiple rpcs to the list of provable primaries and also wait for a specific amount of time before the data got replicated across all machines or at least to the one the client is being connected to.

          Ultimately causing the reading or writing client to fail if the primary could not be validated in a timely fashion.

          Which, I guess, is basically what the option is all about... lest for the failing part, of course.

          Show
          carstenklein@yahoo.de Carsten Klein added a comment - - edited Andy Schwering, here, you definetely lost me What I meant was, that prior to reading or writing from the primary, there should be a third instance that would validate that primary before it is being used, even if it needed do multiple rpcs to the list of provable primaries and also wait for a specific amount of time before the data got replicated across all machines or at least to the one the client is being connected to. Ultimately causing the reading or writing client to fail if the primary could not be validated in a timely fashion. Which, I guess, is basically what the option is all about... lest for the failing part, of course.
          Hide
          Marqin Hubert Jarosz added a comment -

          What's the current state of this bug?

          Show
          Marqin Hubert Jarosz added a comment - What's the current state of this bug?
          Hide
          ramon.fernandez Ramon Fernandez added a comment -

          Hubert Jarosz, the "3.3 Desired" fixVersion indicates that we're aiming to address this ticket in the current development cycle. Feel free to watch the ticket for updates.

          Regards,
          Ramón.

          Show
          ramon.fernandez Ramon Fernandez added a comment - Hubert Jarosz , the "3.3 Desired" fixVersion indicates that we're aiming to address this ticket in the current development cycle. Feel free to watch the ticket for updates. Regards, Ramón.
          Hide
          schwerin Andy Schwerin added a comment -

          Hubert Jarosz, in the meantime, for single-document reads, if you have write privileges on the collection containing the document, you can use a findAndModify that performs a no-op update to avoid stale reads in cases where that is an operational requirement.

          This documentation suggests one approach, though it's not necessary to do a write that actually changes the document.

          Show
          schwerin Andy Schwerin added a comment - Hubert Jarosz , in the meantime, for single-document reads, if you have write privileges on the collection containing the document, you can use a findAndModify that performs a no-op update to avoid stale reads in cases where that is an operational requirement. This documentation suggests one approach , though it's not necessary to do a write that actually changes the document.
          Hide
          schwerin Andy Schwerin added a comment -

          We have completed implementation of a new "linearizable" read concern under SERVER-18285, and have undertaken some documentation updates under DOCS-8298. As such, I'm resolving this ticket as "fixed" for MongoDB 3.4.0-rc3. The code is actually present and enabled in 3.4.0-rc2, for those interested in further test. Our own testing included, among other things, integrating jepsen tests into our continous integration system. That work was done in SERVER-24509.

          Thanks for your report and follow-up assistance, Kyle Kingsbury.

          Show
          schwerin Andy Schwerin added a comment - We have completed implementation of a new "linearizable" read concern under SERVER-18285 , and have undertaken some documentation updates under DOCS-8298 . As such, I'm resolving this ticket as "fixed" for MongoDB 3.4.0-rc3. The code is actually present and enabled in 3.4.0-rc2, for those interested in further test. Our own testing included, among other things, integrating jepsen tests into our continous integration system . That work was done in SERVER-24509 . Thanks for your report and follow-up assistance, Kyle Kingsbury .

            People

            • Votes:
              15 Vote for this issue
              Watchers:
              119 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: