Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-17975

Stale reads with WriteConcern Majority and ReadPreference Primary

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 2.6.7
    • Fix Version/s: 3.4.0-rc3
    • Component/s: Replication
    • Labels:
      None
    • Backwards Compatibility:
      Minor Change
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      Clone jepsen, check out commit 72697c09eff26fdb1afb7491256c873f03404307, cd mongodb, and run `lein test`. Might need to run `lein install` in the jepsen/jepsen directory first.

      Show
      Clone jepsen, check out commit 72697c09eff26fdb1afb7491256c873f03404307, cd mongodb, and run `lein test`. Might need to run `lein install` in the jepsen/jepsen directory first.

      Description

      Hello, everyone! Hope you're having a terrific week.

      I think I may have found a thing!

      In Jepsen tests involving a mix of reads, writes, and compare-and-set against a single document, MongoDB appears to allow stale reads, even when writes use WriteConcern.MAJORITY, when network partitions cause a leader election. This holds for both plain find-by-id lookups and for queries explicitly passing ReadPreference.primary().

      Here's how we execute read, write, and compare-and-set operations against a register:

      https://github.com/aphyr/jepsen/blob/72697c09eff26fdb1afb7491256c873f03404307/mongodb/src/mongodb/document_cas.clj#L55-L81

      And this is the schedule for failures: a 60-second on, 60-second off pattern of network partitions cutting the network cleanly into a randomly selected 3-node majority component and a 2-node minority component.

      https://github.com/aphyr/jepsen/blob/72697c09eff26fdb1afb7491256c873f03404307/mongodb/src/mongodb/core.clj#L377-L391

      This particular test is a bit finicky--it's easy to get knossos locked into a really slow verification cycle, or to have trouble triggering the bug. Wish I had a more reliable test for you!

      Attached, linearizability.txt shows the linearizability analysis from Knossos for a test run with a relatively simple failure mode. In this test, MongoDB returns the value "0" for the document, even though the only possible values for the document at that time were 1, 2, 3, or 4. The value 0 was the proper state at some time close to the partition's beginning, but successful reads just after the partition was fully established indicated that at least one of the indeterminate (:info) CaS operations changing the value away from 0 had to have executed.

      You can see this visually in the attached image, where I've drawn the acknowledged (:ok) operations as green and indeterminate (:info) operations as yellow bars; omitting :fail ops which are known to have not taken place. Time moves from left to right; each process is a numbered horizontal track. The value must be zero just prior to the partition, but in order to read 4 and 3 we must execute process 1's CAS from 0->4; all possible paths from that point on cannot result in a value of 0 in time for process 5's final read.

      Since the MongoDB docs for Read Preferences (http://docs.mongodb.org/manual/core/read-preference/) say "reading from the primary guarantees that read operations reflect the latest version of a document", I suspect this behavior conflicts with Mongo's intended behavior.

      There is good news! If you remove all read operations from the mix, performing only CaS and writes, single-register ops with WriteConcern MAJORITY do appear to be linearizable! Or, at least, I haven't devised an aggressive enough test to expose any faults yet.

      This suggests to me that MongoDB might make the same mistake that Etcd and Consul did with respect to consistent reads: assuming that a node which believes it is currently a primary can safely service a read request without confirming with a quorum of secondaries that it is still the primary. If this is so, you might refer to https://github.com/coreos/etcd/issues/741 and https://gist.github.com/armon/11059431 for more context on why this behavior is not consistent.

      If this is the case, I think you can recover linearizable reads by computing the return value for the query, then verifying with a majority of nodes that no leadership transitions have happened since the start of the query, and then sending the result back to the client--preventing a logically "old" primary from servicing reads.

      Let me know if there's anything else I can help with!

      1. history.edn
        307 kB
        Kyle Kingsbury
      2. linearizability.txt
        48 kB
        Kyle Kingsbury
      1. CCNSOQ6UwAEAvsO.jpg
        25 kB
      2. Journal - 84.png
        764 kB

        Issue Links

          Activity

          Hide
          carstenklein@yahoo.de Carsten Klein added a comment - - edited

          Andy Schwering, here, you definetely lost me

          What I meant was, that prior to reading or writing from the primary, there should be a third instance that would validate that primary before it is being used, even if it needed do multiple rpcs to the list of provable primaries and also wait for a specific amount of time before the data got replicated across all machines or at least to the one the client is being connected to.

          Ultimately causing the reading or writing client to fail if the primary could not be validated in a timely fashion.

          Which, I guess, is basically what the option is all about... lest for the failing part, of course.

          Show
          carstenklein@yahoo.de Carsten Klein added a comment - - edited Andy Schwering, here, you definetely lost me What I meant was, that prior to reading or writing from the primary, there should be a third instance that would validate that primary before it is being used, even if it needed do multiple rpcs to the list of provable primaries and also wait for a specific amount of time before the data got replicated across all machines or at least to the one the client is being connected to. Ultimately causing the reading or writing client to fail if the primary could not be validated in a timely fashion. Which, I guess, is basically what the option is all about... lest for the failing part, of course.
          Hide
          Marqin Hubert Jarosz added a comment -

          What's the current state of this bug?

          Show
          Marqin Hubert Jarosz added a comment - What's the current state of this bug?
          Hide
          ramon.fernandez Ramon Fernandez added a comment -

          Hubert Jarosz, the "3.3 Desired" fixVersion indicates that we're aiming to address this ticket in the current development cycle. Feel free to watch the ticket for updates.

          Regards,
          Ramón.

          Show
          ramon.fernandez Ramon Fernandez added a comment - Hubert Jarosz , the "3.3 Desired" fixVersion indicates that we're aiming to address this ticket in the current development cycle. Feel free to watch the ticket for updates. Regards, Ramón.
          Hide
          schwerin Andy Schwerin added a comment -

          Hubert Jarosz, in the meantime, for single-document reads, if you have write privileges on the collection containing the document, you can use a findAndModify that performs a no-op update to avoid stale reads in cases where that is an operational requirement.

          This documentation suggests one approach, though it's not necessary to do a write that actually changes the document.

          Show
          schwerin Andy Schwerin added a comment - Hubert Jarosz , in the meantime, for single-document reads, if you have write privileges on the collection containing the document, you can use a findAndModify that performs a no-op update to avoid stale reads in cases where that is an operational requirement. This documentation suggests one approach , though it's not necessary to do a write that actually changes the document.
          Hide
          schwerin Andy Schwerin added a comment -

          We have completed implementation of a new "linearizable" read concern under SERVER-18285, and have undertaken some documentation updates under DOCS-8298. As such, I'm resolving this ticket as "fixed" for MongoDB 3.4.0-rc3. The code is actually present and enabled in 3.4.0-rc2, for those interested in further test. Our own testing included, among other things, integrating jepsen tests into our continous integration system. That work was done in SERVER-24509.

          Thanks for your report and follow-up assistance, Kyle Kingsbury.

          Show
          schwerin Andy Schwerin added a comment - We have completed implementation of a new "linearizable" read concern under SERVER-18285 , and have undertaken some documentation updates under DOCS-8298 . As such, I'm resolving this ticket as "fixed" for MongoDB 3.4.0-rc3. The code is actually present and enabled in 3.4.0-rc2, for those interested in further test. Our own testing included, among other things, integrating jepsen tests into our continous integration system . That work was done in SERVER-24509 . Thanks for your report and follow-up assistance, Kyle Kingsbury .

            People

            • Votes:
              15 Vote for this issue
              Watchers:
              119 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: