Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-17975

Stale reads with WriteConcern Majority and ReadPreference Primary

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major - P3
    • Resolution: Unresolved
    • Affects Version/s: 2.6.7
    • Fix Version/s: 3.1 Desired
    • Component/s: Replication
    • Labels:
      None
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      Clone jepsen, check out commit 72697c09eff26fdb1afb7491256c873f03404307, cd mongodb, and run `lein test`. Might need to run `lein install` in the jepsen/jepsen directory first.

      Show
      Clone jepsen, check out commit 72697c09eff26fdb1afb7491256c873f03404307, cd mongodb, and run `lein test`. Might need to run `lein install` in the jepsen/jepsen directory first.

      Description

      Hello, everyone! Hope you're having a terrific week.

      I think I may have found a thing!

      In Jepsen tests involving a mix of reads, writes, and compare-and-set against a single document, MongoDB appears to allow stale reads, even when writes use WriteConcern.MAJORITY, when network partitions cause a leader election. This holds for both plain find-by-id lookups and for queries explicitly passing ReadPreference.primary().

      Here's how we execute read, write, and compare-and-set operations against a register:

      https://github.com/aphyr/jepsen/blob/72697c09eff26fdb1afb7491256c873f03404307/mongodb/src/mongodb/document_cas.clj#L55-L81

      And this is the schedule for failures: a 60-second on, 60-second off pattern of network partitions cutting the network cleanly into a randomly selected 3-node majority component and a 2-node minority component.

      https://github.com/aphyr/jepsen/blob/72697c09eff26fdb1afb7491256c873f03404307/mongodb/src/mongodb/core.clj#L377-L391

      This particular test is a bit finicky--it's easy to get knossos locked into a really slow verification cycle, or to have trouble triggering the bug. Wish I had a more reliable test for you!

      Attached, linearizability.txt shows the linearizability analysis from Knossos for a test run with a relatively simple failure mode. In this test, MongoDB returns the value "0" for the document, even though the only possible values for the document at that time were 1, 2, 3, or 4. The value 0 was the proper state at some time close to the partition's beginning, but successful reads just after the partition was fully established indicated that at least one of the indeterminate (:info) CaS operations changing the value away from 0 had to have executed.

      You can see this visually in the attached image, where I've drawn the acknowledged (:ok) operations as green and indeterminate (:info) operations as yellow bars; omitting :fail ops which are known to have not taken place. Time moves from left to right; each process is a numbered horizontal track. The value must be zero just prior to the partition, but in order to read 4 and 3 we must execute process 1's CAS from 0->4; all possible paths from that point on cannot result in a value of 0 in time for process 5's final read.

      Since the MongoDB docs for Read Preferences (http://docs.mongodb.org/manual/core/read-preference/) say "reading from the primary guarantees that read operations reflect the latest version of a document", I suspect this behavior conflicts with Mongo's intended behavior.

      There is good news! If you remove all read operations from the mix, performing only CaS and writes, single-register ops with WriteConcern MAJORITY do appear to be linearizable! Or, at least, I haven't devised an aggressive enough test to expose any faults yet.

      This suggests to me that MongoDB might make the same mistake that Etcd and Consul did with respect to consistent reads: assuming that a node which believes it is currently a primary can safely service a read request without confirming with a quorum of secondaries that it is still the primary. If this is so, you might refer to https://github.com/coreos/etcd/issues/741 and https://gist.github.com/armon/11059431 for more context on why this behavior is not consistent.

      If this is the case, I think you can recover linearizable reads by computing the return value for the query, then verifying with a majority of nodes that no leadership transitions have happened since the start of the query, and then sending the result back to the client--preventing a logically "old" primary from servicing reads.

      Let me know if there's anything else I can help with!

      1. history.edn
        307 kB
        Kyle Kingsbury
      2. linearizability.txt
        48 kB
        Kyle Kingsbury
      1. CCNSOQ6UwAEAvsO.jpg
        25 kB
      2. Journal - 84.png
        764 kB

        Issue Links

          Activity

          Hide
          aphyr Kyle Kingsbury added a comment - - edited

          > I'm not sure that can ever happen. Even if it could happen, it would be easy to tweak the step-down and election sequences so that step-down is guaranteed to happen faster than election of a new primary.

          The network is not synchronous, clocks drift, nodes pause, etc. Fixing a race condition via a timeout is an easy workaround, but I think you'll find (like Consul) that it's a probabilistic hack at best.

          Show
          aphyr Kyle Kingsbury added a comment - - edited > I'm not sure that can ever happen. Even if it could happen, it would be easy to tweak the step-down and election sequences so that step-down is guaranteed to happen faster than election of a new primary. The network is not synchronous, clocks drift, nodes pause, etc. Fixing a race condition via a timeout is an easy workaround, but I think you'll find (like Consul) that it's a probabilistic hack at best.
          Hide
          schwerin Andy Schwerin added a comment -

          You cannot implement this feature with timing tricks. Even if everything else is going great, the OS scheduler can screw you pretty easily on a heavily loaded system, and just fail to schedule the step-down work on the old primary. We see this in our test harnesses sometimes, in tests that wait for failover to complete.

          Show
          schwerin Andy Schwerin added a comment - You cannot implement this feature with timing tricks. Even if everything else is going great, the OS scheduler can screw you pretty easily on a heavily loaded system, and just fail to schedule the step-down work on the old primary. We see this in our test harnesses sometimes, in tests that wait for failover to complete.
          Hide
          carstenklein@yahoo.de Carsten Klein added a comment - - edited

          Hm, looking at MariaDB Galera, it uses both a proxy and an additional arbitrator for handling both fail over and for making sure that updates and presumably also reads are valid.
          Would it not be possible to implement a similar scheme, lest the proxy of course, in MongoDB to get rid of this once and for all?

          As I see it, each mongo db replicate acts as an arbitrator. The same goes for MariaDB Galera, however, here, they also integrated an additional independent arbitrator that does not hold a replication set, just the transaction log.

          Show
          carstenklein@yahoo.de Carsten Klein added a comment - - edited Hm, looking at MariaDB Galera, it uses both a proxy and an additional arbitrator for handling both fail over and for making sure that updates and presumably also reads are valid. Would it not be possible to implement a similar scheme, lest the proxy of course, in MongoDB to get rid of this once and for all? As I see it, each mongo db replicate acts as an arbitrator. The same goes for MariaDB Galera, however, here, they also integrated an additional independent arbitrator that does not hold a replication set, just the transaction log.
          Hide
          schwerin Andy Schwerin added a comment - - edited

          Carsten Klein, if I understand Galera's model correctly, SERVER-18285 should provide the equivalent behavior to setting wsrep_sync_wait = 1, while using it in conjunction with w:majority writes starting in MongoDB 3.2 ought to provide wsrep_sync_wait=3 or possibly 7.

          There are some differences, as individual replica sets in MongoDB only elect a single primary (write master) at a time, but I believe the effect is similar.

          Show
          schwerin Andy Schwerin added a comment - - edited Carsten Klein , if I understand Galera's model correctly, SERVER-18285 should provide the equivalent behavior to setting wsrep_sync_wait = 1, while using it in conjunction with w:majority writes starting in MongoDB 3.2 ought to provide wsrep_sync_wait=3 or possibly 7. There are some differences, as individual replica sets in MongoDB only elect a single primary (write master) at a time, but I believe the effect is similar.
          Hide
          carstenklein@yahoo.de Carsten Klein added a comment - - edited

          Andy Schwering, here, you definetely lost me

          What I meant was, that prior to reading or writing from the primary, there should be a third instance that would validate that primary before it is being used, even if it needed do multiple rpcs to the list of provable primaries and also wait for a specific amount of time before the data got replicated across all machines or at least to the one the client is being connected to.

          Ultimately causing the reading or writing client to fail if the primary could not be validated in a timely fashion.

          Which, I guess, is basically what the option is all about... lest for the failing part, of course.

          Show
          carstenklein@yahoo.de Carsten Klein added a comment - - edited Andy Schwering, here, you definetely lost me What I meant was, that prior to reading or writing from the primary, there should be a third instance that would validate that primary before it is being used, even if it needed do multiple rpcs to the list of provable primaries and also wait for a specific amount of time before the data got replicated across all machines or at least to the one the client is being connected to. Ultimately causing the reading or writing client to fail if the primary could not be validated in a timely fashion. Which, I guess, is basically what the option is all about... lest for the failing part, of course.

            People

            • Votes:
              6 Vote for this issue
              Watchers:
              98 Start watching this issue

              Dates

              • Created:
                Updated:
                Days since reply:
                13 weeks, 3 days ago
                Date of 1st Reply: