[SERVER-35316] Causal Consistency Violation for local read/write concerns in Jepsen Sharded Cluster Test Created: 31/May/18 Updated: 27/Oct/23 Resolved: 20/Sep/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Cristopher Stauffer | Assignee: | Alyson Cabral (Inactive) |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||
| Description |
|
Original Problem:
When running a test against 2 replica set shards, with rc “local” and wc “w1” we get reads returning the base value of the document, nil, despite occurring after acknowledged writes in the session. Each single threaded client is writing to one key at a time, using one session, against a single mongos router, and does not have writes to secondaries enabled. The nemesis partitions the network into random halves for 10 seconds, with a 10 second wait in between (This failure has not appeared with partitions disabled. The expected pattern of operations in this test is read nil/0, write 1, read 1, write 2, read 2. In the test histories I’ve attached below, the :position field is the op’s optime value, read from the session after acknowledgement. It’s :link field is the previous optime value the client has seen in that key’s session. In the first set of results (rwr-initial-read-1), this occurs in 3 keys over a 40 second test. In each failing key (15 23 16), we see a read of 0 (representing the initial empty document’s read nil for the checker), and a successful write of 1. Then the read following write 2 returns an empty document. The second result set (rwr-initial-read-2) provided is a longer test over 300 seconds, where we observe this anomaly 6 times. Keys 50, 51, 82, 125, and 143 appear to drop the value on write 2. However, key 116 is missing the value for write 1. Also of note is that in the history for key 116, (under independent/116) write 2 succeeds and appears in the final read for the key. See op `{:type :ok, :f :read, :value 2, :process 5, :time 205602064887, :position 6559250071353819138, :link 6559250071353819137, :index 1206}`
|
| Comments |
| Comment by Alyson Cabral (Inactive) [ 19/Sep/18 ] |
|
We've created This ticket specifically points out an anomaly with using values of write concern less than 'majority'. Write concern, or write acknowledgment, tells the server how long the client is willing to wait to know the definitive result of that write (i.e. whether the write committed or not). Write concern options are: 1 – the write returns a success once it has been applied to the primary Only a successful write with write concern majority is committed and guarantees durability to any system failures. Let’s consider the behavior of write concern (WC) during a network partition, when write concern is 1 (WC:1). In this example, the causal sequence of operations is as follows: If the write W1 is using write concern 1, and the read R1 is using read concern majority, the diagram below represents which timeline the operation can successfully execute on. The write W1 with write concern 1 may return successfully when applied to the timeline of either P1 or P2. If we consider the case where the write committed to P1 this means the client will get a success message even if the write may be ultimately rolled back. The causal read R1 with read concern majority will wait until the time T1 (or later) becomes majority committed before returning success. P1 is unable to progress it’s notion of the majority commit point since it is partitioned from a majority of nodes, so any successful read R1 must have executed on the true primary’s timeline and will see the definitive result of the write. In this case, the definitive result of the write may be that the write did not commit. If R1 sees the result that the write W1 did not commit, this means that the write will never be committed. Even with write concerns less than majority, the causal ordering of the committed writes is maintained. However, durability is not guaranteed. To learn more about the effects of different combinations of read and write concern on causal guarantees, we've created a detailed explanation of behavior in our documentation. No anomalies exist when using the combination of read concern majority and write concern majority. |
| Comment by Kit Patella [ 06/Jun/18 ] |
|
I've run the causal register test on 4.0-rc1 with various write and read levels, and there's no apparent change in behavior between 3.6 and 4.0. That is, operations with sub-majority writes may not be observed in future dependent reads. In tests with w: majority, I have not found any anomalous histories. If they are possible suggested, then the current causal register test may not be strong enough to find it, and may be worth testing under future work. As they stand now, sub-majority writes can violate the base case of causality in the presence of network partitions. It seems insufficient to document in https://docs.mongodb.com/manual/core/read-isolation-consistency-recency/ that any acknowledged write in a CC-enabled session will be causally consistent. Rather, a write must be acknowledged by a majority of nodes for it to be safe. It looks like now, we have dependent writes failing to appear in time for the read, and therefore we're serving up stale data, despite depending on a successful write in that session. If a read's dependent writes are not available, you have to block until they are. Though, you have the option to cache written values in the client so they can be served up immediately for dependent reads. |
| Comment by Misha Tyulenev [ 31/May/18 ] |
|
cristopher.stauffer yes for both questions. The results look consistent with the design.
w:majority and level:majority will give all four properties. |
| Comment by Cristopher Stauffer [ 31/May/18 ] |
|
Oh behalf of user: |
| Comment by Cristopher Stauffer [ 31/May/18 ] |
|
Two Questions: The local read can return the data that was rolled back on partition as it was w1 write. As such there should be missing values that will satisfying the afterClusterTime. Please confirm this is the case? To make this behavior disappear the data has to be w:majority or level: majority Can you send the test design? Specifically to be sure how the operationTime is connected to the afterClusterTime in a session. |