[SERVER-35316] Causal Consistency Violation for local read/write concerns in Jepsen Sharded Cluster Test Created: 31/May/18  Updated: 27/Oct/23  Resolved: 20/Sep/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Cristopher Stauffer Assignee: Alyson Cabral (Inactive)
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File diagram1.png     PNG File diagram2.png    
Issue Links:
Depends
is depended on by DOCS-11866 Document Causal Consistency behavior ... Closed
Duplicate
Related
Operating System: ALL
Participants:
Linked BF Score: 0

 Description   

Original Problem:

 

When running a test against 2 replica set shards, with rc “local” and wc “w1” we get reads returning the base value of the document, nil, despite occurring after acknowledged writes in the session. Each single threaded client is writing to one key at a time, using one session, against a single mongos router, and does not have writes to secondaries enabled. The nemesis partitions the network into random halves for 10 seconds, with a 10 second wait in between (This failure has not appeared with partitions disabled.

The expected pattern of operations in this test is read nil/0, write 1, read 1, write 2, read 2. In the test histories I’ve attached below, the :position field is the op’s optime value, read from the session after acknowledgement. It’s :link field is the previous optime value the client has seen in that key’s session. 

In the first set of results (rwr-initial-read-1), this occurs in 3 keys over a 40 second test. In each failing key (15 23 16), we see a read of 0 (representing the initial empty document’s read nil for the checker), and a successful write of 1. Then the read following write 2 returns an empty document.  

The second result set (rwr-initial-read-2) provided is a longer test over 300 seconds, where we observe this anomaly 6 times. Keys 50, 51, 82, 125, and 143 appear to drop the value on write 2. However, key 116 is missing the value for write 1.  Also of note is that in the history for key 116, (under independent/116) write 2 succeeds and appears in the final read for the key. See op `{:type :ok, :f :read, :value 2, :process 5, :time 205602064887, :position 6559250071353819138, :link 6559250071353819137, :index 1206}`

 

 



 Comments   
Comment by Alyson Cabral (Inactive) [ 19/Sep/18 ]

We've created DOCS-11866 and updated the documentation to recommend read and write concern configurations that are always safe, even during network partitions. When used with causal consistency, the combination of read concern majority and write concern majority always safely delivers causal guarantees.

This ticket specifically points out an anomaly with using values of write concern less than 'majority'.

Write concern, or write acknowledgment, tells the server how long the client is willing to wait to know the definitive result of that write (i.e. whether the write committed or not). Write concern options are:

1 – the write returns a success once it has been applied to the primary
N – the write returns a success once it has been applied to N number of nodes
Majority – the write returns a success once it has been applied to a majority of nodes

Only a successful write with write concern majority is committed and guarantees durability to any system failures.

Let’s consider the behavior of write concern (WC) during a network partition, when write concern is 1 (WC:1).

In this example, the causal sequence of operations is as follows:
At Time T1 perform a write W1
At Time T2 perform a read R1

In this diagram, P1 has been partitioned from a majority of nodes and P2 has been elected as the new primary. However, P1 does not yet know it isn’t primary and may continue to accept writes. Once P1 is reconnected to a majority of nodes, all of its writes since the timeline diverged will be rolled back.

If the write W1 is using write concern 1, and the read R1 is using read concern majority, the diagram below represents which timeline the operation can successfully execute on. 

The write W1 with write concern 1 may return successfully when applied to the timeline of either P1 or P2. If we consider the case where the write committed to P1 this means the client will get a success message even if the write may be ultimately rolled back.

The causal read R1 with read concern majority will wait until the time T1 (or later) becomes majority committed before returning success. P1 is unable to progress it’s notion of the majority commit point since it is partitioned from a majority of nodes, so any successful read R1 must have executed on the true primary’s timeline and will see the definitive result of the write. In this case, the definitive result of the write may be that the write did not commit. If R1 sees the result that the write W1 did not commit, this means that the write will never be committed.

Even with write concerns less than majority, the causal ordering of the committed writes is maintained. However, durability is not guaranteed.

To learn more about the effects of different combinations of read and write concern on causal guarantees, we've created a detailed explanation of behavior in our documentation. No anomalies exist when using the combination of read concern majority and write concern majority.

Comment by Kit Patella [ 06/Jun/18 ]

I've run the causal register test on 4.0-rc1 with various write and read levels, and there's no apparent change in behavior between 3.6 and 4.0. That is, operations with sub-majority writes may not be observed in future dependent reads.

In tests with w: majority, I have not found any anomalous histories. If they are possible suggested, then the current causal register test may not be strong enough to find it, and may be worth testing under future work.

As they stand now, sub-majority writes can violate the base case of causality in the presence of network partitions. It seems insufficient to document in https://docs.mongodb.com/manual/core/read-isolation-consistency-recency/ that any acknowledged write in a CC-enabled session will be causally consistent. Rather, a write must be acknowledged by a majority of nodes for it to be safe. It looks like now, we have dependent writes failing to appear in time for the read, and therefore we're serving up stale data, despite depending on a successful write in that session. If a read's dependent writes are not available, you have to block until they are. Though, you have the option to cache written values in the client so they can be served up immediately for dependent reads.

Comment by Misha Tyulenev [ 31/May/18 ]

cristopher.stauffer yes for both questions. The results look consistent with the design.
Here are some ideas on how it can be investigated further:
using w:1 or level:local will result in different anomalies as some data can disappear.
It will be interesting to learn the anomalies from the client perspective what is broken:

w:majority and level:majority will give all four properties.
relaxing read or write concern will introduce a compromise for a user i.e. w:majority, level:local will likely preserve monotonic writes, and read your writes but will break 2 other properties

Comment by Cristopher Stauffer [ 31/May/18 ]

Oh behalf of user: 
 
I've tested with w: w1, and level: majority, and while the behavior is distinct, we still appear to be missing writes. Rather than reading out an empty document, we observe the latest acknowledged write missing (See results.edn in the attached test case).
 
The behavior has indeed disappeared when running with w: majority, r: local. I'll keep digging here.
 
Regarding opTime and afterClusterTime in the session, I am depending on the java driver's behavior. I start a session at the beginning of each key, pass the session to each invocation for that key, and write the session's optime value to the test's history after each successful op.

Comment by Cristopher Stauffer [ 31/May/18 ]

Two Questions:

The local read can return the data that was rolled back on partition as it was w1 write. As such there should be missing values that will satisfying the afterClusterTime. Please confirm this is the case? To make this behavior disappear the data has to be w:majority or level: majority

Can you send the test design? Specifically to be sure how the operationTime is connected to the afterClusterTime in a session.

Generated at Thu Feb 08 04:39:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.