[SERVER-41769] The committedSnapshot should not be greater than allCommitted when EMRC=false Created: 14/Jun/19 Updated: 29/Oct/23 Resolved: 16/Jul/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 4.3.1 |
| Fix Version/s: | 4.2.0-rc3, 4.3.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jason Chan | Assignee: | Jason Chan |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Requested: |
v4.2
|
||||||||||||||||
| Sprint: | Repl 2019-07-01, Repl 2019-07-15, Repl 2019-07-29 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 20 | ||||||||||||||||
| Description |
|
Currently we start a transaction that performs modifications to documents with snapshot read concern. With snapshot read concern, we set the readSource to be the kAllCommittedSnapshot. This can cause build failures in the case where in our evergreen runs, an HMAC key insertion right before we prepopulate our test collection. In this case, since we prepopulate the collection while the WT transaction for the HMAC key insertion is not yet committed, the documents we insert into the test collection never make it into the all committed snapshot. This will cause inconsistencies in our test results. We should pin the committedSnapshot to always be <= allCommitted and also audit all uses of transactions with snapshot read concern to make sure any previous writes outside the transaction that are depended on by the transaction should be using majority write concern. |
| Comments |
| Comment by Githook User [ 17/Jul/19 ] |
|
Author: {'name': 'Jason Chan', 'username': 'jasonjhchan', 'email': 'jason.chan@10gen.com'}Message: (cherry picked from commit 5d376f806460dee5a121483e33c0eeeb8c3791d7) |
| Comment by Githook User [ 17/Jul/19 ] |
|
Author: {'name': 'Jason Chan', 'email': 'jason.chan@10gen.com', 'username': 'jasonjhchan'}Message: (cherry picked from commit 795b99ef36ea9448c862db6fd56ff6321bf66c9e) |
| Comment by Githook User [ 16/Jul/19 ] |
|
Author: {'name': 'Jason Chan', 'username': 'jasonjhchan', 'email': 'jason.chan@10gen.com'}Message: |
| Comment by Githook User [ 16/Jul/19 ] |
|
Author: {'name': 'Jason Chan', 'username': 'jasonjhchan', 'email': 'jason.chan@10gen.com'}Message: |
| Comment by Jason Chan [ 27/Jun/19 ] |
|
Yes, I've changed the ticket title and description. I will audit the usages of transactions with snapshot read concern to make sure any writes that are depended on by the transaction will be using majority write concern. |
| Comment by Judah Schvimer [ 27/Jun/19 ] |
|
That makes sense. Should we also check that we are using w:majority everywhere we need to be? |
| Comment by Jason Chan [ 27/Jun/19 ] |
|
judah.schvimer Some alternative solutions that I didn't think were as good: |
| Comment by Judah Schvimer [ 27/Jun/19 ] |
|
jason.chan, what do you propose to do to fix this ticket? |
| Comment by Jason Chan [ 27/Jun/19 ] |
|
I created a quick repro of the bug here (should only be run when enableMajorityReadConcern is false). The issue seems to be as described. We set the committed snapshot to be the lastCommittedOpTime. The lastCommittedOpTime gets set to the durableOpTime which ends up being the same as the lastApplied. This allows the insert with majority write concern to complete successfully but the transaction will continue to read at the allCommittedTimestamp with snapshot read concern. |
| Comment by William Schultz (Inactive) [ 27/Jun/19 ] |
|
Yes, forcing all tests that want RYOW behavior to contain their ops within a single session might be additionally tricky to enforce across our base of tests. |
| Comment by Judah Schvimer [ 27/Jun/19 ] |
|
Causal consistency would not be sufficient since many tests expect sessions to read writes done outside of their session. Causal is a session property. |
| Comment by William Schultz (Inactive) [ 27/Jun/19 ] |
|
Also, to Judah's above point, using causal consistency in our tests certainly would be a valid alternative to guarantee read-your-writes behavior for transactions, but we feared that this would be harder for people to remember when writing new tests. Arguably, any test that truly wants read-your-writes behavior should be using causal consistency, but we don't strictly adhere to that in practice. |
| Comment by William Schultz (Inactive) [ 27/Jun/19 ] |
|
jason.chan is going to investigate this a bit more to come up with a concrete repro of the theory above, but after talking with him a bit more we came to a few conclusions on this issue:
|
| Comment by William Schultz (Inactive) [ 24/Jun/19 ] |
|
There may be an issue here that is specific to enableMajorityReadConcern=false. When EMRC=false, we set the committed snapshot to the lastCommittedOpTime, instead of the stable timestamp. This would mean that, on 1 node replica sets, it is not restricted to being behind the allCommitted, and so if we advance the commit point immediately to lastApplied on 1 node replica sets, then a w:majority write would not wait for the allCommitted to move beyond the optime of the write. With EMRC=true, though, if the committed snapshot is restrained to be behind the allCommitted, then waiting for majority write concern would also wait for the allCommitted to reach the optime of the write. Presumably we could write a simple test to demonstrate this behavior. If that is the case, then we may care more about making the semantics of transactions consistent between both settings of the enableMajorityReadConcern parameter. |
| Comment by Judah Schvimer [ 21/Jun/19 ] |
|
We currently have a bit of a problem that our transaction testing of different read concerns is fairly arbitrary. I don't want to accidentally lose snapshot read concern coverage by removing all occurrences of it. Using causal consistency on the given session is a more canonical way of getting read your writes. |
| Comment by Jason Chan [ 20/Jun/19 ] |
|
The same issue happened in BF-13486 with a different jstest. We should look into a solution to the general problem rather than fixing these tests on a case-by-case basis. |