[SERVER-44638] j:false is ignored in startTransaction write concern Created: 14/Nov/19 Updated: 06/Dec/22 Resolved: 07/May/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, WiredTiger |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Mark Callaghan | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Done | Votes: | 0 |
| Labels: | perfteam-wishlist | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
|||||||||||||||||||||||||||||||||||||||||||||||
| Assigned Teams: |
Storage Execution
|
|||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | |||||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.2, v4.0
|
|||||||||||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: | To observe fdatasync per commit, run "strace -c -f -p $( pidof mongod )" in background and then run...
To compare insert performance with/without transactions for both values of j run...
|
|||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Repl 2019-12-02 | |||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | ||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Single replica set transactions should respect the writeConcern option j:false. Sharded transactions will be handled in Original description: There is a call to fdatasync per transaction with j:false in the startTransaction write concern. I prefer to have an option to do transactions without fsync/fdatasync per commit. The current behavior isn't documented, and it doesn't seem right to accept j:false in the write concern but then ignore it. I have been testing this with a single-node replica set and w:1 and what I see is:
Docs for startTransaction don't explain why fsync on commit would always be done – https://docs.mongodb.com/manual/reference/method/Session.startTransaction/ Docs for writeConcern also don't explain this - https://docs.mongodb.com/manual/reference/write-concern/ |
| Comments |
| Comment by Connie Chen [ 07/May/20 ] | |||||||||||||||||||||
|
This was fixed in 4.4, we decided not to backport to 4.0 and 4.2 | |||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 06/May/20 ] | |||||||||||||||||||||
|
Moving this to the execution team to reconsider backporting replicate before journaling. | |||||||||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 29/Apr/20 ] | |||||||||||||||||||||
|
Thank you for debugging and explaining this. I repeated one of my benchmarks using two configurations:
The benchmark does batch writes with 100 docs inserted per write. I confirmed that journalCommitInterval is 100ms for both versions of MongoDB.
So setting it does what I want, even if there are more fsyncs/s with 4.2 than 4.4 (16.0 vs 9.6)
| |||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 29/Apr/20 ] | |||||||||||||||||||||
|
My look at 4.4 shows that the replicate before journaling project has corrected the undesirable behavior of a session using causalConsistency: true causing unnecessary fsyncs. | |||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 29/Apr/20 ] | |||||||||||||||||||||
|
mark.callaghan, regarding 4.2, if you change the call to startTransaction to include causalConsistency: false:
do you still see `j: false` causing fsyncs? From an investigation on a related ticket, I was able to reproduce your findings about WT log_flush calls and fsync calls reaching the OS, but my interpretation is that the `j: false` write concern is obeyed. However, what I found is the following sequence is taking place:
The fsync is a subsequent request's startTransaction is effectively making the previous transaction durable. Using causalConsistency: false removes the call to waitForAllEarlierOplogWritesToBeVisible, and in my observation, removes log_flush and subsequent fsync. In 4.2, while it's necessary for a primary to journal oplog data before making it available to secondaries (for reasons), the journaling piece is not necessary in this use of waitForAllEarlierOplogWritesToBeVisible. In my understanding here, the code re-use here buys into a stronger guarantee than necessary. However in 4.4 with the replicate before journaling project, we removed the need to journal oplog data before making it available to secondaries. This ought to reflect in waitForAllEarlierOplogWritesToBeVisible never needing to fsync data. I will rerun the related reproducer on v4.4 with causalConsistency: true (the default) to verify that we see similar journal flushes/fsyncs independent of the causalConsistency setting. | |||||||||||||||||||||
| Comment by Tess Avitabile (Inactive) [ 06/Mar/20 ] | |||||||||||||||||||||
|
That's correct. If there is no two-phase commit required, we should respect j:false. | |||||||||||||||||||||
| Comment by Mark Callaghan (Inactive) [ 06/Mar/20 ] | |||||||||||||||||||||
|
tess.avitabile - does this mean that a single-replica transaction (reads & writes done to only one shard) in a sharded cluster will respect j:false? | |||||||||||||||||||||
| Comment by Tess Avitabile (Inactive) [ 11/Dec/19 ] | |||||||||||||||||||||
I do not intend to document that we should respect j:false for single replica set transactions, since this is a bug fix. I marked
No, there is no feedback from MongoDB. We do ban writeConcernMajorityJournalDefault:false with sharded transactions, and we document this here. However, I think the documentation should also make clear that regardless of the user's requested writeConcern, a sharded transaction must perform some component writes with writeConcern {w: "majority", j: true}. I filed a documentation ticket for this change. | |||||||||||||||||||||
| Comment by Mark Callaghan [ 11/Dec/19 ] | |||||||||||||||||||||
|
That will work for me as my primary use case is storage engine performance testing and sharding isn't needed for that. Two questions:
| |||||||||||||||||||||
| Comment by Tess Avitabile (Inactive) [ 09/Dec/19 ] | |||||||||||||||||||||
|
mdcallag@gmail.com, I wanted to check in about my above response. If you have no objections, I'll change this ticket to specify that we should respect j:false for single replica set transactions. I'll change | |||||||||||||||||||||
| Comment by Tess Avitabile (Inactive) [ 27/Nov/19 ] | |||||||||||||||||||||
|
mdcallag@gmail.com, I've discussed this with judah.schvimer and the Replication team, and we agree that it's a bug to ignore j:false. This is our proposed path forward: For a single replica set transaction, respect j:false when writing the data changes and oplog entry. For a sharded transaction, we require that all writes are performed with w:majority and j:true to ensure atomicity. Our view is that the client's writeConcern applies to when the coordinator can return after writing the decision. Even though the decision must be written with w:majority and j:true before it can be sent to the participants, it's possible to return to the client as soon as the decision is written with the client's requested writeConcern. The savings will be limited because the client still has to wait for earlier writes to be written with w:majority and j:true, such as the prepareTransaction writes. This is consistent with the approach we laid out for handling weaker writeConcern values as described in How does this approach sound to you? | |||||||||||||||||||||
| Comment by Carl Champain (Inactive) [ 20/Nov/19 ] | |||||||||||||||||||||
|
Thanks for the report. Kind regards, | |||||||||||||||||||||
| Comment by Mark Callaghan [ 14/Nov/19 ] | |||||||||||||||||||||
|
I was using Mongo 4.2.1 with this mongo.conf
And then ran rs.initiate() to get single-node replica set
|