[SERVER-44638] j:false is ignored in startTransaction write concern Created: 14/Nov/19  Updated: 06/Dec/22  Resolved: 07/May/20

Status: Closed
Project: Core Server
Component/s: Replication, WiredTiger
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Mark Callaghan Assignee: Backlog - Storage Execution Team
Resolution: Done Votes: 0
Labels: perfteam-wishlist
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
related to DOCS-13288 Document that sharded transactions al... Closed
Assigned Teams:
Storage Execution
Operating System: ALL
Backport Requested:
v4.2, v4.0
Steps To Reproduce:

To observe fdatasync per commit, run "strace -c -f -p $( pidof mongod )" in background and then run...

 

session = db.getMongo().startSession( { readPreference: { mode: "primary" } } );
col = session.getDatabase("test").foo;
// Confirm collection exists to avoid error from transactional DDL below
col.insertOne( { abc: 1 } );
var i;
var ninsert=10000;
var jvals = [false];
function time_it(jval) {
 var start_msecs = Date.now()
 for (i = 0; i < ninsert; i++) {
 session.startTransaction( { writeConcern: { w: 1, j: jval} } );
 col.insertOne( { abc: 1 } );
 session.commitTransaction();
 }
 var end_msecs = Date.now()
 var tot_secs = (end_msecs - start_msecs) / 1000.0;
 print("With trx:", jval, tot_secs, ninsert, ninsert/tot_secs);
}
jvals.forEach(time_it);
session.endSession();

 

To compare insert performance with/without transactions for both values of j run...

 

session = db.getMongo().startSession( { readPreference: { mode: "primary" } } );
col = session.getDatabase("test").foo;
// Confirm collection exists to avoid error from transactional DDL below
col.insertOne( { abc: 1 } );
var i;
var ninsert=5000;
var jvals = [true, false];
function time_it(jval) {
 var start_msecs = Date.now()
 for (i = 0; i < ninsert; i++) {
 session.startTransaction( { writeConcern: { w: 1, j: jval} } );
 col.insertOne( { abc: 1 } );
 session.commitTransaction();
 }
 var end_msecs = Date.now()
 var tot_secs = (end_msecs - start_msecs) / 1000.0;
 print("With trx:", jval, tot_secs, ninsert, ninsert/tot_secs);
start_msecs = Date.now()
 for (i = 0; i < ninsert; i++) {
 col.insertOne( { abc: 1 }, { w: 1, j: jval} );
 }
 end_msecs = Date.now()
 tot_secs = (end_msecs - start_msecs) / 1000.0;
 print("Without trx:", jval, tot_secs, ninsert, ninsert/tot_secs);
}
jvals.forEach(time_it);
session.endSession();

 

Sprint: Repl 2019-12-02
Participants:

 Description   

Single replica set transactions should respect the writeConcern option j:false

Sharded transactions will be handled in SERVER-37364, where we specify that commitTransaction can return as soon as the decision is written with the client's requested writeConcern. Some component writes must still be performed with {w:"majority", j: true}.

Original description:

There is a call to fdatasync per transaction with j:false in the startTransaction write concern. I prefer to have an option to do transactions without fsync/fdatasync per commit. The current behavior isn't documented, and it doesn't seem right to accept j:false in the write concern but then ignore it.

I have been testing this with a single-node replica set and w:1 and what I see is:

  • insert rate without transactions is better with j:false than j:true as expected
  • insert rate with transactions is same for j:false vs j:true, this isn't expected
  • fdatasync/commit is done with transactions regardless of value for j

Docs for startTransaction don't explain why fsync on commit would always be done – https://docs.mongodb.com/manual/reference/method/Session.startTransaction/

Docs for writeConcern also don't explain this - https://docs.mongodb.com/manual/reference/write-concern/



 Comments   
Comment by Connie Chen [ 07/May/20 ]

This was fixed in 4.4, we decided not to backport to 4.0 and 4.2

Comment by Daniel Gottlieb (Inactive) [ 06/May/20 ]

Moving this to the execution team to reconsider backporting replicate before journaling.

Comment by Mark Callaghan (Inactive) [ 29/Apr/20 ]

Thank you for debugging and explaining this. I repeated one of my benchmarks using two configurations:

  • transactions enabled, start_session(causal_consistency=False)
  • transactions enabled, start_session()

The benchmark does batch writes with 100 docs inserted per write. I confirmed that journalCommitInterval is 100ms for both versions of MongoDB.

  • 4.4.0-rc3 does 9.6 fsyncs/s which is expected given journalCommitInterval=100ms
  • 4.2.5 does 16.0 fsyncs/s with causal_consistency=false and 73.6/s when the causal arg is not set

So setting it does what I want, even if there are more fsyncs/s with 4.2 than 4.4 (16.0 vs 9.6)

425.f = 4.2.5 with causal_consistency=false
425.def = 4.2.5 without setting causal_consistency in start_session() call
440.f = 4.4.0-rc3 with causal_consistency=false
444.def = 4.4.0-rc3 without setting causal_consistency in start_session() call
 
                425.f   425.def 440.f   440.def
write/s         144.9   78.7    148.2   128.2
insert/s        14493   7874    14815   12821
fsync/s         16.0    73.6    9.6     9.6

Comment by Daniel Gottlieb (Inactive) [ 29/Apr/20 ]

My look at 4.4 shows that the replicate before journaling project has corrected the undesirable behavior of a session using causalConsistency: true causing unnecessary fsyncs.

Comment by Daniel Gottlieb (Inactive) [ 29/Apr/20 ]

mark.callaghan, regarding 4.2, if you change the call to startTransaction to include causalConsistency: false:

 session.startTransaction( { writeConcern: { w: 1, j: jval}, causalConsistency: false } );

do you still see `j: false` causing fsyncs?

From an investigation on a related ticket, I was able to reproduce your findings about WT log_flush calls and fsync calls reaching the OS, but my interpretation is that the `j: false` write concern is obeyed. However, what I found is the following sequence is taking place:

| Transaction Writer                                                  | OplogManager                                    |
|---------------------------------------------------------------------+-------------------------------------------------|
|                                                                     | Top of oplog = visibility = 0                   |
| StartTxn                                                            |                                                 |
| waitForAllEarlierOplogWritesToBeVisible (no-op)                     |                                                 |
| Write A                                                             |                                                 |
| CommitTimestamp 10                                                  |                                                 |
| Commit                                                              | Top of oplog = 10, visibility = 0               |
| Trigger journal flush                                               |                                                 |
| (sends response)                                                    |                                                 |
|                                                                     | Wakes up, waits for 100ms to allow piggybackers |
| (new causal request)                                                |                                                 |
| waitForAllEarlierOplogWritesToBeVisible (visibility < top of oplog) |                                                 |
|                                                                     | breaks loop to accommodate refresh request      |
|                                                                     | log_flush + fsync                               |
|                                                                     | top of oplog = visibility = 10                  |
| StartTxn                                                            |                                                 |
| Write B                                                             |                                                 |
| Commit Timestamp 20                                                 |                                                 |
| Commit                                                              |                                                 |

The fsync is a subsequent request's startTransaction is effectively making the previous transaction durable.

Using causalConsistency: false removes the call to waitForAllEarlierOplogWritesToBeVisible, and in my observation, removes log_flush and subsequent fsync.

In 4.2, while it's necessary for a primary to journal oplog data before making it available to secondaries (for reasons), the journaling piece is not necessary in this use of waitForAllEarlierOplogWritesToBeVisible. In my understanding here, the code re-use here buys into a stronger guarantee than necessary.

However in 4.4 with the replicate before journaling project, we removed the need to journal oplog data before making it available to secondaries. This ought to reflect in waitForAllEarlierOplogWritesToBeVisible never needing to fsync data. I will rerun the related reproducer on v4.4 with causalConsistency: true (the default) to verify that we see similar journal flushes/fsyncs independent of the causalConsistency setting.

Comment by Tess Avitabile (Inactive) [ 06/Mar/20 ]

That's correct. If there is no two-phase commit required, we should respect j:false.

Comment by Mark Callaghan (Inactive) [ 06/Mar/20 ]

tess.avitabile - does this mean that a single-replica transaction (reads & writes done to only one shard) in a sharded cluster will respect j:false?

Comment by Tess Avitabile (Inactive) [ 11/Dec/19 ]
  • will new behavior be documented?

I do not intend to document that we should respect j:false for single replica set transactions, since this is a bug fix. I marked SERVER-37364 as needing documentation changes.

  • is there any feedback from MongoDB when it enforces stronger write concern than requested by a client operation? In this case I used strace to figure out what was going on, but others might not have access to do that.

No, there is no feedback from MongoDB. We do ban writeConcernMajorityJournalDefault:false with sharded transactions, and we document this here. However, I think the documentation should also make clear that regardless of the user's requested writeConcern, a sharded transaction must perform some component writes with writeConcern {w: "majority", j: true}. I filed a documentation ticket for this change.

Comment by Mark Callaghan [ 11/Dec/19 ]

That will work for me as my primary use case is storage engine performance testing and sharding isn't needed for that. Two questions:

  • will new behavior be documented?
  • is there any feedback from MongoDB when it enforces stronger write concern than requested by a client operation? In this case I used strace to figure out what was going on, but others might not have access to do that.
Comment by Tess Avitabile (Inactive) [ 09/Dec/19 ]

mdcallag@gmail.com, I wanted to check in about my above response. If you have no objections, I'll change this ticket to specify that we should respect j:false for single replica set transactions. I'll change SERVER-37364 to cover j:false in the sharded case.

Comment by Tess Avitabile (Inactive) [ 27/Nov/19 ]

mdcallag@gmail.com, I've discussed this with judah.schvimer and the Replication team, and we agree that it's a bug to ignore j:false. This is our proposed path forward:

For a single replica set transaction, respect j:false when writing the data changes and oplog entry.

For a sharded transaction, we require that all writes are performed with w:majority and j:true to ensure atomicity. Our view is that the client's writeConcern applies to when the coordinator can return after writing the decision. Even though the decision must be written with w:majority and j:true before it can be sent to the participants, it's possible to return to the client as soon as the decision is written with the client's requested writeConcern. The savings will be limited because the client still has to wait for earlier writes to be written with w:majority and j:true, such as the prepareTransaction writes. This is consistent with the approach we laid out for handling weaker writeConcern values as described in SERVER-37364.

How does this approach sound to you?

Comment by Carl Champain (Inactive) [ 20/Nov/19 ]

Hi mdcallag@gmail.com,

Thanks for the report.
I'm passing this ticket along to the Execution team for additional investigation.

Kind regards,
Carl

Comment by Mark Callaghan [ 14/Nov/19 ]

I was using Mongo 4.2.1 with this mongo.conf

 

processManagement:
 fork: true
systemLog:
 destination: file
 path: /home/mdcallag/d/mo421/mongo.log
 logAppend: true
storage:
 syncPeriodSecs: 60
 dbPath: /data/m/mo
 journal:
 enabled: true
operationProfiling:
 slowOpThresholdMs: 2000
replication:
 oplogSizeMB: 2000
 replSetName: foo
security:
 authorization: enabled
storage.wiredTiger.collectionConfig.blockCompressor: none
storage.wiredTiger.engineConfig.journalCompressor: none
storage.wiredTiger.engineConfig.cacheSizeGB: 4

 

And then ran rs.initiate() to get single-node replica set

 

 

Generated at Thu Feb 08 05:06:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.