[SERVER-36294] Sum of currentActive and currentInactive not equal to currentOpen transactions Created: 26/Jul/18  Updated: 29/Oct/23  Resolved: 11/Aug/18

Status: Closed
Project: Core Server
Component/s: Diagnostics
Affects Version/s: None
Fix Version/s: 4.0.2, 4.1.2

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Nathan Louie
Resolution: Fixed Votes: 0
Labels: SWDI, global-txn-metrics
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File txn-open-active-inactive.png    
Issue Links:
Backports
Related
related to SERVER-36295 Transaction metrics not updated on Tr... Closed
related to SERVER-35793 Write concurrency test to verify serv... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0
Sprint: Repl 2018-08-13
Participants:

 Description   

Small inaccuracies due to skew might be expected because the counter updates aren't atomic, but this looks to me to be much larger than that, as if the transaction is spending significant time between between corresponding updates to the counters.

This was a simple test with 10 threads doing transactional updates to single independent documents:

function repro() {
 
    nthreads = 10
 
    db.createCollection("c", {writeConcern: {w: "majority"}})
 
    threads = []
    for (var t=0; t<nthreads; t++) {
        thread = new ScopedThread((t) =>  {
            var ses = db.getMongo().startSession()
            ses.getDatabase("test").c.insert({_id: t, i: 0})
            for (;;) {
                try {
                    ses.startTransaction()
                    ses.getDatabase("test").c.update({_id: t}, {$inc: {i: 1}})
                    //if (t==0) sleep(1000)
                    ses.commitTransaction()
                } catch (e) {
                    print(e)
                }
            }
        }, t)
        threads.push(thread)
        thread.start()
    }
    for (var t = 0; t < nthreads; t++)
        threads[t].join()
}



 Comments   
Comment by Githook User [ 16/Aug/18 ]

Author:

{'username': 'will62794', 'email': 'william.schultz@mongodb.com', 'name': 'William Schultz'}

Message: SERVER-36294 Mark transaction as inactive when it is first marked as open

(cherry picked from commit 95bcc2148cc7ec3ca4f54e16c4a75bbc0f806ace)
Branch: v4.0
https://github.com/mongodb/mongo/commit/15b6b95877b09d9a2c5d2f63a981692545875c3a

Comment by William Schultz (Inactive) [ 11/Aug/18 ]

Closing this ticket since functional change was pushed. We will add additional testing for this fix as part of SERVER-35793.

Comment by Githook User [ 11/Aug/18 ]

Author:

{'username': 'will62794', 'name': 'William Schultz', 'email': 'william.schultz@mongodb.com'}

Message: SERVER-36294 Mark transaction as inactive when it is first marked as open
Branch: master
https://github.com/mongodb/mongo/commit/95bcc2148cc7ec3ca4f54e16c4a75bbc0f806ace

Comment by William Schultz (Inactive) [ 26/Jul/18 ]

As per offline discussion, we will start by moving the initial call to "open" the transaction into the first unstash event, instead of in the initial call to beginOrContinueTxn. This will make it so that it is not possible for a transaction to appear "open" and neither active nor inactive.

Comment by Bruce Lucas (Inactive) [ 26/Jul/18 ]

Using pmp on the repro above I find a lot of threads here:

pthread_cond_wait@@GLIBC_2.3.2:185;std::condition_variable::wait;mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil;mongo::OperationContext::waitForConditionOrInterruptNoAssert;mongo::OperationContext::waitForConditionOrInterrupt;mongo::WiredTigerOplogManager::waitForAllEarlierOplogWritesToBeVisible;mongo::repl::ReplicationCoordinatorExternalStateImpl::waitForAllEarlierOplogWritesToBeVisible;mongo::repl::ReplicationCoordinatorImpl::_waitUntilOpTime;mongo::repl::ReplicationCoordinatorImpl::_waitUntilClusterTimeForRead;mongo::repl::ReplicationCoordinatorImpl::waitUntilOpTimeForReadUntil;mongo::repl::ReplicationCoordinatorImpl::waitUntilOpTimeForRead;mongo::waitForReadConcern;mongo::ServiceEntryPointMongod::Hooks::waitForReadConcern;mongo::(anonymous namespace)::execCommandDatabase;mongo::(anonymous namespace)::receivedCommands;mongo::ServiceEntryPointCommon::handleRequest;mongo::ServiceEntryPointMongod::handleRequest;mongo::ServiceStateMachine::_processMessage;mongo::ServiceStateMachine::_runNextInGuard;std::_Function_handler<...>;mongo::transport::ServiceExecutorSynchronous::schedule;mongo::ServiceStateMachine::_scheduleNextWithGuard;mongo::ServiceStateMachine::_sourceCallback;mongo::ServiceStateMachine::_sourceMessage;mongo::ServiceStateMachine::_runNextInGuard;std::_Function_handler<...>;std::_Function_handler<...>;mongo::(anonymous namespace)::runFunc;start_thread:333;clone:109

If at this point in the code the transaction is "open" but neither "active" nor "inactive" that (rather than time between bumping open and bumping active) could account for the discrepancy.

Comment by Bruce Lucas (Inactive) [ 26/Jul/18 ]

So, based on the fact that active+inactive is a relatively small fraction of open (see chart above), the amount of time spent outside of the active or inactive time period is considerably less than the open time period (time between incrementing open and decrementing it). If most of that time is between bumping open and bumping active (an assumption, maybe not founded), then the amount of time between incrementing open and incrementing active is a relatively large fraction of the time between incrementing open and decrementing it. The latter includes a round trip to the client (assuming we don't decrement open until we receive the commit from the client - correct?), so that seems like a rather long time indeed.

(To put some numbers to it, from the chart, (avg active + avg inactive) / (avg open) = (0.637 + 0.976) / 7.488 = 0.215 = 22%, so about 78% of the "open" time we are neither "active" nor "inactive", by the current metrics.).

Comment by William Schultz (Inactive) [ 26/Jul/18 ]

Yes, as I understand it, the snapshot allocation could be considered one of the earliest "resource acquisition" events of a transaction. This may or may not, however, eagerly start a storage engine transaction. milkie would be the better person to answer that question.

In regards to the delay between the "opening" of the transaction and the first "unstashing" (snapshot allocation, etc.), I guess it depends what you consider a "large" delay. The call to beginOrContinueTxn, which is executed insideĀ here, basically just sets up some metadata related to the transaction, but no actual "work" of the transaction has been done yet. It is not until this point that we will actually execute the first "unstash" event, and then subsequently execute the first command of the transaction. I haven't really measured how long it generally takes between these two events (the "begin" and "unstash" event), but it seems to be mostly occupied with various command argument parsing logic. It's probably a little weird that we set up the transaction metadata indicating the start of a transaction before we parse all the necessary command arguments, but that's the way things have been structured.

Comment by Bruce Lucas (Inactive) [ 26/Jul/18 ]

I think it would be reasonable to consider it started when the first resource is taken, and I think your comment implies that that is the snapshot? If so that seems like a reasonable place.

However if I'm interpreting this all correctly it seems like there is a large delay between when we first get word of the new transaction and currently bump open, and when we actually start to do some work related to the transaction (create snapshot and wuow) and bump active. If so do we know why? Is there a performance problem here? If this is expected maybe this is something that needs to have some kind of metric?

Comment by William Schultz (Inactive) [ 26/Jul/18 ]

Yes, the initial "unstash" operation is where, for example, we create the transaction's WriteUnitOfWork and pre-allocate a snapshot for the transaction, so I think this is a reasonable marker for the transaction start. It would be a bit more difficult to "start" the transaction when it first takes collection locks, since the taking of locks is deferred to the execution of the first command run on that transaction. So, in order to track that event we would probably have to inject some metrics tracking into specific command codepaths, which I think we would rather not do.

Comment by Bruce Lucas (Inactive) [ 26/Jul/18 ]

Maybe the point at which the transaction first begins to hold resources, e.g. locks and cache content, is a reasonable marker for the start of the transaction? This will be the point at which the transaction begins to have a potentially significant impact on other operations, right?

Comment by William Schultz (Inactive) [ 26/Jul/18 ]

bruce.lucas Yes, you are right. We currently treat Session::beginOrContinueTxn as the "opening" of the transaction, even though we don't initially set the transition to active until the first time it calls Session::unstashTransactionResources. I think we intuitively thought to "open" the transaction inside beginOrContinueTxn, but your point is definitely valid, in that it makes the active/inactive counters inconsistent. I think we should consider moving the initial call to "open" the transaction into "unstash", so that it aligns with the initial call to make the transaction "active". There is arguably some ambiguity about what constitutes the true "beginning" of a multi-document transaction internally, so I think it is reasonable to consider the fist "unstash" event as the beginning of the transaction, especially since I don't see any reason why there should be a significant gap of time between when we call beginOrContinueTxn and when we first call unstashTransactionResources.

Comment by Bruce Lucas (Inactive) [ 26/Jul/18 ]

I think the problem here is skew between the open counter and the inactive or active counters, not between the active and inactive. For example when we increment open, at that point in the code we don't also immediately increment active or inactive (whichever is appropriate). Maybe a significant amount of time elapses before we finally increment active or inactive?

Comment by Eric Milkie [ 26/Jul/18 ]

Currently between the active and inactive counters, they are always updated in concert in the code, one immediately after the other, so the only benefit to the check is that we haven't forgotten to update both together in the code. It doesn't check, however, if we missed a place where the counters should have been updated.

Comment by Bruce Lucas (Inactive) [ 26/Jul/18 ]

I think it's not a bad thing that there are three separate counters. If there were only two, one of them could be inaccurate and we wouldn't know it. The redundancy provides a check on the accuracy.

Comment by Eric Milkie [ 26/Jul/18 ]

I'm not sure why the code keeps three separate counters for active, inactive, and open, and updates them all independently. We could reduce the number of counters and calculate one from the other two.

Generated at Thu Feb 08 04:42:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.