[SERVER-47649] Possible race between session expiration and transaction creation Created: 18/Apr/20  Updated: 29/Oct/23  Resolved: 31/May/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.2.0

Type: Improvement Priority: Major - P3
Reporter: David Bartley Assignee: Dmitry Agranat
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Participants:

 Description   

We've been observing odd transaction failures since ramping up an application that uses MongoDB 4.0 transactions. We observe the following:
1) update, with lsid=X, txnNumber=1, startTransaction=true, autocommit=false
2) commitTransaction, with lsid=X, txnNumber=1, autocommit=false
3) wait for at least 30 minutes
4) update, with lsid=X, txnNumber=2, startTransaction=true, autocommit=false
5) 2nd update, with lsid=X, txnNumber=2, autocommit=false

Despite the update in 4 succeeding, the update in 5 fails. We've added some debugging info to our mongod build and observe that when 5 fails, either _activeTxnNumber=1 and _txnState=3, or _activeTxnNumber=-1 and _txnState=0, which would indicate that mongod had seemingly lost the transaction state from 4.

The actual operation/command doesn't seem matter, as we've observed it with "aggregate" as well.



 Comments   
Comment by Dmitry Agranat [ 31/May/20 ]

Hi bartle, I will go ahead and close this ticket given the reported issue is fixed in 4.2. If this is still an issue for you after upgrading to 4.2, please reopen this ticket and we'll have a look.

Regards,
Dima

Comment by Dmitry Agranat [ 05/May/20 ]

bartle The race you described makes sense. In 4.0, SessionCatalog::_reapInMemorySessionsOlderThan acquired a mutex, then dropped and re-acquired it, leaving a gap where a new transaction could be started but then still be treated as expired.

This code was heavily refactored in 4.2 and we believe this check in the new code prevents the race.

Comment by Dmitry Agranat [ 23/Apr/20 ]

Thanks bartle for a detailed explanation of what might have happened.

The reason I've asked before about your driver and if you saw any client-side errors is because we've seen (and fixed) at least one case with the same error which was related to client-side serialization error. Given the error was spotted only on the server-side, at first, it was not clear whether it's a driver, server, or another issue.

Getting this information would help us to speed up this case progress together with the attempts to reproduce the reported issue.

Thanks,
Dima

Comment by Jeffrey Yemin [ 23/Apr/20 ]

This does not look like SPEC-1195 to me. SPEC-1195 is about client-side errors (e.g. a serialization error) that happens while in a transaction.

Comment by A. Jesse Jiryu Davis [ 23/Apr/20 ]

I've moved teams since we wrote the Drivers Transactions Spec, so I'm going to abstain from commenting.

Comment by David Bartley [ 23/Apr/20 ]

Reading through the code in session.cpp and session_catalog.cpp, I have a potential explanation for what happend:
1) Existing session X is left idle for 30 mins
2) _reapInMemorySessionsOlderThan runs and observes that session X is expired, and adds it to possiblyExpiredLsids
3) X is used in a new transaction (i.e. update with startTransaction:true), and _activeTxnNumber is updated
4) _reapInMemorySessionsOlderThan continues running and calls _invalidateSession on X, which removes it from _txnTable
5) X is used in an additional transaction, but the session can't be found in _txnTable
6) refreshFromStorageIfNeeded is called to populate the in-memory session from disk, which loads details from the "config.transactions" collection, which would reflect state as of 1 (i.e. there would be no record of the new transaction from 3)

In step 3 above, you might expect that this should also update the "config.transactions" collection (and hence 6 should have fetched state reflecting the new transaction), but my read of the code is that updates to "config.transactions" only happen in response to oplog writes (onWriteOpCompletedOnPrimary calls updateSessionEntry), and in 4.0 individual operations in a transaction don't generate an oplog write; instead, a single oplog entry is generated when the transaction is committed.

Assuming this analysis is correct, I'd expect the fix would be to have the "Remove the session ids from the in-memory catalog" loop verify that "session->lastCheckout < possiblyExpired" is still true (in this case it wouldn't be, because the new session in 3) would have updated lastCheckout).

Comment by David Bartley [ 23/Apr/20 ]

We're actually using a ~custom driver, though it should be compliant with the driver specs. I have no reason to believe you couldn't reproduce this with a standard driver, though I don't know of a way to reliably reproduce this (other than perhaps trying to reuse a session in a transaction after ~exactly 30 min have past since the last transaction on the session).

The error comes from https://github.com/mongodb/mongo/blob/r4.0.14/src/mongo/db/session.cpp#L567-L570, so the full error would be of the form "Given transaction number 2 does not match any in-progress transactions.", with code NoSuchTransaction.

There aren't any logs covering the time of the event, other than "connection accepted" and "end connection" logs.

Comment by Dmitry Agranat [ 19/Apr/20 ]

Hi bartle, thank you for the report.

In order for us to investigate this issue, could you please provide:

  • driver type and the exact driver version
  • compressed mongod log covering the time of this event (or at least all the lines covering all the relevant isid and txnNumber)
  • full client-side error

I've created a secure uploader for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Thanks,
Dima

Comment by David Bartley [ 18/Apr/20 ]

Oh, the actual error is "Given transaction number 2 does not match any in-progress transactions.".

Generated at Thu Feb 08 05:14:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.