[SERVER-41062] Always return TransactionTooLarge rather than BSONObjectTooLarge when transaction is too large to fit in a single applyOps Created: 08/May/19  Updated: 29/Oct/23  Resolved: 02/Jul/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.0.9
Fix Version/s: 4.0.11

Type: Improvement Priority: Major - P3
Reporter: Jason Chan Assignee: Siyuan Zhou
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-40790 Test downgrading 4.2 to 4.0 and then ... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2019-07-01, Repl 2019-07-15
Participants:

 Description   

We should backport the moving of the try clause in onTransactionCommit up to also cover the opsArray.done() call since it is possible for that call to throw a BSONObjectTooLarge exception, as we do in SERVER-40790.

The call in question on the 4.0 branch is here: https://github.com/mongodb/mongo/blob/ed33d2f8e66cd907ef13ee162f6a1dac3ff25672/src/mongo/db/op_observer_impl.cpp#L918



 Comments   
Comment by Githook User [ 02/Jul/19 ]

Author:

{'name': 'Siyuan Zhou', 'email': 'visualzhou@gmail.com', 'username': 'visualzhou'}

Message: SERVER-41062 Always return TransactionTooLarge rather than BSONObjectTooLarge when transaction is too large to fit in a single applyOps
Branch: v4.0
https://github.com/mongodb/mongo/commit/de8b171fbd730be5f82d0c4b6a3c378000167246

Comment by Siyuan Zhou [ 13/May/19 ]

Thanks, william.schultz for the clarification. Even if getDurableReplOperationSize() overestimates the size, it'll still be a subtle assumption to rely on. To code defensively and to make the code easier to reason about, it's still better to make the change in 4.0. I think it's fine to just extend try-catch to cover more lines without a targeted test. That's easier than guaranteeing the overestimation. However as you mentioned offline, its priority is low.

Comment by William Schultz (Inactive) [ 13/May/19 ]

To clarify, my hypothesis was that the size calculation made by repl::OplogEntry::getDurableReplOperationSize overestimates by enough bytes so that by the time we place the operation into a BSONArray inside an applyOps object, we would have already failed inside TransactionParticipant. I think the thing to investigate would be to compare the size estimate returned by repl::OplogEntry::getDurableReplOperationSize for a given operation with the actual size needed to place it inside an applyOps object subarray.

Comment by Siyuan Zhou [ 13/May/19 ]

According to the comment, we don't assumeĀ repl::OplogEntry::getReplOperationSize() gives an overestimate of the actual size, so we need to have try-catch to cover array.done(), which is what the try-catch was supposed to cover.

    // _transactionOperationBytes is based on the in-memory size of the operation.  With overhead,
    // we expect the BSON size of the operation to be larger, so it's possible to make a transaction
    // just a bit too large and have it fail only in the commit.  It's still useful to fail early
    // when possible (e.g. to avoid exhausting server memory).

Comment by Jason Chan [ 09/May/19 ]

siyuan.zhou

After discussing with william.schultz, this backport might not be necessary since we check the transaction size here https://github.com/mongodb/mongo/blob/76912f3714ddcd82a3b53e1ecebba0044ac13d07/src/mongo/db/session.cpp#L1071 before entering the Op Observer onCommit call.

The bug existed in 4.2 because of the specific edge case where we start a 4.2 style transaction with no size limit, and then perform a downgrade and try to commit that large transaction. At the point of commit, opsArray.done() will end up trying to fit the large transaction into a single applyOps and throw BSONObjectTooLarge. We don't expect this to happen in 4.0 since we would be unable to create a large transaction in the first place. We should instead be hitting the error in addTransactionOperation before ever reaching the onTransactionCommit.

Generated at Thu Feb 08 04:56:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.