[SERVER-60823] runCommandWithRetries in JS test framework exceeds JS interpreter recursion limit Created: 19/Oct/21  Updated: 23/Jan/24

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Benety Goh Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-66991 add incompatible_with_gcov to geo_nea... Closed
is related to SERVER-32522 set_read_and_write_concerns.js treats... Closed
is related to SERVER-38937 Unify txn_override.js and auto_retry_... Closed
Assigned Teams:
Replication
Operating System: ALL
Backport Requested:
v7.3
Sprint: Execution Team 2021-11-29, Execution Team 2021-12-13, Execution Team 2022-01-24
Participants:
Linked BF Score: 173

 Description   

In some of our CI suites, the command invocation in tests is overridden to support retries. For some multi-document passthroughs, when there is a need to retry a transaction, the runCommandWithRetries() logic in implicitly_retry_on_background_op_in_progress.js and network_error_and_txn_override.js implicitly recurses into itself with the following repeating stack of function calls. This has the potential to exceed the internal recursion limit in the JS interpreter, leading to the test terminating early.

[multi_stmt_txn_passthrough:orp] retryEntireTransaction@jstests/libs/override_methods/network_error_and_txn_override.js:690:15
[multi_stmt_txn_passthrough:orp] retryWithTxnOverride@jstests/libs/override_methods/network_error_and_txn_override.js:753:15
[multi_stmt_txn_passthrough:orp] runCommandOverrideBody@jstests/libs/override_methods/network_error_and_txn_override.js:1032:23
[multi_stmt_txn_passthrough:orp] runCommandOverride@jstests/libs/override_methods/network_error_and_txn_override.js:1101:21
[multi_stmt_txn_passthrough:orp] overrideRunCommand/Mongo.prototype.runCommand@jstests/libs/override_methods/override_helpers.js:81:20
[multi_stmt_txn_passthrough:orp] runCommandWithRetries/<@jstests/libs/override_methods/implicitly_retry_on_background_op_in_progress.js:58:19
[multi_stmt_txn_passthrough:orp] assert.soon@src/mongo/shell/assert.js:366:21
[multi_stmt_txn_passthrough:orp] runCommandWithRetries@jstests/libs/override_methods/implicitly_retry_on_background_op_in_progress.js:54:5
[multi_stmt_txn_passthrough:orp] overrideRunCommand/Mongo.prototype.runCommand@jstests/libs/override_methods/override_helpers.js:81:20
[multi_stmt_txn_passthrough:orp] runClientFunctionWithRetries@src/mongo/shell/session.js:371:27
[multi_stmt_txn_passthrough:orp] runCommand@src/mongo/shell/session.js:466:25
[multi_stmt_txn_passthrough:orp] DB.prototype._runCommandImpl@src/mongo/shell/db.js:155:12
[multi_stmt_txn_passthrough:orp] DB.prototype.runCommand@src/mongo/shell/db.js:170:16
[multi_stmt_txn_passthrough:orp] retryEntireTransaction@jstests/libs/override_methods/network_error_and_txn_override.js:690:15
[multi_stmt_txn_passthrough:orp] retryWithTxnOverride@jstests/libs/override_methods/network_error_and_txn_override.js:753:15
[multi_stmt_txn_passthrough:orp] runCommandOverrideBody@jstests/libs/override_methods/network_error_and_txn_override.js:1032:23
[multi_stmt_txn_passthrough:orp] runCommandOverride@jstests/libs/override_methods/network_error_and_txn_override.js:1101:21
[multi_stmt_txn_passthrough:orp] overrideRunCommand/Mongo.prototype.runCommand@jstests/libs/override_methods/override_helpers.js:81:20
[multi_stmt_txn_passthrough:orp] runCommandWithRetries/<@jstests/libs/override_methods/implicitly_retry_on_background_op_in_progress.js:58:19
[multi_stmt_txn_passthrough:orp] assert.soon@src/mongo/shell/assert.js:366:21
[multi_stmt_txn_passthrough:orp] runCommandWithRetries@jstests/libs/override_methods/implicitly_retry_on_background_op_in_progress.js:54:5



 Comments   
Comment by Sviatlana Zuiko [ 02/Jun/22 ]

benety.goh@mongodb.com and colleagues,
Just a heads up that there are two recent spikes of failures with "InternalError: too much recursion" error on .multi_stmt_txn._jscore_passthrough suite:

  • tenant_migration_multi_stmt_txn_jscore_passthrough / insert2.js - BF-25458
  • replica_sets_multi_stmt_txn_stepdown_jscore_passthrough o/ geo_s2sparse.js,geo_near_random2.js - BF-22812

Taking into account that the issue is being hit constantly and is also "Hot due to frequency", can we prioritize the work on SERVER-60823? Thank you!

Comment by Steven Vannelli [ 10/May/22 ]

Moving this ticket to the Backlog and removing the "Backlog" fixVersion as per our latest policy for using fixVersions.

Comment by Benety Goh [ 14/Apr/22 ]

A more sustainable approach to solving the recursion issue may be to address the reentrant nature of the runCommandOverride() function in network_error_and_txn_override.js. This would involve revisiting and refining the solution in SERVER-38937, which itself was an improvement over an earlier attempt.

Comment by Benety Goh [ 14/Apr/22 ]

One alternate solution is to address the indirection in overrideRunFunction added in SERVER-32522. This reduces the depth of the call stack by a factor determined by the number of overrides. However, this will not prevent the Javascript interpreter from hitting the recursion limit if we have any overridden functions invoking the top level DB.runCommand().

Comment by Benety Goh [ 07/Apr/22 ]

Testing with a multi-doc transaction test overrides and a modified server that repeatedly fails with a WriteConflictException, the part of the stack that is currently most interesting is:

[multi_stmt_txn_passthrough:single_insert] DB.prototype.runCommand@src/mongo/shell/db.js:182:21
[multi_stmt_txn_passthrough:single_insert] retryEntireTransaction@jstests/libs/override_methods/network_error_and_txn_override.js:687:37

In retryEntryTransaction, we invoke the top-level DB.runCommand() on each embedded operation in a multi-doc transaction. This has the effect of adding a duplicate stack of test override calls to the recursion on each retry attempt.

Comment by Githook User [ 10/Mar/22 ]

Author:

{'name': 'Tommaso Tocci', 'email': 'tommaso.tocci@mongodb.com', 'username': 'toto-dev'}

Message: SERVER-60823 temporarly disable test from txn suites due to recursion issue
Branch: master
https://github.com/mongodb/mongo/commit/c0652d91db0fb5cc54f12811e100ec8108130112

Comment by Dianna Hohensee (Inactive) [ 02/Nov/21 ]

At the least, we need the BF to stop happening.

Comment by Benety Goh [ 19/Oct/21 ]

One possible solution is to avoid the use recursion to retry a command - this might be accomplished by having one of the lower level functions in the JS stack return some signal to retry, rather than performing the command invocation itself.

Generated at Thu Feb 08 05:50:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.