[SERVER-45769] FSM workloads that run commands and expect them to fail cause infinite retry loops Created: 24/Jan/20  Updated: 29/Oct/23  Resolved: 03/Nov/20

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Bug Priority: Major - P3
Reporter: Maria van Keulen Assignee: Ali Mir
Resolution: Fixed Votes: 0
Labels: former-quick-wins, undo-candidate
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-45767 Blacklist create_database.js from con... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2020-10-19, Repl 2020-11-02, Repl 2020-11-16
Participants:

 Description   

SERVER-45767 is an example of a situation where a workload can lead to an infinite transaction retry loop inside suites that use withTxnAndAutoRetry. In order to avoid similar mysterious test time out scenarios in the future, this retry loop should only execute for a finite number of retries.



 Comments   
Comment by Githook User [ 03/Nov/20 ]

Author:

{'name': 'Ali Mir', 'email': 'ali.mir@mongodb.com', 'username': 'ali-mir'}

Message: SERVER-45769 Add additional logging about iterations in auto_retry_transaction.js
Branch: master
https://github.com/mongodb/mongo/commit/4308e341038a3a0fbdbe7a278c38b395ceb83936

Comment by Siyuan Zhou [ 10/Apr/20 ]

The original issue was some tests expect the commands to fail, so they cannot run in the passthrough transactions, otherwise they will retry infinitely. One solution is to blacklist the test since as in SERVER-45767, since it only gives us extra test coverage on failed commands. Another more complex solution is to express that the command is expected to fail so it should run in its own transaction and can fail.

I think it's fine to just blacklist them from transaction passthrough tests. Instead, we need to improve the debugging experience when this scenario happens. Since the system is busy looping, the core dump isn't helpful and could be misleading. The JS stacktrace (if available) isn't useful either. We can probably print out logs when a statement or a transaction is retried many times to help debugging this kind of issues.

Comment by Ryan Timmons [ 27/Jan/20 ]

We had an idea during triage to simply print a message after the do/while had iterated more than N times. This would help to gather data about how many iterations we actually want and the path to using N as an exit-condition (whether using do/while or assert.soon) becomes more obvious. Adding this print logic would hopefully be easy to do as part of any other work.

Comment by Maria van Keulen [ 27/Jan/20 ]

judah.schvimer As we introduce more features into multi-document transactions, suites like concurrency_replication_multi_stmt_txn which are primarily suited to idempotent operations will continue to uncover false-positive situations like the BF that resulted in SERVER-45767. I believe it is important to better future-proof these suites somehow.
I maintain that it is less desirable to have false-positive time outs due to the way this suite is run than to have false-positive BFs that can at least be tied to the retries in this suite.
Whichever implementation path we decide on, I believe it is important to take into consideration and communicate to developers how many retries occur as a result of this retry loop.

Comment by Judah Schvimer [ 27/Jan/20 ]

If there is no "safe" number, I would advocate for an assert.soon timeout of 10 minutes like we do elsewhere so the test doesn't time out, but we still have an unbounded number of attempts. Any extra BFs to me seem more costly than BFs that are harder to diagnose.

Comment by Maria van Keulen [ 27/Jan/20 ]

judah.schvimer I think the exact number will need some trial and error, depending on how many times in practice the loop generally needs to run to accommodate all of the TransientTransactionErrors.

FWIW, I would argue that the BFs that would arise out of this change would be much quicker to diagnose than the potential seemingly malignant (i.e., time out) BFs that could occur with the existing structure.

Comment by Judah Schvimer [ 27/Jan/20 ]

After how many failed attempts can we guarantee there is a bug? I would only want to lower the number of attempts if it wouldn't lead to unnecessary BFs.

Comment by Maria van Keulen [ 27/Jan/20 ]

Got it. In that case, perhaps we can have both the finite retry loop and the assert.soon. The suite itself relies upon a retry loop in order to handle true transient transaction errors. Perhaps the assert.soon can be added when we actually perform the contents of the transaction. This way, we reduce the number of cases where we have to wait for the test to time out for the hang analyzer to be called, but we can still distinguish between a test failing due to exceeding a set number of retries versus a test failing due to a true hang.
My reservation with replacing the entire retry framework with an assert.soon is it does not distinguish between an infinite quantity of retries versus one retry that takes an infinite amount of time.

Comment by Ryan Timmons [ 27/Jan/20 ]

What is the disadvantage of having the retry loop be finite?

There is no real disadvantage just that it's "yet another" place that is doing retries.

In the event of a true hang, the timeout would trigger regardless of the number of retries.

The evergreen hang-analyzer kicks in as a last-resort. Relying on the assert.soon integration lets the failure happen sooner and can provide additional context in the logs since the invocation of assert.soon shows up in log backtraces.

Comment by Maria van Keulen [ 27/Jan/20 ]

I don't think using assert.soon here would address the issue of keeping benign test infrastructure-related timeouts separate from genuine timeouts. What is the disadvantage of having the retry loop be finite? In the event of a true hang, the timeout would trigger regardless of the number of retries.

Comment by Ryan Timmons [ 27/Jan/20 ]

If it used assert.soon it would automatically trigger the hang-analyzer.

That is in fact my recommendation here - change the do/while to use assert.soon. I'm not 100% sure if the hang-analyzer output would be all that useful to be honest, but consolidating flow-control through a finite number of helpers (such as assert.soon) lets us add more post-test logic in one place.

Evergreen should also run the hang-analyzer after it exceeds the task timeout. Looks like that's indeed what happened in the underlying BF (BF-15948). imho that seems like a "last-resort"; doing things explicitly inside flow-control lets us be more careful about where/when/why the hang-analyzer/* and friends are called.

Comment by Maria van Keulen [ 27/Jan/20 ]

judah.schvimer FWIW, the hang analyzer output was not particularly helpful in this case, since the time out was due to testing infrastructure rather than something in the code going wrong. It would be more useful to have something that this suite specifically outputs, such as a log message stating that the maximum number of retries for this suite was exceeded, in future cases like these.
Time outs are inherently some of the most difficult cases to debug, so separating out this comparatively benign case would be helpful.

Comment by Judah Schvimer [ 27/Jan/20 ]

maria.vankeulen, was the test timeout helpful for getting hang analyzer output? ryan.timmons, would this be able to get hang analyzer output on failure if it didn't time out the task?

Generated at Thu Feb 08 05:09:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.