[SERVER-45769] FSM workloads that run commands and expect them to fail cause infinite retry loops Created: 24/Jan/20 Updated: 29/Oct/23 Resolved: 03/Nov/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 4.9.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Maria van Keulen | Assignee: | Ali Mir |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | former-quick-wins, undo-candidate | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Repl 2020-10-19, Repl 2020-11-02, Repl 2020-11-16 | ||||||||
| Participants: | |||||||||
| Description |
|
|
| Comments |
| Comment by Githook User [ 03/Nov/20 ] |
|
Author: {'name': 'Ali Mir', 'email': 'ali.mir@mongodb.com', 'username': 'ali-mir'}Message: |
| Comment by Siyuan Zhou [ 10/Apr/20 ] |
|
The original issue was some tests expect the commands to fail, so they cannot run in the passthrough transactions, otherwise they will retry infinitely. One solution is to blacklist the test since as in I think it's fine to just blacklist them from transaction passthrough tests. Instead, we need to improve the debugging experience when this scenario happens. Since the system is busy looping, the core dump isn't helpful and could be misleading. The JS stacktrace (if available) isn't useful either. We can probably print out logs when a statement or a transaction is retried many times to help debugging this kind of issues. |
| Comment by Ryan Timmons [ 27/Jan/20 ] |
|
We had an idea during triage to simply print a message after the do/while had iterated more than N times. This would help to gather data about how many iterations we actually want and the path to using N as an exit-condition (whether using do/while or assert.soon) becomes more obvious. Adding this print logic would hopefully be easy to do as part of any other work. |
| Comment by Maria van Keulen [ 27/Jan/20 ] |
|
judah.schvimer As we introduce more features into multi-document transactions, suites like concurrency_replication_multi_stmt_txn which are primarily suited to idempotent operations will continue to uncover false-positive situations like the BF that resulted in |
| Comment by Judah Schvimer [ 27/Jan/20 ] |
|
If there is no "safe" number, I would advocate for an assert.soon timeout of 10 minutes like we do elsewhere so the test doesn't time out, but we still have an unbounded number of attempts. Any extra BFs to me seem more costly than BFs that are harder to diagnose. |
| Comment by Maria van Keulen [ 27/Jan/20 ] |
|
judah.schvimer I think the exact number will need some trial and error, depending on how many times in practice the loop generally needs to run to accommodate all of the TransientTransactionErrors. FWIW, I would argue that the BFs that would arise out of this change would be much quicker to diagnose than the potential seemingly malignant (i.e., time out) BFs that could occur with the existing structure. |
| Comment by Judah Schvimer [ 27/Jan/20 ] |
|
After how many failed attempts can we guarantee there is a bug? I would only want to lower the number of attempts if it wouldn't lead to unnecessary BFs. |
| Comment by Maria van Keulen [ 27/Jan/20 ] |
|
Got it. In that case, perhaps we can have both the finite retry loop and the assert.soon. The suite itself relies upon a retry loop in order to handle true transient transaction errors. Perhaps the assert.soon can be added when we actually perform the contents of the transaction. This way, we reduce the number of cases where we have to wait for the test to time out for the hang analyzer to be called, but we can still distinguish between a test failing due to exceeding a set number of retries versus a test failing due to a true hang. |
| Comment by Ryan Timmons [ 27/Jan/20 ] |
There is no real disadvantage just that it's "yet another" place that is doing retries.
The evergreen hang-analyzer kicks in as a last-resort. Relying on the assert.soon integration lets the failure happen sooner and can provide additional context in the logs since the invocation of assert.soon shows up in log backtraces. |
| Comment by Maria van Keulen [ 27/Jan/20 ] |
|
I don't think using assert.soon here would address the issue of keeping benign test infrastructure-related timeouts separate from genuine timeouts. What is the disadvantage of having the retry loop be finite? In the event of a true hang, the timeout would trigger regardless of the number of retries. |
| Comment by Ryan Timmons [ 27/Jan/20 ] |
|
If it used assert.soon it would automatically trigger the hang-analyzer. That is in fact my recommendation here - change the do/while to use assert.soon. I'm not 100% sure if the hang-analyzer output would be all that useful to be honest, but consolidating flow-control through a finite number of helpers (such as assert.soon) lets us add more post-test logic in one place. Evergreen should also run the hang-analyzer after it exceeds the task timeout. Looks like that's indeed what happened in the underlying BF (BF-15948). imho that seems like a "last-resort"; doing things explicitly inside flow-control lets us be more careful about where/when/why the hang-analyzer/* and friends are called. |
| Comment by Maria van Keulen [ 27/Jan/20 ] |
|
judah.schvimer FWIW, the hang analyzer output was not particularly helpful in this case, since the time out was due to testing infrastructure rather than something in the code going wrong. It would be more useful to have something that this suite specifically outputs, such as a log message stating that the maximum number of retries for this suite was exceeded, in future cases like these. |
| Comment by Judah Schvimer [ 27/Jan/20 ] |
|
maria.vankeulen, was the test timeout helpful for getting hang analyzer output? ryan.timmons, would this be able to get hang analyzer output on failure if it didn't time out the task? |