[SERVER-47950] Continuous stepdown thread should fail resmoke job loudly on exceptions Created: 05/May/20 Updated: 29/Oct/23 Resolved: 28/May/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 4.4.0-rc11, 4.7.0 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Siyuan Zhou | Assignee: | Siyuan Zhou |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Backport Requested: |
v4.4
|
||||||||
| Sprint: | Repl 2020-06-01 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 28 | ||||||||
| Description |
|
Stepdown thread exits silently on exceptions. It should fail loudly and fail the evergreen task. For example, if self._await_primaries times out and throws an exception, that should fail the task immediately rather than hang resmoke on its after_test hook. |
| Comments |
| Comment by Githook User [ 19/Jun/20 ] |
|
Author: {'name': 'Siyuan Zhou', 'email': 'siyuan.zhou@mongodb.com', 'username': 'visualzhou'}Message: (cherry picked from commit 116d80c8267425fb71043f5a1200993355917707) |
| Comment by Githook User [ 28/May/20 ] |
|
Author: {'name': 'Siyuan Zhou', 'email': 'siyuan.zhou@mongodb.com', 'username': 'visualzhou'}Message: |
| Comment by Robert Guo (Inactive) [ 07/May/20 ] |
|
I think Repl would be in a better position to make the change. While the mechanics of Max's suggestion is mostly about coordinating events in resmoke, the core problem is that a failure in _step_down_all() caused the replica set to be in a state such that subsequent tests can't proceed. I think repl would be able to better determine which error events can be reported later and which require a "stop the world now" solution. I'd be happy to add a diff of the change Max suggested above; the code change itself should be fairly small. |
| Comment by Max Hirschhorn [ 06/May/20 ] |
|
siyuan.zhou, the ContinuousStepdown._check_thread() method is how the parent (job) thread reports that the _StepdownThread has exited. It is called at the beginning of ContinuousStepdown.after_test(). As you've pointed out, checking at the beginning is insufficient if the in-progress iteration in _StepdownThread.run() raises an exception after _StepdownThread.pause() has been called. I think the smallest change for addressing this would look like (1) having self._is_idle_evt.set() in the except Exception block and (2) moving the if not self._stepdown_thread.is_alive() check inside of _StepdownThread.pause() immediately after the call to self._is_idle_evt.wait(). The contract would be the stepdown thread is done running when self._is_idle_evt.wait() returns; however, _is_idle_evt being set no longer means the cluster is expected to eventually be in a healthy state where _await_primaries() would succeed. Doing (2) is actually probably optional because the parent thread would raise an exception either in _await_primaries() from the cluster not being in a healthy state, or from detecting the stepdown thread having exited already in ContinuousStepdown.before_test() when the next is being run. |
| Comment by Siyuan Zhou [ 05/May/20 ] |
|
I think we need a way to let the parent thread know the error. They communicate via several ways, notably self.is_idle_evt and self._lifecycle. We can also add a new shared variable. max.hirschhorn, which do you recommend? |