[SERVER-47950] Continuous stepdown thread should fail resmoke job loudly on exceptions Created: 05/May/20  Updated: 29/Oct/23  Resolved: 28/May/20

Status: Closed
Project: Core Server
Component/s: Replication, Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.4.0-rc11, 4.7.0

Type: Improvement Priority: Major - P3
Reporter: Siyuan Zhou Assignee: Siyuan Zhou
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.4
Sprint: Repl 2020-06-01
Participants:
Linked BF Score: 28

 Description   

Stepdown thread exits silently on exceptions. It should fail loudly and fail the evergreen task.

For example, if self._await_primaries times out and throws an exception, that should fail the task immediately rather than hang resmoke on its after_test hook.



 Comments   
Comment by Githook User [ 19/Jun/20 ]

Author:

{'name': 'Siyuan Zhou', 'email': 'siyuan.zhou@mongodb.com', 'username': 'visualzhou'}

Message: SERVER-47950 Continuous stepdown thread should fail resmoke job loudly on exceptions

(cherry picked from commit 116d80c8267425fb71043f5a1200993355917707)
Branch: v4.4
https://github.com/mongodb/mongo/commit/c9454a38845a3755e6ec638dea570122cc68ae61

Comment by Githook User [ 28/May/20 ]

Author:

{'name': 'Siyuan Zhou', 'email': 'siyuan.zhou@mongodb.com', 'username': 'visualzhou'}

Message: SERVER-47950 Continuous stepdown thread should fail resmoke job loudly on exceptions
Branch: master
https://github.com/mongodb/mongo/commit/116d80c8267425fb71043f5a1200993355917707

Comment by Robert Guo (Inactive) [ 07/May/20 ]

I think Repl would be in a better position to make the change. While the mechanics of Max's suggestion is mostly about coordinating events in resmoke, the core problem is that a failure in _step_down_all() caused the replica set to be in a state such that subsequent tests can't proceed. I think repl would be able to better determine which error events can be reported later and which require a "stop the world now" solution.

I'd be happy to add a diff of the change Max suggested above; the code change itself should be fairly small.

Comment by Max Hirschhorn [ 06/May/20 ]

siyuan.zhou, the ContinuousStepdown._check_thread() method is how the parent (job) thread reports that the _StepdownThread has exited. It is called at the beginning of ContinuousStepdown.after_test(). As you've pointed out, checking at the beginning is insufficient if the in-progress iteration in _StepdownThread.run() raises an exception after _StepdownThread.pause() has been called. I think the smallest change for addressing this would look like (1) having self._is_idle_evt.set() in the except Exception block and (2) moving the if not self._stepdown_thread.is_alive() check inside of _StepdownThread.pause() immediately after the call to self._is_idle_evt.wait().

The contract would be the stepdown thread is done running when self._is_idle_evt.wait() returns; however, _is_idle_evt being set no longer means the cluster is expected to eventually be in a healthy state where _await_primaries() would succeed. Doing (2) is actually probably optional because the parent thread would raise an exception either in _await_primaries() from the cluster not being in a healthy state, or from detecting the stepdown thread having exited already in ContinuousStepdown.before_test() when the next is being run.

Comment by Siyuan Zhou [ 05/May/20 ]

I think we need a way to let the parent thread know the error. They communicate via several ways, notably self.is_idle_evt and self._lifecycle. We can also add a new shared variable. max.hirschhorn, which do you recommend?

Generated at Thu Feb 08 05:15:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.