[SERVER-56675] Data files being archived twice for a single test run Created: 05/May/21 Updated: 27/Oct/23 Resolved: 30/Jul/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Gregory Noma | Assignee: | Robert Guo (Inactive) |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | tig-data-archival | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Sprint: | STM 2021-07-12, STM 2021-07-26, STM 2021-08-09 |
| Participants: | |
| Story Points: | 2 |
| Description |
|
It appears that data files are being archived twice for a single test run under "Repetition 0" and "Repetition 1" despite no apparent reason for it to do this. Here is an example task and logs of this (seemingly) extraneous archival. |
| Comments |
| Comment by Robert Guo (Inactive) [ 30/Jul/21 ] | |||
|
Ok I think the issue here is that both the stepdown thread and the foreground test noticed the crash and independently decided to archive. Given that there's no synchronization between the background events and foreground events internal to the test, the following two distinct sequences of events are possible A:
B:
As a result, I think we need both the stepdown thread and the test to check for crashes. More synchronization could happen between stepdown and the test but it would mean some waiting, which delays when we get the data files, and doesn't seem like a worthwhile tradeoff compared with duplicate data files in rare circumstances. If this issue comes up more often or if there are other pain points, please do reopen this ticket and we'd be happy to look into it further. | |||
| Comment by Robert Guo (Inactive) [ 29/Jul/21 ] | |||
|
Copying the full logs from the above snippet:
The second fixture abort (job3_fixture_abort_1) shouldn't've happened given the setup immediately preceding it was successful (job3_fixture_setup_1). We can print the call stack after this line and see what's calling the second abort. It's not obvious to me after staring at the code for a few minutes. Will dig into it a bit more | |||
| Comment by Brooke Miller [ 27/Jul/21 ] | |||
|
Rob will add some more context to the logs provided above and then hand it back off to Richard | |||
| Comment by Robert Guo (Inactive) [ 11/May/21 ] | |||
|
Looked into it with Richard over Zoom and narrowed the problem down to this line:
There shouldn't be anything in resmoke that calls setup and abort on a fixture consecutively. As a next step, we can add some logging to FixtureAbortTestCase and see where the caller is. |