[SERVER-56675] Data files being archived twice for a single test run Created: 05/May/21  Updated: 27/Oct/23  Resolved: 30/Jul/21

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Gregory Noma Assignee: Robert Guo (Inactive)
Resolution: Works as Designed Votes: 0
Labels: tig-data-archival
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Sprint: STM 2021-07-12, STM 2021-07-26, STM 2021-08-09
Participants:
Story Points: 2

 Description   

It appears that data files are being archived twice for a single test run under "Repetition 0" and "Repetition 1" despite no apparent reason for it to do this. Here is an example task and logs of this (seemingly) extraneous archival.



 Comments   
Comment by Robert Guo (Inactive) [ 30/Jul/21 ]

Ok I think the issue here is that both the stepdown thread and the foreground test noticed the crash and independently decided to archive. Given that there's no synchronization between the background events and foreground events internal to the test, the following two distinct sequences of events are possible

A:

  • test starts
  • stepdown thread starts
  • stepdown thread ends
  • server crashes
  • test ends
  • test notices crash

B:

  • test starts
  • stepdown starts
  • test ends
  • server crashes
  • stepdown thread ends
  • stepdown thread notices crash.

 

As a result, I think we need both the stepdown thread and the test to check for crashes. More synchronization could happen between stepdown and the test but it would mean some waiting, which delays when we get the data files, and doesn't seem like a worthwhile tradeoff compared with duplicate data files in rare circumstances.

If this issue comes up more often or if there are other pain points, please do reopen this ticket and we'd be happy to look into it further.

Comment by Robert Guo (Inactive) [ 29/Jul/21 ]

Copying the full logs from the above snippet:

 [2021/05/02 13:53:20.661] [executor:js_test:job3] 13:53:20.660Z job3_fixture_setup_1 ran in 15.86 seconds: no failures detected.
 [2021/05/02 13:53:20.661] [executor:js_test:job3] 13:53:20.661Z Running job3_fixture_abort_1...
 [2021/05/02 13:53:20.698] [executor:js_test:job3] 13:53:20.697Z Writing output of job3_fixture_abort_1 to https://logkeeper.mongodb.org/build/24e997f6076aeb93bce56c4c6bebd776/test/608eaed0c2ab6816e02ceeb5.

The second fixture abort (job3_fixture_abort_1) shouldn't've happened given the setup immediately preceding it was successful (job3_fixture_setup_1). We can print the call stack after this line and see what's calling the second abort. It's not obvious to me after staring at the code for a few minutes.

Will dig into it a bit more

Comment by Brooke Miller [ 27/Jul/21 ]

Rob will add some more context to the logs provided above and then hand it back off to Richard

Comment by Robert Guo (Inactive) [ 11/May/21 ]

Looked into it with Richard over Zoom and narrowed the problem down to this line:

 [2021/05/04 17:21:41.816] [executor:js_test:job2] 17:21:41.816Z job2_fixture_setup_1 ran in 13.80 seconds: no failures detected.
 [2021/05/04 17:21:41.825] [executor:js_test:job2] 17:21:41.817Z Running job2_fixture_abort_1...

There shouldn't be anything in resmoke that calls setup and abort on a fixture consecutively. As a next step, we can add some logging to FixtureAbortTestCase and see where the caller is.

Generated at Thu Feb 08 05:39:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.