[SERVER-43049] Test failure file archiving can miss files. Created: 27/Aug/19  Updated: 06/Dec/22  Resolved: 09/Jan/20

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: 4.2.0, 4.3.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Backlog - Server Tooling and Methods (STM) (Inactive)
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
is related to SERVER-42315 Don't copy data files from a running ... Closed
Assigned Teams:
Server Tooling & Methods
Operating System: ALL
Backport Requested:
v4.2
Participants:
Linked BF Score: 15

 Description   

I suspect the failure to add a file to the archive (because it was deleted after showing up in the list of files, presumably benign) aborts the remainder of the process.

[2019/08/14 20:11:41.659] [executor:js_test:job0] 2019-08-14T19:11:41.644+0000 Archive succeeded for jstestfuzz-94f4-ent_fb4b-qa_a6ce-1565809840034-08:CheckReplDBHash: Tar/gzip Data files jstestfuzz-94f4-ent_fb4b-qa_a6ce-1565809840034-08:CheckReplDBHash - Execution 0 Repetition 0 files: ['/data/db/job0/resmoke']; Unable to add /data/db/job0/resmoke to archive file: [Errno 2] No such file or directory: '/data/db/job0/resmoke/node0/collection-188--1419834514067471100.wt'
[2019/08/14 20:11:41.659] [executor:js_test:job0] 2019-08-14T19:11:41.644+0000 JSTest jstestfuzz/out/jstestfuzz-94f4-ent_fb4b-qa_a6ce-1565809840034-08.js marked as a failure by a hook's after_test.
[2019/08/14 20:11:41.659] Traceback (most recent call last):
[2019/08/14 20:11:41.659]   File "/data/mci/e4693cd76d49b3f1588ee74afd52ddac/src/buildscripts/resmokelib/testing/hooks/interface.py", line 72, in run_dynamic_test
[2019/08/14 20:11:41.659]     self.run_test()
[2019/08/14 20:11:41.659]   File "/data/mci/e4693cd76d49b3f1588ee74afd52ddac/src/buildscripts/resmokelib/testing/hooks/jsfile.py", line 86, in run_test
[2019/08/14 20:11:41.659]     self._js_test.run_test()
[2019/08/14 20:11:41.659]   File "/data/mci/e4693cd76d49b3f1588ee74afd52ddac/src/buildscripts/resmokelib/testing/testcases/jstest.py", line 237, in run_test
[2019/08/14 20:11:41.659]     self._run_single_copy()
[2019/08/14 20:11:41.659]   File "/data/mci/e4693cd76d49b3f1588ee74afd52ddac/src/buildscripts/resmokelib/testing/testcases/jstest.py", line 192, in _run_single_copy
[2019/08/14 20:11:41.659]     test_case.run_test()
[2019/08/14 20:11:41.659]   File "/data/mci/e4693cd76d49b3f1588ee74afd52ddac/src/buildscripts/resmokelib/testing/testcases/interface.py", line 119, in run_test
[2019/08/14 20:11:41.659]     self._execute(shell)
[2019/08/14 20:11:41.659]   File "/data/mci/e4693cd76d49b3f1588ee74afd52ddac/src/buildscripts/resmokelib/testing/testcases/interface.py", line 140, in _execute
[2019/08/14 20:11:41.659]     raise self.failureException("%s failed" % (self.short_description()))
[2019/08/14 20:11:41.659] AssertionError: JSTest jstests/hooks/run_check_repl_dbhash.js failed
[2019/08/14 20:11:41.659] During handling of the above exception, another exception occurred:
[2019/08/14 20:11:41.659] Traceback (most recent call last):
[2019/08/14 20:11:41.659]   File "/data/mci/e4693cd76d49b3f1588ee74afd52ddac/src/buildscripts/resmokelib/testing/hooks/jsfile.py", line 53, in after_test
[2019/08/14 20:11:41.659]     JSHook.after_test(self, test, test_report)
[2019/08/14 20:11:41.659]   File "/data/mci/e4693cd76d49b3f1588ee74afd52ddac/src/buildscripts/resmokelib/testing/hooks/jsfile.py", line 37, in after_test
[2019/08/14 20:11:41.659]     hook_test_case.run_dynamic_test(test_report)
[2019/08/14 20:11:41.659]   File "/data/mci/e4693cd76d49b3f1588ee74afd52ddac/src/buildscripts/resmokelib/testing/hooks/interface.py", line 77, in run_dynamic_test
[2019/08/14 20:11:41.659]     raise errors.TestFailure(err.args[0])
[2019/08/14 20:11:41.659] buildscripts.resmokelib.errors.TestFailure: JSTest jstests/hooks/run_check_repl_dbhash.js failed
[2019/08/14 20:11:41.659] During handling of the above exception, another exception occurred:
[2019/08/14 20:11:41.659] Traceback (most recent call last):
[2019/08/14 20:11:41.659]   File "/data/mci/e4693cd76d49b3f1588ee74afd52ddac/src/buildscripts/resmokelib/testing/job.py", line 274, in _run_hooks_after_tests
[2019/08/14 20:11:41.659]     self._run_hook(hook, hook.after_test, test)
[2019/08/14 20:11:41.659]   File "/data/mci/e4693cd76d49b3f1588ee74afd52ddac/src/buildscripts/resmokelib/testing/job.py", line 227, in _run_hook
[2019/08/14 20:11:41.659]     hook_function(test, self.report)
[2019/08/14 20:11:41.659]   File "/data/mci/e4693cd76d49b3f1588ee74afd52ddac/src/buildscripts/resmokelib/testing/hooks/jsfile.py", line 55, in after_test
[2019/08/14 20:11:41.659]     raise errors.ServerFailure(err.args[0])
[2019/08/14 20:11:41.659] buildscripts.resmokelib.errors.ServerFailure: JSTest jstests/hooks/run_check_repl_dbhash.js failed
[2019/08/14 20:11:41.659] [executor] 2019-08-14T19:11:41.644+0000 Uploading to S3 /data/mci/e4693cd76d49b3f1588ee74afd52ddac/tmp/tmp23fdeabw.tgz to bucket mongodatafiles path mongodb-mongo-v4.2/enterprise-rhel-62-64-bit-flow-control-off/5706e608e514786deaae3f4fa7916fc7e6371dfe/datafiles/mongo-data-mongodb_mongo_v4.2_enterprise_rhel_62_64_bit_flow_control_off_jstestfuzz_replication_2_enterprise_rhel_62_64_bit_flow_control_off_5706e608e514786deaae3f4fa7916fc7e6371dfe_19_08_14_17_50_41-jstestfuzz-94f4-ent_fb4b-qa_a6ce-1565809840034-08_CheckReplDBHash-0-0.tgz



 Comments   
Comment by Daniel Gottlieb (Inactive) [ 09/Sep/19 ]

FWIW, I tried a hacky better-effort solution. The (unverified) premise for an improvement is that TarFile.add will first search the directory for files to add. Then once the files are gathered, it will add them one by one to the archive. The assumption is that adding each individual file here requires actually copying the contents of the file into the archive. Thus the window for the last file in the archive disappearing before being copied (and failing the archive) is much larger than the previous files. My attempt was to shrink the window for failure.

I kept the archival code that adds one directory and has TarFile.add recursively discover individual files to add to the archive. Instead, I tried using the exclude predicate. That function takes a filename string as input and the method would "succeed" if the file could be successfully opened, letting that file be archived. If the predicate failed to open the file, it returns a value telling TarFile.add to skip it.

However the exclude predicate was removed somewhere in python3 and fully replaced with the filter predicate. This predicate takes a TarInfo as input. Unfortunately, the tarring library first opens the file in order to create the TarInfo object.

Comment by Max Hirschhorn [ 27/Aug/19 ]

It turns out we always continue on error (which I believe is correct), so I really don't know what happened here.

The try-except block within the for-loop is misleading because there's actually only one entry in the list. We call TarFile.add() with the name of the /data/db/jobN/ directory and rely on the tarfile module to do the recursive traversal through the directory for us. If an exception occurs due to a file being removed either during this traversal or when attempting to add it to the tarball, then we'll stop adding more files to the tarball.

This issue should be addressed as a result of the proposal for SERVER-42315 to only archive the data files after the processes have exited because the data directory contents will be stable.

Comment by Daniel Gottlieb (Inactive) [ 27/Aug/19 ]

It turns out we always continue on error (which I believe is correct), so I really don't know what happened here.

Reassigning to TIG.

Generated at Thu Feb 08 05:02:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.