[SERVER-31562] dump replica set oplogs at the end of every failed test Created: 13/Oct/17  Updated: 30/Oct/23  Resolved: 14/Feb/18

Status: Closed
Project: Core Server
Component/s: Replication, Testing Infrastructure
Affects Version/s: None
Fix Version/s: 3.4.16, 3.6.6, 3.7.3

Type: Improvement Priority: Major - P3
Reporter: Judah Schvimer Assignee: Jonathan Abrahams
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-32852 Capture FTDC data on failures of the ... Closed
Related
is related to SERVER-26884 Support archiving data files in Everg... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v3.6, v3.4
Sprint: TIG 2018-02-26
Participants:

 Description   

Before shutting down nodes we can connect to each and dump their oplogs. Alternatively, we could wrap our tests in try...catches that dump the oplogs in the catch block. The latter would require getting ahold of the ShardingTest and ReplSetTest instances in an override which may not be possible.



 Comments   
Comment by Githook User [ 24/May/18 ]

Author:

{'username': 'hptabster', 'name': 'Jonathan Abrahams', 'email': 'jonathan@mongodb.com'}

Message: SERVER-31562 Archival for test failures from suites not using a resmoke fixture

(cherry picked from commit 9fd34c78b7471a3cec40e7cdc221d10b1a100ad3)
Branch: v3.6
https://github.com/mongodb/mongo/commit/d0523a71a34efa1fdb7f10cd2888986c04740010

Comment by Githook User [ 24/May/18 ]

Author:

{'username': 'hptabster', 'name': 'Jonathan Abrahams', 'email': 'jonathan@mongodb.com'}

Message: SERVER-31562 Archival for test failures from suites not using a resmoke fixture

(cherry picked from commit 9fd34c78b7471a3cec40e7cdc221d10b1a100ad3)
Branch: v3.4
https://github.com/mongodb/mongo/commit/fd5fc15bbc8021899b12e83bca11e2ec9c2a1163

Comment by Githook User [ 14/Feb/18 ]

Author:

{'email': 'jonathan@mongodb.com', 'name': 'Jonathan Abrahams', 'username': 'hptabster'}

Message: SERVER-31562 Archival for test failures from suites not using a resmoke fixture
Branch: master
https://github.com/mongodb/mongo/commit/9fd34c78b7471a3cec40e7cdc221d10b1a100ad3

Comment by Judah Schvimer [ 13/Feb/18 ]

If resmoke would notice a primary crash without it, then I don't think we need it.

Comment by Max Hirschhorn [ 13/Feb/18 ]

I am interested in the RollbackFuzzer, any other fuzzer tests that use replication (like generational_fuzzer_replication), fsm suites, and adding CheckPrimary to all suites that use it (since that means there was a crash that we might want to investigate.

judah.schvimer, is the CheckPrimary hook as useful to add to all test suites given the changes that were made in SERVER-31670? We aren't expecting failovers to occur in the vast majority of cases and resmoke.py would already detect if one of the secondaries crashed.

Comment by Jonathan Abrahams [ 12/Feb/18 ]

I am interested in the RollbackFuzzer, any other fuzzer tests that use replication (like generational_fuzzer_replication), fsm suites, and adding CheckPrimary to all suites that use it (since that means there was a crash that we might want to investigate.

We are planning to handle FSM (concurrency) suite failures in SERVER-32852.

For tests which start/stop their own mongod cluster (not using a resmoke fixture), like rollback_fuzzer, the current mechanism to archive a failed test would be on any failure within the test. The tests which use resmoke to launch the fixture, like jstestfuzz_replication*, can specify which test or hook to archive, i.e. CheckPrimary hook.

Comment by Judah Schvimer [ 12/Feb/18 ]

I am interested in the RollbackFuzzer, any other fuzzer tests that use replication (like generational_fuzzer_replication), fsm suites, and adding CheckPrimary to all suites that use it (since that means there was a crash that we might want to investigate.

Comment by Max Hirschhorn [ 12/Feb/18 ]

The jstests/noPassthrough/wt_unclean_shutdown.js and jstests/noPassthrough/backup_restore.js tests were the original motivation for SERVER-26884. The latter has since been split into backup_restore_fsync_lock.js, backup_restore_rolling.js, and backup_restore_stop_start.js. We should solicit feedback from the Storage and Replication teams as to what other tests and/or test suites for which it would be appropriate to gather data files upon failure.

https://jira.mongodb.org/browse/SERVER-33193?focusedCommentId=1800396&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1800396

Are some of the tests I know the Storage and Replication teams would benefit from archiving data files for.

Comment by Kevin Duong [ 12/Feb/18 ]

jonathan.abrahams To follow up with storage and repl on this.

Comment by Max Hirschhorn [ 17/Oct/17 ]

SERVER-26884 is intended to be done as part of debuggability improvements during the MongoDB 3.8 release cycle. I can only convey this through "3.7 Desired" at the moment, but we'll see to it that your use-case for tests using ReplSetTest and ShardingTest is covered by that project.

Comment by Judah Schvimer [ 17/Oct/17 ]

I think that would be sufficient, however if this were easier it could be useful while SERVER-26884 is on the backlog.

Comment by Max Hirschhorn [ 17/Oct/17 ]

judah.schvimer, would is be sufficient to upload the data files of the mongod processes to S3 on test failure? I'm wonder if collecting these diagnostics would be better handled by SERVER-26884.

Generated at Thu Feb 08 04:27:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.