[SERVER-27227] Disable fail points on test failures before shutdown Created: 30/Nov/16  Updated: 08/Dec/16  Resolved: 08/Dec/16

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Judah Schvimer Assignee: Judah Schvimer
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-20773 core/fsync.js can leave test server f... Closed
is related to SERVER-27280 repl failpoints that block until they... Closed
Sprint: Repl 2016-12-12, Repl 2017-01-23
Participants:
Linked BF Score: 0

 Description   

Fail points do not get turned off automatically when a test fails. Some fail points like rsSyncApplyStop can lead to fasserts, if they remain on during shutdown. This leads to extra confusion when diagnosing failures. We should automatically turn off all fail points before shutdown.



 Comments   
Comment by Judah Schvimer [ 08/Dec/16 ]

I agree that SERVER-27280 is probably sufficient. I can't think of any other failpoints that would cause shutdown hangs, and making them shutdown aware is probably a better solution.

Comment by Max Hirschhorn [ 08/Dec/16 ]

If we expect tests to disable their own failpoints, then any test using a failpoint that could lead to a shutdown hang will have to surround the entire test with a try-finally statement.

It seems like SERVER-27280 was proposing that failpoints should automatically check if the server is shutting down. I'm not sure how that proposal will turn out.

I think a better design, if possible, would be to automatically wrap tests in try-finally blocks and disable any problematic failpoints on failure.

It does tend to be very complicated to write a robust try/catch/finally block. Inlined below is an example of how we handle {fsync: 1, lock} and {fsyncUnlock: 1} in ReplSetTest. I'd be interested in only needing write such a structure once and having other tests able to take advantage of it. A Promise-like construct would potentially make the try/catch/finally easier to express in the test, but I haven't thought deeply enough about it.

var activeException = false;
 
try {
    // Lock the primary to prevent the TTL monitor from deleting expired documents in
    // the background while we are getting the dbhashes of the replica set members.
    assert.commandWorked(primary.adminCommand({fsync: 1, lock: 1}),
                            'failed to lock the primary');
    this.awaitReplication(60 * 1000 * 5);
    checkerFunction.apply(this, checkerFunctionArgs);
} catch (e) {
    activeException = true;
    throw e;
} finally {
    // Allow writes on the primary.
    var res = primary.adminCommand({fsyncUnlock: 1});
 
    if (!res.ok) {
        var msg = 'failed to unlock the primary, which may cause this' +
            ' test to hang: ' + tojson(res);
        if (activeException) {
            print(msg);
        } else {
            throw new Error(msg);
        }
    }
}

Comment by Judah Schvimer [ 08/Dec/16 ]

If we expect tests to disable their own failpoints, then any test using a failpoint that could lead to a shutdown hang will have to surround the entire test with a try-finally statement. Is that a design pattern we want in our tests? I think a better design, if possible, would be to automatically wrap tests in try-finally blocks and disable any problematic failpoints on failure.

Comment by Max Hirschhorn [ 08/Dec/16 ]

judah.schvimer, why shouldn't it be the test's responsibility to disable the failpoint if it enabled the failpoint? If you wanted the failpoint to be disabled before the mongo shell terminated the spawned processes, then the portion of the test after the failpoint was enabled should be wrapped in a try/finally block.

The mongo shell currently doesn't expose a way to inject additional logic into its "atexit" handler, and I could see situations where attempting to communicate with a MongoDB process when the test has failed could lead to the mongo shell hanging rather than exiting cleanly with a non-zero return code (as it would have done before attempting to disable failpoints automatically). If we decide to pursue this idea, I'd like to ensure that it also handles the fsync+lock scenario described in SERVER-20773.

Generated at Thu Feb 08 04:14:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.