[SERVER-49416] rollback_fuzzer_clean_shutdowns suite may perform unclean shutdowns Created: 10/Jul/20  Updated: 29/Oct/23  Resolved: 24/Jul/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.7.0

Type: Bug Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: William Schultz (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-49862 Re-enable fast count validation in te... Closed
related to SERVER-48370 Make creating/extending backup cursor... Backlog
related to SERVER-44650 New test mode for foreground validate... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

// To produce an unclean shutdown when one shouldn't be allowed, run this under resmoke with: --suites=rollback_fuzzer_clean_shutdowns jstests/rollback_restart.js
(function() {
    "use strict";
 
    load("jstests/replsets/libs/rollback_test.js");
 
    TestData.rollbackShutdowns = true;
    let dbName = "test";
    let sourceCollName = "coll";
 
    let CommonOps = (node) => {
        // Insert a document that will exist on all nodes.
        assert.commandWorked(node.getDB(dbName)[sourceCollName].insert({}));
    };
 
    let RollbackOps = (node) => {
        // Delete the document on the rollback node so it will be refetched from sync source.
        assert.commandWorked(node.getDB(dbName)[sourceCollName].insert({}));
    };
 
    let rollbackTest = new RollbackTest("restart");
    CommonOps(rollbackTest.getPrimary());
    let rollbackNode = rollbackTest.transitionToRollbackOperations();
    RollbackOps(rollbackNode);
 
   // Neither of these should be allowed to shut down the node uncleanly.
    rollbackTest.restartNode(1, 9 /* SIGKILL */);
    rollbackTest.restartNode(1, 9 /* SIGKILL */);
 
    rollbackTest.transitionToSyncSourceOperationsBeforeRollback();
    rollbackTest.transitionToSyncSourceOperationsDuringRollback();
    rollbackTest.transitionToSteadyStateOperations();
    rollbackTest.stop();
}());

Sprint: Repl 2020-08-10
Participants:

 Description   

The rollback_fuzzer_clean_shutdowns suite is supposed to only shut down nodes cleanly i.e. with a SIGTERM signal. This is ensured in the rollback test fixture by converting any signal to a SIGTERM if the TestData.allowUncleanShutdowns parameter is not set, which is the default. However, in stopMongoProgram we set the allowUncleanShutdown setting to true if the given signal is not equal to MongoRunner.EXIT_CLEAN. So, after we do an initial clean shutdown in a generated fuzzer test, subsequent shutdowns are allowed to be unclean i.e. use SIGKILL.



 Comments   
Comment by Githook User [ 24/Jul/20 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-49416 Don't override the TestData.allowUncleanShutdowns parameter when doing an unclean shutdown in the shell
Branch: master
https://github.com/mongodb/mongo/commit/8e2e2c70d741ec2fded145ccf455d6924b7fe289

Comment by William Schultz (Inactive) [ 24/Jul/20 ]

In a patch build that includes the fix for this bug, a number of validation errors appeared in tests due to fast count mismatches. I expect these were masked by the existing code since we skipped fast count validation whenever a test did any shutdown, regardless of whether it was clean or unclean. There are a number of failures in noPassthrough related to backup restore, and a couple in replica_sets related to prepared transactions.

Comment by William Schultz (Inactive) [ 10/Jul/20 ]

I believe the confusion is MongoRunner.EXIT_CLEAN is an exit code and SIGTERM is a signal for shutting down. A mongod which receives a SIGTERM and cleanly shuts down should exit with code 0 (= MongoRunner.EXIT_CLEAN).

Agreed. I think we just need to change MongoRunner.EXIT_CLEAN to SIGTERM in that condition. I expect that's what the original ticket was trying to achieve, but mixed up the signal and exit codes.

Comment by Max Hirschhorn [ 10/Jul/20 ]

I believe the confusion is MongoRunner.EXIT_CLEAN is an exit code and SIGTERM is a signal for shutting down. A mongod which receives a SIGTERM and cleanly shuts down should exit with code 0 (= MongoRunner.EXIT_CLEAN).

Comment by Gregory Noma [ 10/Jul/20 ]

william.schultz The itent of that condition was so that we don't enforce fast counts during validation if there was an unclean shutdown during the test (here and here), since in that case we expect fast counts to be inaccurate. Perhaps in that case, we should set TestData.skipEnforceFastCountOnValidate instead of TestData.allowUncleanShutdowns? When do we expect SIGTERM (15) versus EXIT_CLEAN (0)?

Comment by William Schultz (Inactive) [ 10/Jul/20 ]

Note that this isn't a critical problem as far as test coverage goes, since we are still getting coverage of both clean and unclean shutdowns. It just makes debugging more difficult if your assumption is that only clean shutdowns are occurring in this suite.

Comment by William Schultz (Inactive) [ 10/Jul/20 ]

We can see this occurring in the logs of any recent rollback_fuzzer_clean_shutdowns Evergreen run as well. A regex search for Stopping node.*with signal 9 turns up the issue.

Comment by William Schultz (Inactive) [ 10/Jul/20 ]

The problematic behavior in stopMongoProgram was introduced in this commit. It seems that this condition may be checking for the wrong signal code, since MongoRunner.EXIT_CLEAN is 0. Perhaps the correct condition may be to check if signal !== 15 i.e. check for SIGTERM shutdown signal. I'm not sure exactly what the original intention of that logic was, though. gregory.noma

Generated at Thu Feb 08 05:19:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.