[SERVER-80901] Audit gFeatureFlagServerlessChangeStreams Created: 08/Sep/23  Updated: 15/Nov/23  Resolved: 10/Nov/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.0.5

Type: Task Priority: Major - P3
Reporter: Kyle Suarez Assignee: Romans Kasperovics
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-79274 FCV checks can be racy if FCV is unin... Closed
is related to SERVER-70202 Investigate calling serverGlobalParam... Closed
is related to SERVER-79317 Provide more documentation and helper... Closed
is related to SERVER-82246 Change isEnabled to invariant when FC... Closed
Backwards Compatibility: Fully Compatible
Sprint: QE 2023-10-02, QE 2023-10-16, QE 2023-10-30, QE 2023-11-13
Participants:

 Description   

This ticket has been split from an audit of all Query 7.0 feature flags. This ticket is a request to audit gFeatureFlagServerlessChangeStreams.

Intial sync can temporarily reset the fcv value to uninitialized and sets the new value afterwards. This can cause call sites trying to inspect the fcv value to hit this invariant. We need to audit feature flag usage and determine if the feature flag check can be run during initial sync:

If it can never be called when initial sync is running, do nothing. Note that this can be tricky to prove as we once thought the catalog cache loader can never be run while initial sync is happening but it can.

If it might get run during initial sync, this could be the case if the feature is run during initial sync itself, if the feature is in a background thread that runs during initial sync, or if the feature is run in a command that is allowed during initial sync, such as hello, serverStatus, etc. In this case, use one of these options:

  • Use isEnabledUseLastLTSFCVWhenUninitialized. It checks against the last LTS FCV version if the FCV version is unset, but note that this could result in the feature not being turned on even though the FCV will be set to latest once initial sync is complete.
  • Use isEnabledUseLatestFCVWhenUninitialized. This instead checks against the latest FCV version if the FCV version is unset, but note that this could result in the feature being turned on even though the FCV has not been upgraded yet and will be set to lastLTS once initial sync is complete.
  • Write your own special logic to avoid the invariant (for example, waiting for the FCV to become initialized before checking isEnabled, or uasserting instead of invariant-ing)

See this section of the README



 Comments   
Comment by Githook User [ 10/Nov/23 ]

Author:

{'name': 'Romans Kasperovics', 'email': 'romans.kasperovics@mongodb.com', 'username': 'romanskas'}

Message: SERVER-80901 Default FCV for featureFlagServerlessChangeStreams

In mongod v7.0 'FeatureFlag::isEnabled()' crashes on invariant if
'serverGlobalParams.featureCompatibility' is not initialized, which is the
case during the initial sync.

Since 'featureFlagServerlessChangeStreams' is not set by default in v7.0, we
can use any safe default FCV to exclude the chance of hitting the invariant.
Branch: v7.0
https://github.com/mongodb/mongo/commit/4cdac1e55c22b2aa66ad5ba3d47181f7c80d1e29

Comment by Romans Kasperovics [ 06/Nov/23 ]

Attaching a test script to this ticket (because I am not going to commit it):

(function() {
"use strict";
 
load("jstests/libs/collection_drop_recreate.js");  // For assertDropAndRecreateCollection.
load("jstests/libs/log.js");                       // For findMatchingLogLine.
 
// Create a replica set with 2 nodes and stable role assignment.
const rst = new ReplSetTest({nodes: [{}, {rsConfig: {priority: 0, votes: 0}}], serverless: true});
rst.startSet();
rst.initiate();
 
let primary = rst.getPrimary();
let secondary = rst.getSecondary();
let testDB = primary.getDB(jsTestName());
let coll = assertDropAndRecreateCollection(
    testDB, "test", {changeStreamPreAndPostImages: {enabled: true}});
 
assert.commandWorked(coll.insert({message: "hello"}, {writeConcern: {w: "majority"}}));
 
const restartOptions = {
    startClean: true,
    setParameter: {
        // Speed-up pre-image removal job runs.
        expiredChangeStreamPreImageRemovalJobSleepSecs: 1,
        // Activate the 'serverless' mode.
        featureFlagServerlessChangeStreams: true,
        multitenancySupport: true,
        internalChangeStreamUseTenantIdForTesting: true,
        // Make sure the log contains the necessary lines.
        logComponentVerbosity: tojsononeline({replication: {verbosity: 5}, query: {verbosity: 5}})
    }
};
 
const failPoint = "initialSyncFuzzerSynchronizationPoint1";
// Hang during the initial sync.
restartOptions.setParameter["failpoint." + failPoint] = tojsononeline({mode: "alwaysOn"});
 
// Restart the secondary and clean the data to trigger the initial sync.
secondary = rst.restart(secondary, restartOptions);
let testDBOnSecondary = secondary.getDB(jsTestName());
 
function getLogLineCount(db, pattern) {
    return [
        ...findMatchingLogLines(assert.commandWorked(db.adminCommand({getLog: "global"})).log,
                                pattern)
    ].length;
}
 
const failPointLogId = 21158;
const jobRunLogId = 6278517;
 
// Count removal job runs until now.
const removalJobRuns = getLogLineCount(testDBOnSecondary, {id: jobRunLogId});
 
// Assert the fail point was hit.
assert.soon(() => getLogLineCount(testDBOnSecondary, {id: failPointLogId}) === 1);
 
// Assert that the pre-image removal job runs at least two times.
assert.soon(() => getLogLineCount(testDBOnSecondary, {id: jobRunLogId}) >= removalJobRuns + 2);
 
// Resume the initial sync process to finish the test gracefully.
assert.commandWorked(testDBOnSecondary.adminCommand({configureFailPoint: failPoint, mode: "off"}));
 
rst.stopSet();
}());

Comment by Huayu Ouyang [ 06/Nov/23 ]

romans.kasperovics@mongodb.com Yeah, I also see that gFeatureFlagServerlessChangeStreams isn't enabled by default on 7.0 anyways so it seems safe to replace isEnabled() with isEnabledUseLastLTSFCVWhenUninitialized on 7.0. And yes, this ticket is only for 7.0.

Comment by Romans Kasperovics [ 06/Nov/23 ]

Thanks samy.lanka@mongodb.com! We (my team) think that replacing isEnabled() with isEnabledUseLastLTSFCVWhenUninitialized() is a valid possibility in 7.0, because it is unlikely that serverless change streams ever enabled for productive use in this version. I am not sure though that this is a safe choice apply for the same code in the subsequent versions. If this ticket is only about v7.0, I can replace isEnabled() with isEnabledUseLastLTSFCVWhenUninitialized(). Should I do that?

Comment by Samyukta Lanka [ 06/Nov/23 ]

huayu.ouyang@mongodb.com I just wanted to point out that the audit of if the feature flag check is run during initial sync still needs to be completed and that checking isInitialized first isn't enough to avoid needing special logic for feature flag checks that can run during initial sync.

romans.kasperovics@mongodb.com We are going to solve the race in a different way as part of SERVER-79274, but I think the ask for this ticket is to see if any feature flag checks can be run during initial sync, and if they can, to use these helper functions to deal with that possibility: isEnabledUseLastLTSFCVWhenUninitialized or isEnabledUseLatestFCVWhenUninitialized

Comment by Romans Kasperovics [ 06/Nov/23 ]

samy.lanka@mongodb.com, this is a valid point. Do you think I should add a global lock acquisition to this code?
https://github.com/10gen/mongo/blob/728a7335d455a881b3a3b0a208269fad3518489a/src/mongo/db/change_stream_serverless_helpers.cpp#L48

Comment by Huayu Ouyang [ 06/Nov/23 ]

samy.lanka@mongodb.com I think that race is less specific to initial sync/this feature flag and we can fix it in SERVER-79274 when we backport to 7.0?

Comment by Samyukta Lanka [ 06/Nov/23 ]

romans.kasperovics@mongodb.com Isn't there still the possibility of a race condition if the FCV is reset after the call to serverGlobalParams.featureCompatibility.isVersionInitialized() but before the call to gFeatureFlagServerlessChangeStreams.isEnabled()?

Comment by Huayu Ouyang [ 02/Nov/23 ]

romans.kasperovics@mongodb.com Yes, that's true, and also to clarify, this ticket (SERVER-80901) is about auditing the feature flag in 7.0, where the change from Jiawei doesn't exist, so it is still necessary to audit it.

Generated at Thu Feb 08 06:44:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.