[SERVER-25248] minvalid should be set while appropriate locks are held Created: 25/Jul/16  Updated: 07/Dec/16  Resolved: 08/Nov/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.0, 3.3.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kevin Pulo Assignee: Mathias Stearn
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-26034 fsync+lock should never see intermedi... Closed
Related
related to SERVER-24933 Clean shutdown of secondaries should ... Closed
related to SERVER-25071 Ensure replication batch finishes bef... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:
  • Start replset environment, eg:

    mlaunch init --replicaset --nodes 2 --arbiter --port 31976 --oplogSize 1000 --wiredTigerCacheSizeGB 2
    

  • Wait for primary
  • Load up the primary:

    mongo --port 31976 --quiet --eval '
        while(1) {
            db.test.update( { _id: Math.floor(Math.random()*10000) },
                            { $inc: { a: 1 } },
                            { upsert: true, multi: false } );
        }'
    

  • On the secondary, check (repeatedly) if there is a begin field in minvalid while fsyncLocked:

    mongo --port 31977 --eval '
        while(1) {
            assert.commandWorked(db.fsyncLock());
            var minvalid = db.getSiblingDB("local").replset.minvalid.find().next();
            printjson(minvalid);
            try {
                assert( ! minvalid.begin);
            } finally {
                assert.commandWorked(db.fsyncUnlock());
            }
        }'
    

    Always fails for me within a few attempts/seconds.

Sprint: Repl 2016-11-21
Participants:

 Description   

The main impact of not doing this is that fsyncLocked secondaries can appear to be midway through a batch (ie. in an inconsistent state).

multiApply() takes the fsyncLock mutex while the batch is applied. This means that if an fsyncLock command comes in while a batch is being applied, it will wait for the batch to finish. Similarly, if the secondary is fsyncLocked, a new batch will not start being applied.

However, minvalid is currently updated outside multiApply, and therefore outside this lock. This means that there is a race where minvalid can be updated even though the secondary has been fsyncLocked. This means that the secondary has been marked as inconsistent (minvalid has a begin field), even though (in this particular case) it isn't.

This is a problem because user expectation is that fsyncLocking a secondary will leave it in a consistent state (since repl writes have been stopped, and reads are possible, meaning that the secondary must be between batches).

If the user takes an atomic filesystem snapshot of the dbpath of an fsyncLocked secondary, and then tries to start it up in a context that doesn't have access to the original replica set (eg. to seed a QA environment), then the node will go into RECOVERING and not be able to come out. Similarly, if a user has a dbpath snapshot that is marked as inconsistent, it is not possible to be able to tell if the snapshot is actually from within a batch, or suffering from the problem described here.

The solution is that minvalid should only have the begin field added if the batch is actually about to be applied (as opposed to possibly held up by fsyncLock). Similarly, minvalid should have the begin field removed as soon as the batch has finished being applied (and not potentially delayed by fsyncLock). Together, these mean that minvalid should only be updated inside multiApply() while the fsyncLock mutex (and PBWM) is held.

Workarounds:

  • Take filesystem snapshots of the primary instead. This is not always possible, since it may add extra load on the primary (eg. LVM snapshots need to be mounted and copied elsewhere).
  • For 3.2.x where x is >= 9, shutdown the secondary before snapshotting it (see SERVER-24933).
  • After SERVER-25071 is fixed, shutting down the secondary will also be viable for master/3.3+.

db/repl/sync_tail.cpp

827
        // Set minValid to the last OpTime that needs to be applied, in this batch or from the
828
        // (last) failed batch, whichever is larger.
829
        // This will cause this node to go into RECOVERING state
830
        // if we should crash and restart before updating finishing.
831
        const auto& start = lastWriteOpTime;
832
 
833
 
834
        // Take the max of the first endOptime (if we recovered) and the end of our batch.
835
 
836
        // Setting end to the max of originalEndOpTime and lastOpTime (the end of the batch)
837
        // ensures that we keep pushing out the point where we can become consistent
838
        // and allow reads. If we recover and end up doing smaller batches we must pass the
839
        // originalEndOpTime before we are good.
840
        //
841
        // For example:
842
        // batch apply, 20-40, end = 40
843
        // batch failure,
844
        // restart
845
        // batch apply, 20-25, end = max(25, 40) = 40
846
        // batch apply, 25-45, end = 45
847
        const OpTime end(std::max(originalEndOpTime, lastOpTime));
848
 
849
        // This write will not journal/checkpoint.
850
        StorageInterface::get(&txn)->setMinValid(&txn, {start, end});       <<<<-------------------,
851
                                                                                                   |
852
        const size_t opsInBatch = ops.getCount();                                                  |  Race "on the way in"
853
        lastWriteOpTime = multiApply(&txn, ops.releaseBatch());             <<<<-------------------+
854
        if (lastWriteOpTime.isNull()) {                                                            |
855
            // fassert if oplog application failed for any reasons other than shutdown.            |
856
            error() << "Failed to apply " << opsInBatch << " operations - batch start:" << start   |
857
                    << " end:" << end;                                                             |
858
            fassert(34360, inShutdownStrict());                                                    |
859
            // Return without setting minvalid in the case of shutdown.                            |
860
            return;                                                                                |  Race "out the way out"
861
        }                                                                                          |
862
                                                                                                   |
863
        setNewTimestamp(lastWriteOpTime.getTimestamp());                                           |
864
        StorageInterface::get(&txn)->setMinValid(&txn, end, DurableRequirement::None);   <<<<------'
865
        minValidBoundaries.start = {};
866
        minValidBoundaries.end = end;
867
        finalizer->record(lastWriteOpTime);

db/repl/sync_tail.cpp

1234
StatusWith<OpTime> multiApply(OperationContext* txn,
1235
                              OldThreadPool* workerPool,
1236
                              MultiApplier::Operations ops,
1237
                              MultiApplier::ApplyOperationFn applyOperation) {
1238
    if (!txn) {
1239
        return {ErrorCodes::BadValue, "invalid operation context"};
1240
    }
1241
 
1242
    if (!workerPool) {
1243
        return {ErrorCodes::BadValue, "invalid worker pool"};
1244
    }
1245
 
1246
    if (ops.empty()) {
1247
        return {ErrorCodes::EmptyArrayOperation, "no operations provided to multiApply"};
1248
    }
1249
 
1250
    if (!applyOperation) {
1251
        return {ErrorCodes::BadValue, "invalid apply operation function"};
1252
    }
1253
 
1254
    if (getGlobalServiceContext()->getGlobalStorageEngine()->isMmapV1()) {
1255
        // Use a ThreadPool to prefetch all the operations in a batch.
1256
        prefetchOps(ops, workerPool);
1257
    }
1258
 
1259
    LOG(2) << "replication batch size is " << ops.size();
1260
    // We must grab this because we're going to grab write locks later.
1261
    // We hold this mutex the entire time we're writing; it doesn't matter
1262
    // because all readers are blocked anyway.
1263
    stdx::lock_guard<SimpleMutex> fsynclk(filesLockedFsync);
1264
 
1265
    // Stop all readers until we're done. This also prevents doc-locking engines from deleting old
1266
    // entries from the oplog until we finish writing.
1267
    Lock::ParallelBatchWriterMode pbwm(txn->lockState());
1268
 
1269
    auto replCoord = ReplicationCoordinator::get(txn);
1270
    if (replCoord->getMemberState().primary() && !replCoord->isWaitingForApplierToDrain()) {
1271
        severe() << "attempting to replicate ops while primary";
1272
        return {ErrorCodes::CannotApplyOplogWhilePrimary,
1273
                "attempting to replicate ops while primary"};
1274
    }
1275
 
1276
    ...


Generated at Thu Feb 08 04:08:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.