Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.2.0, 3.3.0
Component/s: Replication
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Steps To Reproduce:
Hide

Start replset environment, eg:

mlaunch init --replicaset --nodes 2 --arbiter --port 31976 --oplogSize 1000 --wiredTigerCacheSizeGB 2

Wait for primary

Load up the primary:

mongo --port 31976 --quiet --eval ' while(1) { db.test.update( { _id: Math.floor(Math.random()*10000) }, { $inc: { a: 1 } }, { upsert: true, multi: false } ); }'

On the secondary, check (repeatedly) if there is a begin field in minvalid while fsyncLocked:

mongo --port 31977 --eval ' while(1) { assert.commandWorked(db.fsyncLock()); var minvalid = db.getSiblingDB("local").replset.minvalid.find().next(); printjson(minvalid); try { assert( ! minvalid.begin); } finally { assert.commandWorked(db.fsyncUnlock()); } }'

Always fails for me within a few attempts/seconds.
Show
Start replset environment, eg: mlaunch init --replicaset --nodes 2 --arbiter --port 31976 --oplogSize 1000 --wiredTigerCacheSizeGB 2 Wait for primary Load up the primary: mongo --port 31976 --quiet --eval ' while (1) { db.test.update( { _id: Math .floor( Math .random()*10000) }, { $inc: { a: 1 } }, { upsert: true , multi: false } ); }' On the secondary, check (repeatedly) if there is a begin field in minvalid while fsyncLocked: mongo --port 31977 --eval ' while (1) { assert .commandWorked(db.fsyncLock()); var minvalid = db.getSiblingDB( "local" ).replset.minvalid.find().next(); printjson(minvalid); try { assert ( ! minvalid.begin); } finally { assert .commandWorked(db.fsyncUnlock()); } }' Always fails for me within a few attempts/seconds.
Sprint:
Repl 2016-11-21
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The main impact of not doing this is that fsyncLocked secondaries can appear to be midway through a batch (ie. in an inconsistent state).

multiApply() takes the fsyncLock mutex while the batch is applied. This means that if an fsyncLock command comes in while a batch is being applied, it will wait for the batch to finish. Similarly, if the secondary is fsyncLocked, a new batch will not start being applied.

However, minvalid is currently updated outside multiApply, and therefore outside this lock. This means that there is a race where minvalid can be updated even though the secondary has been fsyncLocked. This means that the secondary has been marked as inconsistent (minvalid has a begin field), even though (in this particular case) it isn't.

This is a problem because user expectation is that fsyncLocking a secondary will leave it in a consistent state (since repl writes have been stopped, and reads are possible, meaning that the secondary must be between batches).

If the user takes an atomic filesystem snapshot of the dbpath of an fsyncLocked secondary, and then tries to start it up in a context that doesn't have access to the original replica set (eg. to seed a QA environment), then the node will go into RECOVERING and not be able to come out. Similarly, if a user has a dbpath snapshot that is marked as inconsistent, it is not possible to be able to tell if the snapshot is actually from within a batch, or suffering from the problem described here.

The solution is that minvalid should only have the begin field added if the batch is actually about to be applied (as opposed to possibly held up by fsyncLock). Similarly, minvalid should have the begin field removed as soon as the batch has finished being applied (and not potentially delayed by fsyncLock). Together, these mean that minvalid should only be updated inside multiApply() while the fsyncLock mutex (and PBWM) is held.

Workarounds:

Take filesystem snapshots of the primary instead. This is not always possible, since it may add extra load on the primary (eg. LVM snapshots need to be mounted and copied elsewhere).
For 3.2.x where x is >= 9, shutdown the secondary before snapshotting it (see ~~SERVER-24933~~).
After ~~SERVER-25071~~ is fixed, shutting down the secondary will also be viable for master/3.3+.

db/repl/sync_tail.cpp

        // Set minValid to the last OpTime that needs to be applied, in this batch or from the
        // (last) failed batch, whichever is larger.
        // This will cause this node to go into RECOVERING state
        // if we should crash and restart before updating finishing.
        const auto& start = lastWriteOpTime;


        // Take the max of the first endOptime (if we recovered) and the end of our batch.

        // Setting end to the max of originalEndOpTime and lastOpTime (the end of the batch)
        // ensures that we keep pushing out the point where we can become consistent
        // and allow reads. If we recover and end up doing smaller batches we must pass the
        // originalEndOpTime before we are good.
        //
        // For example:
        // batch apply, 20-40, end = 40
        // batch failure,
        // restart
        // batch apply, 20-25, end = max(25, 40) = 40
        // batch apply, 25-45, end = 45
        const OpTime end(std::max(originalEndOpTime, lastOpTime));

        // This write will not journal/checkpoint.
        StorageInterface::get(&txn)->setMinValid(&txn, {start, end});       <<<<-------------------,
                                                                                                   |
        const size_t opsInBatch = ops.getCount();                                                  |  Race "on the way in"
        lastWriteOpTime = multiApply(&txn, ops.releaseBatch());             <<<<-------------------+
        if (lastWriteOpTime.isNull()) {                                                            |
            // fassert if oplog application failed for any reasons other than shutdown.            |
            error() << "Failed to apply " << opsInBatch << " operations - batch start:" << start   |
                    << " end:" << end;                                                             |
            fassert(34360, inShutdownStrict());                                                    |
            // Return without setting minvalid in the case of shutdown.                            |
            return;                                                                                |  Race "out the way out"
        }                                                                                          |
                                                                                                   |
        setNewTimestamp(lastWriteOpTime.getTimestamp());                                           |
        StorageInterface::get(&txn)->setMinValid(&txn, end, DurableRequirement::None);   <<<<------'
        minValidBoundaries.start = {};
        minValidBoundaries.end = end;
        finalizer->record(lastWriteOpTime);

db/repl/sync_tail.cpp

StatusWith<OpTime> multiApply(OperationContext* txn,
                              OldThreadPool* workerPool,
                              MultiApplier::Operations ops,
                              MultiApplier::ApplyOperationFn applyOperation) {
    if (!txn) {
        return {ErrorCodes::BadValue, "invalid operation context"};
    }

    if (!workerPool) {
        return {ErrorCodes::BadValue, "invalid worker pool"};
    }

    if (ops.empty()) {
        return {ErrorCodes::EmptyArrayOperation, "no operations provided to multiApply"};
    }

    if (!applyOperation) {
        return {ErrorCodes::BadValue, "invalid apply operation function"};
    }

    if (getGlobalServiceContext()->getGlobalStorageEngine()->isMmapV1()) {
        // Use a ThreadPool to prefetch all the operations in a batch.
        prefetchOps(ops, workerPool);
    }

    LOG(2) << "replication batch size is " << ops.size();
    // We must grab this because we're going to grab write locks later.
    // We hold this mutex the entire time we're writing; it doesn't matter
    // because all readers are blocked anyway.
    stdx::lock_guard<SimpleMutex> fsynclk(filesLockedFsync);

    // Stop all readers until we're done. This also prevents doc-locking engines from deleting old
    // entries from the oplog until we finish writing.
    Lock::ParallelBatchWriterMode pbwm(txn->lockState());

    auto replCoord = ReplicationCoordinator::get(txn);
    if (replCoord->getMemberState().primary() && !replCoord->isWaitingForApplierToDrain()) {
        severe() << "attempting to replicate ops while primary";
        return {ErrorCodes::CannotApplyOplogWhilePrimary,
                "attempting to replicate ops while primary"};
    }

    ...

duplicates

SERVER-26034 fsync+lock should never see intermediate states of secondary batch application

Closed

related to

SERVER-24933 Clean shutdown of secondaries should occur in between oplog batches, not during

Closed

SERVER-25071 Ensure replication batch finishes before shutdown

Closed

Assignee:: Mathias Stearn
Reporter:: Kevin Pulo
Participants:: Kevin Pulo, Mathias Stearn
Votes:: 0 Vote for this issue
Watchers:: 8 Start watching this issue

Created:: Jul 25 2016 07:42:53 AM UTC
Updated:: Dec 07 2016 10:20:42 PM UTC
Resolved:: Nov 08 2016 10:00:01 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates