Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-25248

minvalid should be set while appropriate locks are held

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.2.0, 3.3.0
    • Component/s: Replication
    • Labels:
      None
    • Fully Compatible
    • ALL
    • Hide
      • Start replset environment, eg:
        mlaunch init --replicaset --nodes 2 --arbiter --port 31976 --oplogSize 1000 --wiredTigerCacheSizeGB 2
        
      • Wait for primary
      • Load up the primary:
        mongo --port 31976 --quiet --eval '
            while(1) {
                db.test.update( { _id: Math.floor(Math.random()*10000) },
                                { $inc: { a: 1 } },
                                { upsert: true, multi: false } );
            }'
        
      • On the secondary, check (repeatedly) if there is a begin field in minvalid while fsyncLocked:
        mongo --port 31977 --eval '
            while(1) {
                assert.commandWorked(db.fsyncLock());
                var minvalid = db.getSiblingDB("local").replset.minvalid.find().next();
                printjson(minvalid);
                try {
                    assert( ! minvalid.begin);
                } finally {
                    assert.commandWorked(db.fsyncUnlock());
                }
            }'
        

        Always fails for me within a few attempts/seconds.

      Show
      Start replset environment, eg: mlaunch init --replicaset --nodes 2 --arbiter --port 31976 --oplogSize 1000 --wiredTigerCacheSizeGB 2 Wait for primary Load up the primary: mongo --port 31976 --quiet --eval ' while (1) { db.test.update( { _id: Math .floor( Math .random()*10000) }, { $inc: { a: 1 } }, { upsert: true , multi: false } ); }' On the secondary, check (repeatedly) if there is a begin field in minvalid while fsyncLocked: mongo --port 31977 --eval ' while (1) { assert .commandWorked(db.fsyncLock()); var minvalid = db.getSiblingDB( "local" ).replset.minvalid.find().next(); printjson(minvalid); try { assert ( ! minvalid.begin); } finally { assert .commandWorked(db.fsyncUnlock()); } }' Always fails for me within a few attempts/seconds.
    • Repl 2016-11-21

      The main impact of not doing this is that fsyncLocked secondaries can appear to be midway through a batch (ie. in an inconsistent state).

      multiApply() takes the fsyncLock mutex while the batch is applied. This means that if an fsyncLock command comes in while a batch is being applied, it will wait for the batch to finish. Similarly, if the secondary is fsyncLocked, a new batch will not start being applied.

      However, minvalid is currently updated outside multiApply, and therefore outside this lock. This means that there is a race where minvalid can be updated even though the secondary has been fsyncLocked. This means that the secondary has been marked as inconsistent (minvalid has a begin field), even though (in this particular case) it isn't.

      This is a problem because user expectation is that fsyncLocking a secondary will leave it in a consistent state (since repl writes have been stopped, and reads are possible, meaning that the secondary must be between batches).

      If the user takes an atomic filesystem snapshot of the dbpath of an fsyncLocked secondary, and then tries to start it up in a context that doesn't have access to the original replica set (eg. to seed a QA environment), then the node will go into RECOVERING and not be able to come out. Similarly, if a user has a dbpath snapshot that is marked as inconsistent, it is not possible to be able to tell if the snapshot is actually from within a batch, or suffering from the problem described here.

      The solution is that minvalid should only have the begin field added if the batch is actually about to be applied (as opposed to possibly held up by fsyncLock). Similarly, minvalid should have the begin field removed as soon as the batch has finished being applied (and not potentially delayed by fsyncLock). Together, these mean that minvalid should only be updated inside multiApply() while the fsyncLock mutex (and PBWM) is held.

      Workarounds:

      • Take filesystem snapshots of the primary instead. This is not always possible, since it may add extra load on the primary (eg. LVM snapshots need to be mounted and copied elsewhere).
      • For 3.2.x where x is >= 9, shutdown the secondary before snapshotting it (see SERVER-24933).
      • After SERVER-25071 is fixed, shutting down the secondary will also be viable for master/3.3+.

      db/repl/sync_tail.cpp
              // Set minValid to the last OpTime that needs to be applied, in this batch or from the
              // (last) failed batch, whichever is larger.
              // This will cause this node to go into RECOVERING state
              // if we should crash and restart before updating finishing.
              const auto& start = lastWriteOpTime;
      
      
              // Take the max of the first endOptime (if we recovered) and the end of our batch.
      
              // Setting end to the max of originalEndOpTime and lastOpTime (the end of the batch)
              // ensures that we keep pushing out the point where we can become consistent
              // and allow reads. If we recover and end up doing smaller batches we must pass the
              // originalEndOpTime before we are good.
              //
              // For example:
              // batch apply, 20-40, end = 40
              // batch failure,
              // restart
              // batch apply, 20-25, end = max(25, 40) = 40
              // batch apply, 25-45, end = 45
              const OpTime end(std::max(originalEndOpTime, lastOpTime));
      
              // This write will not journal/checkpoint.
              StorageInterface::get(&txn)->setMinValid(&txn, {start, end});       <<<<-------------------,
                                                                                                         |
              const size_t opsInBatch = ops.getCount();                                                  |  Race "on the way in"
              lastWriteOpTime = multiApply(&txn, ops.releaseBatch());             <<<<-------------------+
              if (lastWriteOpTime.isNull()) {                                                            |
                  // fassert if oplog application failed for any reasons other than shutdown.            |
                  error() << "Failed to apply " << opsInBatch << " operations - batch start:" << start   |
                          << " end:" << end;                                                             |
                  fassert(34360, inShutdownStrict());                                                    |
                  // Return without setting minvalid in the case of shutdown.                            |
                  return;                                                                                |  Race "out the way out"
              }                                                                                          |
                                                                                                         |
              setNewTimestamp(lastWriteOpTime.getTimestamp());                                           |
              StorageInterface::get(&txn)->setMinValid(&txn, end, DurableRequirement::None);   <<<<------'
              minValidBoundaries.start = {};
              minValidBoundaries.end = end;
              finalizer->record(lastWriteOpTime);
      
      db/repl/sync_tail.cpp
      StatusWith<OpTime> multiApply(OperationContext* txn,
                                    OldThreadPool* workerPool,
                                    MultiApplier::Operations ops,
                                    MultiApplier::ApplyOperationFn applyOperation) {
          if (!txn) {
              return {ErrorCodes::BadValue, "invalid operation context"};
          }
      
          if (!workerPool) {
              return {ErrorCodes::BadValue, "invalid worker pool"};
          }
      
          if (ops.empty()) {
              return {ErrorCodes::EmptyArrayOperation, "no operations provided to multiApply"};
          }
      
          if (!applyOperation) {
              return {ErrorCodes::BadValue, "invalid apply operation function"};
          }
      
          if (getGlobalServiceContext()->getGlobalStorageEngine()->isMmapV1()) {
              // Use a ThreadPool to prefetch all the operations in a batch.
              prefetchOps(ops, workerPool);
          }
      
          LOG(2) << "replication batch size is " << ops.size();
          // We must grab this because we're going to grab write locks later.
          // We hold this mutex the entire time we're writing; it doesn't matter
          // because all readers are blocked anyway.
          stdx::lock_guard<SimpleMutex> fsynclk(filesLockedFsync);
      
          // Stop all readers until we're done. This also prevents doc-locking engines from deleting old
          // entries from the oplog until we finish writing.
          Lock::ParallelBatchWriterMode pbwm(txn->lockState());
      
          auto replCoord = ReplicationCoordinator::get(txn);
          if (replCoord->getMemberState().primary() && !replCoord->isWaitingForApplierToDrain()) {
              severe() << "attempting to replicate ops while primary";
              return {ErrorCodes::CannotApplyOplogWhilePrimary,
                      "attempting to replicate ops while primary"};
          }
      
          ...
      

            Assignee:
            mathias@mongodb.com Mathias Stearn
            Reporter:
            kevin.pulo@mongodb.com Kevin Pulo
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: