Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-25248

minvalid should be set while appropriate locks are held

    XMLWordPrintableJSON

Details

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major - P3 Major - P3
    • None
    • 3.2.0, 3.3.0
    • Replication
    • None
    • Fully Compatible
    • ALL
    • Hide
      • Start replset environment, eg:

        mlaunch init --replicaset --nodes 2 --arbiter --port 31976 --oplogSize 1000 --wiredTigerCacheSizeGB 2
        

      • Wait for primary
      • Load up the primary:

        mongo --port 31976 --quiet --eval '
            while(1) {
                db.test.update( { _id: Math.floor(Math.random()*10000) },
                                { $inc: { a: 1 } },
                                { upsert: true, multi: false } );
            }'
        

      • On the secondary, check (repeatedly) if there is a begin field in minvalid while fsyncLocked:

        mongo --port 31977 --eval '
            while(1) {
                assert.commandWorked(db.fsyncLock());
                var minvalid = db.getSiblingDB("local").replset.minvalid.find().next();
                printjson(minvalid);
                try {
                    assert( ! minvalid.begin);
                } finally {
                    assert.commandWorked(db.fsyncUnlock());
                }
            }'
        

        Always fails for me within a few attempts/seconds.

      Show
      Start replset environment, eg: mlaunch init --replicaset --nodes 2 --arbiter --port 31976 --oplogSize 1000 --wiredTigerCacheSizeGB 2 Wait for primary Load up the primary: mongo --port 31976 --quiet --eval ' while(1) { db.test.update( { _id: Math.floor(Math.random()*10000) }, { $inc: { a: 1 } }, { upsert: true, multi: false } ); }' On the secondary, check (repeatedly) if there is a begin field in minvalid while fsyncLocked: mongo --port 31977 --eval ' while(1) { assert.commandWorked(db.fsyncLock()); var minvalid = db.getSiblingDB("local").replset.minvalid.find().next(); printjson(minvalid); try { assert( ! minvalid.begin); } finally { assert.commandWorked(db.fsyncUnlock()); } }' Always fails for me within a few attempts/seconds.
    • Repl 2016-11-21

    Description

      The main impact of not doing this is that fsyncLocked secondaries can appear to be midway through a batch (ie. in an inconsistent state).

      multiApply() takes the fsyncLock mutex while the batch is applied. This means that if an fsyncLock command comes in while a batch is being applied, it will wait for the batch to finish. Similarly, if the secondary is fsyncLocked, a new batch will not start being applied.

      However, minvalid is currently updated outside multiApply, and therefore outside this lock. This means that there is a race where minvalid can be updated even though the secondary has been fsyncLocked. This means that the secondary has been marked as inconsistent (minvalid has a begin field), even though (in this particular case) it isn't.

      This is a problem because user expectation is that fsyncLocking a secondary will leave it in a consistent state (since repl writes have been stopped, and reads are possible, meaning that the secondary must be between batches).

      If the user takes an atomic filesystem snapshot of the dbpath of an fsyncLocked secondary, and then tries to start it up in a context that doesn't have access to the original replica set (eg. to seed a QA environment), then the node will go into RECOVERING and not be able to come out. Similarly, if a user has a dbpath snapshot that is marked as inconsistent, it is not possible to be able to tell if the snapshot is actually from within a batch, or suffering from the problem described here.

      The solution is that minvalid should only have the begin field added if the batch is actually about to be applied (as opposed to possibly held up by fsyncLock). Similarly, minvalid should have the begin field removed as soon as the batch has finished being applied (and not potentially delayed by fsyncLock). Together, these mean that minvalid should only be updated inside multiApply() while the fsyncLock mutex (and PBWM) is held.

      Workarounds:

      • Take filesystem snapshots of the primary instead. This is not always possible, since it may add extra load on the primary (eg. LVM snapshots need to be mounted and copied elsewhere).
      • For 3.2.x where x is >= 9, shutdown the secondary before snapshotting it (see SERVER-24933).
      • After SERVER-25071 is fixed, shutting down the secondary will also be viable for master/3.3+.

      db/repl/sync_tail.cpp

      827
              // Set minValid to the last OpTime that needs to be applied, in this batch or from the
      828
              // (last) failed batch, whichever is larger.
      829
              // This will cause this node to go into RECOVERING state
      830
              // if we should crash and restart before updating finishing.
      831
              const auto& start = lastWriteOpTime;
      832
       
      833
       
      834
              // Take the max of the first endOptime (if we recovered) and the end of our batch.
      835
       
      836
              // Setting end to the max of originalEndOpTime and lastOpTime (the end of the batch)
      837
              // ensures that we keep pushing out the point where we can become consistent
      838
              // and allow reads. If we recover and end up doing smaller batches we must pass the
      839
              // originalEndOpTime before we are good.
      840
              //
      841
              // For example:
      842
              // batch apply, 20-40, end = 40
      843
              // batch failure,
      844
              // restart
      845
              // batch apply, 20-25, end = max(25, 40) = 40
      846
              // batch apply, 25-45, end = 45
      847
              const OpTime end(std::max(originalEndOpTime, lastOpTime));
      848
       
      849
              // This write will not journal/checkpoint.
      850
              StorageInterface::get(&txn)->setMinValid(&txn, {start, end});       <<<<-------------------,
      851
                                                                                                         |
      852
              const size_t opsInBatch = ops.getCount();                                                  |  Race "on the way in"
      853
              lastWriteOpTime = multiApply(&txn, ops.releaseBatch());             <<<<-------------------+
      854
              if (lastWriteOpTime.isNull()) {                                                            |
      855
                  // fassert if oplog application failed for any reasons other than shutdown.            |
      856
                  error() << "Failed to apply " << opsInBatch << " operations - batch start:" << start   |
      857
                          << " end:" << end;                                                             |
      858
                  fassert(34360, inShutdownStrict());                                                    |
      859
                  // Return without setting minvalid in the case of shutdown.                            |
      860
                  return;                                                                                |  Race "out the way out"
      861
              }                                                                                          |
      862
                                                                                                         |
      863
              setNewTimestamp(lastWriteOpTime.getTimestamp());                                           |
      864
              StorageInterface::get(&txn)->setMinValid(&txn, end, DurableRequirement::None);   <<<<------'
      865
              minValidBoundaries.start = {};
      866
              minValidBoundaries.end = end;
      867
              finalizer->record(lastWriteOpTime);
      

      db/repl/sync_tail.cpp

      1234
      StatusWith<OpTime> multiApply(OperationContext* txn,
      1235
                                    OldThreadPool* workerPool,
      1236
                                    MultiApplier::Operations ops,
      1237
                                    MultiApplier::ApplyOperationFn applyOperation) {
      1238
          if (!txn) {
      1239
              return {ErrorCodes::BadValue, "invalid operation context"};
      1240
          }
      1241
       
      1242
          if (!workerPool) {
      1243
              return {ErrorCodes::BadValue, "invalid worker pool"};
      1244
          }
      1245
       
      1246
          if (ops.empty()) {
      1247
              return {ErrorCodes::EmptyArrayOperation, "no operations provided to multiApply"};
      1248
          }
      1249
       
      1250
          if (!applyOperation) {
      1251
              return {ErrorCodes::BadValue, "invalid apply operation function"};
      1252
          }
      1253
       
      1254
          if (getGlobalServiceContext()->getGlobalStorageEngine()->isMmapV1()) {
      1255
              // Use a ThreadPool to prefetch all the operations in a batch.
      1256
              prefetchOps(ops, workerPool);
      1257
          }
      1258
       
      1259
          LOG(2) << "replication batch size is " << ops.size();
      1260
          // We must grab this because we're going to grab write locks later.
      1261
          // We hold this mutex the entire time we're writing; it doesn't matter
      1262
          // because all readers are blocked anyway.
      1263
          stdx::lock_guard<SimpleMutex> fsynclk(filesLockedFsync);
      1264
       
      1265
          // Stop all readers until we're done. This also prevents doc-locking engines from deleting old
      1266
          // entries from the oplog until we finish writing.
      1267
          Lock::ParallelBatchWriterMode pbwm(txn->lockState());
      1268
       
      1269
          auto replCoord = ReplicationCoordinator::get(txn);
      1270
          if (replCoord->getMemberState().primary() && !replCoord->isWaitingForApplierToDrain()) {
      1271
              severe() << "attempting to replicate ops while primary";
      1272
              return {ErrorCodes::CannotApplyOplogWhilePrimary,
      1273
                      "attempting to replicate ops while primary"};
      1274
          }
      1275
       
      1276
          ...
      

      Attachments

        Activity

          People

            mathias@mongodb.com Mathias Stearn
            kevin.pulo@mongodb.com Kevin Pulo
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: