Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-21868

Shutdown may not be handled correctly on secondary nodes

    • Fully Compatible
    • ALL
    • Repl E (01/08/16)
    • 0

      Issue Status as of Dec 16, 2015

      In a replica set, if a secondary node is shut down while replicating writes, the node may mark certain replicated operations as successfully applied even though they have not.

      This problem only applies to a “clean shutdown”, which occurs when the node is shut down via one of the following means:

      • The shutdown command
      • The Ctrl-C handler on Windows
      • The following POSIX signals: TERM, HUP, INT, USR1, XCPU

      Notably, this error does not apply to nodes that shut down abnormally. If a mongod process is ended due to a hard termination, such as via a KILL signal, it will not be subject to this bug.

      If a secondary node is shut down while replicating writes, the node may end up in an inconsistent state with respect to the primary and other secondaries.

      There are two workarounds for safely shutting down a secondary node running 3.2.0. They are described below.

      Use a non-clean shutdown method

      By inducing a non-clean shutdown, the bug can be avoided. This approach is safe on all deployments using WiredTiger, and all MMAP deployments with journaling enabled (the default).

      On a system that supports posix signals, send a KILL (9) or QUIT (3) signal to the mongod process to shut it down. On Windows, use “tskill”. The storage engine and replication recovery code will bring the node back into a consistent state upon server restart.

      This is a temporary workaround for 3.2.0 users. Do not use after upgrading to 3.2.1 or newer.

      Remove the node from the replica set

      Removing the node from its replica set configuration before shutting it down ensures that the node is not processing replicated writes at shutdown time.

      Remove the node from the replica set configuration via the replSetReconfig command or rs.reconfig shell helper. Then, wait for the node to enter the REMOVED state before shutting it down.

      Only MongoDB 3.2.0 is affected by this issue.

      The fix is included in the 3.2.1 production release.

      Original description

      In sync_tail.cc, multiApply() assumes the application always succeeds, then sets minValid to acknowledge that.

              // This write will not journal/checkpoint.
              setMinValid(&txn, {start, end});
              lastWriteOpTime = multiApply(&txn, ops);
              setMinValid(&txn, end, DurableRequirement::None);
              minValidBoundaries.start = {};
              minValidBoundaries.end = end;

      multiApply() delegates the work to applyOps(), which simply schedules the work to worker threads:

      // Doles out all the work to the writer pool threads and waits for them to complete
      void applyOps(const std::vector<std::vector<BSONObj>>& writerVectors,
                    OldThreadPool* writerPool,
                    SyncTail::MultiSyncApplyFunc func,
                    SyncTail* sync) {
          TimerHolder timer(&applyBatchStats);
          for (std::vector<std::vector<BSONObj>>::const_iterator it = writerVectors.begin();
               it != writerVectors.end();
               ++it) {
              if (!it->empty()) {
                  writerPool->schedule(func, stdx::cref(*it), sync);

      However schedule() may return an error to indicate shutdown is already in progress. sync_tail.cpp ignores the error and continues to mark that operation finished.

      If the shutdown happens after the schedule of operations, the secondary will run into another fassert, which is also unexpected. Restart cannot fix the inconsistent state either. This has also been observed in repeated runs of backup_restore.js

      As a result, any kind of operations may be marked executed by mistake when shutting down the secondary, including commands and database operations, leading to an inconsistent state with the primary and potential missing/stale documents on secondaries.

      To fix this issue, after the on_block_exit of the join call we need to check if shutdown is happened and return the empty optime to indicate the batch is not complete.

            siyuan.zhou@mongodb.com Siyuan Zhou
            siyuan.zhou@mongodb.com Siyuan Zhou
            0 Vote for this issue
            26 Start watching this issue