[SERVER-21868] Shutdown may not be handled correctly on secondary nodes Created: 11/Dec/15  Updated: 25/Jan/17  Resolved: 17/Dec/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.0
Fix Version/s: 3.2.1, 3.3.0

Type: Bug Priority: Major - P3
Reporter: Siyuan Zhou Assignee: Siyuan Zhou
Resolution: Done Votes: 0
Labels: code-only
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Sprint: Repl E (01/08/16)
Participants:
Linked BF Score: 0

 Description   
Issue Status as of Dec 16, 2015

ISSUE SUMMARY
In a replica set, if a secondary node is shut down while replicating writes, the node may mark certain replicated operations as successfully applied even though they have not.

This problem only applies to a “clean shutdown”, which occurs when the node is shut down via one of the following means:

  • The shutdown command
  • The Ctrl-C handler on Windows
  • The following POSIX signals: TERM, HUP, INT, USR1, XCPU

Notably, this error does not apply to nodes that shut down abnormally. If a mongod process is ended due to a hard termination, such as via a KILL signal, it will not be subject to this bug.

USER IMPACT
If a secondary node is shut down while replicating writes, the node may end up in an inconsistent state with respect to the primary and other secondaries.

WORKAROUNDS
There are two workarounds for safely shutting down a secondary node running 3.2.0. They are described below.

Use a non-clean shutdown method

By inducing a non-clean shutdown, the bug can be avoided. This approach is safe on all deployments using WiredTiger, and all MMAP deployments with journaling enabled (the default).

On a system that supports posix signals, send a KILL (9) or QUIT (3) signal to the mongod process to shut it down. On Windows, use “tskill”. The storage engine and replication recovery code will bring the node back into a consistent state upon server restart.

This is a temporary workaround for 3.2.0 users. Do not use after upgrading to 3.2.1 or newer.

Remove the node from the replica set

Removing the node from its replica set configuration before shutting it down ensures that the node is not processing replicated writes at shutdown time.

Remove the node from the replica set configuration via the replSetReconfig command or rs.reconfig shell helper. Then, wait for the node to enter the REMOVED state before shutting it down.

AFFECTED VERSIONS
Only MongoDB 3.2.0 is affected by this issue.

FIX VERSION
The fix is included in the 3.2.1 production release.

Original description

In sync_tail.cc, multiApply() assumes the application always succeeds, then sets minValid to acknowledge that.

        // This write will not journal/checkpoint.
        setMinValid(&txn, {start, end});
 
        lastWriteOpTime = multiApply(&txn, ops);
        setNewTimestamp(lastWriteOpTime.getTimestamp());
 
        setMinValid(&txn, end, DurableRequirement::None);
        minValidBoundaries.start = {};
        minValidBoundaries.end = end;
        finalizer.record(lastWriteOpTime);

multiApply() delegates the work to applyOps(), which simply schedules the work to worker threads:

// Doles out all the work to the writer pool threads and waits for them to complete
void applyOps(const std::vector<std::vector<BSONObj>>& writerVectors,
              OldThreadPool* writerPool,
              SyncTail::MultiSyncApplyFunc func,
              SyncTail* sync) {
    TimerHolder timer(&applyBatchStats);
    for (std::vector<std::vector<BSONObj>>::const_iterator it = writerVectors.begin();
         it != writerVectors.end();
         ++it) {
        if (!it->empty()) {
            writerPool->schedule(func, stdx::cref(*it), sync);
        }
    }
}

However schedule() may return an error to indicate shutdown is already in progress. sync_tail.cpp ignores the error and continues to mark that operation finished.

If the shutdown happens after the schedule of operations, the secondary will run into another fassert, which is also unexpected. Restart cannot fix the inconsistent state either. This has also been observed in repeated runs of backup_restore.js

As a result, any kind of operations may be marked executed by mistake when shutting down the secondary, including commands and database operations, leading to an inconsistent state with the primary and potential missing/stale documents on secondaries.

To fix this issue, after the on_block_exit of the join call we need to check if shutdown is happened and return the empty optime to indicate the batch is not complete.



 Comments   
Comment by Githook User [ 17/Dec/15 ]

Author:

{u'username': u'visualzhou', u'name': u'Siyuan Zhou', u'email': u'siyuan.zhou@mongodb.com'}

Message: SERVER-21868 Shutdown may not be handled correctly in oplog application.

(cherry picked from commit ac70c5eb4d987702535ad6c00ab980de5873cdf4)
Branch: v3.2
https://github.com/mongodb/mongo/commit/f785174734fcf309c6be9cbc5f8a3ae591ce4dfd

Comment by Githook User [ 17/Dec/15 ]

Author:

{u'username': u'visualzhou', u'name': u'Siyuan Zhou', u'email': u'siyuan.zhou@mongodb.com'}

Message: SERVER-21868 Shutdown may not be handled correctly in oplog application.
Branch: master
https://github.com/mongodb/mongo/commit/ac70c5eb4d987702535ad6c00ab980de5873cdf4

Generated at Thu Feb 08 03:58:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.