[SERVER-30940] rollback can truncate the oplog behind the commit point Created: 04/Sep/17  Updated: 30/Oct/23  Resolved: 10/Oct/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.5.12
Fix Version/s: 3.6.0-rc0

Type: Bug Priority: Major - P3
Reporter: James O'Leary Assignee: Judah Schvimer
Resolution: Fixed Votes: 0
Labels: sysperf-36
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

During one of the sys perf aggregation tests, the primary of the first shard failed with the following backtrace:

 /data/mci/2244ff45fe11239a32115d5ad616b9f0/src/src/mongo/util/stacktrace_posix.cpp:172:0: mongo::printStackTrace(std::ostream&)
 /data/mci/2244ff45fe11239a32115d5ad616b9f0/src/src/mongo/util/signal_handlers_synchronous.cpp:180:0: mongo::(anonymous namespace)::printSignalAndBacktrace(int)
 /data/mci/2244ff45fe11239a32115d5ad616b9f0/src/src/mongo/util/signal_handlers_synchronous.cpp:237:0: mongo::(anonymous namespace)::abruptQuit(int)
 ??:0:0: ??
 ??:0:0: ??
 ??:0:0: ??
 /data/mci/2244ff45fe11239a32115d5ad616b9f0/src/src/mongo/util/assert_util.cpp:141:0: mongo::invariantOKFailed(char const*, mongo::Status const&, char const*, unsigned int)
 /data/mci/2244ff45fe11239a32115d5ad616b9f0/src/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp:1676:0: mongo::WiredTigerRecordStore::cappedTruncateAfter(mongo::OperationContext*, mongo::RecordId, bool)
 /data/mci/2244ff45fe11239a32115d5ad616b9f0/src/src/mongo/db/catalog/collection_impl.cpp:885:0: mongo::CollectionImpl::cappedTruncateAfter(mongo::OperationContext*, mongo::RecordId, bool)
 /data/mci/2244ff45fe11239a32115d5ad616b9f0/src/src/mongo/db/catalog/collection.h:637:0: mongo::Collection::cappedTruncateAfter(mongo::OperationContext*, mongo::RecordId, bool)
 /data/mci/2244ff45fe11239a32115d5ad616b9f0/src/src/mongo/db/repl/rs_rollback.cpp:1381:0: mongo::repl::rollback_internal::syncFixUp(mongo::OperationContext*, mongo::repl::rollback_internal::FixUpInfo const&, mongo::repl::RollbackSource const&, mongo::repl::ReplicationCoordinator*, mongo::repl::ReplicationProcess*)
 /data/mci/2244ff45fe11239a32115d5ad616b9f0/src/src/mongo/db/repl/rs_rollback.cpp:929:0: _syncRollback
 /data/mci/2244ff45fe11239a32115d5ad616b9f0/src/src/mongo/db/repl/rs_rollback.cpp:1412:0: mongo::repl::syncRollback(mongo::OperationContext*, mongo::repl::OplogInterface const&, mongo::repl::RollbackSource const&, int, mongo::repl::ReplicationCoordinator*, mongo::repl::ReplicationProcess*)
 /data/mci/2244ff45fe11239a32115d5ad616b9f0/src/src/mongo/db/repl/rs_rollback.cpp:1451:0: mongo::repl::rollback(mongo::OperationContext*, mongo::repl::OplogInterface const&, mongo::repl::RollbackSource const&, int, mongo::repl::ReplicationCoordinator*, mongo::repl::ReplicationProcess*, std::function<void (int)>)
 /data/mci/2244ff45fe11239a32115d5ad616b9f0/src/src/mongo/db/repl/bgsync.cpp:763:0: mongo::repl::BackgroundSync::_fallBackOnRollbackViaRefetch(mongo::OperationContext*, mongo::HostAndPort const&, int, mongo::repl::OplogInterface*, bool, std::function<mongo::DBClientBase* ()>)
 /data/mci/2244ff45fe11239a32115d5ad616b9f0/src/src/mongo/db/repl/bgsync.cpp:694:0: mongo::repl::BackgroundSync::_runRollback(mongo::OperationContext*, mongo::Status const&, mongo::HostAndPort const&, int, mongo::repl::StorageInterface*)
 /data/mci/2244ff45fe11239a32115d5ad616b9f0/src/src/mongo/db/repl/bgsync.cpp:516:0: mongo::repl::BackgroundSync::_produce(mongo::OperationContext*)
 /data/mci/2244ff45fe11239a32115d5ad616b9f0/src/src/mongo/db/repl/bgsync.cpp:274:0: mongo::repl::BackgroundSync::_runProducer()
 /data/mci/2244ff45fe11239a32115d5ad616b9f0/src/src/mongo/db/repl/bgsync.cpp:231:0: mongo::repl::BackgroundSync::_run()
 /data/mci/7c12b94560113a33bce052faea90f386/toolchain-builder/build-gcc-v2.sh-Cox/x86_64-mongodb-linux/libstdc++-v3/src/c++11/../../../../../gcc-5.4.0/libstdc++-v3/src/c++11/thread.cc:84:0: execute_native_thread_routine
 ??:0:0: ??
 ??:0:0: ??



 Comments   
Comment by Judah Schvimer [ 10/Oct/17 ]

The above commit ensured that replication would invariant before rolling back behind the commit point. I'm closing as fixed and will reopen it if it reproduces.

Comment by Kevin Duong [ 06/Oct/17 ]

Changing this from debugging with submitter to 3.6 required.

Comment by Githook User [ 19/Sep/17 ]

Author:

{'username': 'judahschvimer', 'name': 'Judah Schvimer', 'email': 'judah@mongodb.com'}

Message: SERVER-30940 make sure we never roll back behind commit point
Branch: master
https://github.com/mongodb/mongo/commit/a14ac80417995193d0e713dc3d0c2edd489d123a

Comment by Judah Schvimer [ 13/Sep/17 ]

After talking to milkie, this appears to imply that we are truncating behind the stable timestamp, which is behind the replication committed optime. We pass cappedTruncateAfter false for the "inclusive" flag, so we are not removing the common point itself.

This first line is strange. The two oplog entries we compare to go into rollback have the same timestamp and term, but different hashes, which should be impossible.

[2017/09/04 04:12:55.821] 2017-09-04T03:35:07.064+0000 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: { ts: Timestamp 1504495939000|1, t: 4 }. source's GTE: { ts: Timestamp 1504495939000|1, t: 4 } hashes: (957221172679042315/9217139146572485651)
[2017/09/04 04:12:55.822] 2017-09-04T03:35:30.291+0000 I REPL     [rsBackgroundSync] Rollback using the 'rollbackViaRefetch' method due to startup server parameter.
[2017/09/04 04:12:55.823] 2017-09-04T03:35:47.746+0000 I REPL     [rsBackgroundSync] transition to ROLLBACK
[2017/09/04 04:12:55.823] 2017-09-04T03:35:54.791+0000 I ROLLBACK [rsBackgroundSync] Starting rollback. Sync source: 10.2.0.202:27017
[2017/09/04 04:12:55.823] 2017-09-04T03:36:04.683+0000 I ROLLBACK [rsBackgroundSync] Finding the Common Point
[2017/09/04 04:12:55.824] 2017-09-04T03:36:09.370+0000 I ROLLBACK [rsBackgroundSync] our last optime:   Sep  4 03:32:19:1
[2017/09/04 04:12:55.824] 2017-09-04T03:36:13.233+0000 I ROLLBACK [rsBackgroundSync] their last optime: Sep  4 03:35:59:1
[2017/09/04 04:12:55.825] 2017-09-04T03:36:17.503+0000 I ROLLBACK [rsBackgroundSync] diff in end of log times: -220 seconds
[2017/09/04 04:12:55.826] 2017-09-04T03:36:43.815+0000 I ROLLBACK [rsBackgroundSync] Rollback common point is { ts: Timestamp 1504495573000|1, t: 1 }
[2017/09/04 04:12:55.826] 2017-09-04T03:36:45.258+0000 I ROLLBACK [rsBackgroundSync] Starting refetching documents
[2017/09/04 04:12:55.826] 2017-09-04T03:36:48.701+0000 I ROLLBACK [rsBackgroundSync] Finished refetching documents. Total size of documents refetched: 0
[2017/09/04 04:12:55.826] 2017-09-04T03:36:51.054+0000 I ROLLBACK [rsBackgroundSync] Checking the RollbackID and updating the MinValid if necessary
[2017/09/04 04:12:55.826] 2017-09-04T03:36:57.630+0000 I ROLLBACK [rsBackgroundSync] Setting minvalid to { ts: Timestamp 1504496209000|1, t: 4 }
[2017/09/04 04:12:55.827] 2017-09-04T03:37:15.118+0000 I ROLLBACK [rsBackgroundSync] Dropping collections to roll back create operations
[2017/09/04 04:12:55.827] 2017-09-04T03:37:17.149+0000 I ROLLBACK [rsBackgroundSync] Rolling back renameCollection commands and collection drop commands.
[2017/09/04 04:12:55.827] 2017-09-04T03:37:18.669+0000 I ROLLBACK [rsBackgroundSync] Rolling back collections pending being dropped: Removing them from the list of drop-pending collections in the DropPendingCollectionReaper.
[2017/09/04 04:12:55.827] 2017-09-04T03:37:21.587+0000 I ROLLBACK [rsBackgroundSync] Rolling back createIndexes commands.
[2017/09/04 04:12:55.827] 2017-09-04T03:37:22.168+0000 I ROLLBACK [rsBackgroundSync] Rolling back dropIndexes commands.
[2017/09/04 04:12:55.827] 2017-09-04T03:37:23.189+0000 I ROLLBACK [rsBackgroundSync] Deleting and updating documents to roll back insert, update and remove operations
[2017/09/04 04:12:55.827] 2017-09-04T03:37:24.420+0000 I ROLLBACK [rsBackgroundSync] Rollback deleted 0 documents and updated 0 documents.
[2017/09/04 04:12:55.827] 2017-09-04T03:37:25.142+0000 I ROLLBACK [rsBackgroundSync] Truncating the oplog at { ts: Timestamp 1504495573000|1, t: 1 }

Comment by Eric Milkie [ 12/Sep/17 ]

Agreed; I removed the link to SERVER-30783.

Comment by Judah Schvimer [ 12/Sep/17 ]

This doesn't look related to SERVER-30783. That ticket is about when to call cappedTruncateAfter in ReplicationRecovery and this fassert occurs in rs_rollback.cpp. I will continue to investigate where we call cappedTruncateAfter in rs_rollback.cpp here.

Comment by Eric Milkie [ 04/Sep/17 ]

This isn't a segmentation fault (doesn't appear in the log), so I removed the links to other tickets and changed the title to reflect this.

Comment by Eric Milkie [ 04/Sep/17 ]

The error that triggered the crash:

[2017/09/04 00:12:55.828] 2017-09-04T03:37:33.086+0000 E STORAGE  [rsBackgroundSync] WiredTiger error (22) [1504496252:378359][16314:0x7fe9acc43700], WT_CONNECTION.set_timestamp: set_timestamp: oldest timestamp must not be later than commit timestamp: Invalid argument

I think this is a manifestation of a problem that SERVER-30873 will fix.

Comment by Eric Milkie [ 04/Sep/17 ]

jim.oleary do you have a link to the test? I'd like to see the full log.

Generated at Thu Feb 08 04:25:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.