[SERVER-41163] During initial sync, failing to apply an update thats in a prepared txn hits an invariant Created: 15/May/19  Updated: 29/Oct/23  Resolved: 19/Jun/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.2.0-rc3

Type: Bug Priority: Major - P3
Reporter: Vlad Rachev (Inactive) Assignee: Pavithra Vetriselvan
Resolution: Fixed Votes: 0
Labels: isfz, prepare_initial_sync
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File initsync_fuzzer-cacf2-1557881965671-05.js    
Issue Links:
Backports
Depends
is depended on by SERVER-41437 Unify transaction oplog traversal Closed
Related
related to SERVER-41113 Ignore NamespaceNotFound exceptions w... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.2
Steps To Reproduce:

I've attached the test. Here are the logs.
Here's the stacktrace:

[js_test:initsync_fuzzer-cacf2-1557881965671-05] 2019-05-15T16:43:12.171-0400 d34588| 2019-05-15T16:43:12.171-0400 F  -        [repl-writer-worker-2] Invariant failure isCrudOpType() src/mongo/db/repl/oplog_entry.cpp 292
[js_test:initsync_fuzzer-cacf2-1557881965671-05] 2019-05-15T16:43:12.171-0400 d34588| 2019-05-15T16:43:12.171-0400 F  -        [repl-writer-worker-2]
[js_test:initsync_fuzzer-cacf2-1557881965671-05] 2019-05-15T16:43:12.171-0400 d34588|
[js_test:initsync_fuzzer-cacf2-1557881965671-05] 2019-05-15T16:43:12.172-0400 d34588| ***aborting after invariant() failure
[js_test:initsync_fuzzer-cacf2-1557881965671-05] 2019-05-15T16:43:12.172-0400 d34588|
[js_test:initsync_fuzzer-cacf2-1557881965671-05] 2019-05-15T16:43:12.172-0400 d34588|
[js_test:initsync_fuzzer-cacf2-1557881965671-05] 2019-05-15T16:43:12.172-0400 d34588| 2019-05-15T16:43:12.172-0400 F  -        [repl-writer-worker-2] Got signal: 6 (Aborted).
[js_test:initsync_fuzzer-cacf2-1557881965671-05] 2019-05-15T16:43:12.173-0400 d34588|  0x7faebee287a1 0x7faebee2802e 0x7faebee280cf 0x7faebcc37890 0x7faebc872e97 0x7faebc874801 0x7faebeda8bc6 0x7faebfd07e60 0x7faec1b5e492 0x7faec1b6013a 0x7faec1b6255d 0x7faec1b54de8 0x7faec1b5bffa 0x7faec10cfb16 0x7faec10d0d10 0x7faec10d2755 0x7faebd4b957f 0x7faebcc2c6db 0x7faebc95588f
[js_test:initsync_fuzzer-cacf2-1557881965671-05] 2019-05-15T16:43:12.173-0400 d34588| ----- BEGIN BACKTRACE -----
 /mnt/home/vlad/mongodb/github/mongo/src/mongo/util/stacktrace_posix.cpp:174:39: mongo::printStackTrace(std::ostream&)
 /mnt/home/vlad/mongodb/github/mongo/src/mongo/util/signal_handlers_synchronous.cpp:182:20: mongo::(anonymous namespace)::printSignalAndBacktrace(int)
 /mnt/home/vlad/mongodb/github/mongo/src/mongo/util/signal_handlers_synchronous.cpp:239:28: mongo::(anonymous namespace)::abruptQuit(int)
 ??:0:0: ??
 /build/glibc-OTsEL5/glibc-2.27/signal/../sysdeps/unix/sysv/linux/nptl-signals.h:80:0: __libc_signal_restore_set
 /build/glibc-OTsEL5/glibc-2.27/signal/../sysdeps/unix/sysv/linux/raise.c:48:0: gsignal
 /build/glibc-OTsEL5/glibc-2.27/stdlib/abort.c:79:0: abort
 /mnt/home/vlad/mongodb/github/mongo/src/mongo/util/assert_util.cpp:104:15: mongo::invariantFailed(char const*, char const*, unsigned int)
 /mnt/home/vlad/mongodb/github/mongo/src/mongo/util/invariant.h:69:33: void mongo::invariantWithLocation<bool>(bool const&, char const*, char const*, unsigned int)
 /mnt/home/vlad/mongodb/github/mongo/src/mongo/db/repl/oplog_entry.cpp:292:5: mongo::repl::OplogEntry::getIdElement() const [clone .cold.94]
 /mnt/home/vlad/mongodb/github/mongo/src/mongo/db/repl/sync_tail.cpp:895:53: mongo::repl::SyncTail::getMissingDoc(mongo::OperationContext*, mongo::repl::OplogEntry const&)
 /mnt/home/vlad/mongodb/github/mongo/src/mongo/db/repl/sync_tail.cpp:951:57: mongo::repl::SyncTail::fetchAndInsertMissingDocument(mongo::OperationContext*, mongo::repl::OplogEntry const&)
 /mnt/home/vlad/mongodb/github/mongo/src/mongo/db/repl/sync_tail.cpp:1071:58: mongo::repl::multiSyncApply(mongo::OperationContext*, std::vector<mongo::repl::OplogEntry const*, std::allocator<mongo::repl::OplogEntry const*> >*, mongo::repl::SyncTail*, std::vector<mongo::MultikeyPathInfo, std::allocator<mongo::MultikeyPathInfo> >*)
 /opt/mongodbtoolchain/revisions/e84eb3fd219668197589e62dba14b9914712642d/stow/gcc-v3.ipd/include/c++/8.2.0/bits/std_function.h:283:40: std::_Function_handler<mongo::Status (mongo::OperationContext*, std::vector<mongo::repl::OplogEntry const*, std::allocator<mongo::repl::OplogEntry const*> >*, mongo::repl::SyncTail*, std::vector<mongo::MultikeyPathInfo, std::allocator<mongo::MultikeyPathInfo> >*), mongo::Status (*)(mongo::OperationContext*, std::vector<mongo::repl::OplogEntry const*, std::allocator<mongo::repl::OplogEntry const*> >*, mongo::repl::SyncTail*, std::vector<mongo::MultikeyPathInfo, std::allocator<mongo::MultikeyPathInfo> >*)>::_M_invoke(std::_Any_data const&, mongo::OperationContext*&&, std::vector<mongo::repl::OplogEntry const*, std::allocator<mongo::repl::OplogEntry const*> >*&&, mongo::repl::SyncTail*&&, std::vector<mongo::MultikeyPathInfo, std::allocator<mongo::MultikeyPathInfo> >*&&)
 /opt/mongodbtoolchain/revisions/e84eb3fd219668197589e62dba14b9914712642d/stow/gcc-v3.ipd/include/c++/8.2.0/bits/std_function.h:687:71: std::function<mongo::Status (mongo::OperationContext*, std::vector<mongo::repl::OplogEntry const*, std::allocator<mongo::repl::OplogEntry const*> >*, mongo::repl::SyncTail*, std::vector<mongo::MultikeyPathInfo, std::allocator<mongo::MultikeyPathInfo> >*)>::operator()(mongo::OperationContext*, std::vector<mongo::repl::OplogEntry const*, std::allocator<mongo::repl::OplogEntry const*> >*, mongo::repl::SyncTail*, std::vector<mongo::MultikeyPathInfo, std::allocator<mongo::MultikeyPathInfo> >*) const
 /mnt/home/vlad/mongodb/github/mongo/src/mongo/db/repl/sync_tail.cpp:1278:92: operator()
 /mnt/home/vlad/mongodb/github/mongo/src/mongo/util/interruptible.h:179:47: runWithoutInterruptionExceptAtGlobalShutdown<mongo::repl::SyncTail::_applyOps(std::vector<std::vector<const mongo::repl::OplogEntry*> >&, std::vector<mongo::Status>*, std::vector<std::vector<mongo::MultikeyPathInfo> >*)::<lambda(auto:20)> [with auto:20 = mongo::Status]::<lambda()> >
 /mnt/home/vlad/mongodb/github/mongo/src/mongo/db/repl/sync_tail.cpp:1277:20: operator()<mongo::Status>
 /mnt/home/vlad/mongodb/github/mongo/src/mongo/util/functional.h:144:10: callRegularVoid<mongo::repl::SyncTail::_applyOps(std::vector<std::vector<const mongo::repl::OplogEntry*> >&, std::vector<mongo::Status>*, std::vector<std::vector<mongo::MultikeyPathInfo> >*)::<lambda(auto:20)> >
 /mnt/home/vlad/mongodb/github/mongo/src/mongo/util/functional.h:158:39: _ZZN5mongo15unique_functionIFvNS_6StatusEEE8makeImplIZNS_4repl8SyncTail9_applyOpsERSt6vectorIS7_IPKNS5_10OplogEntryESaISA_EESaISC_EEPS7_IS1_SaIS1_EEPS7_IS7_INS_16MultikeyPathInfoESaISJ_EESaISL_EEEUlT_E_EEDaOSP_EN12SpecificImpl4callEOS1_
 /mnt/home/vlad/mongodb/github/mongo/src/mongo/util/functional.h:108:54: mongo::unique_function<void (mongo::Status)>::operator()(mongo::Status) const
 /mnt/home/vlad/mongodb/github/mongo/src/mongo/util/concurrency/thread_pool.cpp:341:9: mongo::ThreadPool::_doOneTask(std::unique_lock<std::mutex>*)
 /mnt/home/vlad/mongodb/github/mongo/src/mongo/util/concurrency/thread_pool.cpp:294:19: mongo::ThreadPool::_consumeTasks()
 /mnt/home/vlad/mongodb/github/mongo/src/mongo/util/concurrency/thread_pool.cpp:247:24: mongo::ThreadPool::_workerThreadBody(mongo::ThreadPool*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
 ??:0:0: ??
 ??:0:0: start_thread
 /build/glibc-OTsEL5/glibc-2.27/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95:0: clone
[js_test:initsync_fuzzer-cacf2-1557881965671-05] 2019-05-15T16:43:12.176-0400 d34588| -----  END BACKTRACE  -----

Sprint: Repl 2019-06-03, Repl 2019-06-17, Repl 2019-07-01
Participants:
Linked BF Score: 8

 Description   

When an update fails to get applied due to missing documents, we fetch the missing document using the oplog entry to figure out what document to retrieve.

However, the oplog entry for the update in this case is the commitTransaction command. Currently, we assert that the oplog entry can only be an insert, delete or update. We thus hit this invariant when we try and retrieve the missing doc.



 Comments   
Comment by Githook User [ 27/Jun/19 ]

Author:

{'name': 'Pavi Vetriselvan', 'email': 'pvselvan@umich.edu', 'username': 'pvselvan'}

Message: SERVER-41163 apply operations separately during initial sync commit transaction oplog application

(cherry picked from commit d3c0e4ad46fcba5aac61ecec1409e9df6e11f66e)
Branch: v4.2
https://github.com/mongodb/mongo/commit/fc489b8a343793ec50133af5f0f46942094acdd8

Comment by Githook User [ 19/Jun/19 ]

Author:

{'name': 'Pavi Vetriselvan', 'email': 'pvselvan@umich.edu', 'username': 'pvselvan'}

Message: SERVER-41163 apply operations separately during initial sync commit transaction oplog application
Branch: master
https://github.com/mongodb/mongo/commit/d3c0e4ad46fcba5aac61ecec1409e9df6e11f66e

Comment by Judah Schvimer [ 17/May/19 ]

prepareTransaction oplog entries need to be applied in their own batch in a single write unit of work so it doesn't make sense to extract and parallelize them. commitTransaction oplog entries for prepared transactions during steady state replication don't need to apply the operations since they're already in "prepare". In initial sync when we apply commitTransaction without going through prepare, we can do "extract and parallelize" safely though. In steady state replication it is also important to timestamp the transaction writes with the commitTimestamp, and not the commitTransaction oplog entry timestamp, which could be easy to mess up in the "extract and parallelize" approach. While initial sync doesn't really care about timestamping, we should still make a minimal effort to get it right (or document that we're not timestamping correctly during initial sync) in case we do ever care. We could also consider doing "extract and parallelize" for commitTransaction oplog entries in replication recovery, but it would be complicated by SERVER-41178, so I think we should restrict this change to initial sync.

I think in creating the implementation plan it was seen as a short-cut to not "extract and parallelize" commitTransaction oplog entries during initial sync and replication recovery, and then we never got around to optimizing it. Little did we know it wasn't just an optimization.

This functionality was implemented in SERVER-36491 and SERVER-35879, so we may be able to remove some code there (though maybe not if we still need it for replication recovery).

Comment by Pavithra Vetriselvan [ 16/May/19 ]

judah.schvimer, I agree! If I understand correctly, the functionality to extract applyOps operations from commit oplog entries exists here. It seems like we can make this work for commit oplog entries as part of a prepared transactions by modifying this function.

Is there a reason we chose not to do this for prepare?

Comment by Judah Schvimer [ 15/May/19 ]

I think we should be unpacking these commitTransaction oplog entries during initial sync into their own writer threads. In that case the writer threads will see the entry as an update and not as a commitTransaction. This would also be more performant. We currently apply prepared transaction commitTransaction oplog entries in their own batch.

Generated at Thu Feb 08 04:56:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.