-
Type:
Task
-
Resolution: Duplicate
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
(copied to CRM)
Links
This is an on-prem deployment, no direct links available. Attaching data from affected node.
Status
MongoDB 4.2.7 2 shard cluster, 9 members per shard, managed by Ops Manager. Only 2 members out of 9 are affected. Other 7 shard members are unaffected. Seen the same crash issue on different nodes each time.
Issue Description
What is the user seeing?
Periodic crashes on mongod processes. Elapsed time between crashes varies from a few minutes to a few hours.
Where is it happening
Secondary members are crashing frequently.
When is it happening (timeline of events)
It is happening regularly. One or two crashes seem to align but most are independent of each other.
How much/many/long
Repeated crashes, but not necessarily a crash loop. Nodes run for a few minutes before crashing again.
Diagnostics and Hypotheses
Diagnostics
There’s not much activity in the logs. From FTDC, I noticed that transaction reaper and session collection jobs run for a relatively long time (4 secs) right after the restarts, but this could be due to the restart itself, not necessarily associated with the cause. Checked and confirmed that they are not running imperva and Guardian
As for the crash itself, it seems to be happening on tcmalloc allocation during deletes. Symbolized stack trace:
INFO:symbolize:found symbols file in local cache: /Users/veerareddyk/.mongosymb.cache/ed9725c6fd810c1429deded7305a981b98ade3b5.debug INFO:symbolize:found symbolizer in local cache: /Users/veerareddyk/.mongosymb.cache/mongosymb.py INFO:symbolize:detected symbolizer interpreter: python3 INFO:symbolize:running symbolizer: /usr/bin/env python3 /Users/veerareddyk/.mongosymb.cache/mongosymb.py --symbolizer-path=/Users/veerareddyk/Backtrace/triage-scripts-master/mongosymb/llvm-symbolizer/llvm-symbolizer-macos /Users/veerareddyk/.mongosymb.cache/ed9725c6fd810c1429deded7305a981b98ade3b5.debug /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/util/stacktrace_posix.cpp:174:39: mongo::printStackTrace(std::ostream&) /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/util/signal_handlers_synchronous.cpp:184:20: mongo::(anonymous namespace)::printSignalAndBacktrace(int) /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/util/signal_handlers_synchronous.cpp:287:28: mongo::(anonymous namespace)::abruptQuitWithAddrSignal(int, siginfo_t*, void*) ??:0:0: ??
Latest crash Symbolized stack trace
INFO:symbolize:found symbols file in local cache: /Users/veerareddyk/.mongosymb.cache/ed9725c6fd810c1429deded7305a981b98ade3b5.debug INFO:symbolize:found symbolizer in local cache: /Users/veerareddyk/.mongosymb.cache/mongosymb.py INFO:symbolize:detected symbolizer interpreter: python3 INFO:symbolize:running symbolizer: /usr/bin/env python3 /Users/veerareddyk/.mongosymb.cache/mongosymb.py --symbolizer-path=/Users/veerareddyk/Backtrace/triage-scripts-master/mongosymb/llvm-symbolizer/llvm-symbolizer-macos /Users/veerareddyk/.mongosymb.cache/ed9725c6fd810c1429deded7305a981b98ade3b5.debug /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/util/stacktrace_posix.cpp:174:39: mongo::printStackTrace(std::ostream&) /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/util/signal_handlers_synchronous.cpp:184:20: mongo::(anonymous namespace)::printSignalAndBacktrace(int) /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/util/signal_handlers_synchronous.cpp:287:28: mongo::(anonymous namespace)::abruptQuitWithAddrSignal(int, siginfo_t*, void*) ??:0:0: ?? /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/gperftools-2.7/dist/src/linked_list.h:87:3: tcmalloc::SLL_PopRange(void**, int, void**, void**) /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/gperftools-2.7/dist/src/thread_cache.h:238:19: tcmalloc::ThreadCache::FreeList::PopRange(int, void**, void**) /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/gperftools-2.7/dist/src/thread_cache.cc:206:16: tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned int, int) /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/gperftools-2.7/dist/src/thread_cache.cc:164:24: tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned int) /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/wiredtiger/src/block/block_ext.c:1298:5: __wt_block_extlist_free /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/wiredtiger/src/block/block_ckpt.c:936:5: __wt_block_checkpoint_resolve /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/wiredtiger/src/meta/meta_track.c:144:9: __meta_track_apply /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/wiredtiger/src/meta/meta_track.c:310:13: __wt_meta_track_off /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/wiredtiger/src/txn/txn_ckpt.c:979:9: __txn_checkpoint /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/wiredtiger/src/txn/txn_ckpt.c:1041:11: __txn_checkpoint_wrapper /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/wiredtiger/src/txn/txn_ckpt.c:1097:9: __wt_txn_checkpoint /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/wiredtiger/src/session/session_api.c:1956:11: __session_checkpoint.cold.50 /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp:375:21: mongo::WiredTigerKVEngine::WiredTigerCheckpointThread::run() /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/util/background.cpp:151:8: mongo::BackgroundJob::jobBody() /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/util/background.cpp:177:38: operator() /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.wVo/include/c++/8.2.0/bits/invoke.h:60:36: __invoke_impl<void, mongo::BackgroundJob::go()::<lambda()> > /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.wVo/include/c++/8.2.0/bits/invoke.h:95:40: __invoke<mongo::BackgroundJob::go()::<lambda()> > /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.wVo/include/c++/8.2.0/tuple:1678:27: __apply_impl<mongo::BackgroundJob::go()::<lambda()>, std::tuple<> > /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.wVo/include/c++/8.2.0/tuple:1687:31: apply<mongo::BackgroundJob::go()::<lambda()>, std::tuple<> > /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/stdx/thread.h:172:36: operator() /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.wVo/include/c++/8.2.0/bits/invoke.h:60:36: __invoke_impl<void, mongo::stdx::thread::thread(Function&&, Args&& ...) [with Function = mongo::BackgroundJob::go()::<lambda()>; Args = {}; typename std::enable_if<(! std::is_same<mongo::stdx::thread, typename std::decay<_Tp>::type>::value), int>::type <anonymous> = 0]::<lambda()> > /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.wVo/include/c++/8.2.0/bits/invoke.h:95:40: __invoke<mongo::stdx::thread::thread(Function&&, Args&& ...) [with Function = mongo::BackgroundJob::go()::<lambda()>; Args = {}; typename std::enable_if<(! std::is_same<mongo::stdx::thread, typename std::decay<_Tp>::type>::value), int>::type <anonymous> = 0]::<lambda()> > /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.wVo/include/c++/8.2.0/thread:234:26: _M_invoke<0> /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.wVo/include/c++/8.2.0/thread:243:31: operator() /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.wVo/include/c++/8.2.0/thread:186:13: std::thread::_State_impl<std::thread::_Invoker<std::tuple<mongo::stdx::thread::thread<mongo::BackgroundJob::go()::'lambda0'(), 0>(mongo::BackgroundJob::go()::'lambda0'()&&)::'lambda'()> > >::_M_run() /data/mci/ed40764efa33e3521ac4762d5c74c991/toolchain-builder/tmp/build-gcc-v3.sh-TRH/build/x86_64-mongodb-linux/libstdc++-v3/src/c++11/../../../../../src/combined/libstdc++-v3/src/c++11/thread.cc:80:18: execute_native_thread_routine ??:0:0: ?? ??:0:0: ??
Hypotheses
Could this be related to https://github.com/gperftools/gperftools/issues/1036 and/or SERVER-57306 or WT-6366
Questions for the Server Team
- Why are the crashes happening on different nodes with different error ?
- Why only on secondary nodes nodes?
- Do you believe there’s a chance the other nodes might be in danger of entering similar states including primary?
- How to recover the affected members from this state? (proposed workarounds?)