Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-7913

Secondary sharded members in a crash loop

    XMLWordPrintable

    Details

    • Type: Task
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Case:

      Description

      Links

      This is an on-prem deployment, no direct links available. Attaching data from affected node.

      Status

      MongoDB 4.2.7 2 shard cluster, 9 members per shard, managed by Ops Manager. Only 2 members out of 9 are affected. Other 7 shard members are unaffected. Seen the same crash issue on different nodes each time.

      Issue Description

      What is the user seeing?

      Periodic crashes on mongod processes. Elapsed time between crashes varies from a few minutes to a few hours.

      Where is it happening

      Secondary members are crashing frequently.

      When is it happening (timeline of events)

      It is happening regularly. One or two crashes seem to align but most are independent of each other.

      How much/many/long

      Repeated crashes, but not necessarily a crash loop. Nodes run for a few minutes before crashing again.

      Diagnostics and Hypotheses

      Diagnostics

      There’s not much activity in the logs. From FTDC, I noticed that transaction reaper and session collection jobs run for a relatively long time (4 secs) right after the restarts, but this could be due to the restart itself, not necessarily associated with the cause. Checked and confirmed that they are not running imperva and Guardian

      As for the crash itself, it seems to be happening on tcmalloc allocation during deletes. Symbolized stack trace:

      INFO:symbolize:found symbols file in local cache: /Users/veerareddyk/.mongosymb.cache/ed9725c6fd810c1429deded7305a981b98ade3b5.debug
      INFO:symbolize:found symbolizer in local cache: /Users/veerareddyk/.mongosymb.cache/mongosymb.py
      INFO:symbolize:detected symbolizer interpreter: python3
      INFO:symbolize:running symbolizer: /usr/bin/env python3 /Users/veerareddyk/.mongosymb.cache/mongosymb.py --symbolizer-path=/Users/veerareddyk/Backtrace/triage-scripts-master/mongosymb/llvm-symbolizer/llvm-symbolizer-macos /Users/veerareddyk/.mongosymb.cache/ed9725c6fd810c1429deded7305a981b98ade3b5.debug
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/util/stacktrace_posix.cpp:174:39: mongo::printStackTrace(std::ostream&)
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/util/signal_handlers_synchronous.cpp:184:20: mongo::(anonymous namespace)::printSignalAndBacktrace(int)
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/util/signal_handlers_synchronous.cpp:287:28: mongo::(anonymous namespace)::abruptQuitWithAddrSignal(int, siginfo_t*, void*)
       ??:0:0: ??
       

      Latest crash Symbolized stack trace

      INFO:symbolize:found symbols file in local cache: /Users/veerareddyk/.mongosymb.cache/ed9725c6fd810c1429deded7305a981b98ade3b5.debug
      INFO:symbolize:found symbolizer in local cache: /Users/veerareddyk/.mongosymb.cache/mongosymb.py
      INFO:symbolize:detected symbolizer interpreter: python3
      INFO:symbolize:running symbolizer: /usr/bin/env python3 /Users/veerareddyk/.mongosymb.cache/mongosymb.py --symbolizer-path=/Users/veerareddyk/Backtrace/triage-scripts-master/mongosymb/llvm-symbolizer/llvm-symbolizer-macos /Users/veerareddyk/.mongosymb.cache/ed9725c6fd810c1429deded7305a981b98ade3b5.debug
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/util/stacktrace_posix.cpp:174:39: mongo::printStackTrace(std::ostream&)
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/util/signal_handlers_synchronous.cpp:184:20: mongo::(anonymous namespace)::printSignalAndBacktrace(int)
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/util/signal_handlers_synchronous.cpp:287:28: mongo::(anonymous namespace)::abruptQuitWithAddrSignal(int, siginfo_t*, void*)
       ??:0:0: ??
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/gperftools-2.7/dist/src/linked_list.h:87:3: tcmalloc::SLL_PopRange(void**, int, void**, void**)
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/gperftools-2.7/dist/src/thread_cache.h:238:19: tcmalloc::ThreadCache::FreeList::PopRange(int, void**, void**)
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/gperftools-2.7/dist/src/thread_cache.cc:206:16: tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned int, int)
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/gperftools-2.7/dist/src/thread_cache.cc:164:24: tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned int)
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/wiredtiger/src/block/block_ext.c:1298:5: __wt_block_extlist_free
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/wiredtiger/src/block/block_ckpt.c:936:5: __wt_block_checkpoint_resolve
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/wiredtiger/src/meta/meta_track.c:144:9: __meta_track_apply
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/wiredtiger/src/meta/meta_track.c:310:13: __wt_meta_track_off
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/wiredtiger/src/txn/txn_ckpt.c:979:9: __txn_checkpoint
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/wiredtiger/src/txn/txn_ckpt.c:1041:11: __txn_checkpoint_wrapper
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/wiredtiger/src/txn/txn_ckpt.c:1097:9: __wt_txn_checkpoint
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/third_party/wiredtiger/src/session/session_api.c:1956:11: __session_checkpoint.cold.50
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp:375:21: mongo::WiredTigerKVEngine::WiredTigerCheckpointThread::run()
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/util/background.cpp:151:8: mongo::BackgroundJob::jobBody()
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/util/background.cpp:177:38: operator()
       /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.wVo/include/c++/8.2.0/bits/invoke.h:60:36: __invoke_impl<void, mongo::BackgroundJob::go()::<lambda()> >
       /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.wVo/include/c++/8.2.0/bits/invoke.h:95:40: __invoke<mongo::BackgroundJob::go()::<lambda()> >
       /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.wVo/include/c++/8.2.0/tuple:1678:27: __apply_impl<mongo::BackgroundJob::go()::<lambda()>, std::tuple<> >
       /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.wVo/include/c++/8.2.0/tuple:1687:31: apply<mongo::BackgroundJob::go()::<lambda()>, std::tuple<> >
       /data/mci/cfacd41feaf002ff024d2dae9624113a/src/src/mongo/stdx/thread.h:172:36: operator()
       /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.wVo/include/c++/8.2.0/bits/invoke.h:60:36: __invoke_impl<void, mongo::stdx::thread::thread(Function&&, Args&& ...) [with Function = mongo::BackgroundJob::go()::<lambda()>; Args = {}; typename std::enable_if<(! std::is_same<mongo::stdx::thread, typename std::decay<_Tp>::type>::value), int>::type <anonymous> = 0]::<lambda()> >
       /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.wVo/include/c++/8.2.0/bits/invoke.h:95:40: __invoke<mongo::stdx::thread::thread(Function&&, Args&& ...) [with Function = mongo::BackgroundJob::go()::<lambda()>; Args = {}; typename std::enable_if<(! std::is_same<mongo::stdx::thread, typename std::decay<_Tp>::type>::value), int>::type <anonymous> = 0]::<lambda()> >
       /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.wVo/include/c++/8.2.0/thread:234:26: _M_invoke<0>
       /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.wVo/include/c++/8.2.0/thread:243:31: operator()
       /opt/mongodbtoolchain/revisions/e5348beb43e147b74a40f4ca5fb05a330ea646cf/stow/gcc-v3.wVo/include/c++/8.2.0/thread:186:13: std::thread::_State_impl<std::thread::_Invoker<std::tuple<mongo::stdx::thread::thread<mongo::BackgroundJob::go()::'lambda0'(), 0>(mongo::BackgroundJob::go()::'lambda0'()&&)::'lambda'()> > >::_M_run()
       /data/mci/ed40764efa33e3521ac4762d5c74c991/toolchain-builder/tmp/build-gcc-v3.sh-TRH/build/x86_64-mongodb-linux/libstdc++-v3/src/c++11/../../../../../src/combined/libstdc++-v3/src/c++11/thread.cc:80:18: execute_native_thread_routine
       ??:0:0: ??
       ??:0:0: ??
       

      Hypotheses

      Could this be related to https://github.com/gperftools/gperftools/issues/1036 and/or SERVER-57306 or WT-6366 

      Questions for the Server Team

      • Why are the crashes happening on different nodes with different error ?
      • Why only on secondary nodes nodes?
      • Do you believe there’s a chance the other nodes might be in danger of entering similar states including primary?
      • How to recover the affected members from this state? (proposed workarounds?)

        Attachments

          Activity

            People

            Assignee:
            Unassigned Unassigned
            Reporter:
            veerareddy.konatham VeeraReddy Konatham
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: