Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-20922

Double call to stepdown can result in invariant failure

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 3.2.0-rc1
    • Affects Version/s: None
    • Component/s: Replication
    • Labels:
      None
    • Fully Compatible
    • ALL
    • 0

      I attached logs of a run of sharding/auth.js that demonstrates this problem.

      The test fails after triggering this invariant:

      [js_test:auth] 2015-10-13T18:48:09.579-0400 d20265| 2015-10-13T18:48:09.576-0400 I -        [replExecDBWorker-0] Invariant failure !_isSignaled src/mongo/db/repl/replication_executor.cpp 528
      [js_test:auth] 2015-10-13T18:48:09.579-0400 d20265| 2015-10-13T18:48:09.576-0400 I SHARDING [conn21] MigrateFromStatus::done About to acquire global lock to exit critical section
      [js_test:auth] 2015-10-13T18:48:09.579-0400 d20265| 2015-10-13T18:48:09.576-0400 I -        [replExecDBWorker-0]
      [js_test:auth] 2015-10-13T18:48:09.579-0400 d20265|
      [js_test:auth] 2015-10-13T18:48:09.579-0400 d20265| ***aborting after invariant() failure
      [js_test:auth] 2015-10-13T18:48:09.579-0400 d20265|
      [js_test:auth] 2015-10-13T18:48:09.579-0400 d20265|
      [js_test:auth] 2015-10-13T18:48:09.580-0400 d20265| 2015-10-13T18:48:09.576-0400 I NETWORK  [conn23] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [127.0.0.1:55080]
      [js_test:auth] 2015-10-13T18:48:09.586-0400 d20266| 2015-10-13T18:48:09.586-0400 I ACCESS   [conn14] Successfully authenticated as principal __system on local
      [js_test:auth] 2015-10-13T18:48:09.591-0400 d20265| 2015-10-13T18:48:09.591-0400 F -        [replExecDBWorker-0] Got signal: 6 (Aborted).
      [js_test:auth] 2015-10-13T18:48:09.591-0400 d20265|
      [js_test:auth] 2015-10-13T18:48:09.591-0400 d20265|  0x12f6b82 0x12f5cd9 0x12f64f2 0x7f59ab51d340 0x7f59ab17ecc9 0x7f59ab1820d8 0x1283c2b 0xf0eed8 0xf0fff9 0xf10378 0xf0a88b 0xf10e46 0xf0f322 0xf1458f 0xe75cae 0xf4278a 0xf434bc 0x128e1d0 0x128eb59 0x128f6b0 0x7f59ab9f2a40 0x7f59ab515182 0x7f59ab24247d
      [js_test:auth] 2015-10-13T18:48:09.591-0400 d20265| ----- BEGIN BACKTRACE -----
      [js_test:auth] 2015-10-13T18:48:09.592-0400 d20265| {"backtrace":[{"b":"400000","o":"EF6B82"},{"b":"400000","o":"EF5CD9"},{"b":"400000","o":"EF64F2"},{"b":"7F59AB50D000","o":"10340"},{"b":"7F59AB148000","o":"36CC9"},{"b":"7F59AB148000","o":"3A0D8"},{"b":"400000","o":"E83C2B"},{"b":"400000","o":"B0EED8"},{"b":"400000","o":"B0FFF9"},{"b":"400000","o":"B10378"},{"b":"400000","o":"B0A88B"},{"b":"400000","o":"B10E46"},{"b":"400000","o":"B0F322"},{"b":"400000","o":"B1458F"},{"b":"400000","o":"A75CAE"},{"b":"400000","o":"B4278A"},{"b":"400000","o":"B434BC"},{"b":"400000","o":"E8E1D0"},{"b":"400000","o":"E8EB59"},{"b":"400000","o":"E8F6B0"},{"b":"7F59AB941000","o":"B1A40"},{"b":"7F59AB50D000","o":"8182"},{"b":"7F59AB148000","o":"FA47D"}],"processInfo":{ "mongodbVersion" : "3.1.10-pre-", "gitVersion" : "9c9100212f7f8f3afb5f240d405f853894c376f1", "compiledModules" : [ "subscription" ], "uname" : { "sysname" : "Linux", "release" : "3.13.0-58-generic", "version" : "#97-Ubuntu SMP Wed Jul 8 02:56:15 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "94E34C403A93DC7FF58BE81A5D2678B6A7DCDB0C" }, { "b" : "7FFFF7F70000", "elfType" : 3, "buildId" : "083C85C3A0476C2B9FEDD2C9D02100E02ABCA8EB" }, { "b" : "7F59AD131000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C1AE4CB7195D337A77A3C689051DABAA3980CA0C" }, { "b" : "7F59ACEC8000", "path" : "/usr/lib/x86_64-linux-gnu/libnetsnmpagent.so.30", "elfType" : 3, "buildId" : "96C16FDBBA28C6635657AFDBAF0F5A1090072474" }, { "b" : "7F59ACBEE000", "path" : "/usr/lib/x86_64-linux-gnu/libnetsnmp.so.30", "elfType" : 3, "buildId" : "61AE85EF50A072D671D55B4776383F8365A3FAA7" }, { "b" : "7F59AC813000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "F000D29917E9B6E94A35A8F02E5C62846E5916BC" }, { "b" : "7F59AC5F8000", "path" : "/usr/lib/x86_64-linux-gnu/libsasl2.so.2", "elfType" : 3, "buildId" : "666B276BD134B0E9579B67D4EE333F2D0FB813CD" }, { "b" : "7F59AC3B2000", "path" : "/usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "EEF6871CB80C0344DF907DD8B0D8C90A0B57D4F0" }, { "b" : "7F59AC0AC000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1D76B71E905CB867B27CEF230FCB20F01A3178F5" }, { "b" : "7F59ABE4D000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "A20EFFEC993A8441FA17F2079F923CBD04079E19" }, { "b" : "7F59ABC45000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "92FCF41EFE012D6186E31A59AD05BDBB487769AB" }, { "b" : "7F59AB941000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3, "buildId" : "4BF6F7ADD8244AD86008E6BF40D90F8873892197" }, { "b" : "7F59AB72B000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "8D0AA71411580EE6C08809695C3984769F25725B" }, { "b" : "7F59AB50D000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "9318E8AF0BFBE444731BB0461202EF57F7C39542" }, { "b" : "7F59AB148000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "30C94DC66A1FE95180C3D68D2B89E576D5AE213C" }, { "b" : "7F59AD335000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9F00581AB3C73E3AEA35995A0C50D24D59A01D47" }, { "b" : "7F59AAF3E000", "path" : "/lib/x86_64-linux-gnu/libwrap.so.0", "elfType" : 3, "buildId" : "54FCBC5B0F994A13A9B0EAD46F23E7DA7F7FE75B" }, { "b" : "7F59AABB5000", "path" : "/usr/lib/libperl.so.5.18", "elfType" : 3, "buildId" : "FA1AD13C56D51B69C9558106D8DAF14730B4C14F" }, { "b" : "7F59AA8EA000", "path" : "/usr/lib/x86_64-linux-gnu/libkrb5.so.3", "elfType" : 3, "buildId" : "00EB4BEB543C549A411AF26E8C2B4BBA806F57AE" }, { "b" : "7F59AA6BB000", "path" : "/usr/lib/x86_64-linux-gnu/libk5crypto.so.3", "elfType" : 3, "buildId" : "3057CF4B96D55B1CD2C3681B2A1F75279F66F225" }, { "b" : "7F59AA4B7000", "path" : "/lib/x86_64-linux-gnu/libcom_err.so.2", "elfType" : 3, "buildId" : "04BF7D9BE17AC2A5F7121B246488932718870207" }, { "b" : "7F59AA2AC000", "path" : "/usr/lib/x86_64-linux-gnu/libkrb5support.so.0", "elfType" : 3, "buildId" : "86E28C0FCF8D7DAEB9AE77A7C7930F4C2B78A64E" }, { "b" : "7F59AA092000", "path" : "/lib/x86_64-linux-gnu/libnsl.so.1", "elfType" : 3, "buildId" : "497315006FCA1547A16E644FB7FEBA8BD2FAB054" }, { "b" : "7F59A9E59000", "path" : "/lib/x86_64-linux-gnu/libcrypt.so.1", "elfType" : 3, "buildId" : "1B0F2710E989E9A581C257DFFDC90085D0E1348A" }, { "b" : "7F59A9C55000", "path" : "/lib/x86_64-linux-gnu/libkeyutils.so.1", "elfType" : 3, "buildId" : "0F03635F97B93D3DACD84F0ED363C56BD266044F" }, { "b" : "7F59A9A3A000", "path" : "/lib/x86_64-linux-gnu/libresolv.so.2", "elfType" : 3, "buildId" : "616683BCFD8626F176EDA99B6A5D4D2C57996590" } ] }}
      [js_test:auth] 2015-10-13T18:48:09.593-0400 d20265|  mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x12f6b82]
      [js_test:auth] 2015-10-13T18:48:09.593-0400 d20265|  mongod(+0xEF5CD9) [0x12f5cd9]
      [js_test:auth] 2015-10-13T18:48:09.593-0400 d20265|  mongod(+0xEF64F2) [0x12f64f2]
      [js_test:auth] 2015-10-13T18:48:09.593-0400 d20265|  libpthread.so.0(+0x10340) [0x7f59ab51d340]
      [js_test:auth] 2015-10-13T18:48:09.593-0400 d20265|  libc.so.6(gsignal+0x39) [0x7f59ab17ecc9]
      [js_test:auth] 2015-10-13T18:48:09.593-0400 d20265|  libc.so.6(abort+0x148) [0x7f59ab1820d8]
      [js_test:auth] 2015-10-13T18:48:09.593-0400 d20265|  mongod(_ZN5mongo15invariantFailedEPKcS1_j+0xCB) [0x1283c2b]
      [js_test:auth] 2015-10-13T18:48:09.593-0400 d20265|  mongod(+0xB0EED8) [0xf0eed8]
      [js_test:auth] 2015-10-13T18:48:09.593-0400 d20265|  mongod(_ZN5mongo4repl19ReplicationExecutor18signalEvent_inlockERKNS_8executor12TaskExecutor11EventHandleE+0x19) [0xf0fff9]
      [js_test:auth] 2015-10-13T18:48:09.593-0400 d20265|  mongod(_ZN5mongo4repl19ReplicationExecutor11signalEventERKNS_8executor12TaskExecutor11EventHandleE+0x38) [0xf10378]
      [js_test:auth] 2015-10-13T18:48:09.593-0400 d20265|  mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl15_stepDownFinishERKNS_8executor12TaskExecutor12CallbackArgsE+0xCB) [0xf0a88b]
      [js_test:auth] 2015-10-13T18:48:09.594-0400 d20265|  mongod(_ZN5mongo4repl19ReplicationExecutor12_doOperationEPNS_16OperationContextERKNS_6StatusERKNS_8executor12TaskExecutor14CallbackHandleEPSt4listINS1_8WorkItemESaISD_EEPSt5mutex+0x206) [0xf10e46]
      [js_test:auth] 2015-10-13T18:48:09.594-0400 d20265|  mongod(+0xB0F322) [0xf0f322]
      [js_test:auth] 2015-10-13T18:48:09.594-0400 d20265|  mongod(+0xB1458F) [0xf1458f]
      [js_test:auth] 2015-10-13T18:48:09.594-0400 d20265|  mongod(+0xA75CAE) [0xe75cae]
      [js_test:auth] 2015-10-13T18:48:09.594-0400 d20265|  mongod(+0xB4278A) [0xf4278a]
      [js_test:auth] 2015-10-13T18:48:09.594-0400 d20265|  mongod(_ZN5mongo4repl10TaskRunner9_runTasksEv+0x9C) [0xf434bc]
      [js_test:auth] 2015-10-13T18:48:09.594-0400 d20265|  mongod(_ZN5mongo10ThreadPool10_doOneTaskEPSt11unique_lockISt5mutexE+0x130) [0x128e1d0]
      [js_test:auth] 2015-10-13T18:48:09.594-0400 d20265|  mongod(_ZN5mongo10ThreadPool13_consumeTasksEv+0xA9) [0x128eb59]
      [js_test:auth] 2015-10-13T18:48:09.594-0400 d20265|  mongod(_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKSs+0x100) [0x128f6b0]
      [js_test:auth] 2015-10-13T18:48:09.594-0400 d20265|  libstdc++.so.6(+0xB1A40) [0x7f59ab9f2a40]
      [js_test:auth] 2015-10-13T18:48:09.594-0400 d20265|  libpthread.so.0(+0x8182) [0x7f59ab515182]
      [js_test:auth] 2015-10-13T18:48:09.595-0400 d20265|  libc.so.6(clone+0x6D) [0x7f59ab24247d]
      [js_test:auth] 2015-10-13T18:48:09.595-0400 d20265| -----  END BACKTRACE  -----
      

      Earlier in the log you see these lines:

      [js_test:auth] 2015-10-13T18:48:06.601-0400 d20265| 2015-10-13T18:48:06.601-0400 I REPL     [ReplicationExecutor] Stepping down from primary in response to heartbeat
      [js_test:auth] 2015-10-13T18:48:07.563-0400 d20265| 2015-10-13T18:48:07.563-0400 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun
      

      The problem seems to be that a new primary is elected, then the previous primary gets a heartbeat response from it. The heartbeat response tells it to step itself down, and also includes a new term, so we now call _stepDownStart() twice, which schedules two different stepdown tasks, causing the second one to signal the already-signaled _stepDownFinishEvent

            Assignee:
            benety.goh@mongodb.com Benety Goh
            Reporter:
            spencer@mongodb.com Spencer Brody (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: