-
Type:
Bug
-
Resolution: Done
-
Priority:
Major - P3
-
Affects Version/s: 3.0.0-rc6, 3.0.0-rc7
-
Component/s: Replication
-
None
-
Fully Compatible
-
ALL
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Sometimes a 3.0.0-rc7 mongod that is part of a replica set which is part of a sharded cluster will crash when it is restarted in order to disable auth.
I discovered this while running Automation's test suite. Although the mongod that crashes is part of a replica set, the crash does not appear to happen when the replica set is not part of a sharded cluster.
The crash also happens with 3.0.0-rc6.
The test that triggers it sets up a sharded cluster with auth enabled. Users are added. Then all processes are restarted with auth disabled. There is a partial order to the stopping and starting. Replica set members can stop and start immediately. Mongoses can stop immediately and can each start only after all other mongoses and config servers are either stopped or have restarted. Config servers can stop only after all mongoses have stopped. Config servers can each start after all other mongoses and config servers are either stopped or have restarted.
Here is the config file of a replica set member before it was restarted (with auth on):
net:
port: 9003
processManagement:
fork: "true"
replication:
oplogSizeMB: 64
replSetName: a
security:
authorization: enabled
keyFile: /tmp/mms-automation/test/output/keyfile.keyfile
storage:
dbPath: /tmp/mms-automation/test/output/data/process9003
journal:
enabled: false
preallocDataFiles: true
smallFiles: true
systemLog:
destination: file
path: /tmp/mms-automation/test/logs/run9003
Here is the new config file (with auth off) for the same replica set member:
net:
port: 9003
processManagement:
fork: "true"
replication:
oplogSizeMB: 64
replSetName: a
storage:
dbPath: /tmp/mms-automation/test/output/data/process9003
journal:
enabled: false
preallocDataFiles: true
smallFiles: true
systemLog:
destination: file
path: /tmp/mms-automation/test/logs/run9003
And here is the log after the process attempts to start with the new config file:
2015-02-03T12:19:18.945-0500 I CONTROL [initandlisten] MongoDB starting : pid=85055 port=9003 dbpath=/tmp/mms-automation/test/output/data/process9003 64-bit host=neurofunk.local
2015-02-03T12:19:18.945-0500 I CONTROL [initandlisten] db version v3.0.0-rc7
2015-02-03T12:19:18.945-0500 I CONTROL [initandlisten] git version: e4c60053b2967e16f765fa25d16aa6d629faa196
2015-02-03T12:19:18.945-0500 I CONTROL [initandlisten] build info: Darwin bs-osx108-2 12.5.0 Darwin Kernel Version 12.5.0: Sun Sep 29 13:33:47 PDT 2013; root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64 BOOST_LIB_VERSION=1_49
2015-02-03T12:19:18.945-0500 I CONTROL [initandlisten] allocator: system
2015-02-03T12:19:18.945-0500 I CONTROL [initandlisten] options: { config: "/tmp/mms-automation/test/output/data/process9003/automation-mongod.conf", net: { port: 9003 }, processManagement: { fork: true }, replication: { oplogSizeMB: 64, replSetName: "a" }, storage: { dbPath: "/tmp/mms-automation/test/output/data/process9003", journal: { enabled: false }, mmapv1: { preallocDataFiles: true, smallFiles: true } }, systemLog: { destination: "file", path: "/tmp/mms-automation/test/logs/run9003" } }
2015-02-03T12:19:18.975-0500 I NETWORK [initandlisten] waiting for connections on port 9003
2015-02-03T12:19:18.977-0500 I NETWORK [initandlisten] connection accepted from 10.4.102.189:52938 #1 (1 connection now open)
2015-02-03T12:19:18.982-0500 I NETWORK [initandlisten] connection accepted from 10.4.102.189:52939 #2 (2 connections now open)
2015-02-03T12:19:19.008-0500 I REPL [ReplicationExecutor] new replica set config in use: { _id: "a", version: 1, members: [ { _id: 0, host: "neurofunk.local:9001", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "neurofunk.local:9002", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "neurofunk.local:9003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatTimeoutSecs: 10, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 } } }
2015-02-03T12:19:19.008-0500 I REPL [ReplicationExecutor] transition to STARTUP2
2015-02-03T12:19:19.009-0500 I REPL [ReplicationExecutor] Starting replication applier threads
2015-02-03T12:19:19.009-0500 I REPL [ReplicationExecutor] transition to RECOVERING
2015-02-03T12:19:19.011-0500 I REPL [ReplicationExecutor] transition to SECONDARY
2015-02-03T12:19:19.040-0500 I REPL [ReplicationExecutor] Error in heartbeat request to neurofunk.local:9001; Unauthorized not authorized on admin to execute command { replSetHeartbeat: "a", pv: 1, v: 1, from: "neurofunk.local:9003", fromId: 2, checkEmpty: false }
2015-02-03T12:19:19.040-0500 I REPL [ReplicationExecutor] transition to RECOVERING
2015-02-03T12:19:19.040-0500 I REPL [ReplicationExecutor] Member neurofunk.local:9002 is now in state RECOVERING
2015-02-03T12:19:19.040-0500 I REPL [ReplicationExecutor] Standing for election
2015-02-03T12:19:19.041-0500 I REPL [ReplicationExecutor] not electing self, neurofunk.local:9002 would veto with 'I don't think neurofunk.local:9003 is electable because the member is not currently a secondary; member is more than 10 seconds behind the most up-to-date member (mask 0xA)'
2015-02-03T12:19:19.042-0500 I REPL [ReplicationExecutor] not electing self, we are not freshest
2015-02-03T12:19:19.065-0500 I ACCESS [conn1] Successfully authenticated as principal testuser on admin
2015-02-03T12:19:19.066-0500 I ACCESS [conn2] Successfully authenticated as principal testuser on admin
2015-02-03T12:19:19.450-0500 I NETWORK [initandlisten] connection accepted from 10.4.102.189:53033 #3 (3 connections now open)
2015-02-03T12:19:20.669-0500 I NETWORK [initandlisten] connection accepted from 10.4.102.189:52767 #4 (4 connections now open)
2015-02-03T12:19:20.713-0500 I NETWORK [initandlisten] connection accepted from 10.4.102.189:53143 #5 (5 connections now open)
2015-02-03T12:19:20.717-0500 I ACCESS [conn5] SCRAM-SHA-1 authentication failed for __system on local ; AuthenticationFailed It is not possible to authenticate as the __system user on servers started without a --keyFile parameter
2015-02-03T12:19:20.718-0500 I NETWORK [conn5] end connection 10.4.102.189:53143 (4 connections now open)
2015-02-03T12:19:20.853-0500 I NETWORK [initandlisten] connection accepted from 10.4.102.189:53175 #6 (5 connections now open)
2015-02-03T12:19:20.921-0500 I NETWORK [initandlisten] connection accepted from 10.4.102.189:53197 #7 (6 connections now open)
2015-02-03T12:19:21.045-0500 I REPL [ReplicationExecutor] Member neurofunk.local:9002 is now in state SECONDARY
2015-02-03T12:19:21.045-0500 I REPL [ReplicationExecutor] Standing for election
2015-02-03T12:19:21.045-0500 I REPL [ReplicationExecutor] Error in heartbeat request to neurofunk.local:9001; Unauthorized not authorized on admin to execute command { replSetHeartbeat: "a", pv: 1, v: 1, from: "neurofunk.local:9003", fromId: 2, checkEmpty: false }
2015-02-03T12:19:21.046-0500 I REPL [ReplicationExecutor] transition to SECONDARY
2015-02-03T12:19:21.046-0500 I REPL [ReplicationExecutor] transition to PRIMARY
2015-02-03T12:19:21.047-0500 I REPL [ReplicationExecutor] replSet possible election tie; sleeping 212ms until 2015-02-03T12:19:21.258-0500
2015-02-03T12:19:21.047-0500 I - [ReplicationExecutor] Invariant failure _role == Role::candidate src/mongo/db/repl/topology_coordinator_impl.cpp 1897
2015-02-03T12:19:21.049-0500 I CONTROL [ReplicationExecutor]
0x1068bb459 0x10686f7c0 0x10685ca93 0x106628f08 0x1065ea2e5 0x106611e84 0x1065f526d 0x1065f80f6 0x1065f2ad7 0x1068eec51 0x7fff9694e2fc 0x7fff9694e279 0x7fff9694c4b1
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"106160000","o":"75B459"},{"b":"106160000","o":"70F7C0"},{"b":"106160000","o":"6FCA93"},{"b":"106160000","o":"4C8F08"},{"b":"106160000","o":"48A2E5"},{"b":"106160000","o":"4B1E84"},{"b":"106160000","o":"49526D"},{"b":"106160000","o":"4980F6"},{"b":"106160000","o":"492AD7"},{"b":"106160000","o":"78EC51"},{"b":"7FFF9694B000","o":"32FC"},{"b":"7FFF9694B000","o":"3279"},{"b":"7FFF9694B000","o":"14B1"}],"processInfo":{ "mongodbVersion" : "3.0.0-rc7", "gitVersion" : "e4c60053b2967e16f765fa25d16aa6d629faa196", "uname" : { "sysname" : "Darwin", "release" : "14.0.0", "version" : "Darwin Kernel Version 14.0.0: Fri Sep 19 00:26:44 PDT 2014; root:xnu-2782.1.97~2/RELEASE_X86_64", "machine" : "x86_64" }, "somap" : [ { "path" : "/tmp/mms-automation/test/versions/mongodb-osx-x86_64-3.0.0-rc7/bin/mongod", "machType" : 2, "b" : "106160000", "buildId" : "0A95B1834CD531D7A030ADBA1D96FB0A" }, { "path" : "/usr/lib/libSystem.B.dylib", "machType" : 6, "b" : "7FFF98912000", "buildId" : "DA954461EC6A3DF085516FC810627627" }, { "path" : "/usr/lib/libc++.1.dylib", "machType" : 6, "b" : "7FFF94BDB000", "buildId" : "1B9530FD989B3174BB1CBDC159501710" }, { "path" : "/usr/lib/system/libcache.dylib", "machType" : 6, "b" : "7FFF946AA000", "buildId" : "45E9A2E799C436B2BEE30C4E11614AD1" }, { "path" : "/usr/lib/system/libcommonCrypto.dylib", "machType" : 6, "b" : "7FFF9B867000", "buildId" : "D381EBC669D831D380845A80A32CB748" }, { "path" : "/usr/lib/system/libcompiler_rt.dylib", "machType" : 6, "b" : "7FFF98914000", "buildId" : "BF8FC133EE103DA69B9092039E28678F" }, { "path" : "/usr/lib/system/libcopyfile.dylib", "machType" : 6, "b" : "7FFF8D368000", "buildId" : "0C68D3A6ACDD3EF3991ACC82C32AB836" }, { "path" : "/usr/lib/system/libcorecrypto.dylib", "machType" : 6, "b" : "7FFF92659000", "buildId" : "E178980139853949B7366B3378873301" }, { "path" : "/usr/lib/system/libdispatch.dylib", "machType" : 6, "b" : "7FFF976D5000", "buildId" : "502CF32B669B3709886208188225E4F0" }, { "path" : "/usr/lib/system/libdyld.dylib", "machType" : 6, "b" : "7FFF91CFA000", "buildId" : "19FAF435C16533749DEFD7BBA7D61DB6" }, { "path" : "/usr/lib/system/libkeymgr.dylib", "machType" : 6, "b" : "7FFF8FBBF000", "buildId" : "77845842DE703CC5BD01C3D14227CED5" }, { "path" : "/usr/lib/system/liblaunch.dylib", "machType" : 6, "b" : "7FFF956C8000", "buildId" : "8A9889248BE735FEBF7D322E90EFE49E" }, { "path" : "/usr/lib/system/libmacho.dylib", "machType" : 6, "b" : "7FFF92DBD000", "buildId" : "126CA2EDDE91308F8881B9DAEC3C63B6" }, { "path" : "/usr/lib/system/libquarantine.dylib", "machType" : 6, "b" : "7FFF912FE000", "buildId" : "DC0416272D92361CBABFA869A5C72293" }, { "path" : "/usr/lib/system/libremovefile.dylib", "machType" : 6, "b" : "7FFF99AD8000", "buildId" : "3485B5F46CE83C628DFD8736ED6E8531" }, { "path" : "/usr/lib/system/libsystem_asl.dylib", "machType" : 6, "b" : "7FFF926D0000", "buildId" : "F153AC5B0542356E88C820A62CA704E2" }, { "path" : "/usr/lib/system/libsystem_blocks.dylib", "machType" : 6, "b" : "7FFF9978E000", "buildId" : "9615D10AFCA73BE4AA1A1B195DACE1A1" }, { "path" : "/usr/lib/system/libsystem_c.dylib", "machType" : 6, "b" : "7FFF9479D000", "buildId" : "C185E86274243210B5286B822577A4B8" }, { "path" : "/usr/lib/system/libsystem_configuration.dylib", "machType" : 6, "b" : "7FFF93738000", "buildId" : "9FBA1CE497D0347EA44393ED94512E92" }, { "path" : "/usr/lib/system/libsystem_coreservices.dylib", "machType" : 6, "b" : "7FFF93907000", "buildId" : "41B7C5785A5331C8A96FC73E030B0938" }, { "path" : "/usr/lib/system/libsystem_coretls.dylib", "machType" : 6, "b" : "7FFF8E6FA000", "buildId" : "EBBF7EF680D83F8F825CB412BD6D22C0" }, { "path" : "/usr/lib/system/libsystem_dnssd.dylib", "machType" : 6, "b" : "7FFF923CA000", "buildId" : "62B70ECAE40D3C63896E7F00EC386DDB" }, { "path" : "/usr/lib/system/libsystem_info.dylib", "machType" : 6, "b" : "7FFF9373B000", "buildId" : "B85A85D585303A93B0C34DEC41F79478" }, { "path" : "/usr/lib/system/libsystem_kernel.dylib", "machType" : 6, "b" : "7FFF95BB1000", "buildId" : "93E0E0A975B63904BB4E4BC7C05F4B6B" }, { "path" : "/usr/lib/system/libsystem_m.dylib", "machType" : 6, "b" : "7FFF978A7000", "buildId" : "1E12AB456D9636D0A226F24D9FB0D9D6" }, { "path" : "/usr/lib/system/libsystem_malloc.dylib", "machType" : 6, "b" : "7FFF99ADA000", "buildId" : "19BCC25757173502A71F95D65AFA861B" }, { "path" : "/usr/lib/system/libsystem_network.dylib", "machType" : 6, "b" : "7FFF8E897000", "buildId" : "C0B2313D47BE38A9BEE62634A4F5E14B" }, { "path" : "/usr/lib/system/libsystem_networkextension.dylib", "machType" : 6, "b" : "7FFF900D0000", "buildId" : "29AB225BD7FB30ED960065D44B9A9442" }, { "path" : "/usr/lib/system/libsystem_notify.dylib", "machType" : 6, "b" : "7FFF9B873000", "buildId" : "61147800F3203DAA850CBADF33855F29" }, { "path" : "/usr/lib/system/libsystem_platform.dylib", "machType" : 6, "b" : "7FFF96A66000", "buildId" : "64E34079D7123D669CE2418624A5C040" }, { "path" : "/usr/lib/system/libsystem_pthread.dylib", "machType" : 6, "b" : "7FFF9694B000", "buildId" : "26B1897F0CD330F3B55A37CB45062D73" }, { "path" : "/usr/lib/system/libsystem_sandbox.dylib", "machType" : 6, "b" : "7FFF8FEEB000", "buildId" : "DB9962EF889831CC9B87E01F8CE74C9D" }, { "path" : "/usr/lib/system/libsystem_secinit.dylib", "machType" : 6, "b" : "7FFF93CB6000", "buildId" : "581DAD0F6B633A48B63B917AF799ABAA" }, { "path" : "/usr/lib/system/libsystem_stats.dylib", "machType" : 6, "b" : "7FFF98D7A000", "buildId" : "1DB0443659743F1686CC5FF5F390339C" }, { "path" : "/usr/lib/system/libsystem_trace.dylib", "machType" : 6, "b" : "7FFF92506000", "buildId" : "A9E6B7D8C3273742AC5486C94218B1DF" }, { "path" : "/usr/lib/system/libunc.dylib", "machType" : 6, "b" : "7FFF9211B000", "buildId" : "5676F7EAC1DF329FB006D2C3022B7D70" }, { "path" : "/usr/lib/system/libunwind.dylib", "machType" : 6, "b" : "7FFF93CB0000", "buildId" : "BE7E51A0B6EA3A549CCA9D88F683A6D6" }, { "path" : "/usr/lib/system/libxpc.dylib", "machType" : 6, "b" : "7FFF9B6C5000", "buildId" : "9437C02EA07B38C891CB299FAA63083D" }, { "path" : "/usr/lib/libobjc.A.dylib", "machType" : 6, "b" : "7FFF8D451000", "buildId" : "3B60CD9074A23A5D9686B0772159792A" }, { "path" : "/usr/lib/libauto.dylib", "machType" : 6, "b" : "7FFF92D76000", "buildId" : "A260789BD4D8316A9490254767B8A5F1" }, { "path" : "/usr/lib/libc++abi.dylib", "machType" : 6, "b" : "7FFF91CCE000", "buildId" : "88A22A0F87C63002BFBAAC0F2808B8B9" }, { "path" : "/usr/lib/libDiagnosticMessagesClient.dylib", "machType" : 6, "b" : "7FFF8E17E000", "buildId" : "2EE8E4365CDC34C599595BA218D507FB" } ] }}
mongod(_ZN5mongo15printStackTraceERNSt3__113basic_ostreamIcNS0_11char_traitsIcEEEE+0x39) [0x1068bb459]
mongod(_ZN5mongo10logContextEPKc+0x100) [0x10686f7c0]
mongod(_ZN5mongo15invariantFailedEPKcS1_j+0x183) [0x10685ca93]
mongod(_ZN5mongo4repl23TopologyCoordinatorImpl19processLoseElectionEv+0xD8) [0x106628f08]
mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl25_onFreshnessCheckCompleteEv+0x1F5) [0x1065ea2e5]
mongod(_ZN5mongo4repl19ScatterGatherRunner16_processResponseERKNS0_19ReplicationExecutor25RemoteCommandCallbackDataEPS1_+0xE4) [0x106611e84]
mongod(_ZN5mongo4replL21remoteCommandFinishedERKNS0_19ReplicationExecutor12CallbackDataERKNSt3__18functionIFvRKNS1_25RemoteCommandCallbackDataEEEERKNS1_20RemoteCommandRequestERKNS_10StatusWithINS1_21RemoteCommandResponseEEE+0x2BD) [0x1065f526d]
mongod(_ZN5mongo4repl12_GLOBAL__N_112callNoExceptERKNSt3__18functionIFvvEEE+0x16) [0x1065f80f6]
mongod(_ZN5mongo4repl19ReplicationExecutor3runEv+0x637) [0x1065f2ad7]
mongod(_ZN5boost12_GLOBAL__N_112thread_proxyEPv+0xB1) [0x1068eec51]
libsystem_pthread.dylib(_pthread_body+0x83) [0x7fff9694e2fc]
libsystem_pthread.dylib(_pthread_body+0x0) [0x7fff9694e279]
libsystem_pthread.dylib(thread_start+0xD) [0x7fff9694c4b1]
----- END BACKTRACE -----
2015-02-03T12:19:21.049-0500 I - [ReplicationExecutor]
***aborting after invariant() failure