[SERVER-29015] TopologyCoordinator should not transition to candidate role in a single node replica set if we are in maintenance mode Created: 28/Apr/17  Updated: 30/Oct/23  Resolved: 21/Jun/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.4.1
Fix Version/s: 3.4.7, 3.5.10

Type: Bug Priority: Major - P3
Reporter: Vick Mena (Inactive) Assignee: Benety Goh
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 436558_psin11p184_ftdc.tar.gz     File 436558_psin11p184_logs.tar.gz     File 436558_psin11p189_logs.tar.gz     File 436558_psin11p192_ftdc.tar.gz     File SERVER_29015.js    
Issue Links:
Backports
Related
is related to SERVER-29037 Log new replica set config when replS... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Sprint: Repl 2017-05-29, Repl 2017-06-19, Repl 2017-07-10
Participants:
Case:

 Description   

SIGABRT while attempting to reconfigure replica set

2017-04-28T09:25:38.335-0400 I ACCESS [conn806] Successfully authenticated as principal superUser on admin
2017-04-28T09:28:16.861-0400 I COMMAND [conn806] Attempting to step down in response to replSetStepDown command
2017-04-28T09:28:31.677-0400 I REPL [conn806] cannot freeze node when primary or running for election. state: Running-Election
2017-04-28T09:29:02.335-0400 I REPL [conn806] replSetReconfig admin command received from client
2017-04-28T09:29:02.339-0400 I REPL [conn806] replSetReconfig config object with 1 members parses ok
2017-04-28T09:29:02.340-0400 I - [replExecDBWorker-1] Invariant failure _voteRequester src/mongo/db/repl/replication_coordinator_impl.cpp 2382
2017-04-28T09:29:02.340-0400 I - [replExecDBWorker-1]
 
***aborting after invariant() failure
 
2017-04-28T09:29:02.354-0400 F - [replExecDBWorker-1] Got signal: 6 (Aborted).
 
0x7f4c327754f1 0x7f4c327745e9 0x7f4c32774acd 0x7f4c2e6a57e0 0x7f4c2e334625 0x7f4c2e335e05 0x7f4c3194e360 0x7f4c321e6c08 0x7f4c322015d0 0x7f4c321ffe49 0x7f4c322077c0 0x7f4c321353a6 0x7f4c3225ee77 0x7f4c3226005f 0x7f4c326ec6c5 0x7f4c326ed1f0 0x7f4c326edd99 0x7f4c331ea240 0x7f4c2e69daa1 0x7f4c2e3ea93d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"7F4C310F1000","o":"16844F1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"7F4C310F1000","o":"16835E9"},{"b":"7F4C310F1000","o":"1683ACD"},{"b":"7F4C2E696000","o":"F7E0"},{"b":"7F4C2E302000","o":"32625","s":"gsignal"},{"b":"7F4C2E302000","o":"33E05","s":"abort"},{"b":"7F4C310F1000","o":"85D360","s":"_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j"},{"b":"7F4C310F1000","o":"10F5C08","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl22_finishReplSetReconfigERKNS_8executor12TaskExecutor12CallbackArgsERKNS0_16ReplicaSetConfigEi"},{"b":"7F4C310F1000","o":"11105D0","s":"_ZN5mongo4repl19ReplicationExecutor12_doOperationEPNS_16OperationContextERKNS_6StatusERKNS_8executor12TaskExecutor14CallbackHandleEPNSt7__cxx114listINS1_8WorkItemESaISE_EEEPSt5mutex"},{"b":"7F4C310F1000","o":"110EE49"},{"b":"7F4C310F1000","o":"11167C0"},{"b":"7F4C310F1000","o":"10443A6"},{"b":"7F4C310F1000","o":"116DE77"},{"b":"7F4C310F1000","o":"116F05F","s":"_ZN5mongo4repl10TaskRunner9_runTasksEv"},{"b":"7F4C310F1000","o":"15FB6C5","s":"_ZN5mongo10ThreadPool10_doOneTaskEPSt11unique_lockISt5mutexE"},{"b":"7F4C310F1000","o":"15FC1F0","s":"_ZN5mongo10ThreadPool13_consumeTasksEv"},{"b":"7F4C310F1000","o":"15FCD99","s":"_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE"},{"b":"7F4C310F1000","o":"20F9240"},{"b":"7F4C2E696000","o":"7AA1"},{"b":"7F4C2E302000","o":"E893D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.4.1", "gitVersion" : "5e103c4f5583e2566a45d740225dc250baacfbd7", "compiledModules" : [ "enterprise" ], "uname" : { "sysname" : "Linux", "release" : "2.6.32-573.35.2.el6.x86_64", "version" : "#1 SMP Mon Oct 24 14:14:01 EDT 2016", "machine" : "x86_64" }, "somap" : [ { "b" : "7F4C310F1000", "elfType" : 3, "buildId" : "6592541D75783832B60FD247170C476E08067A03" }, { "b" : "7FFCBBE2B000", "elfType" : 3, "buildId" : "2166F24469D1DCF52164CD85E54A5958F14DC9F0" }, { "b" : "7F4C30C7F000", "path" : "/lib64/libldap-2.4.so.2", "elfType" : 3, "buildId" : "32801AFA6E4B7372E0FB47284BCC41E75FA16F1C" }, { "b" : "7F4C30A70000", "path" : "/lib64/liblber-2.4.so.2", "elfType" : 3, "buildId" : "A5F759C53828926B21000F83968669B7DA7E334F" }, { "b" : "7F4C3081B000", "path" : "/usr/lib64/libcurl.so.4", "elfType" : 3, "buildId" : "77F40FA35472D61709BB51543AA4859BA1E6B7AC" }, { "b" : "7F4C30601000", "path" : "/usr/lib64/libsasl2.so.2", "elfType" : 3, "buildId" : "E0AEE889D5BF1373F2F9EE0D448DBF3F5B5113F0" }, { "b" : "7F4C303BD000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "441FA45097A11508E50D55A3D1FF169BF2BE7C62" }, { "b" : "7F4C30171000", "path" : "/usr/lib64/libnetsnmpagent.so.20", "elfType" : 3, "buildId" : "E4E49DE2554F02ACF2728D1748874101B0709B3A" }, { "b" : "7F4C2FF4B000", "path" : "/usr/lib64/libnetsnmphelpers.so.20", "elfType" : 3, "buildId" : "17A35AEE324676929C7A5C8B4CE54443ED10AC07" }, { "b" : "7F4C2FA83000", "path" : "/usr/lib64/libnetsnmpmibs.so.20", "elfType" : 3, "buildId" : "78A49421FA60389F8C774BE68F5EF17DF2BD9CE3" }, { "b" : "7F4C2F7A9000", "path" : "/usr/lib64/libnetsnmp.so.20", "elfType" : 3, "buildId" : "4CB6272BCAC2270393F559F67E8ED321690F79D5" }, { "b" : "7F4C2F53D000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "B84C31B86733DE212F6886FE6F55630FE56180A9" }, { "b" : "7F4C2F159000", "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "E05F34F58683FC48552C1D5163E2BD4E9DFB1F3D" }, { "b" : "7F4C2EF51000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "95159178F1A4A3DBDC7819FBEA2C80E5FCDD6BAC" }, { "b" : "7F4C2ED4D000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "29B61382141595ECBA6576232E44F2310C3AAB72" }, { "b" : "7F4C2EAC9000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "989FE3A42CA8CEBDCC185A743896F23A0CF537ED" }, { "b" : "7F4C2E8B3000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "2AC15B051D1B8B53937E3341EA931D0E96F745D9" }, { "b" : "7F4C2E696000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "C56DD1B811FC0D9263248EBB308C73FCBCD80FC1" }, { "b" : "7F4C2E302000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "A1DB9754D1F523A6F16ADA929D6764A133DC6FA2" }, { "b" : "7F4C30ECF000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "959C5E10A47EE8A633E7681B64B4B9F74E242ED5" }, { "b" : "7F4C2E0E8000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "C39D7FFB49DFB1B55AD09D1D711AD802123F6623" }, { "b" : "7F4C2DEA8000", "path" : "/usr/lib64/libssl3.so", "elfType" : 3, "buildId" : "D0BC7E14B61557018F0D3DE086F7F547CBD96A49" }, { "b" : "7F4C2DC7C000", "path" : "/usr/lib64/libsmime3.so", "elfType" : 3, "buildId" : "170E7F73BA1C20E8E254380A98C6A083EEA35F68" }, { "b" : "7F4C2D93D000", "path" : "/usr/lib64/libnss3.so", "elfType" : 3, "buildId" : "2AA334714B9242998869C67DB01BF20A119B0AB7" }, { "b" : "7F4C2D711000", "path" : "/usr/lib64/libnssutil3.so", "elfType" : 3, "buildId" : "A9A05587133A7F8634C040CA8013A68EBCF9E2E0" }, { "b" : "7F4C2D50D000", "path" : "/lib64/libplds4.so", "elfType" : 3, "buildId" : "97F07716D324E086D43CC4D05873E1A16E020468" }, { "b" : "7F4C2D308000", "path" : "/lib64/libplc4.so", "elfType" : 3, "buildId" : "C53F8B39797A277F40F582D8D11D3C2FFF7E5D1E" }, { "b" : "7F4C2D0CA000", "path" : "/lib64/libnspr4.so", "elfType" : 3, "buildId" : "7CD7DD1B6C294C61F494519CE3E0D7E114DFB36D" }, { "b" : "7F4C2CE98000", "path" : "/lib64/libidn.so.11", "elfType" : 3, "buildId" : "5659EB985475B586E3BBCB95BA21F4A30BE5EBF4" }, { "b" : "7F4C2CBB1000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "F62622218875795666E08B92D176A50791183EEC" }, { "b" : "7F4C2C985000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "B8DEDADC140347276164C729418C7A37B7224135" }, { "b" : "7F4C2C781000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "13FFCD68952B7715DDF34C9321D82E3041EA9006" }, { "b" : "7F4C2C56B000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "D053BB4FF0C2FC983842F81598813B9B931AD0D1" }, { "b" : "7F4C2C343000", "path" : "/usr/lib64/libssh2.so.1", "elfType" : 3, "buildId" : "8727EC925D6D91DAC74A99BDE8B3C6EE96AF13EA" }, { "b" : "7F4C2C10C000", "path" : "/lib64/libcrypt.so.1", "elfType" : 3, "buildId" : "128802B73016BE233837EA9F2DCBC2153ACC2D6A" }, { "b" : "7F4C2BF01000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "4BDFC7A19C1F328EB4FCFBCE7A1E27606928610D" }, { "b" : "7F4C2BCFE000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "3BCCABE75DC61BBA81AAE45D164E26EF4F9F55DB" }, { "b" : "7F4C2BAF3000", "path" : "/lib64/libwrap.so.0", "elfType" : 3, "buildId" : "8C0C7CAB7F028E4592A8581EB2122FBECAB26B97" }, { "b" : "7F4C2B788000", "path" : "/usr/lib64/perl5/CORE/libperl.so", "elfType" : 3, "buildId" : "545478030DF991A635CC5E3258A3F5D8A7E94561" }, { "b" : "7F4C2B56F000", "path" : "/lib64/libnsl.so.1", "elfType" : 3, "buildId" : "CAD1498B2AA3531958C579F5CB39D8D6BFB5675B" }, { "b" : "7F4C2B36C000", "path" : "/lib64/libutil.so.1", "elfType" : 3, "buildId" : "565D9CDC6BD59EFE0156BAFE21033BE070F014DA" }, { "b" : "7F4C2B101000", "path" : "/usr/lib64/librpm.so.1", "elfType" : 3, "buildId" : "0B73153AA2E650B19153B7E8A57F9C7A965072CD" }, { "b" : "7F4C2AED2000", "path" : "/usr/lib64/librpmio.so.1", "elfType" : 3, "buildId" : "7D821C87BEF03F9D7BBFE7FEE591EC5929D1C22C" }, { "b" : "7F4C2ACC9000", "path" : "/lib64/libpopt.so.0", "elfType" : 3, "buildId" : "E7B49911F1136073DD7DC58E8118CD9A4FBE2A19" }, { "b" : "7F4C2AAB9000", "path" : "/usr/lib64/libsensors.so.4", "elfType" : 3, "buildId" : "6855E5BF5B3634C15F01B1043BD892D727EE3C08" }, { "b" : "7F4C2A8B6000", "path" : "/lib64/libfreebl3.so", "elfType" : 3, "buildId" : "58BAC04A1DB3964A8F594EFFBE4838AD01214EDC" }, { "b" : "7F4C2A697000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "2D0F26E648D9661ABD83ED8B4BBE8F2CFA50393B" }, { "b" : "7F4C2A486000", "path" : "/lib64/libbz2.so.1", "elfType" : 3, "buildId" : "1250B1D041DD7552F0C870BB188DC3A34DF2651D" }, { "b" : "7F4C2A270000", "path" : "/usr/lib64/libelf.so.1", "elfType" : 3, "buildId" : "1C2B39A5003E9DA8FD9C55972C06245E731E6546" }, { "b" : "7F4C2A04F000", "path" : "/usr/lib64/liblzma.so.0", "elfType" : 3, "buildId" : "2F1F98636D83908F9157858BCC7B44A6A6784385" }, { "b" : "7F4C29E22000", "path" : "/usr/lib64/liblua-5.1.so", "elfType" : 3, "buildId" : "6BDB4E1990D6EBA12A5C8D39A7650DB8798BF568" }, { "b" : "7F4C29C1E000", "path" : "/lib64/libcap.so.2", "elfType" : 3, "buildId" : "A436538388F1F25113FDA834CA2EED524EFA17D6" }, { "b" : "7F4C29A16000", "path" : "/lib64/libacl.so.1", "elfType" : 3, "buildId" : "26CC708AC7C0FC1797A2340C024F0ADD0CE054D8" }, { "b" : "7F4C296A2000", "path" : "/lib64/libdb-4.7.so", "elfType" : 3, "buildId" : "54DB4E3C4EC743FE95DD31C9D312E2898724577E" }, { "b" : "7F4C2949D000", "path" : "/lib64/libattr.so.1", "elfType" : 3, "buildId" : "8EF0683858704EF173AB11B1E27076F37F82B7B6" } ] }}
mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x7f4c327754f1]
mongod(+0x16835E9) [0x7f4c327745e9]
mongod(+0x1683ACD) [0x7f4c32774acd]
libpthread.so.0(+0xF7E0) [0x7f4c2e6a57e0]
libc.so.6(gsignal+0x35) [0x7f4c2e334625]
libc.so.6(abort+0x175) [0x7f4c2e335e05]
mongod(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0x0) [0x7f4c3194e360]
mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl22_finishReplSetReconfigERKNS_8executor12TaskExecutor12CallbackArgsERKNS0_16ReplicaSetConfigEi+0x388) [0x7f4c321e6c08]
mongod(_ZN5mongo4repl19ReplicationExecutor12_doOperationEPNS_16OperationContextERKNS_6StatusERKNS_8executor12TaskExecutor14CallbackHandleEPNSt7__cxx114listINS1_8WorkItemESaISE_EEEPSt5mutex+0x220) [0x7f4c322015d0]
mongod(+0x110EE49) [0x7f4c321ffe49]
mongod(+0x11167C0) [0x7f4c322077c0]
mongod(+0x10443A6) [0x7f4c321353a6]
mongod(+0x116DE77) [0x7f4c3225ee77]
mongod(_ZN5mongo4repl10TaskRunner9_runTasksEv+0xAF) [0x7f4c3226005f]
mongod(_ZN5mongo10ThreadPool10_doOneTaskEPSt11unique_lockISt5mutexE+0x135) [0x7f4c326ec6c5]
mongod(_ZN5mongo10ThreadPool13_consumeTasksEv+0xC0) [0x7f4c326ed1f0]
mongod(_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x149) [0x7f4c326edd99]
mongod(+0x20F9240) [0x7f4c331ea240]
libpthread.so.0(+0x7AA1) [0x7f4c2e69daa1]
libc.so.6(clone+0x6D) [0x7f4c2e3ea93d]
----- END BACKTRACE -----

More data forthcoming



 Comments   
Comment by Githook User [ 13/Jul/17 ]

Author:

{u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}

Message: SERVER-29015 TopologyCoordinator should not transition to candidate role in a single node replica set if we are in maintenance mode

(cherry picked from commit 5dd64f88d2b66078c957eea5a7889076ee5956b6)
Branch: v3.4
https://github.com/mongodb/mongo/commit/ee6d550e81773fafd2a981b100ab520b73970c5e

Comment by Githook User [ 21/Jun/17 ]

Author:

{u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}

Message: SERVER-29015 TopologyCoordinator should not transition to candidate role in a single node replica set if we are in maintenance mode
Branch: master
https://github.com/mongodb/mongo/commit/5dd64f88d2b66078c957eea5a7889076ee5956b6

Comment by Benety Goh [ 16/Jun/17 ]

The issue seems to be that TopologyCoordinatorImpl::updateConfig() is erroneously allowing the node to transition to a "candidate" role when it's in maintenance node:

https://github.com/mongodb/mongo/blob/73390210633a157f87221d561ce6cad1497225f9/src/mongo/db/repl/topology_coordinator_impl.cpp#L2231

topology_coordinator_impl.cpp

2189
// This function installs a new config object and recreates MemberData objects
2190
// that reflect the new config.
2191
void TopologyCoordinatorImpl::updateConfig(const ReplSetConfig& newConfig,
2192
                                           int selfIndex,
2193
                                           Date_t now) {
2194
    invariant(_role != Role::candidate);
2195
    invariant(selfIndex < newConfig.getNumMembers());
2196
 
2197
    // Reset term on startup and upgrade/downgrade of protocol version.
2198
    if (!_rsConfig.isInitialized() ||
2199
        _rsConfig.getProtocolVersion() != newConfig.getProtocolVersion()) {
2200
        if (newConfig.getProtocolVersion() == 1) {
2201
            _term = OpTime::kInitialTerm;
2202
        } else {
2203
            invariant(newConfig.getProtocolVersion() == 0);
2204
            _term = OpTime::kUninitializedTerm;
2205
        }
2206
        LOG(1) << "Updated term in topology coordinator to " << _term << " due to new config";
2207
    }
2208
 
2209
    _updateHeartbeatDataForReconfig(newConfig, selfIndex, now);
2210
    _stepDownPending = false;
2211
    _rsConfig = newConfig;
2212
    _selfIndex = selfIndex;
2213
    _forceSyncSourceIndex = -1;
2214
 
2215
    if (_role == Role::leader) {
2216
        if (_selfIndex == -1) {
2217
            log() << "Could not remain primary because no longer a member of the replica set";
2218
        } else if (!_selfConfig().isElectable()) {
2219
            log() << " Could not remain primary because no longer electable";
2220
        } else {
2221
            // Don't stepdown if you don't have to.
2222
            _currentPrimaryIndex = _selfIndex;
2223
            return;
2224
        }
2225
        _role = Role::follower;
2226
    }
2227
 
2228
    // By this point we know we are in Role::follower
2229
    _currentPrimaryIndex = -1;  // force secondaries to re-detect who the primary is
2230
 
2231
    if (_followerMode == MemberState::RS_SECONDARY && _rsConfig.getNumMembers() == 1 &&
2232
        _selfIndex == 0 && _rsConfig.getMemberAt(_selfIndex).isElectable()) {
2233
        // If the new config describes a one-node replica set, we're the one member,
2234
        // we're electable, and we are currently in followerMode SECONDARY,
2235
        // we must transition to candidate, in leiu of heartbeats.
2236
        _role = Role::candidate;
2237
    }
2238
}

Comment by Judah Schvimer [ 02/May/17 ]

I've attached a repro script in SERVER_29015.js. This repro can occur on https://github.com/mongodb/mongo/commit/435d43b66f04fc12fdb4f1e115d1fe9558571334.

Comment by Judah Schvimer [ 02/May/17 ]

The invariant occurs when attempting to postpone finishing the reconfig until after the current election finishes.

Outside of the lock, we reset the _voteRequester here on an election win and here on an election loss.

We appear to change the TopologyCoordinator role to "Candidate" many times without creating a VoteRequester. If we had just reset the VoteRequester and then become a candidate, then we'll hit this invariant.

Comment by Eric Milkie [ 28/Apr/17 ]

More specifically, can you include the existing replica set configuration, and the proposed new configuration from the replSetReconfig command? Also, was the "force" flag to the reconfig command used?

Comment by Daniel Pasette (Inactive) [ 28/Apr/17 ]

Can you include the replica set configuration?

Generated at Thu Feb 08 04:19:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.