[SERVER-17677] Replica Set member backtraces sometimes when removed from replica set Created: 20/Mar/15  Updated: 14/Apr/15  Resolved: 26/Mar/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.0.1
Fix Version/s: 3.0.2

Type: Bug Priority: Major - P3
Reporter: Timothy Olsen (Inactive) Assignee: Matt Dannenberg
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File primary.log     Text File removed.log    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Set up a replica set with 4 nodes.

Reconfigure replica set using "replSetReconfig" command to have only 3 of those nodes.

Watch the now excluded node backtrace (sometimes).

Participants:

 Description   

Relevant portions from the log:

2015-03-18T16:38:15.867-0400 I REPL     [ReplicationExecutor] syncing from: jenkins-slave-osx109-agent1.mms-build.10gen.cc:9102
2015-03-18T16:38:15.868-0400 I REPL     [SyncSourceFeedback] replset setting syncSourceFeedback to jenkins-slave-osx109-agent1.mms-build.10gen.cc:9102
2015-03-18T16:38:15.868-0400 I REPL     [ReplicationExecutor] Cannot find self in new replica set configuration; I must be removed; NodeNotFound No host described in new configuration 3 for replica set a maps to this node
2015-03-18T16:38:15.869-0400 I REPL     [ReplicationExecutor] New replica set config in use: { _id: "a", version: 3, members: [ { _id: 0, host: "rs-9101-alias.lvh.me:9101", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "jenkins-slave-osx109-agent1.mms-build.10gen.cc:9102", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "jenkins-slave-osx109-agent1.mms-build.10gen.cc:9103", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatTimeoutSecs: 10, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 } } }
2015-03-18T16:38:15.869-0400 I REPL     [ReplicationExecutor] This node is not a member of the config
2015-03-18T16:38:15.869-0400 I REPL     [ReplicationExecutor] transition to REMOVED
2015-03-18T16:38:15.870-0400 I -        [SyncSourceFeedback] Fatal Assertion 18650
2015-03-18T16:38:15.870-0400 I NETWORK  [conn32] end connection 208.52.190.99:54406 (5 connections now open)
2015-03-18T16:38:15.870-0400 I NETWORK  [conn34] end connection 208.52.190.99:54420 (5 connections now open)
2015-03-18T16:38:15.870-0400 I NETWORK  [conn35] end connection 208.52.190.99:54423 (5 connections now open)
2015-03-18T16:38:15.871-0400 I NETWORK  [initandlisten] connection accepted from 208.52.190.99:54437 #36 (7 connections now open)
2015-03-18T16:38:15.872-0400 I NETWORK  [initandlisten] connection accepted from 208.52.190.99:54439 #37 (5 connections now open)
2015-03-18T16:38:15.872-0400 I NETWORK  [initandlisten] connection accepted from 208.52.190.99:54438 #38 (6 connections now open)
2015-03-18T16:38:15.872-0400 I CONTROL  [SyncSourceFeedback] 
 0x10656cd79 0x1065205a0 0x10650dc36 0x106287b98 0x1062c00ba 0x1062c1b8c 0x1065a07a1 0x7fff86839899 0x7fff8683972a 0x7fff8683dfc9
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"105DFB000","o":"771D79"},{"b":"105DFB000","o":"7255A0"},{"b":"105DFB000","o":"712C36"},{"b":"105DFB000","o":"48CB98"},{"b":"105DFB000","o":"4C50BA"},{"b":"105DFB000","o":"4C6B8C"},{"b":"105DFB000","o":"7A57A1"},{"b":"7FFF86838000","o":"1899"},{"b":"7FFF86838000","o":"172A"},{"b":"7FFF86838000","o":"5FC9"}],"processInfo":{ "mongodbVersion" : "3.0.1", "gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952", "uname" : { "sysname" : "Darwin", "release" : "13.3.0", "version" : "Darwin Kernel Version 13.3.0: Tue Jun  3 21:27:35 PDT 2014; root:xnu-2422.110.17~1/RELEASE_X86_64", "machine" : "x86_64" }, "somap" : [ { "path" : "/Users/jenkins/workspace/Automation-Agent-OSX/go_planner/tmp/versions/mongodb-osx-x86_64-3.0.1/bin/mongod", "machType" : 2, "b" : "105DFB000", "buildId" : "BCCE8C3AD8BB31D490CB01C153674428" }, { "path" : "/usr/lib/libSystem.B.dylib", "machType" : 6, "b" : "7FFF8E415000", "buildId" : "E6BCC0314342318CA1ADA8E9AEAE716F" }, { "path" : "/usr/lib/libc++.1.dylib", "machType" : 6, "b" : "7FFF82A0F000", "buildId" : "4F68DFC5207739A8A449CAC5FDEE7BDE" }, { "path" : "/usr/lib/system/libcache.dylib", "machType" : 6, "b" : "7FFF8A9BC000", "buildId" : "BDC1E65B72A13DA3A57CB23159CAAD0B" }, { "path" : "/usr/lib/system/libcommonCrypto.dylib", "machType" : 6, "b" : "7FFF8396E000", "buildId" : "8C4F0CA0389C3EDCB155E62DD2187E1D" }, { "path" : "/usr/lib/system/libcompiler_rt.dylib", "machType" : 6, "b" : "7FFF838A1000", "buildId" : "4CD916B21B17362AB403EF24A1DAC141" }, { "path" : "/usr/lib/system/libcopyfile.dylib", "machType" : 6, "b" : "7FFF8DBA7000", "buildId" : "CF29DFF605893590834C82E2316612E8" }, { "path" : "/usr/lib/system/libcorecrypto.dylib", "machType" : 6, "b" : "7FFF8A788000", "buildId" : "F3973C2814B63006BB2B00DD7F09ABC7" }, { "path" : "/usr/lib/system/libdispatch.dylib", "machType" : 6, "b" : "7FFF8E319000", "buildId" : "C4E4A18D3C3B3C9C8709A4270D998DE7" }, { "path" : "/usr/lib/system/libdyld.dylib", "machType" : 6, "b" : "7FFF87D31000", "buildId" : "A1273A0249283BACA8466C069DC1A0A6" }, { "path" : "/usr/lib/system/libkeymgr.dylib", "machType" : 6, "b" : "7FFF839CE000", "buildId" : "3AA8D85DCF003BD3A5A0E28E1A32A6D8" }, { "path" : "/usr/lib/system/liblaunch.dylib", "machType" : 6, "b" : "7FFF8D866000", "buildId" : "A40A0C7B321639B48AE0B5D3BAF1DA8A" }, { "path" : "/usr/lib/system/libmacho.dylib", "machType" : 6, "b" : "7FFF8A6BF000", "buildId" : "1D2910DFC0363A82A3FD44FF73B5FF9B" }, { "path" : "/usr/lib/system/libquarantine.dylib", "machType" : 6, "b" : "7FFF84942000", "buildId" : "7A1A2BCBC03D3A25BFA43E569B2D2C38" }, { "path" : "/usr/lib/system/libremovefile.dylib", "machType" : 6, "b" : "7FFF8AB11000", "buildId" : "3543F917928E3DB2A2F47AB73B4970EF" }, { "path" : "/usr/lib/system/libsystem_asl.dylib", "machType" : 6, "b" : "7FFF84EAD000", "buildId" : "655FB34352CF3E2FB14DBEBF5AAEF94D" }, { "path" : "/usr/lib/system/libsystem_blocks.dylib", "machType" : 6, "b" : "7FFF8FA50000", "buildId" : "FB856CD12AEA39078E9B1E54B6827F82" }, { "path" : "/usr/lib/system/libsystem_c.dylib", "machType" : 6, "b" : "7FFF839CF000", "buildId" : "6FD3A4004BB23B95B90CBE6E9D0D78FA" }, { "path" : "/usr/lib/system/libsystem_configuration.dylib", "machType" : 6, "b" : "7FFF8A1D6000", "buildId" : "4998CB6A9D54390A9F575D1AC53C135C" }, { "path" : "/usr/lib/system/libsystem_dnssd.dylib", "machType" : 6, "b" : "7FFF88046000", "buildId" : "17B03FFD92C532829981EBB28B456207" }, { "path" : "/usr/lib/system/libsystem_info.dylib", "machType" : 6, "b" : "7FFF83126000", "buildId" : "7D41A156D2853849A2C3C04ADE797D98" }, { "path" : "/usr/lib/system/libsystem_kernel.dylib", "machType" : 6, "b" : "7FFF86E68000", "buildId" : "873931CED1AF3596AADBD2E63C9AB29F" }, { "path" : "/usr/lib/system/libsystem_m.dylib", "machType" : 6, "b" : "7FFF8B0BA000", "buildId" : "B7F0E2E4277733FCA787D6430B630D54" }, { "path" : "/usr/lib/system/libsystem_malloc.dylib", "machType" : 6, "b" : "7FFF87CE8000", "buildId" : "A695B4E438E9332EA77229D31E3F1385" }, { "path" : "/usr/lib/system/libsystem_network.dylib", "machType" : 6, "b" : "7FFF881BD000", "buildId" : "8B1E1F1DA5CC3BAE8B1EABC84337A364" }, { "path" : "/usr/lib/system/libsystem_notify.dylib", "machType" : 6, "b" : "7FFF83964000", "buildId" : "52571EC3689437E4946E064B021ED44E" }, { "path" : "/usr/lib/system/libsystem_platform.dylib", "machType" : 6, "b" : "7FFF84892000", "buildId" : "3C3D3DA832B9324398ECD89B9A1670B3" }, { "path" : "/usr/lib/system/libsystem_pthread.dylib", "machType" : 6, "b" : "7FFF86838000", "buildId" : "AB498556B555310E9041F67EC9E00E2C" }, { "path" : "/usr/lib/system/libsystem_sandbox.dylib", "machType" : 6, "b" : "7FFF8803E000", "buildId" : "0D0B13EA6B7A3AC8BE60B548543BEB77" }, { "path" : "/usr/lib/system/libsystem_stats.dylib", "machType" : 6, "b" : "7FFF86808000", "buildId" : "337946FCB2E13DFAA8B530DA8D584D75" }, { "path" : "/usr/lib/system/libunc.dylib", "machType" : 6, "b" : "7FFF8804F000", "buildId" : "62682455186236FE8A047A6B91256438" }, { "path" : "/usr/lib/system/libunwind.dylib", "machType" : 6, "b" : "7FFF880ED000", "buildId" : "78DCC3582FC1302EB3950155B47CB547" }, { "path" : "/usr/lib/system/libxpc.dylib", "machType" : 6, "b" : "7FFF8EEB1000", "buildId" : "AB40CD57F4543FD4B41563B3C0D5C624" }, { "path" : "/usr/lib/libobjc.A.dylib", "machType" : 6, "b" : "7FFF83174000", "buildId" : "AD7FD984271E30F4A3616B20319EC73B" }, { "path" : "/usr/lib/libauto.dylib", "machType" : 6, "b" : "7FFF8C762000", "buildId" : "F45C36E8B6063886B5B1B6745E757CA8" }, { "path" : "/usr/lib/libc++abi.dylib", "machType" : 6, "b" : "7FFF84E83000", "buildId" : "21A807D367323455B77F743E9F916DF0" }, { "path" : "/usr/lib/libDiagnosticMessagesClient.dylib", "machType" : 6, "b" : "7FFF83979000", "buildId" : "4CDB0F7BC0AF3424BC39495696F0DB1E" } ] }}
 mongod(_ZN5mongo15printStackTraceERNSt3__113basic_ostreamIcNS0_11char_traitsIcEEEE+0x39) [0x10656cd79]
 mongod(_ZN5mongo10logContextEPKc+0x100) [0x1065205a0]
 mongod(_ZN5mongo13fassertFailedEi+0xD6) [0x10650dc36]
 mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl45prepareReplSetUpdatePositionCommandHandshakesEPNSt3__16vectorINS_7BSONObjENS2_9allocatorIS4_EEEE+0x3A8) [0x106287b98]
 mongod(_ZN5mongo4repl18SyncSourceFeedback13replHandshakeEPNS_16OperationContextE+0x5A) [0x1062c00ba]
 mongod(_ZN5mongo4repl18SyncSourceFeedback3runEv+0x26C) [0x1062c1b8c]
 mongod(_ZN5boost12_GLOBAL__N_112thread_proxyEPv+0xB1) [0x1065a07a1]
 libsystem_pthread.dylib(_pthread_body+0x8A) [0x7fff86839899]
 libsystem_pthread.dylib(_pthread_struct_init+0x0) [0x7fff8683972a]
 libsystem_pthread.dylib(thread_start+0xD) [0x7fff8683dfc9]
-----  END BACKTRACE  -----
2015-03-18T16:38:15.873-0400 I -        [SyncSourceFeedback] 
 
***aborting after fassert() failure



 Comments   
Comment by Githook User [ 26/Mar/15 ]

Author:

{u'username': u'dannenberg', u'name': u'matt dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: SERVER-17677 do not send ReplSetUpdatePosition handshakes when REMOVED
Branch: v3.0
https://github.com/mongodb/mongo/commit/c9f935b1b6eea0ce43d01a82600842b0884e1725

Comment by Timothy Olsen (Inactive) [ 20/Mar/15 ]

There was no load on the replset during the reconfiguration.

Comment by Timothy Olsen (Inactive) [ 20/Mar/15 ]

I've only hit this backtrace once.

Attached are the requested log files.

Comment by Andy Schwerin [ 20/Mar/15 ]

There's some faulty logic in ReplicationCoordinatorImpl::prepareReplSetUpdatePositionCommand and ReplicationCoordinatorImpl::prepareReplSetUpdatePositionCommandHandshakes. Specifically, in the following code, the comment about having ensured the existence of "member" is irrelevant, because the replication coordinator mutex has been unlocked since the time when we ensured said existence. In the intervening period, a replSetReconfig may have come in, removing the member with the given member id (as happened here).

const MemberConfig* member = _rsConfig.findMemberByID(itr->memberId);
fassert(18650, member); // We ensured the member existed in processHandshake.

Comment by J Rassi [ 20/Mar/15 ]

Tim, could you please attach to this ticket the full log for the node that hit the assertion failure, and the full log for the primary?

And, are you able to reproduce this reliably?

Generated at Thu Feb 08 03:45:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.