[SERVER-28079] Secondary mongod crashes when removed from replicaset using rs.remove() Created: 23/Feb/17  Updated: 06/Dec/22  Resolved: 19/Dec/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.11, 3.4.2, 3.5.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Manoj Vivek Assignee: Backlog - Replication Team
Resolution: Incomplete Votes: 1
Labels: former-quick-wins, neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File server-28079.diff    
Issue Links:
Backports
Related
related to SERVER-27166 secondary crashes after being removed... Closed
is related to SERVER-30089 Arbiter crash with invariant failure ... Closed
Assigned Teams:
Replication
Operating System: ALL
Backport Requested:
v3.6, v3.4, v3.2
Steps To Reproduce:

Just running rs.remove crashed the removed node.
This happened thrice in past two days on two different secondary nodes.

Participants:
Case:

 Description   

Hi,
When I try to remove a replica from the replicaset using rs.remove() command, the secondary when we removed gets crashed.
Say, I run rs.remove(10.0.1.211) on the primary node.
The mongod running at 10.0.1.211 get crashed with the below trace:

2017-02-23T02:46:40.516-0800 I -        [ReplicationExecutor] Invariant failure i < _members.size() src/mongo/db/repl/replica_set_config.cpp 560
2017-02-23T02:46:40.516-0800 I -        [ReplicationExecutor] 
 
***aborting after invariant() failure
 
 
2017-02-23T02:46:40.522-0800 F -        [ReplicationExecutor] Got signal: 6 (Aborted).
 
 0x132fa32 0x132eb89 0x132f392 0x7f001f5375b0 0x7f001f1b6f49 0x7f001f1b8348 0x12b581b 0xee879a 0xf60936 0xf16939 0xf2ae32 0xf2fdc5 0x1b5f300 0x7f001f52ff18 0x7f001f265e9d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"F2FA32","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"F2EB89"},{"b":"400000","o":"F2F392"},{"b":"7F001F528000","o":"F5B0"},{"b":"7F001F183000","o":"33F49","s":"gsignal"},{"b":"7F001F183000","o":"35348","s":"abort"},{"b":"400000","o":"EB581B","s":"_ZN5mongo15invariantFailedEPKcS1_j"},{"b":"400000","o":"AE879A"},{"b":"400000","o":"B60936","s":"_ZNK5mongo4repl23TopologyCoordinatorImpl22shouldChangeSyncSourceERKNS_11HostAndPortERKNS0_6OpTimeES7_bNS_6Date_tE"},{"b":"400000","o":"B16939","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl23_shouldChangeSyncSourceERKNS_8executor12TaskExecutor12CallbackArgsERKNS_11HostAndPortERKNS0_6OpTimeEbPb"},{"b":"400000","o":"B2AE32"},{"b":"400000","o":"B2FDC5","s":"_ZN5mongo4repl19ReplicationExecutor3runEv"},{"b":"400000","o":"175F300","s":"execute_native_thread_routine"},{"b":"7F001F528000","o":"7F18"},{"b":"7F001F183000","o":"E2E9D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.11", "gitVersion" : "009580ad490190ba33d1c6253ebd8d91808923e4", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.4.23-31.54.amzn1.x86_64", "version" : "#1 SMP Tue Oct 18 22:02:09 UTC 2016", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "88BEAACF5E9DF7E467AF167F9F4F31D20015315F" }, { "b" : "7FFC6E5F6000", "elfType" : 3, "buildId" : "3C621354FA6866C1B7DBFFDA88CE59560C001BE1" }, { "b" : "7F0020447000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "FA7CA2477D0B7E4D5D3D875501F1EEC5C2D883A2" }, { "b" : "7F0020064000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "AEC6000432F3A98B6F7F3BD3793F12B13D0A0FC2" }, { "b" : "7F001FE5C000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "92B6FB6A7CF87B575FE6043F95639C1A081E2E2A" }, { "b" : "7F001FC58000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "F08BBD07F4042BC8D0D314F5E1F7F5D26F028CFB" }, { "b" : "7F001F95A000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "DFB15C9F2E7C575E1954C19CEFC2842DE2C265DB" }, { "b" : "7F001F744000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "DB655E06F0F4F7B4EC561BB7E620F5D5BC4F1C54" }, { "b" : "7F001F528000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "3C51D8CB39ED16242013CB77B0125707C6F34406" }, { "b" : "7F001F183000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "3B97A7F435805785DFD11096836F4E904FFF2599" }, { "b" : "7F00206B3000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "CAD953A4B324B3E3AA1449742558362857E826F2" }, { "b" : "7F001EF40000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "9DF61878D8918F25CC74AD01F417FDB051DFE3DA" }, { "b" : "7F001EC5B000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "6F1DB0F811D1B210520443442D4437BC43BF9A80" }, { "b" : "7F001EA58000", "path" : "/usr/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "E52249AE6C9865B5C3B9697A57FC92200DA51CF3" }, { "b" : "7F001E82D000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "F7DF34078FD7BFD684FE46D5F677EEDA1D9B9DC9" }, { "b" : "7F001E617000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "87B4EBF2183C8EA4AB657212203EFFE6340E2F4F" }, { "b" : "7F001E40C000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "381960ACAB9C39461D58BDE7B272C4F61BB3582F" }, { "b" : "7F001E209000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "BF48CD5658DE95CE058C4B828E81C97E2AE19643" }, { "b" : "7F001DFF2000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "E25410C0CBAA4D33369EDD8086A0CF24F5AFE4E7" }, { "b" : "7F001DDD1000", "path" : "/usr/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "803D7EF21A989677D056E52BAEB9AB5B154FB9D9" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x132fa32]
 mongod(+0xF2EB89) [0x132eb89]
 mongod(+0xF2F392) [0x132f392]
 libpthread.so.0(+0xF5B0) [0x7f001f5375b0]
 libc.so.6(gsignal+0x39) [0x7f001f1b6f49]
 libc.so.6(abort+0x148) [0x7f001f1b8348]
 mongod(_ZN5mongo15invariantFailedEPKcS1_j+0xCB) [0x12b581b]
 mongod(+0xAE879A) [0xee879a]
 mongod(_ZNK5mongo4repl23TopologyCoordinatorImpl22shouldChangeSyncSourceERKNS_11HostAndPortERKNS0_6OpTimeES7_bNS_6Date_tE+0x1B6) [0xf60936]
 mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl23_shouldChangeSyncSourceERKNS_8executor12TaskExecutor12CallbackArgsERKNS_11HostAndPortERKNS0_6OpTimeEbPb+0x89) [0xf16939]
 mongod(+0xB2AE32) [0xf2ae32]
 mongod(_ZN5mongo4repl19ReplicationExecutor3runEv+0x275) [0xf2fdc5]
 mongod(execute_native_thread_routine+0x20) [0x1b5f300]
 libpthread.so.0(+0x7F18) [0x7f001f52ff18]
 libc.so.6(clone+0x6D) [0x7f001f265e9d]
-----  END BACKTRACE  -----



 Comments   
Comment by Steven Vannelli [ 19/Dec/19 ]

The known crash has been fixed and the team did not have enough details to identify a separate bug.

Comment by He Lei [ 27/Apr/18 ]

Thank you for your prompt response. Since rs.remove() caused the crash to have no effect on the existing nodes in the production environment, it was decided not to go further into this issue.

Comment by Judah Schvimer [ 24/Apr/18 ]

I think this ticket was actually fixed as part of https://github.com/mongodb/mongo/commit/c88c4809c2440d286ed0fc29e1e8d684f015e563#diff-cbf616dcf2a6cf9c877fbdd057c8a1c4R2769 per this comment above. Someone can run the repro and see if it has been fixed.

dbapower, since it appears your failure has a different cause, and thus likely a different fix, can you please file a new ticket with a script or steps we can follow to reproduce the issue? Can you please include the config before and after the rs.remove call, and the exact rs.remove call you ran?

Thanks!
Judah

Comment by He Lei [ 23/Apr/18 ]

Hi Judah Schvimer,We found the problem maybe is not the no-vote member,when the mongod crash the information is :
Invariant failure i < _members.size() src/mongo/db/repl/repl_set_config.cpp 620

and we find the repl_set_config.cpp line 620,show us :
const MemberConfig& ReplSetConfig::getMemberAt(size_t i) const

{ invariant(i < _members.size()); return _members[i]; }

according to this ,we think is the "_id" cause this problem?
i.e we have 5 number of repl.The _id is 0,1,2,3,4. the _members.size is 5. 4 must <5
When use rs.remove more than one number such as member(3) and member(4) at the same time.the repl have 3 number and the members.size change to 3
but the i still is 4(in some reason) and 4<3 is not true,cause this problem?

Comment by Judah Schvimer [ 19/Apr/18 ]

Hi!
I'm sorry to hear that you've hit this problem. Please see the attached reproduction script attached in server-28079.diff. This applies on top of https://github.com/mongodb/mongo/commit/16cb9f02cb69427a6ebbe67bfa76d566000804e8. Let me know if you have any further questions.
Best,
Judah

Comment by He Lei [ 19/Apr/18 ]

Hi @Judah Schvimer,could you please show how to reproduce this issue? Our mongo instance problem is the same.I try to reproduce ,but nothing happen.
i built 8 member repl,2 no-vote member,1 arbiterOnly or no arbiter. The version i used 3.2/3.4 ,but when try rs.remove the no-vote nodes ,nothing happend. but our production mongod instance has the problem,and i confirm the production environment have 2 no-vote cause this problem.but i can't reproduce in my text environment.

Comment by Agam Dua [ 21/Mar/18 ]

I can confirm this happens when I did an `rs.remove()` on a node which is non-voting.

		{
			"_id" : 11,
			"host" : "<host:port>",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : true,
			"priority" : 0,
			"tags" : {
 
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 0
		}

$ mongo --version
MongoDB shell version v3.4.4
git version: 888390515874a9debd1b6c5d36559ca86b44babd
allocator: tcmalloc
modules: none
build environment:
    distarch: x86_64
    target_arch: x86_64

Comment by Judah Schvimer [ 03/Mar/17 ]

Also, I have confirmed that this has not been fixed in more recent versions.

Comment by Judah Schvimer [ 03/Mar/17 ]

The easiest fix for this is probably to check if a node's _selfIndex is -1 at the beginning of TopologyCoordinator::shouldChangeSyncSource, however I'm not sure if any other uses of _selfConfig() are also a problem.

Comment by Manoj Vivek [ 03/Mar/17 ]

hi judah.schvimer, Thanks so much for checking this.
The rs.config that I posted was the one after removing the replica member and I can confirm we had two non-voting members in the replica set when the crash happened.
Let me know if you need anything else.

PS: Sorry for the misleading rs.conf I posted earlier.

Comment by Judah Schvimer [ 02/Mar/17 ]

Thank you vivek_jonam for providing the replica set configuration. I was able to reproduce the issue, but only using non-voting nodes, and your configuration does not include any non-voting nodes. What was the configuration of the nodes that were removed?

Comment by Judah Schvimer [ 02/Mar/17 ]

Reproduction script attached in server-28079.diff. This applies on top of https://github.com/mongodb/mongo/commit/16cb9f02cb69427a6ebbe67bfa76d566000804e8.

Comment by Manoj Vivek [ 28/Feb/17 ]

Judah , Below is the output of rs.conf() command:

{
	"_id" : "rs_name",
	"version" : 211560,
	"members" : [
		{
			"_id" : 6,
			"host" : "10.0.0.141:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : true,
			"priority" : 0,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 8,
			"host" : "10.0.1.213:27017",
			"arbiterOnly" : true,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 17,
			"host" : "10.0.1.212:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 23,
			"host" : "10.0.1.215:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 24,
			"host" : "10.0.1.211:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		}
	],
	"settings" : {
		"chainingAllowed" : true,
		"heartbeatIntervalMillis" : 2000,
		"heartbeatTimeoutSecs" : 10,
		"electionTimeoutMillis" : 10000,
		"getLastErrorModes" : {
			
		},
		"getLastErrorDefaults" : {
			"w" : 1,
			"wtimeout" : 0
		}
	}
}

Interestingly, rs.conf() didn't print the field protocolVersion.

Comment by Judah Schvimer [ 27/Feb/17 ]

Hi vivek_jonam,

Thank you for bringing this to our attention. Can you please provide information on the configuration of your replica set? Specifically how many nodes you have, the replication protocolVersion, and any arbiters or priorities? Additionally, do you have any more specific steps I can try to reproduce this issue? I have been unable to reproduce it myself.

Thank you,
Judah

Generated at Thu Feb 08 04:17:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.