[SERVER-31693] I cannot remove an unreachable shard server in a mongo shard cluster Created: 24/Oct/17  Updated: 06/Dec/22  Resolved: 15/Dec/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Eric Lee Assignee: Backlog - Triage Team
Resolution: Incomplete Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-31692 How to remove a damage shard in mongo... Closed
Assigned Teams:
Server Triage
Operating System: ALL
Participants:

 Description   

The data of that shard server has lost, the server is down, how can I remove it directly from my cluster?

I have tried

db.runCommand(
{ removeshard: "shard0002" }
)

, however, it made no sense, the response is always:

{
"msg" : "draining ongoing",
"state" : "ongoing",
"remaining" :
{ "chunks" : NumberLong(612), "dbs" : NumberLong(0) }
,
"note" : "you need to drop or movePrimary these databases",
"dbsToMove" : [ ],
"ok" : 1
}

I just want to remove the damage shard (unreachable), and let the cluster run normally.
However, I couldn't find a way to finish it...
Who can help me?
Thanks in advance!



 Comments   
Comment by Kelsey Schubert [ 15/Dec/21 ]

Hi Eric Lee,

Sorry this issue fell through the cracks. I'm going to resolve it, since I assume the issue has been resolved or is no longer relevant, but let me know if I'm mistaken.

Thanks,
Kelsey

Comment by Eric Lee [ 25/Oct/17 ]

I got a new device, and use the same ip to build the shard server, but the data was gone.
The output of db.UserGps.count():

5775812

However, it didn't count the damage shard, the result is wrong.

The output of db.UserGps.stats():

{
	"sharded" : true,
	"capped" : false,
	"ok" : 0,
	"errmsg" : "failed on shard: { ns: \"gps_db.UserGps\", ok: 0.0, errmsg: \"Database [gps_db] not found.\" }"
}

Comment by Eric Lee [ 24/Oct/17 ]

Output of sh.status(true):

--- Sharding Status ---
  sharding version: {
	"_id" : 1,
	"minCompatibleVersion" : 5,
	"currentVersion" : 6,
	"clusterId" : ObjectId("58bd58246056bd6afc62d9de")
}
  shards:
	{  "_id" : "shard0000",  "host" : "xxx:3031",  "state" : 1 }
	{  "_id" : "shard0001",  "host" : "xxx:3031",  "state" : 1 }
	{  "_id" : "shard0002",  "host" : "10.112.18.13:3031",  "state" : 1,  "draining" : true }
	{  "_id" : "shard0003",  "host" : "xxx:3031",  "state" : 1 }
	{  "_id" : "shard0004",  "host" : "xxx:3031",  "state" : 1 }
	{  "_id" : "shard0005",  "host" : "xxx:3031",  "state" : 1 }
	{  "_id" : "shard0006",  "host" : "xxx:3031",  "state" : 1 }
	{  "_id" : "shard0007",  "host" : "xxx:3031",  "state" : 1 }
	{  "_id" : "shard0008",  "host" : "xxx:3031",  "state" : 1 }
  active mongoses:
	{  "_id" : "xxxx",  "ping" : ISODate("2017-10-24T17:51:06.743Z"),  "up" : NumberLong(3480),  "waiting" : true,  "mongoVersion" : "3.4.2" }
	{  "_id" : "xxxx",  "ping" : ISODate("2017-10-24T17:50:56.846Z"),  "up" : NumberLong(18447),  "waiting" : true,  "mongoVersion" : "3.4.9" }
 autosplit:
	Currently enabled: yes
  balancer:
	Currently enabled:  yes
	Currently running:  yes
		Balancer lock taken at Wed Oct 25 2017 00:53:04 GMT+0800 (CST) by ConfigServer:Balancer
		Balancer active window is set between 00:00 and 07:00 server local time
	Failed balancer rounds in last 5 attempts:  5
	Last reported error:  Connection timed out
	Time of Reported error:  Wed Oct 25 2017 01:49:34 GMT+0800 (CST)
	Migration Results for the last 24 hours:
		16 : Success
		5 : Failed with error 'aborted', from shard0005 to shard0007
		10 : Failed with error 'aborted', from shard0002 to shard0008
		10 : Failed with error 'aborted', from shard0001 to shard0004
		49 : Failed with error 'aborted', from shard0005 to shard0006
		131 : Failed with error 'aborted', from shard0005 to shard0004
		10 : Failed with error 'aborted', from shard0000 to shard0006
		1799 : Failed with error 'aborted', from shard0001 to shard0006
		1808 : Failed with error 'aborted', from shard0002 to shard0007
		1608 : Failed with error 'aborted', from shard0005 to shard0003
		1799 : Failed with error 'aborted', from shard0000 to shard0008
  databases:
	{  "_id" : "xxx",  "primary" : "shard0000",  "partitioned" : true }
		xxx.c1
			shard key: { "shd" : 1 }
			unique: false
			balancing: true
			chunks:
				shard0000	191
				shard0001	173
				shard0002	167
				shard0003	167
				shard0004	166
				shard0005	171
				shard0006	164
				shard0007	166
				shard0008	161
			{ "shd" : { "$minKey" : 1 } } -->> { "shd" : 14681 } on : shard0006 Timestamp(679, 0)
			{ "shd" : 14681 } -->> { "shd" : 163145 } on : shard0008 Timestamp(671, 5)
			{ "shd" : 163145 } -->> { "shd" : 207836 } on : shard0008 Timestamp(671, 6)
			{ "shd" : 207836 } -->> { "shd" : 207860 } on : shard0008 Timestamp(671, 7)
			{ "shd" : 207860 } -->> { "shd" : 229145 } on : shard0008 Timestamp(639, 23) jumbo
			{ "shd" : 229145 } -->> { "shd" : 377572 } on : shard0008 Timestamp(639, 24) jumbo
                        { "shd" : 377572 } -->> { "shd" : 377712 } on : shard0005 Timestamp(668, 0)
			{ "shd" : 377712 } -->> { "shd" : 378619 } on : shard0006 Timestamp(670, 0)
			{ "shd" : 378619 } -->> { "shd" : 404468 } on : shard0002 Timestamp(678, 1) jumbo
			{ "shd" : 404468 } -->> { "shd" : 423559 } on : shard0002 Timestamp(676, 12) jumbo
			{ "shd" : 423559 } -->> { "shd" : 423589 } on : shard0004 Timestamp(678, 0)
			{ "shd" : 423589 } -->> { "shd" : 424421 } on : shard0002 Timestamp(672, 0)
			{ "shd" : 424421 } -->> { "shd" : 424499 } on : shard0000 Timestamp(674, 1) jumbo
			{ "shd" : 424499 } -->> { "shd" : 424723 } on : shard0003 Timestamp(675, 0)
			{ "shd" : 424723 } -->> { "shd" : 424729 } on : shard0002 Timestamp(644, 0)
....................

Comment by Eric Lee [ 24/Oct/17 ]

Here's some logs from mongos server:

2017-10-25T01:42:35.446+0800 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Failed to connect to 10.112.18.13:3031 - HostUnreachable: Connection timed out

In my opinion, 'db.runCommand(

{ removeshard: "shard0002" }

)' cannot be executed..

Comment by Eric Lee [ 24/Oct/17 ]

Hi Mark,
It's because my server is down (socket exception [CONNECT_ERROR] for 10.112.18.13:3031) .
And I want to remove it from the shard cluster, what should I do?
I use this command

db.runCommand({ removeshard: "shard0002" })

and it always return:

{
	"msg" : "draining ongoing",
	"state" : "ongoing",
	"remaining" : {
		"chunks" : NumberLong(612),
		"dbs" : NumberLong(0)
	},
	"note" : "you need to drop or movePrimary these databases",
	"dbsToMove" : [ ],
	"ok" : 1
}

I know the draining couldn't be success, cause the shard server 10.112.18.13 is unreachable.
That server(10.112.18.13) met a SSD failure, the data was gone cause the disk is RAID0, I have to remove it from the server and add another new shard server to the cluster.

Regards,
Eric

Comment by Mark Agarunov [ 24/Oct/17 ]

Hello Eric Lee,

Thank you for the report. To get a better idea of why this might be happening, could you please provide the following:

  • The complete output of sh.status(true)
  • The complete logs from all affected mongod nodes
  • The complete logs from all affected mongos nodes.

This should give some insight into why you are seeing this error when trying to remove the shard.

Thanks,
Mark

Generated at Thu Feb 08 04:27:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.