[SERVER-7857] moveChunk fails after removing an unrelated shard (replset shards) Created: 06/Dec/12  Updated: 06/Dec/22  Resolved: 19/Apr/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.2.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Thomas Rueckstiess Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File bug_repro.js     Zip Archive logfiles.zip    
Assigned Teams:
Sharding
Operating System: ALL
Participants:

 Description   

I've created a 4-shard cluster, each consisting of a replica set (3 nodes).
After enableSharding and shardCollection, I remove shard02. I then try to moveChunk the only existing chunk from shard01 to shard04. An error is thrown and the move fails:

{
	"errmsg" : "exception: No replica set monitor active and no cached seed found for set: shard02",
	"code" : 16340,
	"ok" : 0
}

After quitting the mongo shell and restarting it, the moveChunk command works fine. Note: Nothing else (mongod, mongos) was restarted, only the shell reconnected.

Below is the history to reproduce the issue. It was tested on 2.2.0 and 2.2.1. Both versions are affected.

tr@capslock:~/Documents/code/mtools$ mongo --port 27030
MongoDB shell version: 2.2.1
connecting to: 127.0.0.1:27030/test
mongos> sh.status()
--- Sharding Status --- 
  sharding version: { "_id" : 1, "version" : 3 }
  shards:
	{  "_id" : "shard01",  "host" : "shard01/capslock.local:27017,capslock.local:27018,capslock.local:27019" }
	{  "_id" : "shard02",  "host" : "shard02/capslock.local:27020,capslock.local:27021,capslock.local:27022" }
	{  "_id" : "shard03",  "host" : "shard03/capslock.local:27023,capslock.local:27024,capslock.local:27025" }
	{  "_id" : "shard04",  "host" : "shard04/capslock.local:27026,capslock.local:27027,capslock.local:27028" }
  databases:
	{  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
 
mongos> sh.enableSharding('test')
{ "ok" : 1 }
mongos> sh.shardCollection('test.docs', {shardkey: 1})
{ "collectionsharded" : "test.docs", "ok" : 1 }
mongos> sh.status()
--- Sharding Status --- 
  sharding version: { "_id" : 1, "version" : 3 }
  shards:
	{  "_id" : "shard01",  "host" : "shard01/capslock.local:27017,capslock.local:27018,capslock.local:27019" }
	{  "_id" : "shard02",  "host" : "shard02/capslock.local:27020,capslock.local:27021,capslock.local:27022" }
	{  "_id" : "shard03",  "host" : "shard03/capslock.local:27023,capslock.local:27024,capslock.local:27025" }
	{  "_id" : "shard04",  "host" : "shard04/capslock.local:27026,capslock.local:27027,capslock.local:27028" }
  databases:
	{  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
	{  "_id" : "test",  "partitioned" : true,  "primary" : "shard01" }
		test.docs chunks:
				shard01	1
			{ "shardkey" : { $minKey : 1 } } -->> { "shardkey" : { $maxKey : 1 } } on : shard01 Timestamp(1000, 0) 
 
mongos> db.adminCommand({removeShard: 'shard02'})
{
	"msg" : "draining started successfully",
	"state" : "started",
	"shard" : "shard02",
	"ok" : 1
}
mongos> db.adminCommand({removeShard: 'shard02'})
{
	"msg" : "removeshard completed successfully",
	"state" : "completed",
	"shard" : "shard02",
	"ok" : 1
}
mongos> db.adminCommand({removeShard: 'shard02'})
{
	"errmsg" : "exception: can't find shard for: shard02",
	"code" : 13129,
	"ok" : 0
}
mongos> sh.status()
--- Sharding Status --- 
  sharding version: { "_id" : 1, "version" : 3 }
  shards:
	{  "_id" : "shard01",  "host" : "shard01/capslock.local:27017,capslock.local:27018,capslock.local:27019" }
	{  "_id" : "shard03",  "host" : "shard03/capslock.local:27023,capslock.local:27024,capslock.local:27025" }
	{  "_id" : "shard04",  "host" : "shard04/capslock.local:27026,capslock.local:27027,capslock.local:27028" }
  databases:
	{  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
	{  "_id" : "test",  "partitioned" : true,  "primary" : "shard01" }
		test.docs chunks:
				shard01	1
			{ "shardkey" : { $minKey : 1 } } -->> { "shardkey" : { $maxKey : 1 } } on : shard01 Timestamp(1000, 0) 
 
mongos> sh.moveChunk('test.docs', {shardkey: 0}, 'shard04')
{
	"errmsg" : "exception: No replica set monitor active and no cached seed found for set: shard02",
	"code" : 16340,
	"ok" : 0
}
mongos> ^C
bye
tr@capslock:~/Documents/code/mtools$ mongo --port 27030
MongoDB shell version: 2.2.1
connecting to: 127.0.0.1:27030/test
mongos> sh.moveChunk('test.docs', {shardkey: 0}, 'shard04')
{ "millis" : 2236, "ok" : 1 }
mongos> sh.status()
--- Sharding Status --- 
  sharding version: { "_id" : 1, "version" : 3 }
  shards:
	{  "_id" : "shard01",  "host" : "shard01/capslock.local:27017,capslock.local:27018,capslock.local:27019" }
	{  "_id" : "shard03",  "host" : "shard03/capslock.local:27023,capslock.local:27024,capslock.local:27025" }
	{  "_id" : "shard04",  "host" : "shard04/capslock.local:27026,capslock.local:27027,capslock.local:27028" }
  databases:
	{  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
	{  "_id" : "test",  "partitioned" : true,  "primary" : "shard01" }
		test.docs chunks:
				shard04	1
			{ "shardkey" : { $minKey : 1 } } -->> { "shardkey" : { $maxKey : 1 } } on : shard04 Timestamp(2000, 0) 
 
mongos> 



 Comments   
Comment by Thomas Rueckstiess [ 06/Dec/12 ]

The below assertion and mongos stack trace in the log is a red herring. It is caused by calling removeShard one time more than needed. It's not the cause of the issue. I retried and only called removeShard twice, and the error still appears, without the stack trace.

Thu Dec  6 15:48:16 [conn3] resetting shard version of test.docs on capslock.local:27026, version is zero
Thu Dec  6 15:48:16 [conn3] going to start draining shard: shard02
primaryLocalDoc: { _id: "local", primary: "shard02" }
Thu Dec  6 15:48:17 [conn3] going to remove shard: shard02
Thu Dec  6 15:48:17 [conn3] deleting replica set monitor for: shard02/capslock.local:27020,capslock.local:27021,capslock.local:27022
Thu Dec  6 15:48:18 [conn3] Assertion: 13129:can't find shard for: shard02
0x10dd25c9b 0x10dd054de 0x10dd055dd 0x10dcb8aa2 0x10dcb8bbb 0x10dcb58ca 0x10dc680e5 0x10dc6fd5b 0x10dcd0fce 0x10dcc5c9f 0x10dcb188c 0x10db2b92f 0x10dd1bd5d 0x10dd57335 0x7fff8aa65742 0x7fff8aa52181 
 0   mongos                              0x000000010dd25c9b _ZN5mongo15printStackTraceERSo + 43
 1   mongos                              0x000000010dd054de _ZN5mongo11msgassertedEiPKc + 174
 2   mongos                              0x000000010dd055dd _ZN5mongo11msgassertedEiRKSs + 29
 3   mongos                              0x000000010dcb8aa2 _ZN5mongo15StaticShardInfo4findERKSs + 434
 4   mongos                              0x000000010dcb8bbb _ZN5mongo15StaticShardInfo8findCopyERKSs + 37
 5   mongos                              0x000000010dcb58ca _ZN5mongo5Shard5resetERKSs + 42
 6   mongos                              0x000000010dc680e5 _ZN5mongo11dbgrid_cmds14RemoveShardCmd3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb + 313
 7   mongos                              0x000000010dc6fd5b _ZN5mongo7Command20runAgainstRegisteredEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi + 2073
 8   mongos                              0x000000010dcd0fce _ZN5mongo14SingleStrategy7queryOpERNS_7RequestE + 744
 9   mongos                              0x000000010dcc5c9f _ZN5mongo13ShardStrategy7queryOpERNS_7RequestE + 61
 10  mongos                              0x000000010dcb188c _ZN5mongo7Request7processEi + 432
 11  mongos                              0x000000010db2b92f _ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE + 141
 12  mongos                              0x000000010dd1bd5d _ZN5mongo3pms9threadRunEPNS_13MessagingPortE + 1645
 13  mongos                              0x000000010dd57335 thread_proxy + 229
 14  libsystem_c.dylib                   0x00007fff8aa65742 _pthread_start + 327
 15  libsystem_c.dylib                   0x00007fff8aa52181 thread_start + 13
Thu Dec  6 15:48:19 [conn3] end connection 127.0.0.1:56785 (0 connections now open)

Comment by Thomas Rueckstiess [ 06/Dec/12 ]

Log files and javascript to reproduce attached.

Start up a cluster with 4 shards (shard01, shard02, shard03, shard04) each consisting of a replica set.

(you can use my mlaunch script for quick setup: mlaunch --sharded 4 --replicaset
available at: https://github.com/rueckstiess/mtools)

Then run

mongo --port <mongos-port> bug_repro.js

Comment by Thomas Rueckstiess [ 06/Dec/12 ]

Repeated calls to moveChunks (prior to restarting the shell) caused the same error. It was only fixed once I restarted mongo.

Generated at Thu Feb 08 03:15:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.