[SERVER-33386] Chunk Migration stopped Created: 18/Feb/18  Updated: 21/Mar/18  Resolved: 20/Feb/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.0.12
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Kaviyarasan Ramalingam Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

Team,

We are facing some issues while removing sharding in Production instance.

We have three shard in our prod, and we are working on removing the same.

we followed the same as documented https://docs.mongodb.com/v3.0/tutorial/remove-shards-from-cluster/index.html

2018-02-18T05:33:27.635+0000 I SHARDING [conn4421103] waiting till out of critical section
2018-02-18T05:33:27.641+0000 I SHARDING [conn4421196] waiting till out of critical section
2018-02-18T05:33:27.671+0000 I SHARDING [conn4421085] waiting till out of critical section
2018-02-18T05:33:27.677+0000 I SHARDING [conn4420862] waiting till out of critical section
2018-02-18T05:33:27.728+0000 I NETWORK  [conn4420978] Socket recv() timeout  xx.xx.xx.xx:27047
2018-02-18T05:33:27.728+0000 I NETWORK  [conn4420978] SocketException: remote: xx.xx.xx.xx:27047 error: 9001 socket exception [RECV_TIMEOUT] server [xx.xx.xx.xx:27047]
2018-02-18T05:33:27.728+0000 I NETWORK  [conn4420978] DBClientCursor::init call() failed
2018-02-18T05:33:27.728+0000 I NETWORK  [conn4420978] scoped connection to xx.xx.xx.xx:27047,xx.xx.xx.xx:27047,xx.xx.xx.xx:27047 not being returned to the pool
2018-02-18T05:33:27.729+0000 W SHARDING [conn4420978] 10276 DBClientBase::findN: transport error: xx.xx.xx.xx:27047 ns: config.$cmd query: { applyOps: [ { op: "u", b: false, ns: "config.chunks", o: { _id: "database_name.measurement_header-measHeaderId_"72eeba27-1d72-49f9-8847-1433dc88ec9b"", lastmod: Timestamp 1295000|0, lastmodEpoch: ObjectId('56307d6cf424c605b37f5d99'), ns: "database_name.measurement_header", min: { measHeaderId: "72eeba27-1d72-49f9-8847-1433dc88ec9b" }, max: { measHeaderId: "730f20d9-e3bd-4076-8a57-92aed9be8574" }, shard: "fc1" }, o2: { _id: "database_name.measurement_header-measHeaderId_"72eeba27-1d72-49f9-8847-1433dc88ec9b"" } }, { op: "u", b: false, ns: "config.chunks", o: { _id: "database_name.measurement_header-measHeaderId_"732db8b5-70d9-4eb0-8f00-a5b1c58399df"", lastmod: Timestamp 1295000|1, lastmodEpoch: ObjectId('56307d6cf424c605b37f5d99'), ns: "database_name.measurement_header", min: { measHeaderId: "732db8b5-70d9-4eb0-8f00-a5b1c58399df" }, max: { measHeaderId: "734a6a77-e006-48e8-88e9-e9f422c2bf84" }, shard: "fc2" }, o2: { _id: "database_name.measurement_header-measHeaderId_"732db8b5-70d9-4eb0-8f00-a5b1c58399df"" } } ], preCondition: [ { ns: "config.chunks", q: { query: { ns: "database_name.measurement_header" }, orderby: { lastmod: -1 } }, res: { lastmod: Timestamp 1294000|1 } } ] }
2018-02-18T05:33:27.729+0000 W SHARDING [conn4420978] moveChunk commit outcome ongoing: { applyOps: [ { op: "u", b: false, ns: "config.chunks", o: { _id: "database_name.measurement_header-measHeaderId_"72eeba27-1d72-49f9-8847-1433dc88ec9b"", lastmod: Timestamp 1295000|0, lastmodEpoch: ObjectId('56307d6cf424c605b37f5d99'), ns: "database_name.measurement_header", min: { measHeaderId: "72eeba27-1d72-49f9-8847-1433dc88ec9b" }, max: { measHeaderId: "730f20d9-e3bd-4076-8a57-92aed9be8574" }, shard: "fc1" }, o2: { _id: "database_name.measurement_header-measHeaderId_"72eeba27-1d72-49f9-8847-1433dc88ec9b"" } }, { op: "u", b: false, ns: "config.chunks", o: { _id: "database_name.measurement_header-measHeaderId_"732db8b5-70d9-4eb0-8f00-a5b1c58399df"", lastmod: Timestamp 1295000|1, lastmodEpoch: ObjectId('56307d6cf424c605b37f5d99'), ns: "database_name.measurement_header", min: { measHeaderId: "732db8b5-70d9-4eb0-8f00-a5b1c58399df" }, max: { measHeaderId: "734a6a77-e006-48e8-88e9-e9f422c2bf84" }, shard: "fc2" }, o2: { _id: "database_name.measurement_header-measHeaderId_"732db8b5-70d9-4eb0-8f00-a5b1c58399df"" } } ], preCondition: [ { ns: "config.chunks", q: { query: { ns: "database_name.measurement_header" }, orderby: { lastmod: -1 } }, res: { lastmod: Timestamp 1294000|1 } } ] } for command :{ $err: "DBClientBase::findN: transport error: xx.xx.xx.xx:27047 ns: config.$cmd query: { applyOps: [ { op: "u", b: false, ns: "config.chunks", o: { _id: "fluke_...", code: 10276 }
2018-02-18T05:33:27.757+0000 I SHARDING [conn4421129] waiting till out of critical section
2018-02-18T05:33:27.792+0000 I SHARDING [conn4422432] waiting till out of critical section
2018-02-18T05:33:28.054+0000 I SHARDING [conn4421102] waiting till out of critical section
2018-02-18T05:33:28.336+0000 I SHARDING [conn4421107] waiting till out of critical section
2018-02-18T05:33:28.653+0000 I SHARDING [conn4421080] waiting till out of critical section
2018-02-18T05:33:28.733+0000 I SHARDING [conn4421134] waiting till out of critical section
2018-02-18T05:33:28.942+0000 I SHARDING [conn4421486] waiting till out of critical section
2018-02-18T05:33:29.118+0000 I SHARDING [conn4421094] waiting till out of critical section
 
 
 
 
2018-02-18T05:33:37.671+0000 I SHARDING [conn4421085] waiting till out of critical section
2018-02-18T05:33:37.677+0000 I SHARDING [conn4420862] waiting till out of critical section
2018-02-18T05:33:37.729+0000 I NETWORK  [conn4420978] SyncClusterConnection connecting to [xx.xx.xx.xx:27047]
2018-02-18T05:33:37.729+0000 I NETWORK  [conn4420978] SyncClusterConnection connecting to [xx.xx.xx.xx:27047]
2018-02-18T05:33:37.731+0000 I NETWORK  [conn4420978] SyncClusterConnection connecting to [xx.xx.xx.xx:27047]
2018-02-18T05:33:37.731+0000 I SHARDING [conn4420978] moveChunk commit confirmed
2018-02-18T05:33:37.731+0000 I SHARDING [conn4420978] about to log metadata event: { _id: "ec2-54-187-83-95.us-west-2.compute.amazonaws.com-2018-02-18T05:33:37-5a8910312f7cc8f8d9ce3e53", server: "ec2-54-187-83-95.us-west-2.compute.amazonaws.com", clientAddr: "10.0.1.44:33948", time: new Date(1518932017731), what: "moveChunk.commit", ns: "database_name.measurement_header", details: { min: { measHeaderId: "72eeba27-1d72-49f9-8847-1433dc88ec9b" }, max: { measHeaderId: "730f20d9-e3bd-4076-8a57-92aed9be8574" }, from: "fc2", to: "fc1", cloned: 1853, clonedBytes: 23067346, catchup: 0, steady: 0 } }
2018-02-18T05:33:37.731+0000 I COMMAND  [conn4422432] command admin.$cmd command: setShardVersion { setShardVersion: "database_name.measurement_header", configdb: "xx.xx.xx.xx:27047,xx.xx.xx.xx:27047,xx.xx.xx.xx:27047", shard: "fc2", shardHost: "fc2/xx.xx.xx.xx:27027,xx.xx.xx.xx:27027", version: Timestamp 1294000|1, versionEpoch: ObjectId('56307d6cf424c605b37f5d99') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:411 locks:{} 29939ms
2018-02-18T05:33:37.731+0000 I COMMAND  [conn4421102] command admin.$cmd command: setShardVersion { setShardVersion: "database_name.measurement_header", configdb: "xx.xx.xx.xx:27047,xx.xx.xx.xx:27047,xx.xx.xx.xx:27047", shard: "fc2", shardHost: "fc2/xx.xx.xx.xx:27027,xx.xx.xx.xx:27027", version: Timestamp 1294000|1, versionEpoch: ObjectId('56307d6cf424c605b37f5d99') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:411 locks:{} 29677ms
2018-02-18T05:33:37.732+0000 I COMMAND

sh.status() command output from mongod

mongos> sh.status()
--- Sharding Status ---
  sharding version: {
	"_id" : 1,
	"minCompatibleVersion" : 5,
	"currentVersion" : 6,
	"clusterId" : ObjectId("56307d0df424c605b37f5d7f")
}
  shards:
	{  "_id" : "shard1",  "host" : "shard1/xx.xx.xx.xx:27017,xx.xx.xx.xx:27017,xx.xx.xx.xx:27017" }
	{  "_id" : "shard2",  "host" : "shard2/xx.xx.xx.xx:27027,xx.xx.xx.xx:27027",  "draining" : true }
  balancer:
	Currently enabled:  yes
	Currently running:  yes
		Balancer lock taken at Sun Feb 18 2018 06:24:33 GMT+0000 (UTC) by ec2-34-208-161-222.us-west-2.compute.amazonaws.com:27017:1490617694:1804289383:Balancer:846930886
	Failed balancer rounds in last 5 attempts:  0
	Migration Results for the last 24 hours:
		1045 : Success
		29 : Failed with error '_recvChunkCommit failed!', from shard2 to shard1
		2 : Failed with error 'Failed to send migrate commit to configs because { $err: "SyncClusterConnection::findOne prepare failed:  xx.xx.xx.xx:27047 (xx.xx.xx.xx) failed:10276 DBClientBase::findN: transport error: xx.xx.xx.xx:27047 ns: adm...", code: 13104 }', from shard2 to shard1
		3 : Failed with error 'moveChunk failed to engage TO-shard in the data transfer: migrate already in progress', from shard3 to shard1
		2 : Failed with error '_recvChunkCommit failed!', from shard3 to shard2
		2 : Failed with error 'Failed to send migrate commit to configs because { $err: "SyncClusterConnection::findOne prepare failed:  xx.xx.xx.xx:27047 (xx.xx.xx.xx) failed:10276 DBClientBase::findN: transport error: xx.xx.xx.xx:27047 ns: adm...", code: 13104 }', from shard3 to shard2
		30 : Failed with error 'chunk too big to move', from shard2 to shard1
		1 : Failed with error '_recvChunkCommit failed!', from shard3 to shard1
		36 : Failed with error 'chunk too big to move', from shard3 to shard1
		39 : Failed with error 'chunk too big to move', from shard3 to shard2
  databases:
	{  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
	{  "_id" : "fluke_intel",  "partitioned" : true,  "primary" : "shard1" }
		fluke_intel.measurement_header
			shard key: { "measHeaderId" : 1 }
			chunks:
				shard1	1139
				shard2	256
			too many chunks to print, use verbose if you want to force print
	{  "_id" : "test",  "partitioned" : false,  "primary" : "shard1" }
	{  "_id" : "mystique",  "partitioned" : false,  "primary" : "shard1" }
 
mongos>
mongos>

Please help us to resume the chunk migration



 Comments   
Comment by Kelsey Schubert [ 20/Feb/18 ]

Hi kaviyarasan.ramalingam@fluke.com,

Thanks for the report. It appears that there was a network problem which has contributed to this issue. Unfortunately, at this point MongoDB 3.0 is very old and will no longer be supported at the end of the month. Therefore, I would strongly recommend upgrading to a more recent version of MongoDB, which contains numerous improvements to the sharding system.

Please note that the SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-users group.

See also our Technical Support page for additional support resources.

Kind regards,
Kelsey

Comment by Kaviyarasan Ramalingam [ 18/Feb/18 ]

Adding db.currentOp()

mongos> db.currentOp()
{
	"inprog" : [
		{
			"desc" : "conn2498363",
			"threadId" : "0x281df400",
			"connectionId" : 2498363,
			"opid" : "shard1:1382985123",
			"active" : true,
			"secs_running" : 0,
			"microsecs_running" : NumberLong(116),
			"op" : "query",
			"ns" : "mystique.report_request",
			"query" : {
				"$orderby" : {
					"requestedTime" : -1
				},
				"$maxTimeMS" : NumberLong(10000),
				"$query" : {
					"requestedTime" : {
						"$lte" : NumberLong("1518964196064")
					},
					"isProcessingError" : {
						"$exists" : false
					}
				}
			},
			"planSummary" : "COLLSCAN, COLLSCAN",
			"client_s" : "xx.xx.xx.xx:56684",
			"numYields" : 0,
			"locks" : {
				"Global" : "r",
				"MMAPV1Journal" : "r",
				"Database" : "r",
				"Collection" : "R"
			},
			"waitingForLock" : false,
			"lockStats" : {
				"Global" : {
					"acquireCount" : {
						"r" : NumberLong(2)
					}
				},
				"MMAPV1Journal" : {
					"acquireCount" : {
						"r" : NumberLong(1)
					}
				},
				"Database" : {
					"acquireCount" : {
						"r" : NumberLong(1)
					}
				},
				"Collection" : {
					"acquireCount" : {
						"R" : NumberLong(1)
					}
				}
			}
		},
		{
			"desc" : "conn2498342",
			"threadId" : "0x31057400",
			"connectionId" : 2498342,
			"opid" : "shard1:1382984270",
			"active" : true,
			"secs_running" : 0,
			"microsecs_running" : NumberLong(904391),
			"op" : "getmore",
			"ns" : "local.oplog.rs",
			"query" : {
				"ts" : {
					"$gte" : Timestamp(1518902492, 4)
				}
			},
			"client_s" : "xx.xx.xx.xx:43583",
			"numYields" : 0,
			"locks" : {
 
			},
			"waitingForLock" : false,
			"lockStats" : {
				"Global" : {
					"acquireCount" : {
						"r" : NumberLong(2)
					}
				},
				"MMAPV1Journal" : {
					"acquireCount" : {
						"r" : NumberLong(1)
					}
				},
				"Database" : {
					"acquireCount" : {
						"r" : NumberLong(1)
					}
				},
				"oplog" : {
					"acquireCount" : {
						"R" : NumberLong(1)
					}
				}
			}
		},
		{
			"desc" : "conn2497716",
			"threadId" : "0x7f1cbe0",
			"connectionId" : 2497716,
			"opid" : "shard1:1382984269",
			"active" : true,
			"secs_running" : 0,
			"microsecs_running" : NumberLong(904448),
			"op" : "getmore",
			"ns" : "local.oplog.rs",
			"query" : {
				"ts" : {
					"$gte" : Timestamp(1518898178, 9579)
				}
			},
			"client_s" : "xx.xx.xx.xx:53771",
			"numYields" : 0,
			"locks" : {
 
			},
			"waitingForLock" : false,
			"lockStats" : {
				"Global" : {
					"acquireCount" : {
						"r" : NumberLong(2)
					}
				},
				"MMAPV1Journal" : {
					"acquireCount" : {
						"r" : NumberLong(1)
					}
				},
				"Database" : {
					"acquireCount" : {
						"r" : NumberLong(1)
					}
				},
				"oplog" : {
					"acquireCount" : {
						"R" : NumberLong(1)
					}
				}
			}
		},
		{
			"desc" : "conn4124056",
			"threadId" : "0x49ce1a0",
			"connectionId" : 4124056,
			"opid" : "shard2:526317292",
			"active" : true,
			"secs_running" : 4,
			"microsecs_running" : NumberLong(4581783),
			"op" : "getmore",
			"ns" : "local.oplog.rs",
			"query" : {
				"ts" : {
					"$gte" : Timestamp(1514448744, 1)
				}
			},
			"client_s" : "xx.xx.xx.xx:57810",
			"numYields" : 0,
			"locks" : {
 
			},
			"waitingForLock" : false,
			"lockStats" : {
				"Global" : {
					"acquireCount" : {
						"r" : NumberLong(10)
					}
				},
				"MMAPV1Journal" : {
					"acquireCount" : {
						"r" : NumberLong(5)
					}
				},
				"Database" : {
					"acquireCount" : {
						"r" : NumberLong(5)
					}
				},
				"oplog" : {
					"acquireCount" : {
						"R" : NumberLong(5)
					}
				}
			}
		}
	]
}
mongos>

Generated at Thu Feb 08 04:33:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.