Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-33386

Chunk Migration stopped

    • Type: Icon: Question Question
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.0.12
    • Component/s: Sharding
    • Labels:
      None

      Team,

      We are facing some issues while removing sharding in Production instance.

      We have three shard in our prod, and we are working on removing the same.

      we followed the same as documented https://docs.mongodb.com/v3.0/tutorial/remove-shards-from-cluster/index.html

      2018-02-18T05:33:27.635+0000 I SHARDING [conn4421103] waiting till out of critical section
      2018-02-18T05:33:27.641+0000 I SHARDING [conn4421196] waiting till out of critical section
      2018-02-18T05:33:27.671+0000 I SHARDING [conn4421085] waiting till out of critical section
      2018-02-18T05:33:27.677+0000 I SHARDING [conn4420862] waiting till out of critical section
      2018-02-18T05:33:27.728+0000 I NETWORK  [conn4420978] Socket recv() timeout  xx.xx.xx.xx:27047
      2018-02-18T05:33:27.728+0000 I NETWORK  [conn4420978] SocketException: remote: xx.xx.xx.xx:27047 error: 9001 socket exception [RECV_TIMEOUT] server [xx.xx.xx.xx:27047]
      2018-02-18T05:33:27.728+0000 I NETWORK  [conn4420978] DBClientCursor::init call() failed
      2018-02-18T05:33:27.728+0000 I NETWORK  [conn4420978] scoped connection to xx.xx.xx.xx:27047,xx.xx.xx.xx:27047,xx.xx.xx.xx:27047 not being returned to the pool
      2018-02-18T05:33:27.729+0000 W SHARDING [conn4420978] 10276 DBClientBase::findN: transport error: xx.xx.xx.xx:27047 ns: config.$cmd query: { applyOps: [ { op: "u", b: false, ns: "config.chunks", o: { _id: "database_name.measurement_header-measHeaderId_"72eeba27-1d72-49f9-8847-1433dc88ec9b"", lastmod: Timestamp 1295000|0, lastmodEpoch: ObjectId('56307d6cf424c605b37f5d99'), ns: "database_name.measurement_header", min: { measHeaderId: "72eeba27-1d72-49f9-8847-1433dc88ec9b" }, max: { measHeaderId: "730f20d9-e3bd-4076-8a57-92aed9be8574" }, shard: "fc1" }, o2: { _id: "database_name.measurement_header-measHeaderId_"72eeba27-1d72-49f9-8847-1433dc88ec9b"" } }, { op: "u", b: false, ns: "config.chunks", o: { _id: "database_name.measurement_header-measHeaderId_"732db8b5-70d9-4eb0-8f00-a5b1c58399df"", lastmod: Timestamp 1295000|1, lastmodEpoch: ObjectId('56307d6cf424c605b37f5d99'), ns: "database_name.measurement_header", min: { measHeaderId: "732db8b5-70d9-4eb0-8f00-a5b1c58399df" }, max: { measHeaderId: "734a6a77-e006-48e8-88e9-e9f422c2bf84" }, shard: "fc2" }, o2: { _id: "database_name.measurement_header-measHeaderId_"732db8b5-70d9-4eb0-8f00-a5b1c58399df"" } } ], preCondition: [ { ns: "config.chunks", q: { query: { ns: "database_name.measurement_header" }, orderby: { lastmod: -1 } }, res: { lastmod: Timestamp 1294000|1 } } ] }
      2018-02-18T05:33:27.729+0000 W SHARDING [conn4420978] moveChunk commit outcome ongoing: { applyOps: [ { op: "u", b: false, ns: "config.chunks", o: { _id: "database_name.measurement_header-measHeaderId_"72eeba27-1d72-49f9-8847-1433dc88ec9b"", lastmod: Timestamp 1295000|0, lastmodEpoch: ObjectId('56307d6cf424c605b37f5d99'), ns: "database_name.measurement_header", min: { measHeaderId: "72eeba27-1d72-49f9-8847-1433dc88ec9b" }, max: { measHeaderId: "730f20d9-e3bd-4076-8a57-92aed9be8574" }, shard: "fc1" }, o2: { _id: "database_name.measurement_header-measHeaderId_"72eeba27-1d72-49f9-8847-1433dc88ec9b"" } }, { op: "u", b: false, ns: "config.chunks", o: { _id: "database_name.measurement_header-measHeaderId_"732db8b5-70d9-4eb0-8f00-a5b1c58399df"", lastmod: Timestamp 1295000|1, lastmodEpoch: ObjectId('56307d6cf424c605b37f5d99'), ns: "database_name.measurement_header", min: { measHeaderId: "732db8b5-70d9-4eb0-8f00-a5b1c58399df" }, max: { measHeaderId: "734a6a77-e006-48e8-88e9-e9f422c2bf84" }, shard: "fc2" }, o2: { _id: "database_name.measurement_header-measHeaderId_"732db8b5-70d9-4eb0-8f00-a5b1c58399df"" } } ], preCondition: [ { ns: "config.chunks", q: { query: { ns: "database_name.measurement_header" }, orderby: { lastmod: -1 } }, res: { lastmod: Timestamp 1294000|1 } } ] } for command :{ $err: "DBClientBase::findN: transport error: xx.xx.xx.xx:27047 ns: config.$cmd query: { applyOps: [ { op: "u", b: false, ns: "config.chunks", o: { _id: "fluke_...", code: 10276 }
      2018-02-18T05:33:27.757+0000 I SHARDING [conn4421129] waiting till out of critical section
      2018-02-18T05:33:27.792+0000 I SHARDING [conn4422432] waiting till out of critical section
      2018-02-18T05:33:28.054+0000 I SHARDING [conn4421102] waiting till out of critical section
      2018-02-18T05:33:28.336+0000 I SHARDING [conn4421107] waiting till out of critical section
      2018-02-18T05:33:28.653+0000 I SHARDING [conn4421080] waiting till out of critical section
      2018-02-18T05:33:28.733+0000 I SHARDING [conn4421134] waiting till out of critical section
      2018-02-18T05:33:28.942+0000 I SHARDING [conn4421486] waiting till out of critical section
      2018-02-18T05:33:29.118+0000 I SHARDING [conn4421094] waiting till out of critical section
      
      
      
      
      2018-02-18T05:33:37.671+0000 I SHARDING [conn4421085] waiting till out of critical section
      2018-02-18T05:33:37.677+0000 I SHARDING [conn4420862] waiting till out of critical section
      2018-02-18T05:33:37.729+0000 I NETWORK  [conn4420978] SyncClusterConnection connecting to [xx.xx.xx.xx:27047]
      2018-02-18T05:33:37.729+0000 I NETWORK  [conn4420978] SyncClusterConnection connecting to [xx.xx.xx.xx:27047]
      2018-02-18T05:33:37.731+0000 I NETWORK  [conn4420978] SyncClusterConnection connecting to [xx.xx.xx.xx:27047]
      2018-02-18T05:33:37.731+0000 I SHARDING [conn4420978] moveChunk commit confirmed
      2018-02-18T05:33:37.731+0000 I SHARDING [conn4420978] about to log metadata event: { _id: "ec2-54-187-83-95.us-west-2.compute.amazonaws.com-2018-02-18T05:33:37-5a8910312f7cc8f8d9ce3e53", server: "ec2-54-187-83-95.us-west-2.compute.amazonaws.com", clientAddr: "10.0.1.44:33948", time: new Date(1518932017731), what: "moveChunk.commit", ns: "database_name.measurement_header", details: { min: { measHeaderId: "72eeba27-1d72-49f9-8847-1433dc88ec9b" }, max: { measHeaderId: "730f20d9-e3bd-4076-8a57-92aed9be8574" }, from: "fc2", to: "fc1", cloned: 1853, clonedBytes: 23067346, catchup: 0, steady: 0 } }
      2018-02-18T05:33:37.731+0000 I COMMAND  [conn4422432] command admin.$cmd command: setShardVersion { setShardVersion: "database_name.measurement_header", configdb: "xx.xx.xx.xx:27047,xx.xx.xx.xx:27047,xx.xx.xx.xx:27047", shard: "fc2", shardHost: "fc2/xx.xx.xx.xx:27027,xx.xx.xx.xx:27027", version: Timestamp 1294000|1, versionEpoch: ObjectId('56307d6cf424c605b37f5d99') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:411 locks:{} 29939ms
      2018-02-18T05:33:37.731+0000 I COMMAND  [conn4421102] command admin.$cmd command: setShardVersion { setShardVersion: "database_name.measurement_header", configdb: "xx.xx.xx.xx:27047,xx.xx.xx.xx:27047,xx.xx.xx.xx:27047", shard: "fc2", shardHost: "fc2/xx.xx.xx.xx:27027,xx.xx.xx.xx:27027", version: Timestamp 1294000|1, versionEpoch: ObjectId('56307d6cf424c605b37f5d99') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:411 locks:{} 29677ms
      2018-02-18T05:33:37.732+0000 I COMMAND
      
      

      sh.status() command output from mongod

      mongos> sh.status()
      --- Sharding Status ---
        sharding version: {
      	"_id" : 1,
      	"minCompatibleVersion" : 5,
      	"currentVersion" : 6,
      	"clusterId" : ObjectId("56307d0df424c605b37f5d7f")
      }
        shards:
      	{  "_id" : "shard1",  "host" : "shard1/xx.xx.xx.xx:27017,xx.xx.xx.xx:27017,xx.xx.xx.xx:27017" }
      	{  "_id" : "shard2",  "host" : "shard2/xx.xx.xx.xx:27027,xx.xx.xx.xx:27027",  "draining" : true }
        balancer:
      	Currently enabled:  yes
      	Currently running:  yes
      		Balancer lock taken at Sun Feb 18 2018 06:24:33 GMT+0000 (UTC) by ec2-34-208-161-222.us-west-2.compute.amazonaws.com:27017:1490617694:1804289383:Balancer:846930886
      	Failed balancer rounds in last 5 attempts:  0
      	Migration Results for the last 24 hours:
      		1045 : Success
      		29 : Failed with error '_recvChunkCommit failed!', from shard2 to shard1
      		2 : Failed with error 'Failed to send migrate commit to configs because { $err: "SyncClusterConnection::findOne prepare failed:  xx.xx.xx.xx:27047 (xx.xx.xx.xx) failed:10276 DBClientBase::findN: transport error: xx.xx.xx.xx:27047 ns: adm...", code: 13104 }', from shard2 to shard1
      		3 : Failed with error 'moveChunk failed to engage TO-shard in the data transfer: migrate already in progress', from shard3 to shard1
      		2 : Failed with error '_recvChunkCommit failed!', from shard3 to shard2
      		2 : Failed with error 'Failed to send migrate commit to configs because { $err: "SyncClusterConnection::findOne prepare failed:  xx.xx.xx.xx:27047 (xx.xx.xx.xx) failed:10276 DBClientBase::findN: transport error: xx.xx.xx.xx:27047 ns: adm...", code: 13104 }', from shard3 to shard2
      		30 : Failed with error 'chunk too big to move', from shard2 to shard1
      		1 : Failed with error '_recvChunkCommit failed!', from shard3 to shard1
      		36 : Failed with error 'chunk too big to move', from shard3 to shard1
      		39 : Failed with error 'chunk too big to move', from shard3 to shard2
        databases:
      	{  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
      	{  "_id" : "fluke_intel",  "partitioned" : true,  "primary" : "shard1" }
      		fluke_intel.measurement_header
      			shard key: { "measHeaderId" : 1 }
      			chunks:
      				shard1	1139
      				shard2	256
      			too many chunks to print, use verbose if you want to force print
      	{  "_id" : "test",  "partitioned" : false,  "primary" : "shard1" }
      	{  "_id" : "mystique",  "partitioned" : false,  "primary" : "shard1" }
      
      mongos>
      mongos>
      

      Please help us to resume the chunk migration

            Assignee:
            kelsey.schubert@mongodb.com Kelsey Schubert
            Reporter:
            kaviyarasan.ramalingam@fluke.com Kaviyarasan Ramalingam
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: