[SERVER-16475] can't move chunk Created: 09/Dec/14  Updated: 24/Jan/15  Resolved: 17/Dec/14

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.6.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kay Agahd Assignee: Randolph Tan
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-15849 Secondaries should not forward replic... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

We are running 7 shards, each consisting of 3 replicaset members. We do pre-splitting. One of the shards is not accepting new chunks anymore, even if the chunk is empty.
The logs are saying that it's waiting for replication but all members are perfectly in sync. We've read that it might be a problem of local.slaves collection for version 2.2 and 2.4 but we are running v2.6.4 already. We dropped local.slaves collection neverthless but it did not help. We also stepped down the primary with no success. We also stopped one replSet member, removed its data, brought it up again, waited to be in sync, elected it as Primary but the chunkMove never succeeded.

What can we do to get this shard accepting new chunks again?
Here are the logs of the Primary of the destination shard grepped by "migrateThread":

2014-12-09T16:53:02.345+0100 [migrateThread] warning: migrate commit waiting for 2 slaves for 'offerStore.offer' { _id: 3739440290 } -> { _id: 3739940290 } waiting for: 54870f52:a9
2014-12-09T16:53:03.345+0100 [migrateThread] Waiting for replication to catch up before entering critical section
2014-12-09T16:53:04.345+0100 [migrateThread] Waiting for replication to catch up before entering critical section
2014-12-09T16:53:05.345+0100 [migrateThread] Waiting for replication to catch up before entering critical section
2014-12-09T16:53:06.345+0100 [migrateThread] Waiting for replication to catch up before entering critical section

This is the replication status of the replSet:

offerStoreDE2:SECONDARY> rs.status()
{
	"set" : "offerStoreDE2",
	"date" : ISODate("2014-12-09T15:58:26Z"),
	"myState" : 2,
	"syncingTo" : "s131:27017",
	"members" : [
		{
			"_id" : 3,
			"name" : "s136:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 6458100,
			"optime" : Timestamp(1418140706, 503),
			"optimeDate" : ISODate("2014-12-09T15:58:26Z"),
			"self" : true
		},
		{
			"_id" : 4,
			"name" : "s131:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 1919333,
			"optime" : Timestamp(1418140706, 437),
			"optimeDate" : ISODate("2014-12-09T15:58:26Z"),
			"lastHeartbeat" : ISODate("2014-12-09T15:58:26Z"),
			"lastHeartbeatRecv" : ISODate("2014-12-09T15:58:25Z"),
			"pingMs" : 0,
			"syncingTo" : "s568:27017"
		},
		{
			"_id" : 6,
			"name" : "s568:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 8893,
			"optime" : Timestamp(1418140706, 51),
			"optimeDate" : ISODate("2014-12-09T15:58:26Z"),
			"lastHeartbeat" : ISODate("2014-12-09T15:58:26Z"),
			"lastHeartbeatRecv" : ISODate("2014-12-09T15:58:26Z"),
			"pingMs" : 0,
			"electionTime" : Timestamp(1418137258, 1),
			"electionDate" : ISODate("2014-12-09T15:00:58Z")
		}
	],
	"ok" : 1
}



 Comments   
Comment by Randolph Tan [ 17/Dec/14 ]

You're welcome. I am closing this ticket as a duplicate.

Comment by Kay Agahd [ 17/Dec/14 ]

Thanks so much renctan! An upgrade from 2.6.4 to 2.6.6 of the concerned shard fixed the problem. Moving chunks is possible again

Comment by Randolph Tan [ 16/Dec/14 ]

After looking through the logs, I believe that you are running into SERVER-15849 (fixed in 2.6.6). The side effect of this bug is that some of the secondaries would stop being able to report oplog progress upstream and this will cause some write concerns that cannot be satisfied even though it should be able to.

Comment by Kay Agahd [ 16/Dec/14 ]

Is there anybody who could help?

Comment by Kay Agahd [ 11/Dec/14 ]

renctan please tell me if you need supplementary info in order to sort out the issue. Since the shard is not accepting new chunks, our cluster will become unbalanced soon. Thanks for your help.

Comment by Kay Agahd [ 10/Dec/14 ]

Randolph, here comes the error message of the moveChunk command (which took 10 hours btw.):

{
        "cause" : {
                "cause" : {
                        "active" : true,
                        "ns" : "offerStore.offer",
                        "from" : "offerStoreDE1/s479:27017,s480:27017,s483:27017",
                        "min" : {
                                "_id" : NumberLong("3739440290")
                        },
                        "max" : {
                                "_id" : NumberLong("3739940290")
                        },
                        "shardKeyPattern" : {
                                "_id" : 1
                        },
                        "state" : "fail",
                        "errmsg" : "",
                        "counts" : {
                                "cloned" : NumberLong(0),
                                "clonedBytes" : NumberLong(0),
                                "catchup" : NumberLong(0),
                                "steady" : NumberLong(0)
                        },
                        "ok" : 0
                },
                "ok" : 0,
                "errmsg" : "_recvChunkCommit failed!"
        },
        "ok" : 0,
        "errmsg" : "move failed"
}

Comment by Kay Agahd [ 09/Dec/14 ]

Randolph, yes, the cluster ist using auth.
I've uploaded the logs from today and yesterday from the 3 replset members s568, s131 and s136.
todays log:
scp -P 722 [server].log.tgz SERVER-16475@www.mongodb.com:
yesterdays log:
scp -P 722 [server].log.1.tgz SERVER-16475@www.mongodb.com:

The moveChunk command took so much time that I've killed it almost always. If I remember well, it failed by throwing a socket exception or cursor-not-found-exception. I'll execute it again so I can you tell you better.

Comment by Randolph Tan [ 09/Dec/14 ]

Follow-up question: are you using auth? Is it possible to upload the logs for the primary and secondaries?

Thanks!

Comment by Randolph Tan [ 09/Dec/14 ]

Hi,

If you try to use the moveChunk command manually, what kind of error message does it say?

Thanks!

Generated at Thu Feb 08 03:41:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.