[SERVER-15199] moveChunk fails to complete after replSetReconfig Created: 10/Sep/14  Updated: 10/Dec/14  Resolved: 15/Sep/14

Status: Closed
Project: Core Server
Component/s: Replication, Sharding
Affects Version/s: 2.7.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jonathan Abrahams Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: 28qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File move-ren2.log     File secondaryThrottle_moveChunk.js    
Issue Links:
Related
is related to SERVER-15257 awaitReplicationOfLastOpForClient ret... Closed
Operating System: ALL
Steps To Reproduce:

The test is attached along with the log output.

Basic steps are as follows:

  1. Configure 3 shard cluster with 3 node replica sets
  2. Create docs on sharded collection
  3. replSetReconfig on the toShard
  4. awaitReplication
  5. connPoolSync on mongos
  6. connPoolSync on primary toShard
  7. setShardVersion on primary toShard
  8. moveChunk

Note: value of secondaryThrottle (true or false) does not affect behavior

Participants:

 Description   

The moveChunk command fails to complete after a replSetReconfig is issued on the toShard.



 Comments   
Comment by Randolph Tan [ 15/Sep/14 ]

Just talked with Jonathan offline. It looks like the issue being reported is that moveChunk still waits for the slave delayed secondaries even if secondaryThrottle was false. This is expected behavior since the shard will wait for majority of nodes to catch up at the end of the migration. The purpose of the secondaryThrottle is for deliberately slowing the migration process by waiting for replication for each individual writes instead of waiting after a bulk of the writes are executed. Also edited the comment for the original ticket (SERVER-14041) to make it more clear.

Comment by Jonathan Abrahams [ 15/Sep/14 ]

renctan How long did your test run? When I run it under 2.7.6, the moveChunk command did not complete (within 5 minutes). What OS are you running it on? I am using OSX.

Comment by Randolph Tan [ 15/Sep/14 ]

jonathan.abrahams I am having a hard time finding the error from the attached log. I however, able to make the attached test fail on 2.7.6. I want to make sure if this is the same as the issue you reported:

Error moving chunk:  {
	"cause" : {
		"ok" : 0,
		"errmsg" : "moveChunk could not contact to: shard test-rs0 to monitor transfer :: caused by :: 10276 DBClientBase::findN: transport error: ren-desktop:31100 ns: admin.$cmd query: { _recvChunkStatus: 1 }"
	},
	"ok" : 0,
	"errmsg" : "move failed"
}
2014-09-15T13:17:36.283-0400 I QUERY    {
	"cause" : {
		"ok" : 0,
		"errmsg" : "moveChunk could not contact to: shard test-rs0 to monitor transfer :: caused by :: 10276 DBClientBase::findN: transport error: ren-desktop:31100 ns: admin.$cmd query: { _recvChunkStatus: 1 }"
	},
	"ok" : 0,
	"errmsg" : "move failed"
} at /home/ren/mongo-copy/secondaryThrottle_moveChunk.js:242
failed to load: /home/ren/mongo-copy/secondaryThrottle_moveChunk.js

Thanks!

Generated at Thu Feb 08 03:37:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.