[SERVER-12178] cleanupOrphan can fail if shard is moving chunks Created: 20/Dec/13  Updated: 06/Feb/14  Resolved: 07/Jan/14

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 2.5.5

Type: Bug Priority: Major - P3
Reporter: Kamran K. Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: 26qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File transport_error_with_movechunk.js    
Issue Links:
Related
is related to SERVER-11277 cleanupOrphaned does nothing on empty... Closed
Operating System: ALL
Participants:

 Description   

Note: commit message is wrong. Forgot to change it before pushing.

The fix for SERVER-11277 (99dff054c8b8) seems to cause moveChunk commands to fail with a transport error under certain circumstances. The attached JS file reproduces the problem.

  m30999| 2013-12-20T14:12:56.315-0500 [WriteBackListener-localhost:30000] DBClientCursor::init call() failed
 m30000| 2013-12-20T14:12:56.315-0500 [conn1] end connection 127.0.0.1:50886 (4 connections now open)
 m30000| 2013-12-20T14:12:56.315-0500 [conn3] end connection 127.0.0.1:50899 (4 connections now open)
 m30000| 2013-12-20T14:12:56.315-0500 [conn5] end connection 127.0.0.1:50910 (3 connections now open)
 m30001| 2013-12-20T14:12:56.315-0500 [conn5] end connection 127.0.0.1:50908 (4 connections now open)
 m30999| 2013-12-20T14:12:56.315-0500 [WriteBackListener-localhost:30000] Detected bad connection created at 1387566773596630 microSec, clearing pool for localhost:30000 of 0 connections
 m30999| 2013-12-20T14:12:56.315-0500 [conn2] DBClientCursor::init call() failed
 m30999| 2013-12-20T14:12:56.315-0500 [WriteBackListener-localhost:30000] WriteBackListener exception : DBClientBase::findN: transport error: localhost:30000 ns: admin.$cmd query: { writebacklisten: ObjectId('52b496b5ebc1242f136c7597') }
 m30999| 2013-12-20T14:12:56.315-0500 [conn2] Detected bad connection created at 1387566773604578 microSec, clearing pool for localhost:30000 of 0 connections
sh81742| {
sh81742| 	"code" : 10276,
sh81742| 	"ok" : 0,
sh81742| 	"errmsg" : "exception: DBClientBase::findN: transport error: localhost:30000 ns: admin.$cmd query: { moveChunk: \"foo.bar\", from: \"localhost:30000\", to: \"localhost:30001\", fromShard: \"shard0000\", toShard: \"shard0001\", min: { _id: 0.0 }, max: { _id: 20.0 }, maxChunkSizeBytes: 52428800, shardId: \"foo.bar-_id_0.0\", configdb: \"localhost:29000\", secondaryThrottle: false, waitForDelete: true, maxTimeMS: 0 }"
sh81742| }
sh81742| assert failed
 m30001| 2013-12-20T14:12:56.317-0500 [migrateThread] DBClientCursor::init call() failed


Versions tested (chronological order):

6902c6b643f64 (not reproducible)
99dff054c8b8 (when the behavior change was introduced)
77384d0a36a2 (recent commit from 12-20-2013)



 Comments   
Comment by Randolph Tan [ 07/Jan/14 ]

We should add in the documentation that the command can return the same start range so the client can retry if the server detected we are still transitioning the shard version. It will also print a warning:

"orphaned cleanup needs to be retried, collection metadata at shard version <version> changed during reload"

Comment by Randolph Tan [ 07/Jan/14 ]

Note: commit message is wrong. Forgot to change it before pushing.

The QA script was failing because the migrate thread bump the internal major version and the chunk differ got confused when it tries to get new chunks based on the new version since it doesn't exist on the config server yet.

The fix was not to fail the command, but return the same start range so the user can retry again. The command should succeed once the shard version is in steady state.

Comment by Githook User [ 07/Jan/14 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-12178 Make sure to take distributed ns lock when performing orphan cleanup
Branch: master
https://github.com/mongodb/mongo/commit/735a759908a3a8e792909eb23160ec29eba83e9c

Generated at Thu Feb 08 03:27:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.