Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.2.6
Component/s: MapReduce, Sharding
Labels:
None

Operating System:
Linux
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We removed a 2-node replica set from a sharded cluster yesterday. The node fully drained and we ran the "final" removeShard command which resulted in the following

mongos> db.runCommand(

{removeShard : "rsgewrset40"}

)
{
"errmsg" : "exception: can't find shard for: rsgewrset40",
"code" : 13129,
"ok" : 0
}

We then shut down the machines in the replica set and the arbiter for this shard.

All systems except for our map/reduce jobs are running fine. Our MR job is getting the following exception:

MongoDB shell version: 2.2.6
connecting to: REWRWEB1P:27017/crew_feuds_prod Fri Mar 14 14:42:04 uncaught exception: map reduce failed:{
'ok' : 0,
'errmsg' : 'MR post processing failed:

{ result: \'rivals.mp3.pcros\', errmsg: \'exception: could not initialize cursor across all shards because : socket exception [CONNECT_ERROR] for rsgewrset40/rsgewrmng79.taketwo.online:27017,r...\', code: 14827, ok: 0.0 }

'
}

We've restarted all of our mongoS, flushed the router config, and conpoolsynced.

We've had to restart the replica set that was drained and just leave it running even though it's not part of the cluster.

What do we need to do to get the MR job to forget about this node?

Assignee:: Siyuan Zhou
Reporter:: Al Gehrig
Participants:: Al Gehrig, Daniel Pasette, Lars Jacob, Remon van Vliet, Siyuan Zhou
Votes:: 3 Vote for this issue
Watchers:: 8 Start watching this issue

Created:: Mar 14 2014 11:04:06 PM UTC
Updated:: Dec 10 2014 11:18:31 PM UTC
Resolved:: Oct 30 2014 07:52:37 PM UTC

Details

Description

Attachments

Activity

People

Dates