[SERVER-13217] Socket Exception on MapReduce from Removed Shard Created: 14/Mar/14  Updated: 10/Dec/14  Resolved: 30/Oct/14

Status: Closed
Project: Core Server
Component/s: MapReduce, Sharding
Affects Version/s: 2.2.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Al Gehrig Assignee: Siyuan Zhou
Resolution: Done Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: Linux
Participants:

 Description   

We removed a 2-node replica set from a sharded cluster yesterday. The node fully drained and we ran the "final" removeShard command which resulted in the following

mongos> db.runCommand(

{removeShard : "rsgewrset40"}

)
{
"errmsg" : "exception: can't find shard for: rsgewrset40",
"code" : 13129,
"ok" : 0
}

We then shut down the machines in the replica set and the arbiter for this shard.

All systems except for our map/reduce jobs are running fine. Our MR job is getting the following exception:

MongoDB shell version: 2.2.6
connecting to: REWRWEB1P:27017/crew_feuds_prod Fri Mar 14 14:42:04 uncaught exception: map reduce failed:{
'ok' : 0,
'errmsg' : 'MR post processing failed:

{ result: \'rivals.mp3.pcros\', errmsg: \'exception: could not initialize cursor across all shards because : socket exception [CONNECT_ERROR] for rsgewrset40/rsgewrmng79.taketwo.online:27017,r...\', code: 14827, ok: 0.0 }

'
}

We've restarted all of our mongoS, flushed the router config, and conpoolsynced.

We've had to restart the replica set that was drained and just leave it running even though it's not part of the cluster.

What do we need to do to get the MR job to forget about this node?



 Comments   
Comment by Siyuan Zhou [ 30/Oct/14 ]

Hi al.gehrig@rockstarsandiego.com and lars@iamat.com, we haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, feel free to re-open the ticket and provide the information Dan asked in the previous comment.

Regards, Siyuan

Comment by Daniel Pasette (Inactive) [ 29/May/14 ]

Can you tell me your cluster configuration? Can you post the exact map/reduce command you are running?

Comment by Remon van Vliet [ 27/May/14 ]

We would like an update on this as well.

Comment by Lars Jacob [ 25/Apr/14 ]

Ok,
after a bit of digging, we found out that making a fail-over of the primary of the remaining replica set in the shard solved the problem. Seems its related to some cache which didn't get invalidated after removing the drained shards.

Comment by Lars Jacob [ 24/Apr/14 ]

having the same issue here with mongo 2.4.9

Generated at Thu Feb 08 03:31:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.