[SERVER-5286] DBClientBase::findN: transport error for query: { mapreduce.shardedfinish: {....} } Created: 12/Mar/12 Updated: 15/Feb/13 Resolved: 29/Aug/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | MapReduce, Sharding |
| Affects Version/s: | 2.0.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Y. Wayne Huang | Assignee: | Randolph Tan |
| Resolution: | Done | Votes: | 2 |
| Labels: | 212push | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
ubuntu 10.04 x86_64 |
||
| Issue Links: |
|
||||||||
| Operating System: | Linux | ||||||||
| Participants: | |||||||||
| Description |
|
we are running an unauthenticated sharded setup with 4 shards, each a replicaset of 2 members and an arbiter. all nodes are running 2.0.3. periodically, mapreduce jobs fail with: Array , {}), reduce: CodeWScope( , {}), query: { date: { $gt: 1330944522 }, content_type: { $in: [ "text/html; charset=utf-8", "text/html" ] }}, out: "mr.PageView.1331545722.6649.700620" }, shardedOutputCollection: "tmp.mrs.PageView_1331545723_8", shards: { shard2/mongo-s2-02:27022,mongo-s2-01:27021: { result: "tmp.mrs.PageView_1331545723_8", timeMillis: 7056, counts: { input: 146527, emit: 35517, reduce: 4562, output: 1391 }, ok: 1.0 }, shard3/mongo-s3-01:27017,mongo-s3-02:27017: { result: "tmp.mrs.PageView_1331545723_8", timeMillis: 2127, counts: { input: 37213, emit: 34834, reduce: 4460, output: 1351 }, ok: 1.0 }, shard4/mongo-s4-02:27017,mongo-s4-03:27017,mongo-s4-01:27017: { result: "tmp.mrs.PageView_1331545723_8", timeMillis: 3031, counts: { input: 58947, emit: 51909, reduce: 5721, output: 2096 }, ok: 1.0 } }, shardCounts: { shard2/mongo-s2-02:27022,mongo-s2-01:27021: { input: 146527, emit: 35517, reduce: 4562, output: 1391 }, shard3/mongo-s3-01:27017,mongo-s3-02:27017: { input: 37213, emit: 34834, reduce: 4460, output: 1351 }, shard4/mongo-s4-02:27017,mongo-s4-03:27017,mongo-s4-01:27017: { input: 58947, emit: 51909, reduce: 5721, output: 2096 }}, counts: { emit: 122260, input: 242687, output: 4838, reduce: 14743 } } this happens periodically. running flushRouterConfig ahead of the MR job does not resolve this issue. bouncing mongos does not resolve the issue. in back to back runs, it fails about 4 out of every 5 tries with 1 success despite no changes on our end. |
| Comments |
| Comment by Randolph Tan [ 29/Aug/12 ] | |||||||||||
|
Basing on the given steps, I believe this is by design. A reconfiguration on the primary can cause the primary to close all connections and will take some time to recover: http://docs.mongodb.org/manual/reference/replica-configuration/#usage There is also a ticket ( | |||||||||||
| Comment by Grégoire Seux [ 06/Jul/12 ] | |||||||||||
|
I get the exact same issue with 2.0.6. | |||||||||||
| Comment by Y. Wayne Huang [ 25/Apr/12 ] | |||||||||||
|
what's the status of this? i've verified the same behavior exists in 2.0.4 | |||||||||||
| Comment by Y. Wayne Huang [ 05/Apr/12 ] | |||||||||||
|
hi Antoine, thanks for narrowing down the issue. any idea when we might see a fix for this? it's a bit of a pain to have to pause all activity and wait for existing in-progress jobs to finish just to reconfigure a RS. thanks! | |||||||||||
| Comment by Antoine Girbal [ 22/Mar/12 ] | |||||||||||
|
There is a SocketException thrown in server and caught at:
error is within call to:
Seems like there are some RS connections in a bad state, and after about 5-10 MR they are cleaned up and MR works again. | |||||||||||
| Comment by Antoine Girbal [ 22/Mar/12 ] | |||||||||||
|
bug is also in 2.0.4 and trunk (2.1.1) To reproduce:
1st error returned is:
subsequent errors returned are:
| |||||||||||
| Comment by Antoine Girbal [ 21/Mar/12 ] | |||||||||||
|
I am able to reproduce. | |||||||||||
| Comment by Antoine Girbal [ 20/Mar/12 ] | |||||||||||
|
Trying to reproduce, will confirm | |||||||||||
| Comment by Y. Wayne Huang [ 19/Mar/12 ] | |||||||||||
|
anything? | |||||||||||
| Comment by Y. Wayne Huang [ 13/Mar/12 ] | |||||||||||
|
one of our replicaset members was under maint by our provider, so i added a new one (first reconfig) as a backup. this is when i first noticed the issue and filed the bug. then the provider told us they could not recover the failed node, so i removed it (second reconfig) last night and noticed this behavior again. | |||||||||||
| Comment by Y. Wayne Huang [ 13/Mar/12 ] | |||||||||||
|
yes i reproduced the issue last night. i cannot find a manual way to fix this issue (such as flushRouterConfig) but it does seem to go away by itself or possibly the manual fixes i tried take a while to propagate. of that, i'm unsure. | |||||||||||
| Comment by Antoine Girbal [ 13/Mar/12 ] | |||||||||||
|
so are you positive that this issue only happened whey you reconfigured a replica set? | |||||||||||
| Comment by Y. Wayne Huang [ 13/Mar/12 ] | |||||||||||
|
it appears to happen after a change to an underlying replicaset, such as addition or removal of members. it remains in this state for some time, where most MR jobs will fail. eventually though it seems to sort itself out. |