[SERVER-5286] DBClientBase::findN: transport error for query: { mapreduce.shardedfinish: {....} } Created: 12/Mar/12  Updated: 15/Feb/13  Resolved: 29/Aug/12

Status: Closed
Project: Core Server
Component/s: MapReduce, Sharding
Affects Version/s: 2.0.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Y. Wayne Huang Assignee: Randolph Tan
Resolution: Done Votes: 2
Labels: 212push
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

ubuntu 10.04 x86_64


Issue Links:
Related
is related to SERVER-6752 Do not close all connections on repli... Closed
Operating System: Linux
Participants:

 Description   

we are running an unauthenticated sharded setup with 4 shards, each a replicaset of 2 members and an arbiter. all nodes are running 2.0.3. periodically, mapreduce jobs fail with:

Array
(
[assertion] => DBClientBase::findN: transport error: mongo-s1-01:27011 query: { mapreduce.shardedfinish: { mapreduce: "PageView", map: CodeWScope( function()

{ ... }

, {}), reduce: CodeWScope(
function(k, vals)

{ ... }

, {}), query: { date:

{ $gt: 1330944522 }

, content_type:

{ $in: [ "text/html; charset=utf-8", "text/html" ] }

}, out: "mr.PageView.1331545722.6649.700620" }, shardedOutputCollection: "tmp.mrs.PageView_1331545723_8", shards: { shard2/mongo-s2-02:27022,mongo-s2-01:27021: { result: "tmp.mrs.PageView_1331545723_8", timeMillis: 7056, counts:

{ input: 146527, emit: 35517, reduce: 4562, output: 1391 }

, ok: 1.0 }, shard3/mongo-s3-01:27017,mongo-s3-02:27017: { result: "tmp.mrs.PageView_1331545723_8", timeMillis: 2127, counts:

{ input: 37213, emit: 34834, reduce: 4460, output: 1351 }

, ok: 1.0 }, shard4/mongo-s4-02:27017,mongo-s4-03:27017,mongo-s4-01:27017: { result: "tmp.mrs.PageView_1331545723_8", timeMillis: 3031, counts:

{ input: 58947, emit: 51909, reduce: 5721, output: 2096 }

, ok: 1.0 } }, shardCounts: { shard2/mongo-s2-02:27022,mongo-s2-01:27021:

{ input: 146527, emit: 35517, reduce: 4562, output: 1391 }

, shard3/mongo-s3-01:27017,mongo-s3-02:27017:

{ input: 37213, emit: 34834, reduce: 4460, output: 1351 }

, shard4/mongo-s4-02:27017,mongo-s4-03:27017,mongo-s4-01:27017:

{ input: 58947, emit: 51909, reduce: 5721, output: 2096 }

}, counts:

{ emit: 122260, input: 242687, output: 4838, reduce: 14743 }

}
[assertionCode] => 10276
[errmsg] => db assertion failure
[ok] => 0
)

this happens periodically. running flushRouterConfig ahead of the MR job does not resolve this issue. bouncing mongos does not resolve the issue. in back to back runs, it fails about 4 out of every 5 tries with 1 success despite no changes on our end.



 Comments   
Comment by Randolph Tan [ 29/Aug/12 ]

Basing on the given steps, I believe this is by design. A reconfiguration on the primary can cause the primary to close all connections and will take some time to recover:

http://docs.mongodb.org/manual/reference/replica-configuration/#usage

There is also a ticket (SERVER-6752) to make the reconfiguration less sensitive.

Comment by Grégoire Seux [ 06/Jul/12 ]

I get the exact same issue with 2.0.6.

Comment by Y. Wayne Huang [ 25/Apr/12 ]

what's the status of this? i've verified the same behavior exists in 2.0.4

Comment by Y. Wayne Huang [ 05/Apr/12 ]

hi Antoine, thanks for narrowing down the issue. any idea when we might see a fix for this? it's a bit of a pain to have to pause all activity and wait for existing in-progress jobs to finish just to reconfigure a RS. thanks!

Comment by Antoine Girbal [ 22/Mar/12 ]

There is a SocketException thrown in server and caught at:

	mongo::ParallelSortClusteredCursor::_oldInit() at /Users/antoine/workspace/mongo/src/mongo/client/parallel.cpp:1,279 0x1002c9a55	
	mongo::ParallelSortClusteredCursor::_init() at /Users/antoine/workspace/mongo/src/mongo/client/parallel.cpp:1,169 0x1002c2b78	
	mongo::ClusteredCursor::init() at /Users/antoine/workspace/mongo/src/mongo/client/parallel.cpp:82 0x1002af8c3	
	mongo::mr::MapReduceFinishCommand::run() at /Users/antoine/workspace/mongo/src/mongo/db/commands/mr.cpp:1,296 0x1005ccfca	
	mongo::_execCommand() at /Users/antoine/workspace/mongo/src/mongo/db/dbcommands.cpp:1,786 0x1002f723e	
	mongo::execCommand() at /Users/antoine/workspace/mongo/src/mongo/db/dbcommands.cpp:1,892 0x1002f8c84	
	mongo::_runCommands() at /Users/antoine/workspace/mongo/src/mongo/db/dbcommands.cpp:1,974 0x1002f9f90	
	mongo::runCommands() at /Users/antoine/workspace/mongo/src/mongo/db/ops/query.cpp:42 0x100112c05	
	mongo::runQuery() at /Users/antoine/workspace/mongo/src/mongo/db/ops/query.cpp:810 0x100113470	
	receivedQuery() at /Users/antoine/workspace/mongo/src/mongo/db/instance.cpp:242 0x10070f0cd	

error is within call to:

                    _cursors[i].raw()->initLazy( ! firstPass );

Seems like there are some RS connections in a bad state, and after about 5-10 MR they are cleaned up and MR works again.

Comment by Antoine Girbal [ 22/Mar/12 ]

bug is also in 2.0.4 and trunk (2.1.1)

To reproduce:

  • add a replica to a shard
  • do a MR that uses a sharded input
  • stop and remove that new replica
  • find() still works but MR fails (and I believe some other commands too)

1st error returned is:

com.mongodb.CommandResult$CommandFailure: command failed [command failed [mapreduce] { "serverUsed" : "localhost:27001" , "ok" : 0.0 , "errmsg" : "final reduce failed: { result: \"mrout\", errmsg: \"exception: could not initialize cursor across all shards because : socket exception @ foo/agmac.local:27017,agmac.local:27018\", code: 14827, ok: 0.0 }"}
	at com.mongodb.CommandResult.getException(CommandResult.java:75)
	at com.mongodb.CommandResult.throwOnError(CommandResult.java:121)

subsequent errors returned are:

com.mongodb.CommandResult$CommandFailure: command failed [command failed [mapreduce] { "serverUsed" : "localhost:27001" , "assertion" : "DBClientBase::findN: transport error: agmac.local:27018 query: { setShardVersion: \"\", init: true, configdb: \"localhost:27000\", serverID: ObjectId('4f6a2d0fc7e45c8360ef9a03'), authoritative: true }" , "assertionCode" : 10276 , "errmsg" : "db assertion failure" , "ok" : 0}
	at com.mongodb.CommandResult.getException(CommandResult.java:75)
	at com.mongodb.CommandResult.throwOnError(CommandResult.java:121)

Comment by Antoine Girbal [ 21/Mar/12 ]

I am able to reproduce.
The error doesnt lie in mongos, but in the mongod that does the final MR post processing.
This server is gathering docs from all other shards, but it seems that it gets an error when opening a cursor to do so, after one of the shard's replica was reconfigured to remove a server.
This is odd since the removed server is a secondary, and should not be used in the case where slaveOk=false.
This is verified with mongod 2.0.2, trying with 2.0.4

Comment by Antoine Girbal [ 20/Mar/12 ]

Trying to reproduce, will confirm

Comment by Y. Wayne Huang [ 19/Mar/12 ]

anything?

Comment by Y. Wayne Huang [ 13/Mar/12 ]

one of our replicaset members was under maint by our provider, so i added a new one (first reconfig) as a backup. this is when i first noticed the issue and filed the bug. then the provider told us they could not recover the failed node, so i removed it (second reconfig) last night and noticed this behavior again.

Comment by Y. Wayne Huang [ 13/Mar/12 ]

yes i reproduced the issue last night. i cannot find a manual way to fix this issue (such as flushRouterConfig) but it does seem to go away by itself or possibly the manual fixes i tried take a while to propagate. of that, i'm unsure.

Comment by Antoine Girbal [ 13/Mar/12 ]

so are you positive that this issue only happened whey you reconfigured a replica set?

Comment by Y. Wayne Huang [ 13/Mar/12 ]

it appears to happen after a change to an underlying replicaset, such as addition or removal of members. it remains in this state for some time, where most MR jobs will fail. eventually though it seems to sort itself out.

Generated at Thu Feb 08 03:08:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.