[SERVER-3446] Seeing lots of "transport error" Created: 19/Jul/11  Updated: 12/Jul/16  Resolved: 02/Sep/11

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 1.8.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Theo Hultberg Assignee: Mathias Stearn
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

mongod 1.8.1, mongos 1.8.2-rc2. Six server cluster with four shards, 68 gig RAM EC2 instances. JRuby with latest mongo gem.


Operating System: ALL
Participants:

 Description   

I see a lot of "transport errors" like the ones below in my application, and I can't figure out why. I've had to guard every call to mongo in my application with a lot of error handling logic just to avoid keeling over every time it throws a OperationFailure.

Here's a grep for "transport error" from a mongos.log, all these happened yesterday (there's tons more in the logs for previous days):

Mon Jul 18 08:55:40 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb05:27017 query:

{ writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }

Mon Jul 18 16:25:27 [conn659] warning: Could not get last error.DBClientBase::findOne: transport error: rfmcolldb06:27117 query:

{ getlasterror: 1 }

Mon Jul 18 16:25:27 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb06:27117 query:

{ writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }

Mon Jul 18 16:25:27 [Balancer] caught exception while doing balance: DBClientBase::findOne: transport error: rfmcolldb06:27117 query:

{ features: 1 }

Mon Jul 18 16:25:27 [conn664] warning: Could not get last error.DBClientBase::findOne: transport error: rfmcolldb06:27117 query:

{ getlasterror: 1 }

Mon Jul 18 16:35:14 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb01:27017 query:

{ writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }

Mon Jul 18 16:43:53 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb01:27117 query:

{ writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }

Mon Jul 18 16:43:53 [conn685] warning: Could not get last error.DBClientBase::findOne: transport error: rfmcolldb01:27117 query:

{ getlasterror: 1 }

Mon Jul 18 16:43:53 [Balancer] caught exception while doing balance: DBClientBase::findOne: transport error: rfmcolldb01:27117 query:

{ features: 1 }

Mon Jul 18 17:08:48 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb04:27117 query:

{ writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }

Mon Jul 18 17:09:02 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb04:27017 query:

{ writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }

Mon Jul 18 17:09:02 [Balancer] caught exception while doing balance: DBClientBase::findOne: transport error: rfmcolldb04:27017 query:

{ features: 1 }

Mon Jul 18 17:09:08 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb05:27117 query:

{ writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }

Mon Jul 18 17:15:08 ERROR: couldn't unset sharding : DBClientBase::findOne: transport error: rfmcolldb05:27117 query:

{ unsetSharding: 1 }

Mon Jul 18 17:19:44 [WriteBackListener] ERROR: error processing writeback: 10276 DBClientBase::findOne: transport error: rfmcolldb04:27017 query: { setShardVersion: "fragments.pageview_fragments", configdb: "rfmcolldb01:28100,rfmcolldb0
Mon Jul 18 17:20:17 [WriteBackListener] ERROR: error processing writeback: 10276 DBClientBase::findOne: transport error: rfmcolldb05:27117 query: { setShardVersion: "complete.exposures", configdb: "rfmcolldb01:28100,rfmcolldb02:28100,rf
Mon Jul 18 17:20:56 [WriteBackListener] ERROR: error processing writeback: 10276 DBClientBase::findOne: transport error: rfmcolldb05:27117 query:

{ getlasterror: 1 }

Mon Jul 18 18:20:05 [WriteBackListener] ERROR: error processing writeback: 10276 DBClientBase::findOne: transport error: rfmcolldb04:27017 query:

{ getlasterror: 1 }

Mon Jul 18 22:10:17 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb03:27117 query:

{ writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }

Mon Jul 18 22:10:30 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb03:27017 query:

{ writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }

Mon Jul 18 22:10:30 [conn1374] warning: Could not get last error.DBClientBase::findOne: transport error: rfmcolldb03:27017 query:

{ getlasterror: 1 }

Mon Jul 18 22:10:31 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb03:27017 query:

{ writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }

Mon Jul 18 22:10:31 ERROR: couldn't unset sharding : DBClientBase::findOne: transport error: rfmcolldb03:27017 query:

{ unsetSharding: 1 }

Mon Jul 18 22:10:31 ERROR: couldn't unset sharding : DBClientBase::findOne: transport error: rfmcolldb03:27017 query:

{ unsetSharding: 1 }

Mon Jul 18 22:10:31 ERROR: couldn't unset sharding : DBClientBase::findOne: transport error: rfmcolldb03:27017 query:

{ unsetSharding: 1 }

Mon Jul 18 22:10:31 ERROR: couldn't unset sharding : DBClientBase::findOne: transport error: rfmcolldb03:27017 query:

{ unsetSharding: 1 }

Mon Jul 18 22:10:33 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb03:27017 query:

{ writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }

Mon Jul 18 22:10:43 [WriteBackListener] ERROR: error processing writeback: 10276 DBClientBase::findOne: transport error: rfmcolldb02:27017 query: { setShardVersion: "fragments.exposure_fragments", configdb: "rfmcolldb01:28100,rfmcolldb0
Mon Jul 18 22:10:43 [WriteBackListener] ERROR: error processing writeback: 10276 DBClientBase::findOne: transport error: rfmcolldb02:27017 query: { setShardVersion: "fragments.exposure_fragments", configdb: "rfmcolldb01:28100,rfmcolldb0



 Comments   
Comment by Eliot Horowitz (Inactive) [ 02/Sep/11 ]

We've lowered the keep alive time in 2.0

Comment by Kyle Banker [ 27/Jul/11 ]

I've seen this issue on EC2 when the tcp_keepalive_time is too low. 300 is a good setting, but sometimes it's 7200 by default.

sysctl net.ipv4.tcp_keepalive_time=300

Edit /etc/sysctl.conf to make this permanent.

Comment by Theo Hultberg [ 20/Jul/11 ]

Yes, it looks like they correspond to times when the mongo logs complain about network connectivity errors.

Comment by Eliot Horowitz (Inactive) [ 19/Jul/11 ]

That usually indicates either a server down or network trouble.
Can you check for those?

Generated at Thu Feb 08 03:03:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.