[SERVER-3446] Seeing lots of "transport error" Created: 19/Jul/11 Updated: 12/Jul/16 Resolved: 02/Sep/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 1.8.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Theo Hultberg | Assignee: | Mathias Stearn |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
mongod 1.8.1, mongos 1.8.2-rc2. Six server cluster with four shards, 68 gig RAM EC2 instances. JRuby with latest mongo gem. |
||
| Operating System: | ALL |
| Participants: |
| Description |
|
I see a lot of "transport errors" like the ones below in my application, and I can't figure out why. I've had to guard every call to mongo in my application with a lot of error handling logic just to avoid keeling over every time it throws a OperationFailure. Here's a grep for "transport error" from a mongos.log, all these happened yesterday (there's tons more in the logs for previous days): Mon Jul 18 08:55:40 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb05:27017 query: { writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }Mon Jul 18 16:25:27 [conn659] warning: Could not get last error.DBClientBase::findOne: transport error: rfmcolldb06:27117 query: { getlasterror: 1 }Mon Jul 18 16:25:27 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb06:27117 query: { writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }Mon Jul 18 16:25:27 [Balancer] caught exception while doing balance: DBClientBase::findOne: transport error: rfmcolldb06:27117 query: { features: 1 }Mon Jul 18 16:25:27 [conn664] warning: Could not get last error.DBClientBase::findOne: transport error: rfmcolldb06:27117 query: { getlasterror: 1 }Mon Jul 18 16:35:14 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb01:27017 query: { writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }Mon Jul 18 16:43:53 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb01:27117 query: { writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }Mon Jul 18 16:43:53 [conn685] warning: Could not get last error.DBClientBase::findOne: transport error: rfmcolldb01:27117 query: { getlasterror: 1 }Mon Jul 18 16:43:53 [Balancer] caught exception while doing balance: DBClientBase::findOne: transport error: rfmcolldb01:27117 query: { features: 1 }Mon Jul 18 17:08:48 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb04:27117 query: { writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }Mon Jul 18 17:09:02 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb04:27017 query: { writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }Mon Jul 18 17:09:02 [Balancer] caught exception while doing balance: DBClientBase::findOne: transport error: rfmcolldb04:27017 query: { features: 1 }Mon Jul 18 17:09:08 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb05:27117 query: { writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }Mon Jul 18 17:15:08 ERROR: couldn't unset sharding : DBClientBase::findOne: transport error: rfmcolldb05:27117 query: { unsetSharding: 1 }Mon Jul 18 17:19:44 [WriteBackListener] ERROR: error processing writeback: 10276 DBClientBase::findOne: transport error: rfmcolldb04:27017 query: { setShardVersion: "fragments.pageview_fragments", configdb: "rfmcolldb01:28100,rfmcolldb0 Mon Jul 18 18:20:05 [WriteBackListener] ERROR: error processing writeback: 10276 DBClientBase::findOne: transport error: rfmcolldb04:27017 query: { getlasterror: 1 }Mon Jul 18 22:10:17 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb03:27117 query: { writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }Mon Jul 18 22:10:30 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb03:27017 query: { writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }Mon Jul 18 22:10:30 [conn1374] warning: Could not get last error.DBClientBase::findOne: transport error: rfmcolldb03:27017 query: { getlasterror: 1 }Mon Jul 18 22:10:31 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb03:27017 query: { writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }Mon Jul 18 22:10:31 ERROR: couldn't unset sharding : DBClientBase::findOne: transport error: rfmcolldb03:27017 query: { unsetSharding: 1 }Mon Jul 18 22:10:31 ERROR: couldn't unset sharding : DBClientBase::findOne: transport error: rfmcolldb03:27017 query: { unsetSharding: 1 }Mon Jul 18 22:10:31 ERROR: couldn't unset sharding : DBClientBase::findOne: transport error: rfmcolldb03:27017 query: { unsetSharding: 1 }Mon Jul 18 22:10:31 ERROR: couldn't unset sharding : DBClientBase::findOne: transport error: rfmcolldb03:27017 query: { unsetSharding: 1 }Mon Jul 18 22:10:33 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: rfmcolldb03:27017 query: { writebacklisten: ObjectId('4e16ba0635330bc016259dc4') }Mon Jul 18 22:10:43 [WriteBackListener] ERROR: error processing writeback: 10276 DBClientBase::findOne: transport error: rfmcolldb02:27017 query: { setShardVersion: "fragments.exposure_fragments", configdb: "rfmcolldb01:28100,rfmcolldb0 |
| Comments |
| Comment by Eliot Horowitz (Inactive) [ 02/Sep/11 ] |
|
We've lowered the keep alive time in 2.0 |
| Comment by Kyle Banker [ 27/Jul/11 ] |
|
I've seen this issue on EC2 when the tcp_keepalive_time is too low. 300 is a good setting, but sometimes it's 7200 by default. sysctl net.ipv4.tcp_keepalive_time=300 Edit /etc/sysctl.conf to make this permanent. |
| Comment by Theo Hultberg [ 20/Jul/11 ] |
|
Yes, it looks like they correspond to times when the mongo logs complain about network connectivity errors. |
| Comment by Eliot Horowitz (Inactive) [ 19/Jul/11 ] |
|
That usually indicates either a server down or network trouble. |