[SERVER-2492] Assertion: 13633:error querying server Created: 08/Feb/11 Updated: 17/Mar/11 Resolved: 08/Feb/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 1.7.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | David Mytton | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | Linux |
| Participants: |
| Description |
|
On one of our test servers, I am seeing exceptions when trying to query mongo. The server was idle for over 12 hours before I attempted these queries. The PHP driver is reporting these exceptions: 1) DBClientBase::findOne: transport error: rs1a:27018 query: { setShardVersion: "sd.sessions", configdb: "config1:27019", version: Timestamp 4000|1, serverID: ObjectId('4d4d074513e250e07d6e3a4c'), shard: "shard1", shardHost: "set1/rs1a:27018,rs1b:27018" }2) error querying server: set1/rs1a:27018,rs1b:27018 Set 1 has changed since the last access with a new member being added for debugging CS-303. mongos log shows: Tue Feb 8 12:14:50 checking replica set: set3 Tue Feb 8 12:15:45 [conn164] MessagingPort recv() errno:104 Connection reset by peer 10.121.14.3:27018 Tue Feb 8 12:15:48 [conn338] MessagingPort recv() errno:104 Connection reset by peer 10.121.14.3:27018 Tue Feb 8 12:15:50 checking replica set: set1 Tue Feb 8 12:16:30 checking replica set: set1 |
| Comments |
| Comment by Eliot Horowitz (Inactive) [ 08/Feb/11 ] |
|
This is to be expected then. Default values for tcp keep alive kick in after 2 hours, so no router should be configured to close sockets for at least 2 hours. You can change the linux keep alive setting to 30 minutes so the router can't kill the connection. You can see what it is via: by default its 7200 I would recommend trying 1800. |
| Comment by David Mytton [ 08/Feb/11 ] |
|
There is a firewall between mongos and mongod. That's configured to kill idle connections after 1 hour. This is a test web node so it was totally idle overnight, compared to our live web nodes which are always active. |
| Comment by Eliot Horowitz (Inactive) [ 08/Feb/11 ] |
|
Is it possible a router in the middle killed the connection? What's inbetween the mongos and mongod? |
| Comment by David Mytton [ 08/Feb/11 ] |
|
Yes, we're completely on 1.7.5 now. |
| Comment by Eliot Horowitz (Inactive) [ 08/Feb/11 ] |
|
Looks like some router or something killed a socket. |
| Comment by David Mytton [ 08/Feb/11 ] |
|
10.121.14.3 is rs1a which is the master. It hadn't been rebooted. |
| Comment by Eliot Horowitz (Inactive) [ 08/Feb/11 ] |
|
Is 10.121.14.3 the master or a slave? |
| Comment by David Mytton [ 08/Feb/11 ] |
|
Looks like it might be related to adding the new replica set member. On bouncing mongos it shows connecting to the new member: Tue Feb 8 12:25:43 [Balancer] about to contact config servers and shards |