[SERVER-2492] Assertion: 13633:error querying server Created: 08/Feb/11  Updated: 17/Mar/11  Resolved: 08/Feb/11

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 1.7.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: David Mytton Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongos.tar.gz    
Operating System: Linux
Participants:

 Description   

On one of our test servers, I am seeing exceptions when trying to query mongo. The server was idle for over 12 hours before I attempted these queries.

The PHP driver is reporting these exceptions:

1) DBClientBase::findOne: transport error: rs1a:27018 query:

{ setShardVersion: "sd.sessions", configdb: "config1:27019", version: Timestamp 4000|1, serverID: ObjectId('4d4d074513e250e07d6e3a4c'), shard: "shard1", shardHost: "set1/rs1a:27018,rs1b:27018" }

2) error querying server: set1/rs1a:27018,rs1b:27018

Set 1 has changed since the last access with a new member being added for debugging CS-303.

mongos log shows:

Tue Feb 8 12:14:50 checking replica set: set3
Tue Feb 8 12:15:10 checking replica set: set1
Tue Feb 8 12:15:10 checking replica set: set2
Tue Feb 8 12:15:10 checking replica set: set3
Tue Feb 8 12:15:30 checking replica set: set1
Tue Feb 8 12:15:30 checking replica set: set2
Tue Feb 8 12:15:30 checking replica set: set3
Tue Feb 8 12:15:37 [conn238] creating WriteBackListener for: set1/rs1a:27018,rs1b:27018,rs1e:27018
Tue Feb 8 12:15:37 [conn238] MessagingPort recv() errno:104 Connection reset by peer 10.121.14.3:27018
Tue Feb 8 12:15:37 [conn238] SocketException: remote: 10.121.14.3:27018 error: 9001 socket exception [1]
Tue Feb 8 12:15:37 [conn238] DBClientCursor::init call() failed
Tue Feb 8 12:15:38 [conn238] update failed b/c of StaleConfigException, retrying left:4 ns: sd.sessions query:

{ accId: 1, sessId: "45df4rdeifaqq74ukrvgrm1np5" }

Tue Feb 8 12:15:45 [conn164] MessagingPort recv() errno:104 Connection reset by peer 10.121.14.3:27018
Tue Feb 8 12:15:45 [conn164] SocketException: remote: 10.121.14.3:27018 error: 9001 socket exception [1]
Tue Feb 8 12:15:45 [conn164] DBClientCursor::init call() failed
Tue Feb 8 12:15:45 [conn164] Assertion: 13633:error querying server: set1/rs1a:27018,rs1b:27018
0x521149 0x589818 0x58a324 0x66bb6a 0x635ef5 0x66213c 0x673674 0x57f090 0x69b480 0x302860673d 0x3027ed3f6d
/home/david/mongodb/latest/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x129) [0x521149]
/home/david/mongodb/latest/bin/mongos(_ZN5mongo15ClusteredCursor5queryERKSsiNS_7BSONObjEi+0xcb8) [0x589818]
/home/david/mongodb/latest/bin/mongos(_ZN5mongo27SerialServerClusteredCursor4moreEv+0x134) [0x58a324]
/home/david/mongodb/latest/bin/mongos(_ZN5mongo19ShardedClientCursor13sendNextBatchERNS_7RequestEi+0x8a) [0x66bb6a]
/home/david/mongodb/latest/bin/mongos(_ZN5mongo13ShardStrategy7queryOpERNS_7RequestE+0xe45) [0x635ef5]
/home/david/mongodb/latest/bin/mongos(_ZN5mongo7Request7processEi+0x29c) [0x66213c]
/home/david/mongodb/latest/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE+0x144) [0x673674]
/home/david/mongodb/latest/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x260) [0x57f090]
/home/david/mongodb/latest/bin/mongos(thread_proxy+0x80) [0x69b480]
/lib64/libpthread.so.0 [0x302860673d]
/lib64/libc.so.6(clone+0x6d) [0x3027ed3f6d]
Tue Feb 8 12:15:45 [conn164] AssertionException in process: error querying server: set1/rs1a:27018,rs1b:27018
Tue Feb 8 12:15:45 [conn164] update failed b/c of StaleConfigException, retrying left:4 ns: sd.sessions query:

{ accId: 1, sessId: "45df4rdeifaqq74ukrvgrm1np5" }

Tue Feb 8 12:15:48 [conn338] MessagingPort recv() errno:104 Connection reset by peer 10.121.14.3:27018
Tue Feb 8 12:15:48 [conn338] SocketException: remote: 10.121.14.3:27018 error: 9001 socket exception [1]
Tue Feb 8 12:15:48 [conn338] DBClientCursor::init call() failed
Tue Feb 8 12:15:48 [conn338] Assertion: 13633:error querying server: set1/rs1a:27018,rs1b:27018
0x521149 0x589818 0x58a324 0x66bb6a 0x635ef5 0x66213c 0x673674 0x57f090 0x69b480 0x302860673d 0x3027ed3f6d
/home/david/mongodb/latest/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x129) [0x521149]
/home/david/mongodb/latest/bin/mongos(_ZN5mongo15ClusteredCursor5queryERKSsiNS_7BSONObjEi+0xcb8) [0x589818]
/home/david/mongodb/latest/bin/mongos(_ZN5mongo27SerialServerClusteredCursor4moreEv+0x134) [0x58a324]
/home/david/mongodb/latest/bin/mongos(_ZN5mongo19ShardedClientCursor13sendNextBatchERNS_7RequestEi+0x8a) [0x66bb6a]
/home/david/mongodb/latest/bin/mongos(_ZN5mongo13ShardStrategy7queryOpERNS_7RequestE+0xe45) [0x635ef5]
/home/david/mongodb/latest/bin/mongos(_ZN5mongo7Request7processEi+0x29c) [0x66213c]
/home/david/mongodb/latest/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE+0x144) [0x673674]
/home/david/mongodb/latest/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x260) [0x57f090]
/home/david/mongodb/latest/bin/mongos(thread_proxy+0x80) [0x69b480]
/lib64/libpthread.so.0 [0x302860673d]
/lib64/libc.so.6(clone+0x6d) [0x3027ed3f6d]
Tue Feb 8 12:15:48 [conn338] AssertionException in process: error querying server: set1/rs1a:27018,rs1b:27018
Tue Feb 8 12:15:48 [conn338] update failed b/c of StaleConfigException, retrying left:4 ns: sd.sessions query:

{ accId: 1, sessId: "45df4rdeifaqq74ukrvgrm1np5" }

Tue Feb 8 12:15:50 checking replica set: set1
Tue Feb 8 12:15:50 checking replica set: set2
Tue Feb 8 12:15:50 checking replica set: set3
Tue Feb 8 12:16:10 checking replica set: set1
Tue Feb 8 12:16:10 checking replica set: set2
Tue Feb 8 12:16:10 checking replica set: set3
Tue Feb 8 12:16:27 [conn302] MessagingPort recv() errno:104 Connection reset by peer 10.121.14.3:27018
Tue Feb 8 12:16:27 [conn302] SocketException: remote: 10.121.14.3:27018 error: 9001 socket exception [1]
Tue Feb 8 12:16:27 [conn302] DBClientCursor::init call() failed
Tue Feb 8 12:16:27 [conn302] update failed b/c of StaleConfigException, retrying left:4 ns: sd.sessions query:

{ accId: 1, sessId: "45df4rdeifaqq74ukrvgrm1np5" }

Tue Feb 8 12:16:30 checking replica set: set1
Tue Feb 8 12:16:30 checking replica set: set2
Tue Feb 8 12:16:30 checking replica set: set3



 Comments   
Comment by Eliot Horowitz (Inactive) [ 08/Feb/11 ]

This is to be expected then.

Default values for tcp keep alive kick in after 2 hours, so no router should be configured to close sockets for at least 2 hours.

You can change the linux keep alive setting to 30 minutes so the router can't kill the connection.

You can see what it is via:
cat /proc/sys/net/ipv4/tcp_keepalive_time

by default its 7200

I would recommend trying 1800.

Comment by David Mytton [ 08/Feb/11 ]

There is a firewall between mongos and mongod. That's configured to kill idle connections after 1 hour. This is a test web node so it was totally idle overnight, compared to our live web nodes which are always active.

Comment by Eliot Horowitz (Inactive) [ 08/Feb/11 ]

Is it possible a router in the middle killed the connection?

What's inbetween the mongos and mongod?

Comment by David Mytton [ 08/Feb/11 ]

Yes, we're completely on 1.7.5 now.

Comment by Eliot Horowitz (Inactive) [ 08/Feb/11 ]

Looks like some router or something killed a socket.
Are these all on 1.7.5?

Comment by David Mytton [ 08/Feb/11 ]

10.121.14.3 is rs1a which is the master. It hadn't been rebooted.

Comment by Eliot Horowitz (Inactive) [ 08/Feb/11 ]

Is 10.121.14.3 the master or a slave?
Had it been rebooted?

Comment by David Mytton [ 08/Feb/11 ]

Looks like it might be related to adding the new replica set member. On bouncing mongos it shows connecting to the new member:

Tue Feb 8 12:25:43 [Balancer] about to contact config servers and shards
Tue Feb 8 12:25:48 [Balancer] updated set (set1) to: set1/rs1a:27018,rs1e:27018
Tue Feb 8 12:25:48 [Balancer] updated set (set1) to: set1/rs1a:27018,rs1e:27018,rs1b:27018
Tue Feb 8 12:25:48 [Balancer] updated set (set2) to: set2/rs2b:27018,rs2a:27018
Tue Feb 8 12:25:48 [Balancer] updated set (set3) to: set3/rs3a:27018,rs3b:27018
Tue Feb 8 12:25:48 [Balancer] config servers and shards contacted successfully
Tue Feb 8 12:25:48 [Balancer] balancer id: test1:27017 started at Feb 8 12:25:48
Tue Feb 8 12:26:02 [conn1] creating WriteBackListener for: set1/rs1a:27018,rs1e:27018,rs1b:27018
Tue Feb 8 12:26:02 [conn1] creating WriteBackListener for: set2/rs2b:27018,rs2a:27018
Tue Feb 8 12:26:02 [conn1] creating WriteBackListener for: set3/rs3a:27018,rs3b:27018

Generated at Thu Feb 08 03:00:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.