[SERVER-16935] Socket Exception Created: 20/Jan/15  Updated: 14/Apr/16  Resolved: 02/Mar/15

Status: Closed
Project: Core Server
Component/s: Networking, Sharding
Affects Version/s: 2.4.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: guoliang Assignee: Ramon Fernandez Marina
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: Linux
Participants:

 Description   

There are usually SocketException happens, and when this error occurs, we may encount a poor performance of query and other options. The error log is as follow:

Mon Jan 19 09:33:29.699 [LockPinger] cluster 10.11.200.9:9090,10.11.200.7:9090,10.11.200.8:9090 pinged successfully at Mon Jan 19 09:33:29 2015 by distributed lock pinger '10.11.200.9:9090,10.11.200.7:9090,10.11.200.8:9090/localhost.localdomain:6080:1420004034:1804289383', sleeping for 30000ms
Mon Jan 19 09:33:30.754 [ReplicaSetMonitorWatcher] Socket recv() timeout  10.11.200.13:9009
Mon Jan 19 09:33:30.754 [ReplicaSetMonitorWatcher] SocketException: remote: 10.11.200.13:9009 error: 9001 socket exception [3] server [10.11.200.13:9009] 
Mon Jan 19 09:33:30.754 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
Mon Jan 19 09:33:40.776 [ReplicaSetMonitorWatcher] trying reconnect to 10.11.200.13:9009
Mon Jan 19 09:33:40.777 [ReplicaSetMonitorWatcher] reconnect 10.11.200.13:9009 ok
Mon Jan 19 09:33:41.375 [conn100558] end connection 10.11.220.47:45888 (500 connections now open)
Mon Jan 19 09:33:45.776 [ReplicaSetMonitorWatcher] Socket recv() timeout  10.11.200.13:9009
Mon Jan 19 09:33:45.776 [ReplicaSetMonitorWatcher] SocketException: remote: 10.11.200.13:9009 error: 9001 socket exception [3] server [10.11.200.13:9009] 
Mon Jan 19 09:33:45.776 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
Mon Jan 19 09:33:56.767 [WriteBackListener-10.11.200.11:9009] DBClientCursor::init call() failed
Mon Jan 19 09:33:56.767 [WriteBackListener-10.11.200.11:9009] Detected bad connection created at 1421607926129005 microSec, clearing pool for 10.11.200.11:9009
Mon Jan 19 09:33:56.767 [WriteBackListener-10.11.200.11:9009] WriteBackListener exception : DBClientBase::findN: transport error: 10.11.200.11:9009 ns: admin.$cmd query: { writebacklisten: ObjectId('54a38a7b9304f1764a938034') }
Mon Jan 19 09:34:00.394 [conn100563] got not master for: 10.11.200.11:9009
Mon Jan 19 09:34:00.798 [ReplicaSetMonitorWatcher] Socket recv() timeout  10.11.200.12:9009
Mon Jan 19 09:34:00.798 [ReplicaSetMonitorWatcher] SocketException: remote: 10.11.200.12:9009 error: 9001 socket exception [3] server [10.11.200.12:9009] 
Mon Jan 19 09:34:00.798 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
Mon Jan 19 09:34:00.798 [ReplicaSetMonitorWatcher] Detected bad connection created at 1421596473823061 microSec, clearing pool for 10.11.200.12:9009
Mon Jan 19 09:34:00.799 [ReplicaSetMonitorWatcher] trying reconnect to 10.11.200.13:9009
Mon Jan 19 09:34:00.799 [ReplicaSetMonitorWatcher] reconnect 10.11.200.13:9009 ok
Mon Jan 19 09:34:02.254 [conn100457] DBClientCursor::init lazy say() failed
Mon Jan 19 09:34:02.254 [conn100343] DBClientCursor::init lazy say() failed
Mon Jan 19 09:34:02.254 [conn86557] DBClientCursor::init call() failed
Mon Jan 19 09:34:02.254 [conn100571] DBClientCursor::init lazy say() failed
Mon Jan 19 09:34:02.254 [conn100569] DBClientCursor::init lazy say() failed
Mon Jan 19 09:34:02.254 [conn100343] DBClientCursor::init message from say() was empty
Mon Jan 19 09:34:02.255 [conn100569] DBClientCursor::init message from say() was empty
Mon Jan 19 09:34:02.254 [conn100457] DBClientCursor::init message from say() was empty
Mon Jan 19 09:34:02.254 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
Mon Jan 19 09:34:02.255 [conn100568] Detected bad connection created at 1421607555118835 microSec, clearing pool for 10.11.200.13:9009
Mon Jan 19 09:34:02.255 [conn100569] got not master for: 10.11.200.13:9009
Mon Jan 19 09:34:02.255 [conn86557] warning: could not autosplit collection single_msg.single_msg :: caused by :: 10276 DBClientBase::findN: transport error: 10.11.200.13:9009 ns: admin.$cmd query: { splitVector: "single_msg.single_msg", keyPattern: { _id: 1.0 }, min: { _id: 338721191280379031 }, max: { _id: 338721263261672676 }, maxChunkSizeBytes: 67108864, maxSplitPoints: 2, maxChunkObjects: 250000 }
Mon Jan 19 09:34:02.255 [conn100342] DBClientCursor::init lazy say() failed
Mon Jan 19 09:34:02.255 [conn100342] DBClientCursor::init message from say() was empty
Mon Jan 19 09:34:02.255 [conn100342] got not master for: 10.11.200.13:9009
Mon Jan 19 09:34:02.255 [conn100569] warning: exception when receiving data from set9:set9/10.11.200.11:9009,10.11.200.12:9009,10.11.200.13:9009, current connection state is { state: { conn: "set9/10.11.200.11:9009,10.11.200.12:9009,10.11.200.13:9009", vinfo: "mass_msg.mass_msg @ 50|9619||5453417ef0be311c3f688143", cursor: "(empty)", count: 0, done: false }, retryNext: false, init: true, finish: true, errored: false } :: caused by :: 15988 error querying server
Mon Jan 19 09:34:02.255 [conn100457] got not master for: 10.11.200.13:9009
Mon Jan 19 09:34:02.255 [conn86541] Socket say send() errno:104 Connection reset by peer 10.11.200.13:9009
Mon Jan 19 09:34:02.255 [conn100342] warning: exception when receiving data from set9:set9/10.11.200.11:9009,10.11.200.12:9009,10.11.200.13:9009, current connection state is { state: { conn: "set9/10.11.200.11:9009,10.11.200.12:9009,10.11.200.13:9009", vinfo: "single_msg.single_msg @ 51|9499||545742f8383232076b9e2582", cursor: "(empty)", count: 0, done: false }, retryNext: false, init: true, finish: true, errored: false } :: caused by :: 15988 error querying server
Mon Jan 19 09:34:02.255 [conn100571] DBClientCursor::init message from say() was empty
Mon Jan 19 09:34:02.255 [conn100343] got not master for: 10.11.200.13:9009
Mon Jan 19 09:34:02.255 [WriteBackListener-10.11.200.13:9009] DBClientCursor::init call() failed
Mon Jan 19 09:34:02.255 [conn100563] Socket recv() errno:104 Connection reset by peer 10.11.200.11:9009
Mon Jan 19 09:34:02.256 [conn86541] warning: exception during insert :: caused by :: 16460 error inserting 1 documents to shard set9:set9/10.11.200.11:9009,10.11.200.12:9009,10.11.200.13:9009 at version 50|9619||5453417ef0be311c3f688143 :: caused by :: socket exception [SEND_ERROR] for 10.11.200.13:9009
Mon Jan 19 09:34:02.256 [conn100563] SocketException: remote: 10.11.200.11:9009 error: 9001 socket exception [1] server [10.11.200.11:9009] 
Mon Jan 19 09:34:02.256 [conn100343] warning: exception when receiving data from set9:set9/10.11.200.11:9009,10.11.200.12:9009,10.11.200.13:9009, current connection state is { state: { conn: "set9/10.11.200.11:9009,10.11.200.12:9009,10.11.200.13:9009", vinfo: "mass_msg.mass_msg @ 50|9619||5453417ef0be311c3f688143", cursor: "(empty)", count: 0, done: false }, retryNext: false, init: true, finish: true, errored: false } :: caused by :: 15988 error querying server
Mon Jan 19 09:34:02.255 [conn100457] warning: exception when receiving data from set9:set9/10.11.200.11:9009,10.11.200.12:9009,10.11.200.13:9009, current connection state is { state: { conn: "set9/10.11.200.11:9009,10.11.200.12:9009,10.11.200.13:9009", vinfo: "media_doc.media_doc @ 51|367||545341bbf0be311c3f688144", cursor: "(empty)", count: 0, done: false }, retryNext: false, init: true, finish: true, errored: false } :: caused by :: 15988 error querying server
Mon Jan 19 09:34:02.256 [WriteBackListener-10.11.200.13:9009] WriteBackListener exception : DBClientBase::findN: transport error: 10.11.200.13:9009 ns: admin.$cmd query: { writebacklisten: ObjectId('54a38a7b9304f1764a938034') }
Mon Jan 19 09:34:02.256 [conn100563] DBClientCursor::init call() failed
Mon Jan 19 09:34:02.255 [conn100571] got not master for: 10.11.200.13:9009
Mon Jan 19 09:34:02.256 [conn100571] warning: exception when receiving data from set9:set9/10.11.200.11:9009,10.11.200.12:9009,10.11.200.13:9009, current connection state is { state: { conn: "set9/10.11.200.11:9009,10.11.200.12:9009,10.11.200.13:9009", vinfo: "mass_msg.mass_msg @ 50|9619||5453417ef0be311c3f688143", cursor: "(empty)", count: 0, done: false }, retryNext: false, init: true, finish: true, errored: false } :: caused by :: 15988 error querying server
Mon Jan 19 09:34:02.257 [conn100568] end connection 10.11.220.29:54661 (499 connections now open)
Mon Jan 19 09:34:02.264 [mongosMain] connection accepted from 10.11.220.31:57357 #100572 (500 connections now open)
Mon Jan 19 09:34:02.265 [conn100563] trying reconnect to 10.11.200.13:9009
Mon Jan 19 09:34:02.265 [conn100563] reconnect 10.11.200.13:9009 ok
Mon Jan 19 09:34:02.266 [conn100563] Socket recv() errno:104 Connection reset by peer 10.11.200.13:9009
Mon Jan 19 09:34:02.266 [conn100563] SocketException: remote: 10.11.200.13:9009 error: 9001 socket exception [1] server [10.11.200.13:9009] 
Mon Jan 19 09:34:02.266 [conn100563] DBClientCursor::init call() failed
Mon Jan 19 09:34:02.266 [conn100563] Detected bad connection created at 1421607794297937 microSec, clearing pool for 10.11.200.13:9009
Mon Jan 19 09:34:02.266 [conn100563] trying reconnect to 10.11.200.11:9009
Mon Jan 19 09:34:02.267 [conn100563] reconnect 10.11.200.11:9009 ok
Mon Jan 19 09:34:02.267 [conn100563] Socket recv() errno:104 Connection reset by peer 10.11.200.11:9009
Mon Jan 19 09:34:02.267 [conn100563] SocketException: remote: 10.11.200.11:9009 error: 9001 socket exception [1] server [10.11.200.11:9009] 
Mon Jan 19 09:34:02.268 [conn100563] DBClientCursor::init call() failed
Mon Jan 19 09:34:02.268 [conn100563] Detected bad connection created at 1421607925189948 microSec, clearing pool for 10.11.200.11:9009
Mon Jan 19 09:34:03.268 [conn100563] ReplicaSetMonitor::_checkConnection: 10.11.200.11:9009 { setName: "set9", ismaster: false, secondary: true, hosts: [ "10.11.200.11:9009", "10.11.200.13:9009", "10.11.200.12:9009" ], me: "10.11.200.11:9009", maxBsonObjectSize: 16777216, maxMessageSizeBytes: 48000000, localTime: new Date(1421631243269), ok: 1.0 }
Mon Jan 19 09:34:03.272 [mongosMain] connection accepted from 10.11.220.29:56287 #100573 (501 connections now open)
Mon Jan 19 09:34:08.268 [conn100563] Socket recv() timeout  10.11.200.12:9009
Mon Jan 19 09:34:08.268 [conn100563] SocketException: remote: 10.11.200.12:9009 error: 9001 socket exception [3] server [10.11.200.12:9009] 
Mon Jan 19 09:34:08.268 [conn100563] DBClientCursor::init call() failed
Mon Jan 19 09:34:08.268 [conn100563] ReplicaSetMonitor::_checkConnection: caught exception 10.11.200.12:9009 DBClientBase::findN: transport error: 10.11.200.12:9009 ns: admin.$cmd query: { ismaster: 1 }
Mon Jan 19 09:34:13.268 [conn100563] Socket recv() timeout  10.11.200.13:9009
Mon Jan 19 09:34:13.268 [conn100563] SocketException: remote: 10.11.200.13:9009 error: 9001 socket exception [3] server [10.11.200.13:9009] 
Mon Jan 19 09:34:13.268 [conn100563] DBClientCursor::init call() failed
Mon Jan 19 09:34:13.268 [conn100563] ReplicaSetMonitor::_checkConnection: caught exception 10.11.200.13:9009 DBClientBase::findN: transport error: 10.11.200.13:9009 ns: admin.$cmd query: { ismaster: 1 }
Mon Jan 19 09:34:13.269 [conn100573] Primary for replica set set9 changed to 10.11.200.11:9009
Mon Jan 19 09:34:13.270 [conn100572] ReplicaSetMonitor::_checkConnection: 10.11.200.11:9009 { setName: "set9", ismaster: true, secondary: false, hosts: [ "10.11.200.11:9009", "10.11.200.13:9009", "10.11.200.12:9009" ], primary: "10.11.200.11:9009", me: "10.11.200.11:9009", maxBsonObjectSize: 16777216, maxMessageSizeBytes: 48000000, localTime: new Date(1421631253270), ok: 1.0 }
Mon Jan 19 09:34:13.270 [conn86557] ReplicaSetMonitor::_checkConnection: 10.11.200.11:9009 { setName: "set9", ismaster: true, secondary: false, hosts: [ "10.11.200.11:9009", "10.11.200.13:9009", "10.11.200.12:9009" ], primary: "10.11.200.11:9009", me: "10.11.200.11:9009", maxBsonObjectSize: 16777216, maxMessageSizeBytes: 48000000, localTime: new Date(1421631253271), ok: 1.0 }
Mon Jan 19 09:34:13.271 [conn100457] ReplicaSetMonitor::_checkConnection: 10.11.200.11:9009 { setName: "set9", ismaster: true, secondary: false, hosts: [ "10.11.200.11:9009", "10.11.200.13:9009", "10.11.200.12:9009" ], primary: "10.11.200.11:9009", me: "10.11.200.11:9009", maxBsonObjectSize: 16777216, maxMessageSizeBytes: 48000000, localTime: new Date(1421631253272), ok: 1.0 }
Mon Jan 19 09:34:13.272 [conn100342] ReplicaSetMonitor::_checkConnection: 10.11.200.11:9009 { setName: "set9", ismaster: true, secondary: false, hosts: [ "10.11.200.11:9009", "10.11.200.13:9009", "10.11.200.12:9009" ], primary: "10.11.200.11:9009", me: "10.11.200.11:9009", maxBsonObjectSize: 16777216, maxMessageSizeBytes: 48000000, localTime: new Date(1421631253272), ok: 1.0 }
Mon Jan 19 09:34:13.273 [conn86541] ReplicaSetMonitor::_checkConnection: 10.11.200.11:9009 { setName: "set9", ismaster: true, secondary: false, hosts: [ "10.11.200.11:9009", "10.11.200.13:9009", "10.11.200.12:9009" ], primary: "10.11.200.11:9009", me: "10.11.200.11:9009", maxBsonObjectSize: 16777216, maxMessageSizeBytes: 48000000, localTime: new Date(1421631253273), ok: 1.0 }
Mon Jan 19 09:34:14.269 [conn100563] warning: No primary detected for set set9
Mon Jan 19 09:34:14.269 [conn100563] warning: db exception when initializing on set9:set9/10.11.200.11:9009,10.11.200.12:9009,10.11.200.13:9009, current connection state is { state: { conn: "set9/10.11.200.11:9009,10.11.200.12:9009,10.11.200.13:9009", vinfo: "media_doc.media_doc @ 51|367||545341bbf0be311c3f688144", cursor: "(none)", count: 0, done: false }, retryNext: false, init: false, finish: false, errored: false } :: caused by :: 10009 ReplicaSetMonitor no master found for set: set9
Mon Jan 19 09:34:14.279 [mongosMain] connection accepted from 10.11.220.24:52139 #100574 (502 connections now open)
Mon Jan 19 09:34:14.300 [conn100574] Primary for replica set set9 changed to 10.11.200.11:9009
Mon Jan 19 09:34:17.258 [ReplicaSetMonitorWatcher] Socket recv() timeout  10.11.200.15:9010
Mon Jan 19 09:34:17.258 [ReplicaSetMonitorWatcher] SocketException: remote: 10.11.200.15:9010 error: 9001 socket exception [3] server [10.11.200.15:9010] 
Mon Jan 19 09:34:17.258 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
Mon Jan 19 09:34:17.956 [conn100574] end connection 10.11.220.24:52139 (501 connections now open)
Mon Jan 19 09:34:22.258 [ReplicaSetMonitorWatcher] Socket recv() timeout  10.11.200.16:9010
Mon Jan 19 09:34:22.258 [ReplicaSetMonitorWatcher] SocketException: remote: 10.11.200.16:9010 error: 9001 socket exception [3] server [10.11.200.16:9010] 
Mon Jan 19 09:34:22.258 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
Mon Jan 19 09:34:22.276 [ReplicaSetMonitorWatcher] trying reconnect to 10.11.200.12:9009
Mon Jan 19 09:34:22.276 [ReplicaSetMonitorWatcher] reconnect 10.11.200.12:9009 ok
Mon Jan 19 09:34:22.277 [ReplicaSetMonitorWatcher] trying reconnect to 10.11.200.13:9009
Mon Jan 19 09:34:22.277 [ReplicaSetMonitorWatcher] reconnect 10.11.200.13:9009 ok
Mon Jan 19 09:34:37.279 [ReplicaSetMonitorWatcher] Socket recv() timeout  10.11.200.14:9010
Mon Jan 19 09:34:37.279 [ReplicaSetMonitorWatcher] SocketException: remote: 10.11.200.14:9010 error: 9001 socket exception [3] server [10.11.200.14:9010] 
Mon Jan 19 09:34:37.279 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
Mon Jan 19 09:34:37.280 [ReplicaSetMonitorWatcher] trying reconnect to 10.11.200.14:9010
Mon Jan 19 09:34:37.280 [ReplicaSetMonitorWatcher] reconnect 10.11.200.14:9010 ok
Mon Jan 19 09:34:41.374 [mongosMain] connection accepted from 10.11.220.47:46474 #100575 (502 connections now open)
Mon Jan 19 09:34:42.280 [ReplicaSetMonitorWatcher] Socket recv() timeout  10.11.200.14:9010
Mon Jan 19 09:34:42.280 [ReplicaSetMonitorWatcher] SocketException: remote: 10.11.200.14:9010 error: 9001 socket exception [3] server [10.11.200.14:9010] 
Mon Jan 19 09:34:42.280 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
Mon Jan 19 09:34:42.280 [ReplicaSetMonitorWatcher] trying reconnect to 10.11.200.15:9010
Mon Jan 19 09:34:42.281 [ReplicaSetMonitorWatcher] reconnect 10.11.200.15:9010 ok
Mon Jan 19 09:34:42.282 [ReplicaSetMonitorWatcher] trying reconnect to 10.11.200.14:9010
Mon Jan 19 09:34:42.282 [ReplicaSetMonitorWatcher] reconnect 10.11.200.14:9010 ok
Mon Jan 19 09:34:47.282 [ReplicaSetMonitorWatcher] Socket recv() timeout  10.11.200.14:9010
Mon Jan 19 09:34:47.282 [ReplicaSetMonitorWatcher] SocketException: remote: 10.11.200.14:9010 error: 9001 socket exception [3] server [10.11.200.14:9010] 
Mon Jan 19 09:34:47.283 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
Mon Jan 19 09:34:47.283 [ReplicaSetMonitorWatcher] trying reconnect to 10.11.200.16:9010
Mon Jan 19 09:34:47.283 [ReplicaSetMonitorWatcher] reconnect 10.11.200.16:9010 ok
Mon Jan 19 09:34:47.284 [conn140] trying reconnect to 10.11.200.14:9010
Mon Jan 19 09:34:47.285 [conn140] reconnect 10.11.200.14:9010 ok
Mon Jan 19 09:34:47.824 [conn100563] DBClientCursor::init lazy say() failed
Mon Jan 19 09:34:47.824 [conn100563] DBClientCursor::init message from say() was empty
Mon Jan 19 09:34:47.824 [conn100563] got not master for: 10.11.200.14:9010
Mon Jan 19 09:34:47.824 [WriteBackListener-10.11.200.14:9010] DBClientCursor::init call() failed
Mon Jan 19 09:34:47.824 [conn100573] DBClientCursor::init lazy say() failed
Mon Jan 19 09:34:47.824 [conn100561] DBClientCursor::init lazy say() failed
Mon Jan 19 09:34:47.825 [conn100573] DBClientCursor::init message from say() was empty
Mon Jan 19 09:34:47.825 [conn100564] DBClientCursor::init lazy say() failed
Mon Jan 19 09:34:47.825 [conn100342] DBClientCursor::init lazy say() failed
Mon Jan 19 09:34:47.825 [conn100564] DBClientCursor::init message from say() was empty
Mon Jan 19 09:34:47.825 [conn100457] DBClientCursor::init lazy say() failed
Mon Jan 19 09:34:47.825 [conn100573] got not master for: 10.11.200.14:9010
Mon Jan 19 09:34:47.825 [conn100561] DBClientCursor::init message from say() was empty
Mon Jan 19 09:34:47.825 [conn140] DBClientCursor::init call() failed
Mon Jan 19 09:34:47.825 [conn100342] DBClientCursor::init message from say() was empty
Mon Jan 19 09:34:47.825 [conn100565] DBClientCursor::init lazy say() failed
Mon Jan 19 09:34:47.825 [conn100564] got not master for: 10.11.200.14:9010
Mon Jan 19 09:34:47.825 [conn100457] DBClientCursor::init message from say() was empty
Mon Jan 19 09:34:47.825 [conn100457] got not master for: 10.11.200.14:9010
Mon Jan 19 09:34:47.825 [conn100572] DBClientCursor::init lazy say() failed
Mon Jan 19 09:34:47.825 [WriteBackListener-10.11.200.14:9010] Detected bad connection created at 1421605308573619 microSec, clearing pool for 10.11.200.14:9010
Mon Jan 19 09:34:47.825 [conn100342] got not master for: 10.11.200.14:9010



 Comments   
Comment by Ramon Fernandez Marina [ 21/Jan/15 ]

What I can gather from the logs and your setup is that one of you shards may be having trouble, specifically set9:

Mon Jan 19 09:34:02.255 [conn100342] warning: exception when receiving data from set9:set9/10.11.200.11:9009,10.11.200.12:9009,10.11.200.13:9009, current connection state is { state: { conn: "set9/10.11.200.11:9009,10.11.200.12:9009,10.11.200.13:9009", vinfo: "single_msg.single_msg @ 51|9499||545742f8383232076b9e2582", cursor: "(empty)", count: 0, done: false }, retryNext: false, init: true, finish: true, errored: false } :: caused by :: 15988 error querying server

Please upload full logs for servers in set9 to see if a bug may be triggering this situation, as so far the behavior you're seeing from mongos is expected.

As for your question about deployment, as well as for other MongoDB-related support discussions, please post on the mongodb-user group or Stack Overflow with the mongodb tag – the SERVER project is for reporting bugs or feature suggestions for the MongoDB server.

Comment by guoliang [ 21/Jan/15 ]

If we have 14 machines, and we plan to upgrade mongodb version to latest version 2.6.7. Then we will deploy each shard with 2 replica set and one arbiter. Now our problem is which shard number is the best, 14? or 7? 14 means each machine has a primary ,a secondary and an arbiter. eg: machine 1 deploy primary set 1, secondary set 2 and arbiter set 3, etc.

Comment by guoliang [ 21/Jan/15 ]

Fernandez, thanks for your answer, we have 14 machines, the full deployment is simply like follows, set 1 means a replica set of shard 1, we have 10 shards, each shard has 3 replica sets, except shard 2 and shard 7, because of the machine with IP 10.11.200.6 is down. So we add one arbiter node to shard2 and shard7. Is this deployment is reasonable?
10.11.200.2 set1/6 mongos
10.11.200.3 set1/6
10.11.200.4 set1/6
10.11.200.5 set2/7/arb_7

10.11.200.7 set2/7/arb_2 config
10.11.200.8 set3/8 config
10.11.200.9 set3/8 config
10.11.200.10 set3/8
10.11.200.11 set4/9
10.11.200.12 set4/9
10.11.200.13 set4/9
10.11.200.14 set5/10
10.11.200.15 set5/10
10.11.200.16 set5/10

Comment by Ramon Fernandez Marina [ 20/Jan/15 ]

xgl001, can you please provide full details of your setup? What kind of deployment is this? What's running on which IP/machine? On which machine do you see the socket exception error? Can you please provide full logs for it?

Note also that there have been many improvements since version 2.4.4, so I'd recommend you upgrade at least to the latest version in the 2.4 series, which as of this writing is 2.4.12.

Generated at Thu Feb 08 03:42:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.