[SERVER-12137] Socket recv() timeout problem Created: 17/Dec/13  Updated: 11/Jul/16  Resolved: 20/Dec/13

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.4.1
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: dejun teng Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

red hat linux


Attachments: Text File dicom1.log    
Issue Links:
Related
is related to SERVER-10458 Sanity check on "from" side that all ... Closed
Participants:

 Description   

we set up a 30 nodes mongodb system on our cluster with two physical nodes, node40 and node41, 15 each. I tried to up load as much as 500 GB data into this database. however, one node shutdown with this error report:

Sun Nov 24 14:58:15.910 [conn9] command admin.$cmd command:

{ writebacklisten: ObjectId('528f8628c868398bb45de20e') }

ntoreturn:1 keyUpdates:0 reslen:262523 50686ms
Sun Nov 24 14:58:15.973 [conn3642] waiting till out of critical section
Sun Nov 24 14:58:25.540 [conn15] Socket recv() timeout 192.168.1.159:27020
Sun Nov 24 14:58:25.540 [conn15] SocketException: remote: 192.168.1.159:27020 error: 9001 socket exception [3] server [192.168.1.159:27020]
Sun Nov 24 14:58:25.540 [conn15] DBClientCursor::init call() failed
Sun Nov 24 14:58:25.977 [conn3642] waiting till out of critical section
Sun Nov 24 14:58:29.726 [conn15] scoped connection to node40.clus.cci.emory.edu:27020,node40.clus.cci.emory.edu:27021,node41.clus.cci.emory.edu:27020 not being returned to the pool
Sun Nov 24 14:58:29.727 [conn15] warning: 13104 SyncClusterConnection::findOne prepare failed: 10276 DBClientBase::findN: transport error: node40.clus.cci.emory.edu:27020 ns: admin.$cmd query:

{ fsync: 1 }

node40.clus.cci.emory.edu:27020:{}
Sun Nov 24 14:58:29.727 [conn15] warning: moveChunk commit outcome ongoing: { applyOps: [ { op: "u", b: false, ns: "config.chunks", o: { _id: "dicomdb.fs.chunks-files_id_ObjectId('5292592d0cf23a90f681c5e0')", lastmod: Timestamp 12000|0, lastmodEpoch: ObjectId('529258b4c868398bb45e3755'), ns: "dicomdb.fs.chunks", min:

{ files_id: ObjectId('5292592d0cf23a90f681c5e0') }

, max:

{ files_id: ObjectId('529259620cf23a90f681d1ec') }

, shard: "dicom2" }, o2:

{ _id: "dicomdb.fs.chunks-files_id_ObjectId('5292592d0cf23a90f681c5e0')" }

}, { op: "u", b: false, ns: "config.chunks", o: { _id: "dicomdb.fs.chunks-files_id_ObjectId('529259620cf23a90f681d1ec')", lastmod: Timestamp 12000|1, lastmodEpoch: ObjectId('529258b4c868398bb45e3755'), ns: "dicomdb.fs.chunks", min:

{ files_id: ObjectId('529259620cf23a90f681d1ec') }

, max:

{ files_id: ObjectId('5292597d0cf23a90f681de3f') }

, shard: "dicom1" }, o2:

{ _id: "dicomdb.fs.chunks-files_id_ObjectId('529259620cf23a90f681d1ec')" }

} ], preCondition: [ { ns: "config.chunks", q: { query:

{ ns: "dicomdb.fs.chunks" }

, orderby:

{ lastmod: -1 }

}, res:

{ lastmod: Timestamp 11000|3 }

} ] } for command :{ $err: "SyncClusterConnection::findOne prepare failed: 10276 DBClientBase::findN: transport error: node40.clus.cci.emory.edu:27020 ns: admin.$cmd query:

{ fsy...", code: 13104 }

Sun Nov 24 14:58:35.980 [conn3642] waiting till out of critical section
Sun Nov 24 14:58:38.672 [DataFileSync] flushing mmaps took 48713ms for 11 files
Sun Nov 24 14:58:39.729 [conn15] SyncClusterConnection connecting to [node40.clus.cci.emory.edu:27020]
Sun Nov 24 14:58:39.730 [conn15] SyncClusterConnection connecting to [node40.clus.cci.emory.edu:27021]
Sun Nov 24 14:58:39.730 [conn15] SyncClusterConnection connecting to [node41.clus.cci.emory.edu:27020]
Sun Nov 24 14:58:39.731 [conn15] ERROR: moveChunk commit failed: version is at11|3||000000000000000000000000 instead of 12|1||529258b4c868398bb45e3755
Sun Nov 24 14:58:39.731 [conn15] ERROR: TERMINATING

At the beginning, all data is inserted into the primary node, and then balancer try to move chunks to other nodes. It can move chunks for a short period of time like 10 minute and then always raise this error.
I thought it’s the network issue, but actually they can ping each other. Do you know how can I solve this problem? or how can I know what causes this problem.

I also attached the log file, you can locate the error part by searching “failed”



 Comments   
Comment by dejun teng [ 17/Dec/13 ]

Thank you! I never thought its a software problem.
I updated it to 2.4.8 and it looks like work properly. the former error don't occur again up to now.
thank you!

Comment by Eliot Horowitz (Inactive) [ 17/Dec/13 ]

You are probably running into SERVER-10458
IF this is a new cluster, why are you running mongodb 2.4.1?
Would be very good to upgrade to 2.4.8

Generated at Thu Feb 08 03:27:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.