Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-12137

Socket recv() timeout problem

    • Type: Icon: Task Task
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.4.1
    • Component/s: Sharding
    • Labels:
      None
    • Environment:
      red hat linux

      we set up a 30 nodes mongodb system on our cluster with two physical nodes, node40 and node41, 15 each. I tried to up load as much as 500 GB data into this database. however, one node shutdown with this error report:

      Sun Nov 24 14:58:15.910 [conn9] command admin.$cmd command:

      { writebacklisten: ObjectId('528f8628c868398bb45de20e') }

      ntoreturn:1 keyUpdates:0 reslen:262523 50686ms
      Sun Nov 24 14:58:15.973 [conn3642] waiting till out of critical section
      Sun Nov 24 14:58:25.540 [conn15] Socket recv() timeout 192.168.1.159:27020
      Sun Nov 24 14:58:25.540 [conn15] SocketException: remote: 192.168.1.159:27020 error: 9001 socket exception [3] server [192.168.1.159:27020]
      Sun Nov 24 14:58:25.540 [conn15] DBClientCursor::init call() failed
      Sun Nov 24 14:58:25.977 [conn3642] waiting till out of critical section
      Sun Nov 24 14:58:29.726 [conn15] scoped connection to node40.clus.cci.emory.edu:27020,node40.clus.cci.emory.edu:27021,node41.clus.cci.emory.edu:27020 not being returned to the pool
      Sun Nov 24 14:58:29.727 [conn15] warning: 13104 SyncClusterConnection::findOne prepare failed: 10276 DBClientBase::findN: transport error: node40.clus.cci.emory.edu:27020 ns: admin.$cmd query:

      { fsync: 1 }

      node40.clus.cci.emory.edu:27020:{}
      Sun Nov 24 14:58:29.727 [conn15] warning: moveChunk commit outcome ongoing: { applyOps: [ { op: "u", b: false, ns: "config.chunks", o: { _id: "dicomdb.fs.chunks-files_id_ObjectId('5292592d0cf23a90f681c5e0')", lastmod: Timestamp 12000|0, lastmodEpoch: ObjectId('529258b4c868398bb45e3755'), ns: "dicomdb.fs.chunks", min:

      { files_id: ObjectId('5292592d0cf23a90f681c5e0') }

      , max:

      { files_id: ObjectId('529259620cf23a90f681d1ec') }

      , shard: "dicom2" }, o2:

      { _id: "dicomdb.fs.chunks-files_id_ObjectId('5292592d0cf23a90f681c5e0')" }

      }, { op: "u", b: false, ns: "config.chunks", o: { _id: "dicomdb.fs.chunks-files_id_ObjectId('529259620cf23a90f681d1ec')", lastmod: Timestamp 12000|1, lastmodEpoch: ObjectId('529258b4c868398bb45e3755'), ns: "dicomdb.fs.chunks", min:

      { files_id: ObjectId('529259620cf23a90f681d1ec') }

      , max:

      { files_id: ObjectId('5292597d0cf23a90f681de3f') }

      , shard: "dicom1" }, o2:

      { _id: "dicomdb.fs.chunks-files_id_ObjectId('529259620cf23a90f681d1ec')" }

      } ], preCondition: [ { ns: "config.chunks", q: { query:

      { ns: "dicomdb.fs.chunks" }

      , orderby:

      { lastmod: -1 }

      }, res:

      { lastmod: Timestamp 11000|3 }

      } ] } for command :{ $err: "SyncClusterConnection::findOne prepare failed: 10276 DBClientBase::findN: transport error: node40.clus.cci.emory.edu:27020 ns: admin.$cmd query:

      { fsy...", code: 13104 }

      Sun Nov 24 14:58:35.980 [conn3642] waiting till out of critical section
      Sun Nov 24 14:58:38.672 [DataFileSync] flushing mmaps took 48713ms for 11 files
      Sun Nov 24 14:58:39.729 [conn15] SyncClusterConnection connecting to [node40.clus.cci.emory.edu:27020]
      Sun Nov 24 14:58:39.730 [conn15] SyncClusterConnection connecting to [node40.clus.cci.emory.edu:27021]
      Sun Nov 24 14:58:39.730 [conn15] SyncClusterConnection connecting to [node41.clus.cci.emory.edu:27020]
      Sun Nov 24 14:58:39.731 [conn15] ERROR: moveChunk commit failed: version is at11|3||000000000000000000000000 instead of 12|1||529258b4c868398bb45e3755
      Sun Nov 24 14:58:39.731 [conn15] ERROR: TERMINATING

      At the beginning, all data is inserted into the primary node, and then balancer try to move chunks to other nodes. It can move chunks for a short period of time like 10 minute and then always raise this error.
      I thought it’s the network issue, but actually they can ping each other. Do you know how can I solve this problem? or how can I know what causes this problem.

      I also attached the log file, you can locate the error part by searching “failed”

            Assignee:
            Unassigned Unassigned
            Reporter:
            terrytdj dejun teng
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: