Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-26878

Mongod servers become very slow, seems replication thread issue

    • Type: Icon: Bug Bug
    • Resolution: Incomplete
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.0.12
    • Environment:
      mongo 3.0.12 replica set on aws ec2 instances. OS : ubuntu 14.04
    • ALL
    • Hide

      n/a
      it happens after months of running mongod.

      Show
      n/a it happens after months of running mongod.

      Hi,

      We are using mongod 3.0.12 replica-set cluster, and was working fine for months. But suddenly it become very slow.
      This happens for ~30-35 min. and after that it again become normal.

      When we checked the logs, we find replication threads (oplog threads), started exactly that time when the issue starts and when those threads killed, the issue resolved automatically.

      Further, we have checked with AWS, and there is not network issue from there end and there is no slave lag during that time.

      mongod logs, having error:

      Primary:

      2016-10-27T21:24:25.567+0530 I COMMAND  [conn1271549] query local.oplog.rs query: { ts: { $gte: Timestamp 1477581440000|15 } } planSummary: COLLSCAN cursorid:15062548466 ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:34674009 keyUpdates:0 writeConflicts:0 numYields:272282 nreturned:101 reslen:35279 locks:{ Global: { acquireCount: { r: 544566 } }, Database: { acquireCount: { r: 272283 } }, oplog: { acquireCount: { r: 272283 } } } 2168613ms
      2016-10-27T21:24:25.567+0530 I NETWORK  [conn1271549] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.14.2.48:47400] 
      2016-10-27T21:24:25.570+0530 I COMMAND  [conn1271553] query local.oplog.rs query: { ts: { $gte: Timestamp 1477581440000|15 } } planSummary: COLLSCAN cursorid:14898865683 ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:34674214 keyUpdates:0 writeConflicts:0 numYields:272332 nreturned:101 reslen:35279 locks:{ Global: { acquireCount: { r: 544666 } }, Database: { acquireCount: { r: 272333 } }, oplog: { acquireCount: { r: 272333 } } } 2176869ms
      2016-10-27T21:24:25.570+0530 I NETWORK  [conn1271553] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.14.17.163:38129] 
      2016-10-27T21:24:25.580+0530 I COMMAND  [conn1271910] query local.oplog.rs query: { ts: { $gte: Timestamp 1477581557000|5 } } planSummary: COLLSCAN cursorid:15114361808 ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:34676032 keyUpdates:0 writeConflicts:0 numYields:272242 nreturned:101 reslen:45350 locks:{ Global: { acquireCount: { r: 544486 } }, Database: { acquireCount: { r: 272243 } }, oplog: { acquireCount: { r: 272243 } } } 1988208ms
      2016-10-27T21:24:25.580+0530 I NETWORK  [conn1271910] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.14.2.48:47461] 
      2016-10-27T21:24:25.581+0530 I COMMAND  [conn1271921] query local.oplog.rs query: { ts: { $gte: Timestamp 1477581557000|5 } } planSummary: COLLSCAN cursorid:16726213770 ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:34675967 keyUpdates:0 writeConflicts:0 numYields:272287 nreturned:101 reslen:45350 locks:{ Global: { acquireCount: { r: 544576 } }, Database: { acquireCount: { r: 272288 } }, oplog: { acquireCount: { r: 272288 } } } 1917728ms
      2016-10-27T21:24:25.581+0530 I NETWORK  [conn1271921] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.14.2.48:47485] 
      2016-10-27T21:24:25.583+0530 I COMMAND  [conn1271914] query local.oplog.rs query: { ts: { $gte: Timestamp 1477581643000|3 } } planSummary: COLLSCAN cursorid:12954518601 ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:34677296 keyUpdates:0 writeConflicts:0 numYields:272297 nreturned:101 reslen:57683 locks:{ Global: { acquireCount: { r: 544596 } }, Database: { acquireCount: { r: 272298 } }, oplog: { acquireCount: { r: 272298 } } } 1949781ms
      2016-10-27T21:24:25.583+0530 I COMMAND  [conn1271919] query local.oplog.rs query: { ts: { $gte: Timestamp 1477581643000|3 } } planSummary: COLLSCAN cursorid:13575995616 ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:34677259 keyUpdates:0 writeConflicts:0 numYields:272281 nreturned:101 reslen:57683 locks:{ Global: { acquireCount: { r: 544564 } }, Database: { acquireCount: { r: 272282 } }, oplog: { acquireCount: { r: 272282 } } } 1919829ms
      2016-10-27T21:24:25.583+0530 I NETWORK  [conn1271919] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.14.17.163:38206] 
      2016-10-27T21:24:25.583+0530 I NETWORK  [conn1271914] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.14.17.163:38196] 
      

      Secondary:

      2016-10-27T21:23:00.896+0530 I NETWORK  [initandlisten] connection accepted from 10.14.4.176:57444 #130402 (208 connections now open)
      2016-10-27T21:23:12.314+0530 I NETWORK  [rsBackgroundSync] Socket recv() timeout  10.14.4.176:27012
      2016-10-27T21:23:12.314+0530 I NETWORK  [rsBackgroundSync] SocketException: remote: 10.14.4.176:27012 error: 9001 socket exception [RECV_TIMEOUT] server [10.14.4.176:27012] 
      2016-10-27T21:23:12.314+0530 E REPL     [rsBackgroundSync] sync producer problem: 10278 dbclient error communicating with server: 10.14.4.176:27012
      2016-10-27T21:23:12.314+0530 I -        [rsBackgroundSync] caught exception (socket exception [FAILED_STATE] for 10.14.4.176:27012 (10.14.4.176) failed) in destructor (kill)
      2016-10-27T21:23:12.314+0530 I REPL     [ReplicationExecutor] syncing from: 10.14.4.176:27012
      

      Please let me know if you need anything else.

            Assignee:
            kelsey.schubert@mongodb.com Kelsey Schubert
            Reporter:
            rakesh.mib.j@gmail.com Rakesh Kumar
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: