[SERVER-26878] Mongod servers become very slow, seems replication thread issue Created: 02/Nov/16  Updated: 06/Jan/17  Resolved: 06/Jan/17

Status: Closed
Project: Core Server
Component/s: Performance, Replication, WiredTiger
Affects Version/s: 3.0.12
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Rakesh Kumar Assignee: Kelsey Schubert
Resolution: Incomplete Votes: 1
Labels: query, replicaset
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

mongo 3.0.12 replica set on aws ec2 instances. OS : ubuntu 14.04


Operating System: ALL
Steps To Reproduce:

n/a
it happens after months of running mongod.

Participants:

 Description   

Hi,

We are using mongod 3.0.12 replica-set cluster, and was working fine for months. But suddenly it become very slow.
This happens for ~30-35 min. and after that it again become normal.

When we checked the logs, we find replication threads (oplog threads), started exactly that time when the issue starts and when those threads killed, the issue resolved automatically.

Further, we have checked with AWS, and there is not network issue from there end and there is no slave lag during that time.

mongod logs, having error:

Primary:

2016-10-27T21:24:25.567+0530 I COMMAND  [conn1271549] query local.oplog.rs query: { ts: { $gte: Timestamp 1477581440000|15 } } planSummary: COLLSCAN cursorid:15062548466 ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:34674009 keyUpdates:0 writeConflicts:0 numYields:272282 nreturned:101 reslen:35279 locks:{ Global: { acquireCount: { r: 544566 } }, Database: { acquireCount: { r: 272283 } }, oplog: { acquireCount: { r: 272283 } } } 2168613ms
2016-10-27T21:24:25.567+0530 I NETWORK  [conn1271549] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.14.2.48:47400] 
2016-10-27T21:24:25.570+0530 I COMMAND  [conn1271553] query local.oplog.rs query: { ts: { $gte: Timestamp 1477581440000|15 } } planSummary: COLLSCAN cursorid:14898865683 ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:34674214 keyUpdates:0 writeConflicts:0 numYields:272332 nreturned:101 reslen:35279 locks:{ Global: { acquireCount: { r: 544666 } }, Database: { acquireCount: { r: 272333 } }, oplog: { acquireCount: { r: 272333 } } } 2176869ms
2016-10-27T21:24:25.570+0530 I NETWORK  [conn1271553] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.14.17.163:38129] 
2016-10-27T21:24:25.580+0530 I COMMAND  [conn1271910] query local.oplog.rs query: { ts: { $gte: Timestamp 1477581557000|5 } } planSummary: COLLSCAN cursorid:15114361808 ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:34676032 keyUpdates:0 writeConflicts:0 numYields:272242 nreturned:101 reslen:45350 locks:{ Global: { acquireCount: { r: 544486 } }, Database: { acquireCount: { r: 272243 } }, oplog: { acquireCount: { r: 272243 } } } 1988208ms
2016-10-27T21:24:25.580+0530 I NETWORK  [conn1271910] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.14.2.48:47461] 
2016-10-27T21:24:25.581+0530 I COMMAND  [conn1271921] query local.oplog.rs query: { ts: { $gte: Timestamp 1477581557000|5 } } planSummary: COLLSCAN cursorid:16726213770 ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:34675967 keyUpdates:0 writeConflicts:0 numYields:272287 nreturned:101 reslen:45350 locks:{ Global: { acquireCount: { r: 544576 } }, Database: { acquireCount: { r: 272288 } }, oplog: { acquireCount: { r: 272288 } } } 1917728ms
2016-10-27T21:24:25.581+0530 I NETWORK  [conn1271921] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.14.2.48:47485] 
2016-10-27T21:24:25.583+0530 I COMMAND  [conn1271914] query local.oplog.rs query: { ts: { $gte: Timestamp 1477581643000|3 } } planSummary: COLLSCAN cursorid:12954518601 ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:34677296 keyUpdates:0 writeConflicts:0 numYields:272297 nreturned:101 reslen:57683 locks:{ Global: { acquireCount: { r: 544596 } }, Database: { acquireCount: { r: 272298 } }, oplog: { acquireCount: { r: 272298 } } } 1949781ms
2016-10-27T21:24:25.583+0530 I COMMAND  [conn1271919] query local.oplog.rs query: { ts: { $gte: Timestamp 1477581643000|3 } } planSummary: COLLSCAN cursorid:13575995616 ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:34677259 keyUpdates:0 writeConflicts:0 numYields:272281 nreturned:101 reslen:57683 locks:{ Global: { acquireCount: { r: 544564 } }, Database: { acquireCount: { r: 272282 } }, oplog: { acquireCount: { r: 272282 } } } 1919829ms
2016-10-27T21:24:25.583+0530 I NETWORK  [conn1271919] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.14.17.163:38206] 
2016-10-27T21:24:25.583+0530 I NETWORK  [conn1271914] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.14.17.163:38196] 

Secondary:

2016-10-27T21:23:00.896+0530 I NETWORK  [initandlisten] connection accepted from 10.14.4.176:57444 #130402 (208 connections now open)
2016-10-27T21:23:12.314+0530 I NETWORK  [rsBackgroundSync] Socket recv() timeout  10.14.4.176:27012
2016-10-27T21:23:12.314+0530 I NETWORK  [rsBackgroundSync] SocketException: remote: 10.14.4.176:27012 error: 9001 socket exception [RECV_TIMEOUT] server [10.14.4.176:27012] 
2016-10-27T21:23:12.314+0530 E REPL     [rsBackgroundSync] sync producer problem: 10278 dbclient error communicating with server: 10.14.4.176:27012
2016-10-27T21:23:12.314+0530 I -        [rsBackgroundSync] caught exception (socket exception [FAILED_STATE] for 10.14.4.176:27012 (10.14.4.176) failed) in destructor (kill)
2016-10-27T21:23:12.314+0530 I REPL     [ReplicationExecutor] syncing from: 10.14.4.176:27012

Please let me know if you need anything else.



 Comments   
Comment by Kelsey Schubert [ 06/Jan/17 ]

Hi rakesh.kumar,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,
Thomas

Comment by Kelsey Schubert [ 04/Dec/16 ]

Hi rakesh.mib.j@gmail.com,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please answer my previous questions?

Thank you,
Thomas

Comment by Kelsey Schubert [ 18/Nov/16 ]

Hi rakesh.mib.j@gmail.com,

Sorry for the delay responding. To help us continue to investigate this issue, would you please upload the surrounding logs for both the primary and secondary?

In addition, I have a few questions to better understand what is going on:

  1. Would you please clarify how the threads were killed?
  2. Would you clarify how the primary is affected during the slow down?
  3. Would you clarify how the secondary is affected during the slow down?
  4. How frequently have you experienced this issue? Does it correlate to any particular time or events?

Thank you,
Thomas

Generated at Thu Feb 08 04:13:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.