[SERVER-21806] Chained Replication problem when middle secondary under heavy read load Created: 09/Dec/15  Updated: 11/Jan/16  Resolved: 11/Jan/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.0.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: ma6174 Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

we have a replset with three data node: A(master) B(slave) C(slave) in LAN, no network problem, the replset chain is: A -> B -> C. when B is under heavy read load, C sync oplog from B failed with error "socket exception RECV_TIMEOUT", but B sync from A is normal, so node B's data is update to date. after some time we find that C lag from other a lot. then we use "rs.syncFrom(A)" to let C sync oplog from A, after a few second C is update with master.

I expect that when C sync oplog from B failed, or failed some times, will switch to master node automaticly even if B is update to date. we will disable chained replication in replset config before this problem is solved.

some info:

1. db version: 3.0.5
2. chainingAllowed: true(default)
3. SlaveReplicationInfo

repl1:SECONDARY> db.printSlaveReplicationInfo()
source: NODE_B:2706
    syncedTo: Wed Dec 09 2015 10:43:22 GMT+0800 (CST)
    -1 secs (0 hrs) behind the primary
source: NODE_C:2706
    syncedTo: Wed Dec 09 2015 10:34:35 GMT+0800 (CST)
    526 secs (0.15 hrs) behind the primary

4. some sync log

2015-12-09T10:29:01.877+0800 I NETWORK  [rsBackgroundSync] Socket recv() timeout  NODE_B:2706
2015-12-09T10:29:01.905+0800 I NETWORK  [rsBackgroundSync] SocketException: remote: NODE_B:2706 error: 9001 socket exception [RECV_TIMEOUT] server [NODE_B:2706]
2015-12-09T10:29:01.946+0800 E REPL     [rsBackgroundSync] sync producer problem: 10278 dbclient error communicating with server: NODE_B:2706
2015-12-09T10:29:01.950+0800 I REPL     [ReplicationExecutor] syncing from: NODE_B:2706
2015-12-09T10:34:42.673+0800 I NETWORK  [rsBackgroundSync] Socket recv() timeout  NODE_B:2706
2015-12-09T10:34:42.696+0800 I NETWORK  [rsBackgroundSync] SocketException: remote: NODE_B:2706 error: 9001 socket exception [RECV_TIMEOUT] server [NODE_B:2706]
2015-12-09T10:34:42.717+0800 E REPL     [rsBackgroundSync] sync producer problem: 10278 dbclient error communicating with server: NODE_B:2706
2015-12-09T10:34:42.718+0800 I REPL     [ReplicationExecutor] syncing from: NODE_B:2706
2015-12-09T10:34:44.641+0800 I REPL     [rsBackgroundSync] replSet syncTail condition 1
2015-12-09T10:34:45.656+0800 I REPL     [ReplicationExecutor] syncing from: NODE_B:2706
2015-12-09T10:35:45.853+0800 I NETWORK  [rsBackgroundSync] Socket recv() timeout  NODE_B:2706
2015-12-09T10:35:45.853+0800 I NETWORK  [rsBackgroundSync] SocketException: remote: NODE_B:2706 error: 9001 socket exception [RECV_TIMEOUT] server [NODE_B:2706]
2015-12-09T10:35:45.853+0800 E REPL     [rsBackgroundSync] sync producer problem: 10278 dbclient error communicating with server: NODE_B:2706
2015-12-09T10:35:45.853+0800 I REPL     [ReplicationExecutor] syncing from: NODE_B:2706
2015-12-09T10:36:16.721+0800 I NETWORK  [rsBackgroundSync] Socket recv() timeout  NODE_B:2706
2015-12-09T10:36:16.721+0800 I NETWORK  [rsBackgroundSync] SocketException: remote: NODE_B:2706 error: 9001 socket exception [RECV_TIMEOUT] server [NODE_B:2706]
2015-12-09T10:36:16.721+0800 E REPL     [rsBackgroundSync] sync producer problem: 10278 dbclient error communicating with server: NODE_B:2706
2015-12-09T10:36:16.721+0800 I REPL     [ReplicationExecutor] syncing from: NODE_B:2706
2015-12-09T10:37:16.933+0800 I NETWORK  [rsBackgroundSync] Socket recv() timeout  NODE_B:2706
2015-12-09T10:37:16.933+0800 I NETWORK  [rsBackgroundSync] SocketException: remote: NODE_B:2706 error: 9001 socket exception [RECV_TIMEOUT] server [NODE_B:2706]
2015-12-09T10:37:16.933+0800 E REPL     [rsBackgroundSync] sync producer problem: 10278 dbclient error communicating with server: NODE_B:2706
2015-12-09T10:37:16.933+0800 I REPL     [ReplicationExecutor] syncing from: NODE_B:2706
2015-12-09T10:37:47.189+0800 I NETWORK  [rsBackgroundSync] Socket recv() timeout  NODE_B:2706
2015-12-09T10:37:47.189+0800 I NETWORK  [rsBackgroundSync] SocketException: remote: NODE_B:2706 error: 9001 socket exception [RECV_TIMEOUT] server [NODE_B:2706]
2015-12-09T10:37:47.189+0800 E REPL     [rsBackgroundSync] sync producer problem: 10278 dbclient error communicating with server: NODE_B:2706
2015-12-09T10:37:47.189+0800 I REPL     [ReplicationExecutor] syncing from: NODE_B:2706
2015-12-09T10:37:49.456+0800 I REPL     [rsBackgroundSync] replSet syncTail condition 1
2015-12-09T10:37:50.457+0800 I REPL     [ReplicationExecutor] syncing from: NODE_B:2706
2015-12-09T10:43:44.201+0800 I NETWORK  [rsBackgroundSync] Socket recv() timeout  NODE_B:2706
2015-12-09T10:43:44.201+0800 I NETWORK  [rsBackgroundSync] SocketException: remote: NODE_B:2706 error: 9001 socket exception [RECV_TIMEOUT] server [NODE_B:2706]
2015-12-09T10:43:44.201+0800 E REPL     [rsBackgroundSync] sync producer problem: 10278 dbclient error communicating with server: NODE_B:2706
2015-12-09T10:43:44.201+0800 I REPL     [ReplicationExecutor] syncing from: NODE_B:2706
2015-12-09T10:44:07.570+0800 I REPL     [ReplicationExecutor] syncing from: NODE_A:2706 by request
2015-12-09T10:44:07.580+0800 I REPL     [SyncSourceFeedback] replset setting syncSourceFeedback to NODE_A:2706



 Comments   
Comment by Kelsey Schubert [ 11/Jan/16 ]

Hi ma6174,

Socket exception RECV_TIMEOUT likely indicates inadequate system resources for your specified configuration settings.

A replica set member will choose a new sync source if any of the following happen:

  • An error occurs on establishing a connection to a member that is the sync source, like it goes down.
  • The current sync source is more than 30 seconds behind another member.

In your replica set, C is consistently able to open new connections to B, and B does not lag significantly behind A. Therefore, C continues to sync from B.

For MongoDB-related support discussion please post on the mongodb-users group or Stack Overflow with the mongodb tag.

Questions about improving how your system handles heavy load would be best posted on the mongodb-users group.

Kind regards,
Thomas

Comment by Kelsey Schubert [ 23/Dec/15 ]

Hi ma6174,

Sorry for delay publicly responding.

So we can continue to investigate, please upload the following:

  1. The logs starting before you observe the issue for all three nodes?
  2. The output of rs.status() on NODE_C when you first observe this issue?

Thank you,
Thomas

Generated at Thu Feb 08 03:58:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.