[SERVER-9351] 3 node replica set fresh config - failure after initial mongoimport Created: 15/Apr/13  Updated: 16/Apr/13  Resolved: 16/Apr/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: David Sobon Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 12.04.2 LTS 3.2.0-40-virtual, 64-bit, hosted on AWS EC2


Operating System: ALL
Steps To Reproduce:

1) create and initiate clean 3-node replica set cluster.

  • place two nodes in one AZ (with sub 1msec latency and near 1gige networking)
  • place third node into another AWS AZ, 2msec away with 3 hops, via VPN.
    4) mongoimport to {NODE 1}

    5) wait for replication on

    {NODE 3}

    to fail.

Participants:

 Description   

After setting up replication as per architecture design pattern "Geographically Distributed Sets" (2 nodes in one AZ, 1 node in another AZ, via VPN, as per Amazon recommended design), performing a fresh import on NODE 1 (client) to NODE 1 (server) triggers replication issues.

NODE 1 - primary, AZ2 (availability zone)
NODE 2 - secondary, AZ2
NODE 3 - secondary, AZ1

PROBLEM
----------
replication "locks" up on

{NODE 3} and does not recover, either by waiting or restarting mongodb server {NODE 3}

.

mongo client on

{NODE 3} responds very slowly (up to 30 seconds lag), even on enter with no command.

Error logs:
-------------
Mon Apr 15 08:02:55.026 [rsBackgroundSync] Socket recv() timeout {NODE 1}
Mon Apr 15 08:02:55.026 [rsBackgroundSync] SocketException: remote: {NODE 1} error: 9001 socket exception [3] server [{NODE 1}]
Mon Apr 15 08:02:55.026 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: {NODE 1}
Mon Apr 15 08:02:56.050 [rsSyncNotifier] Socket recv() timeout {NODE 1}
Mon Apr 15 08:02:56.050 [rsSyncNotifier] SocketException: remote: {NODE 1} error: 9001 socket exception [3] server [{NODE 1}]
Mon Apr 15 08:02:56.050 [rsSyncNotifier] DBClientCursor::init call() failed
Mon Apr 15 08:02:57.050 [rsSyncNotifier] replset tracking exception: exception: 9001 socket exception [FAILED_STATE] for {NODE 1}
Mon Apr 15 08:02:58.051 [rsSyncNotifier] replset setting oplog notifier to {NODE 1}

replication status
--------------------{NODE 1} state - PRIMARY, optime - 1366013200 {NODE 2} state - SECONDARY, optime - 1366013200{NODE 3}

state - SECONDARY, optime - 1366012945



 Comments   
Comment by David Sobon [ 16/Apr/13 ]

Please mark problem as INVALID.

Issue ended up being the cross-availability-zone VPN connection, the TCP connections did not have TCP MSS set properly.

The solution was on both ends of the VPN link:

iptables -I FORWARD -p tcp --syn -s

{saddr}

/24 -d

{daddr}

/24 -j TCPMSS --set-mss 1356

Generated at Thu Feb 08 03:20:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.