[SERVER-57669] Replication is not catching up on one of the Secondary node. Created: 12/Jun/21  Updated: 12/Jun/21  Resolved: 12/Jun/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Subha Arunachalam Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

ISSUE SUMMARY
Customer had a high load on the cluster and started having replication lag. The
Node 00-01 has been in a DOWN state from a day.

Initial issue was the node was in an infinite restart loop and The snapshot process started 2021-06-11 at 07:22 and ran until 23:13. And, the node encountered errors when starting up. The CoE noticed that there is a "duplicate key" issue prior to node replacement.

At this point CoE has restarted Node 01 up. However replication lag is continuing to increase on the secondary member 00-01. The workload has been reduced by the customer however, replication lag is not catching up. 15,360 IOPS available, there is high latency and queueing on the disk, despite the volume of IOPS not exceeding 1000

 

USER IMPACT
The replication lag on the Secondary is impacting production for customer.



 Comments   
Comment by Prachi Shirodkar [ 12/Jun/21 ]

Wrong ticket type.

Generated at Thu Feb 08 05:42:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.