Details
-
Question
-
Resolution: Incomplete
-
Major - P3
-
None
-
None
-
None
-
None
Description
ISSUE SUMMARY
Customer had a high load on the cluster and started having replication lag. The
Node 00-01 has been in a DOWN state from a day.
Initial issue was the node was in an infinite restart loop and The snapshot process started 2021-06-11 at 07:22 and ran until 23:13. And, the node encountered errors when starting up. The CoE noticed that there is a "duplicate key" issue prior to node replacement.
At this point CoE has restarted Node 01 up. However replication lag is continuing to increase on the secondary member 00-01. The workload has been reduced by the customer however, replication lag is not catching up. 15,360 IOPS available, there is high latency and queueing on the disk, despite the volume of IOPS not exceeding 1000
USER IMPACT
The replication lag on the Secondary is impacting production for customer.