Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.0.5
Component/s: Replication
Labels:
None

Assigned Teams:

Replication
Operating System:
ALL
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Our primary server had a hard drive failure that resulted in read/write performance being ~10x slower than normal. We normally have a low number of page faults but this became worse as reads started to pile up. We wanted to failover to a server with good hardware but we couldn't as the secondaries replication had both fallen behind - and they were getting further behind. After some digging through logs we discovered that the oplog was being sent to the secondaries extremely slowly (~5 minutes for a single query). I then spoke to Scott Hernandez on IRC and he confirmed my suspicion - the parts of the oplog we needed had been paged out and were being read from disk. Due to the degraded hardware these reads were incredibly slow.

We had to shutdown our entire service to allow the server to dedicate its poorly performing disks to serving the oplog so we could fail over to better hardware.

This isn't ideal.

In my opinion, reading the oplog should always get priority over other reads - if replication falls behind and you have to hit disk to get the oplog then replication will likely carry on falling further and further behind. I'd much rather see the reads starting to fail and know I can failover (whilst keeping data).

Assignee:: [DO NOT USE] Backlog - Replication Team
Reporter:: Colin Howe
Participants:: [DO NOT USE] Backlog - Replication Team, Colin Howe, Eric Milkie, Judah Schvimer
Votes:: 0 Vote for this issue
Watchers:: 8 Start watching this issue

Created:: Jun 07 2012 07:40:59 AM UTC
Updated:: Dec 06 2022 05:32:40 AM UTC
Resolved:: Jan 03 2020 07:34:53 PM UTC

Details

Description

Attachments

Activity

People

Dates