Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-6027

Replication suffers on a server with page faults and degraded hardware

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.0.5
    • Component/s: Replication
    • None
    • Replication
    • ALL

      Our primary server had a hard drive failure that resulted in read/write performance being ~10x slower than normal. We normally have a low number of page faults but this became worse as reads started to pile up. We wanted to failover to a server with good hardware but we couldn't as the secondaries replication had both fallen behind - and they were getting further behind. After some digging through logs we discovered that the oplog was being sent to the secondaries extremely slowly (~5 minutes for a single query). I then spoke to Scott Hernandez on IRC and he confirmed my suspicion - the parts of the oplog we needed had been paged out and were being read from disk. Due to the degraded hardware these reads were incredibly slow.

      We had to shutdown our entire service to allow the server to dedicate its poorly performing disks to serving the oplog so we could fail over to better hardware.

      This isn't ideal.

      In my opinion, reading the oplog should always get priority over other reads - if replication falls behind and you have to hit disk to get the oplog then replication will likely carry on falling further and further behind. I'd much rather see the reads starting to fail and know I can failover (whilst keeping data).

            Assignee:
            backlog-server-repl [DO NOT USE] Backlog - Replication Team
            Reporter:
            colinhowe Colin Howe
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: