Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-30785

Slow secondary kills primary

    • Type: Icon: Bug Bug
    • Resolution: Works as Designed
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.4.3, 3.4.4
    • Component/s: Replication, WiredTiger
    • Labels:
      None
    • ALL
    • Hide

      I'm reproducing this using cgroups to throttle IO in the secondary.
      In my setup i had two replicasets, with two members each and an arbiter each.

      Create some heavy traffic with inserts so we can stress the IO.

      In the secondaries, throttled the IO. I used this (tune the values as you wish for your setup):

      #Enable blkio subsystem
      umount /sys/fs/cgroup/blkio
      export CONFIG_BLK_CGROUP=y
      export CONFIG_BLK_DEV_THROTTLING=y
      mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
      cgcreate -g blkio:/iothrottle
      
      # Get the minor and major version of the block device with
      dmsetup ls
      
      # Configure the cgroup throttling writes, reads and iops in the block devices.
      # Important: do it in both journal and data partition if you have them separated
      cgset -r blkio.throttle.write_bps_device="<minor>:<major> 65536" iothrottle # 64b/s
      cgset -r blkio.throttle.read_bps_device="<minor>:<major> 65536" iothrottle # 64b/s
      cgset -r blkio.throttle.write_iops_device="<minor>:<major> 2" iothrottle # 2 iops
      cgset -r blkio.throttle.read_iops_device="<minor>:<major> 2" iothrottle # 2 iops
      
      # Check the cgroup works by adding your pid (echo $$) to it
      echo <mypid> > /sys/fs/cgroup/blkio/iothrottle/tasks
      
      # Check you are in the cgroup
      cat /proc/self/cgroup | grep iothrottle
      
      # Test it (this might take a while, you can ctrl+c and see the write speed)
      dd if=/dev/urandom of=/data/testfile bs=4K count=1024 oflag=direct
      
      # Supposedly adding mongod pid to the cgroup should be enough, but i've seen this not working most of the times, so to make it work, add pid 1 to the cgroup
      echo 1 > /sys/fs/cgroup/blkio/iothrottle/tasks
      
      # Restart mongod process and add its pid to the cgroup
      echo <mongodpid> /sys/fs/cgroup/blkio/iothrottle/tasks
      
      # Check that writes bytes/s and reads bytes/s are low and they don't go over the limit you set in the cgroup
      iotop -o
      
      

      If the IO throttling worked, you will the the secondary starting to lag.
      Then monitor the Wired Tiger dirty cache, while the secondary is lagging, it grows. Wait until it reaches 20% of the total cache.
      The primary will die.

      Show
      I'm reproducing this using cgroups to throttle IO in the secondary. In my setup i had two replicasets, with two members each and an arbiter each. Create some heavy traffic with inserts so we can stress the IO. In the secondaries, throttled the IO. I used this (tune the values as you wish for your setup): #Enable blkio subsystem umount /sys/fs/cgroup/blkio export CONFIG_BLK_CGROUP=y export CONFIG_BLK_DEV_THROTTLING=y mount -t cgroup -o blkio none /sys/fs/cgroup/blkio cgcreate -g blkio:/iothrottle # Get the minor and major version of the block device with dmsetup ls # Configure the cgroup throttling writes, reads and iops in the block devices. # Important: do it in both journal and data partition if you have them separated cgset -r blkio.throttle.write_bps_device= "<minor>:<major> 65536" iothrottle # 64b/s cgset -r blkio.throttle.read_bps_device= "<minor>:<major> 65536" iothrottle # 64b/s cgset -r blkio.throttle.write_iops_device= "<minor>:<major> 2" iothrottle # 2 iops cgset -r blkio.throttle.read_iops_device= "<minor>:<major> 2" iothrottle # 2 iops # Check the cgroup works by adding your pid (echo $$) to it echo <mypid> > /sys/fs/cgroup/blkio/iothrottle/tasks # Check you are in the cgroup cat /proc/self/cgroup | grep iothrottle # Test it ( this might take a while , you can ctrl+c and see the write speed) dd if =/dev/urandom of=/data/testfile bs=4K count=1024 oflag=direct # Supposedly adding mongod pid to the cgroup should be enough, but i've seen this not working most of the times, so to make it work, add pid 1 to the cgroup echo 1 > /sys/fs/cgroup/blkio/iothrottle/tasks # Restart mongod process and add its pid to the cgroup echo <mongodpid> /sys/fs/cgroup/blkio/iothrottle/tasks # Check that writes bytes/s and reads bytes/s are low and they don't go over the limit you set in the cgroup iotop -o If the IO throttling worked, you will the the secondary starting to lag. Then monitor the Wired Tiger dirty cache, while the secondary is lagging, it grows. Wait until it reaches 20% of the total cache. The primary will die.

      A slow secondary due to IO issues/congestion will kill the primary, and the secondary won't transition to primary due to its issues therefore the cluster will get down.

      We have had IO issues in our servers that made the secondary not able to write on time making it lag until it reached a point when it is so slow that you can't even get a mongo shell.
      When that happens, the primary sees the secondary, and sees it as it is lagging, it still belongs to the replicaset because responds to heartbeats but the replication is not working. So far, so good.

      What happens now is that since the secondary starts having issues, the WiredTiger dirty cache of the primary starts increasing until it reaches 20% of the total cache (the default value afaik), when it reaches 20% the primary dies.
      It dies and there is not a single error in the log (checked with verbosity 2).

      This means the whole replica set gets down because the secondary cannot transition to primary due to its IO issues.

      This was tested in Ubuntu Trusty with MongoDb versions 3.4.4 and 3.4.3, in happens in both. Worth to mention that the file system is EXT4, that we know, it is not recommended.
      But the issue i want to point out here is that if, for whatever reason, the secondary has issues with IO that made it get stuck, the primary dies once the WiredTiger dirty cache reaches 20%.

      Also, important to mention that during my tests, the balancer was stopped so the chunk migrations don't interfere with this and that the inserts performed to the cluster all of them had writeConcern 1, never majority.

      In the attached screenshot you can see the light blue node reaches 20% and a few minutes later it goes down to almost 0, this is when it died.

        1. cachedirtymongo.png
          cachedirtymongo.png
          31 kB
        2. dirtycachewt.png
          dirtycachewt.png
          44 kB

            Assignee:
            mark.agarunov Mark Agarunov
            Reporter:
            victorgp VictorGP
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: