Priority: Major - P3
Resolution: Works as Designed
Affects Version/s: 3.4.3, 3.4.4
Fix Version/s: None
Steps To Reproduce:
I'm reproducing this using cgroups to throttle IO in the secondary.
In my setup i had two replicasets, with two members each and an arbiter each.
Create some heavy traffic with inserts so we can stress the IO.
In the secondaries, throttled the IO. I used this (tune the values as you wish for your setup):
If the IO throttling worked, you will the the secondary starting to lag.
Then monitor the Wired Tiger dirty cache, while the secondary is lagging, it grows. Wait until it reaches 20% of the total cache.
The primary will die.I'm reproducing this using cgroups to throttle IO in the secondary. In my setup i had two replicasets, with two members each and an arbiter each. Create some heavy traffic with inserts so we can stress the IO. In the secondaries, throttled the IO. I used this (tune the values as you wish for your setup): #Enable blkio subsystem umount /sys/fs/cgroup/blkio export CONFIG_BLK_CGROUP=y export CONFIG_BLK_DEV_THROTTLING=y mount -t cgroup -o blkio none /sys/fs/cgroup/blkio cgcreate -g blkio:/iothrottle # Get the minor and major version of the block device with dmsetup ls # Configure the cgroup throttling writes, reads and iops in the block devices. # Important: do it in both journal and data partition if you have them separated cgset -r blkio.throttle.write_bps_device= "<minor>:<major> 65536" iothrottle # 64b/s cgset -r blkio.throttle.read_bps_device= "<minor>:<major> 65536" iothrottle # 64b/s cgset -r blkio.throttle.write_iops_device= "<minor>:<major> 2" iothrottle # 2 iops cgset -r blkio.throttle.read_iops_device= "<minor>:<major> 2" iothrottle # 2 iops # Check the cgroup works by adding your pid (echo $$) to it echo <mypid> > /sys/fs/cgroup/blkio/iothrottle/tasks # Check you are in the cgroup cat /proc/self/cgroup | grep iothrottle # Test it ( this might take a while , you can ctrl+c and see the write speed) dd if =/dev/urandom of=/data/testfile bs=4K count= 1024 oflag=direct # Supposedly adding mongod pid to the cgroup should be enough, but i've seen this not working most of the times, so to make it work, add pid 1 to the cgroup echo 1 > /sys/fs/cgroup/blkio/iothrottle/tasks # Restart mongod process and add its pid to the cgroup echo <mongodpid> /sys/fs/cgroup/blkio/iothrottle/tasks # Check that writes bytes/s and reads bytes/s are low and they don't go over the limit you set in the cgroup iotop -o If the IO throttling worked, you will the the secondary starting to lag. Then monitor the Wired Tiger dirty cache, while the secondary is lagging, it grows. Wait until it reaches 20% of the total cache. The primary will die.
A slow secondary due to IO issues/congestion will kill the primary, and the secondary won't transition to primary due to its issues therefore the cluster will get down.
We have had IO issues in our servers that made the secondary not able to write on time making it lag until it reached a point when it is so slow that you can't even get a mongo shell.
When that happens, the primary sees the secondary, and sees it as it is lagging, it still belongs to the replicaset because responds to heartbeats but the replication is not working. So far, so good.
What happens now is that since the secondary starts having issues, the WiredTiger dirty cache of the primary starts increasing until it reaches 20% of the total cache (the default value afaik), when it reaches 20% the primary dies.
It dies and there is not a single error in the log (checked with verbosity 2).
This means the whole replica set gets down because the secondary cannot transition to primary due to its IO issues.
This was tested in Ubuntu Trusty with MongoDb versions 3.4.4 and 3.4.3, in happens in both. Worth to mention that the file system is EXT4, that we know, it is not recommended.
But the issue i want to point out here is that if, for whatever reason, the secondary has issues with IO that made it get stuck, the primary dies once the WiredTiger dirty cache reaches 20%.
Also, important to mention that during my tests, the balancer was stopped so the chunk migrations don't interfere with this and that the inserts performed to the cluster all of them had writeConcern 1, never majority.
In the attached screenshot you can see the light blue node reaches 20% and a few minutes later it goes down to almost 0, this is when it died.