Loading...

Type: Bug
Resolution: Works as Designed
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.4.3, 3.4.4
Component/s: Replication, WiredTiger
Labels:
None

Operating System:
ALL
Steps To Reproduce:
Hide

I'm reproducing this using cgroups to throttle IO in the secondary.
In my setup i had two replicasets, with two members each and an arbiter each.

Create some heavy traffic with inserts so we can stress the IO.

In the secondaries, throttled the IO. I used this (tune the values as you wish for your setup):

#Enable blkio subsystem umount /sys/fs/cgroup/blkio export CONFIG_BLK_CGROUP=y export CONFIG_BLK_DEV_THROTTLING=y mount -t cgroup -o blkio none /sys/fs/cgroup/blkio cgcreate -g blkio:/iothrottle # Get the minor and major version of the block device with dmsetup ls # Configure the cgroup throttling writes, reads and iops in the block devices. # Important: do it in both journal and data partition if you have them separated cgset -r blkio.throttle.write_bps_device="<minor>:<major> 65536" iothrottle # 64b/s cgset -r blkio.throttle.read_bps_device="<minor>:<major> 65536" iothrottle # 64b/s cgset -r blkio.throttle.write_iops_device="<minor>:<major> 2" iothrottle # 2 iops cgset -r blkio.throttle.read_iops_device="<minor>:<major> 2" iothrottle # 2 iops # Check the cgroup works by adding your pid (echo $$) to it echo <mypid> > /sys/fs/cgroup/blkio/iothrottle/tasks # Check you are in the cgroup cat /proc/self/cgroup | grep iothrottle # Test it (this might take a while, you can ctrl+c and see the write speed) dd if=/dev/urandom of=/data/testfile bs=4K count=1024 oflag=direct # Supposedly adding mongod pid to the cgroup should be enough, but i've seen this not working most of the times, so to make it work, add pid 1 to the cgroup echo 1 > /sys/fs/cgroup/blkio/iothrottle/tasks # Restart mongod process and add its pid to the cgroup echo <mongodpid> /sys/fs/cgroup/blkio/iothrottle/tasks # Check that writes bytes/s and reads bytes/s are low and they don't go over the limit you set in the cgroup iotop -o

If the IO throttling worked, you will the the secondary starting to lag.
Then monitor the Wired Tiger dirty cache, while the secondary is lagging, it grows. Wait until it reaches 20% of the total cache.
The primary will die.
Show
I'm reproducing this using cgroups to throttle IO in the secondary. In my setup i had two replicasets, with two members each and an arbiter each. Create some heavy traffic with inserts so we can stress the IO. In the secondaries, throttled the IO. I used this (tune the values as you wish for your setup): #Enable blkio subsystem umount /sys/fs/cgroup/blkio export CONFIG_BLK_CGROUP=y export CONFIG_BLK_DEV_THROTTLING=y mount -t cgroup -o blkio none /sys/fs/cgroup/blkio cgcreate -g blkio:/iothrottle # Get the minor and major version of the block device with dmsetup ls # Configure the cgroup throttling writes, reads and iops in the block devices. # Important: do it in both journal and data partition if you have them separated cgset -r blkio.throttle.write_bps_device= "<minor>:<major> 65536" iothrottle # 64b/s cgset -r blkio.throttle.read_bps_device= "<minor>:<major> 65536" iothrottle # 64b/s cgset -r blkio.throttle.write_iops_device= "<minor>:<major> 2" iothrottle # 2 iops cgset -r blkio.throttle.read_iops_device= "<minor>:<major> 2" iothrottle # 2 iops # Check the cgroup works by adding your pid (echo $$) to it echo <mypid> > /sys/fs/cgroup/blkio/iothrottle/tasks # Check you are in the cgroup cat /proc/self/cgroup | grep iothrottle # Test it ( this might take a while , you can ctrl+c and see the write speed) dd if =/dev/urandom of=/data/testfile bs=4K count=1024 oflag=direct # Supposedly adding mongod pid to the cgroup should be enough, but i've seen this not working most of the times, so to make it work, add pid 1 to the cgroup echo 1 > /sys/fs/cgroup/blkio/iothrottle/tasks # Restart mongod process and add its pid to the cgroup echo <mongodpid> /sys/fs/cgroup/blkio/iothrottle/tasks # Check that writes bytes/s and reads bytes/s are low and they don't go over the limit you set in the cgroup iotop -o If the IO throttling worked, you will the the secondary starting to lag. Then monitor the Wired Tiger dirty cache, while the secondary is lagging, it grows. Wait until it reaches 20% of the total cache. The primary will die.
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

A slow secondary due to IO issues/congestion will kill the primary, and the secondary won't transition to primary due to its issues therefore the cluster will get down.

We have had IO issues in our servers that made the secondary not able to write on time making it lag until it reached a point when it is so slow that you can't even get a mongo shell.
When that happens, the primary sees the secondary, and sees it as it is lagging, it still belongs to the replicaset because responds to heartbeats but the replication is not working. So far, so good.

What happens now is that since the secondary starts having issues, the WiredTiger dirty cache of the primary starts increasing until it reaches 20% of the total cache (the default value afaik), when it reaches 20% the primary dies.
It dies and there is not a single error in the log (checked with verbosity 2).

This means the whole replica set gets down because the secondary cannot transition to primary due to its IO issues.

This was tested in Ubuntu Trusty with MongoDb versions 3.4.4 and 3.4.3, in happens in both. Worth to mention that the file system is EXT4, that we know, it is not recommended.
But the issue i want to point out here is that if, for whatever reason, the secondary has issues with IO that made it get stuck, the primary dies once the WiredTiger dirty cache reaches 20%.

Also, important to mention that during my tests, the balancer was stopped so the chunk migrations don't interfere with this and that the inserts performed to the cluster all of them had writeConcern 1, never majority.

In the attached screenshot you can see the light blue node reaches 20% and a few minutes later it goes down to almost 0, this is when it died.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

cachedirtymongo.png
31 kB
Aug 24 2017 08:08:38 PM UTC
dirtycachewt.png
44 kB
Aug 22 2017 10:40:46 PM UTC

is related to

SERVER-31099 Automate testing when oldest_timestamp stalls

Closed

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates