[SERVER-30785] Slow secondary kills primary Created: 22/Aug/17  Updated: 27/Oct/23  Resolved: 14/Sep/17

Status: Closed
Project: Core Server
Component/s: Replication, WiredTiger
Affects Version/s: 3.4.3, 3.4.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: VictorGP Assignee: Mark Agarunov
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File cachedirtymongo.png     PNG File dirtycachewt.png    
Issue Links:
Related
is related to SERVER-31099 Automate testing when oldest_timestam... Closed
Operating System: ALL
Steps To Reproduce:

I'm reproducing this using cgroups to throttle IO in the secondary.
In my setup i had two replicasets, with two members each and an arbiter each.

Create some heavy traffic with inserts so we can stress the IO.

In the secondaries, throttled the IO. I used this (tune the values as you wish for your setup):

#Enable blkio subsystem
umount /sys/fs/cgroup/blkio
export CONFIG_BLK_CGROUP=y
export CONFIG_BLK_DEV_THROTTLING=y
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
cgcreate -g blkio:/iothrottle
 
# Get the minor and major version of the block device with
dmsetup ls
 
# Configure the cgroup throttling writes, reads and iops in the block devices.
# Important: do it in both journal and data partition if you have them separated
cgset -r blkio.throttle.write_bps_device="<minor>:<major> 65536" iothrottle # 64b/s
cgset -r blkio.throttle.read_bps_device="<minor>:<major> 65536" iothrottle # 64b/s
cgset -r blkio.throttle.write_iops_device="<minor>:<major> 2" iothrottle # 2 iops
cgset -r blkio.throttle.read_iops_device="<minor>:<major> 2" iothrottle # 2 iops
 
# Check the cgroup works by adding your pid (echo $$) to it
echo <mypid> > /sys/fs/cgroup/blkio/iothrottle/tasks
 
# Check you are in the cgroup
cat /proc/self/cgroup | grep iothrottle
 
# Test it (this might take a while, you can ctrl+c and see the write speed)
dd if=/dev/urandom of=/data/testfile bs=4K count=1024 oflag=direct
 
# Supposedly adding mongod pid to the cgroup should be enough, but i've seen this not working most of the times, so to make it work, add pid 1 to the cgroup
echo 1 > /sys/fs/cgroup/blkio/iothrottle/tasks
 
# Restart mongod process and add its pid to the cgroup
echo <mongodpid> /sys/fs/cgroup/blkio/iothrottle/tasks
 
# Check that writes bytes/s and reads bytes/s are low and they don't go over the limit you set in the cgroup
iotop -o

If the IO throttling worked, you will the the secondary starting to lag.
Then monitor the Wired Tiger dirty cache, while the secondary is lagging, it grows. Wait until it reaches 20% of the total cache.
The primary will die.

Participants:

 Description   

A slow secondary due to IO issues/congestion will kill the primary, and the secondary won't transition to primary due to its issues therefore the cluster will get down.

We have had IO issues in our servers that made the secondary not able to write on time making it lag until it reached a point when it is so slow that you can't even get a mongo shell.
When that happens, the primary sees the secondary, and sees it as it is lagging, it still belongs to the replicaset because responds to heartbeats but the replication is not working. So far, so good.

What happens now is that since the secondary starts having issues, the WiredTiger dirty cache of the primary starts increasing until it reaches 20% of the total cache (the default value afaik), when it reaches 20% the primary dies.
It dies and there is not a single error in the log (checked with verbosity 2).

This means the whole replica set gets down because the secondary cannot transition to primary due to its IO issues.

This was tested in Ubuntu Trusty with MongoDb versions 3.4.4 and 3.4.3, in happens in both. Worth to mention that the file system is EXT4, that we know, it is not recommended.
But the issue i want to point out here is that if, for whatever reason, the secondary has issues with IO that made it get stuck, the primary dies once the WiredTiger dirty cache reaches 20%.

Also, important to mention that during my tests, the balancer was stopped so the chunk migrations don't interfere with this and that the inserts performed to the cluster all of them had writeConcern 1, never majority.

In the attached screenshot you can see the light blue node reaches 20% and a few minutes later it goes down to almost 0, this is when it died.



 Comments   
Comment by VictorGP [ 05/Oct/17 ]

Interesting, good to know, thanks for the clarification. We will disable it then.

Comment by Eric Milkie [ 05/Oct/17 ]

The parameter indeed changes the behavior and performance profile of the server. It does so to be prepared in case a read with a read concern level of majority is issued. It actually doesn’t matter if no reads ever do occur. The parameter increases the CPU and memory consumption of the server, proportional to the number of writes issued.

Comment by VictorGP [ 05/Oct/17 ]

Hi Eric,

Although enableMajorityReadConcern is in the configuration, none of our clients were using majority read concern. In the example i did with cgroups to reproduce the issue, there are only writes, no reads, so i believe this parameter doesn't matter here.

Comment by Eric Milkie [ 11/Sep/17 ]

Thanks for confirming the behavior. I wanted to make sure nothing crashed, as that behavior would be unexpected.
As Michael mentioned above, the WiredTiger storage engine can slow down remarkably when it begins to run out of workspace in memory. This particular situation was triggered by running with enableMajorityReadConcern active, having a majority of secondaries fall behind, and doing writes with a write concern less than "w:majority". Since you are running with only one secondary node, this makes a majority of one and is the most fragile configuration of nodes.
The easiest way to avoid this situation is to do all writes with "w:majority", which will also guarantee your writes will not roll back in the face of a failure of one of the nodes in the cluster. Running with the watchdog as you have done here certainly will help things in this situation, by forcefully triggering the secondary to elect itself, but you will lose a lot of writes when this occurs.
In the meantime, we continue to make improvements in WiredTiger to ensure that it degrades more gracefully in cases like this, where it would simply accept fewer and fewer writes as cache memory filled.

Comment by VictorGP [ 08/Sep/17 ]

Hi Eric,

I missed that log line when it receives the SIGTERM.
It looks like i had a watchdog that we implemented due to https://jira.mongodb.org/browse/SERVER-14139 enabled and it killed it first with SIGTERM and probably then with SIGKILL.

So, my analysis of what happened is partially incorrect, because the primary died because this watchdog killed it.

However, the point of the primary having issues due to the secondary having IO issues, still applies. The watchdog killed the primary because it wasn't responding to queries, this watchdog basically does db.stats(), if it fails several times in a row then it kills the process.
When the dirty cache in the primary reaches 20%, this primary is not able to respond to any query, not even a simple one like db.stats(), our whole cluster got down because this primary is not able to process queries (or maybe it is able to process them but they take forever). Then another issue is that the primary failover doesn't happen because replicaset wide everything is ok because the network threads respond to the heartbeats.

If you manage to reproduce the problem you will see it, it doesn't matter if the primary actually dies or not, the important fact is that the primary is unusable (or super slow).

To answer your question about cache data, as a majority of secondaries becomes more lagged, the primary must store more interim data states in order to serve read concern majority requests.

That makes sense, but it looks like if this data is 'dirty' and reaches 20%, the primary suffers.

Comment by Eric Milkie [ 07/Sep/17 ]

The log from the primary shows it was shut down via a SIGTERM signal delivered from another process on the system:

2017-08-22T14:51:45.391-0700 I CONTROL  [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends

The primary does take a while to shut down, however, and by 14:52 the process may have ended by itself before it finished all the shutdown tasks. Could it be that something delivered a SIGKILL? Maybe it was the same process that delivered the original SIGTERM?
Is this not the correct log I'm looking at?
Do you know if anything would have sent a SIGTERM signal to shut down the server?

To answer your question about cache data, as a majority of secondaries becomes more lagged, the primary must store more interim data states in order to serve read concern majority requests.

Comment by VictorGP [ 07/Sep/17 ]

Hi Mark,

That's the exact moment when the primary reaches 20% of the dirty cache.
I guess you are talking about the primary logs, at 14:49:35 there was not shutdown, actually there are logs until 14:52:05.313 when the mongod process stopped/crashed by itself, no manual stop, without printing any error log or stack trace in the logs or shutdown message.
Then at 14:53:54.714 the server was started manually.

Besides when the primary stopped processing requests or when it crashed, the main issue here is, why the primary dirty cache increases when there are IO issues in the secondary. Why can that happen?

As an extra bit of information, we were running other experiments with a replicaset of 4 members plus an arbiter, and one of the secondaries had these IO issues, in that case the dirty cache grew to 20% in the primary but also in the other secondaries. I think it makes a chain effect depending on who replicates from who.

Comment by Mark Agarunov [ 07/Sep/17 ]

Hello victorgp,

Thank you for providing this data. Looking over the logs, it appears that the server was processing requests until 14:49:35, at which point it was shut down. Could you clarify what you mean by the node being 'dead', and if the node was manually shut down?

Thanks,
Mark

Comment by VictorGP [ 24/Aug/17 ]

Done.

Along with the uploaded files i attach the corresponding dirty cache graph so you can compare timings.

As you can see in the screenshot chart, green line is primary (msc2-2) and yellow line is secondary (msc2-1), the dirty cache starts growing at 14:31, then at 14:49 it reaches 20%.

In the secondary logs, you can see that the first "msc2-2.stg:27017; HostUnreachable:" happens at 14:52:08 (the primary is unresponsive at this point) and then in the primary logs the server restarts at 14:53:54.

You can also see that in the secondary logs, an election starts at 14:52:15, it wins the dry-run election but it never becomes primary due to: "Not starting an election, since we are not electable due to: Not standing for election again; already candidate" (we have also opened a ticket for this issue https://jira.mongodb.org/browse/SERVER-30727)

I had to filter out COMMAND, QUERY and INSERT log traces, for security purpose, due to sensitive data, and also because the log file with them were 2Gb.

Comment by Mark Agarunov [ 24/Aug/17 ]

Hello victorgp,

I've created a secure upload portal so that you can send us these files privately.

Thanks,
Mark

Comment by VictorGP [ 24/Aug/17 ]

Can you provide me with a private channel to upload this?

Comment by Mark Agarunov [ 24/Aug/17 ]

Hello victorgp,

Thank you for the detailed report. To get some better insight into what may be causing this, could you please provide:

  • The complete logs from mongod when this issue is present,
  • An archive (tar or zip) of the $dbpath/diagnostic.data directory,
  • The configuration you are using for mongod

This should provide some more information to better diagnose this issue.

Thanks,
Mark

Comment by VictorGP [ 22/Aug/17 ]

I forgot to mention that, if before the WiredTiger dirty cache reaches 20%, you stop/kill the secondary, it keeps growing until you add the secondary back this time without issues.

Generated at Thu Feb 08 04:25:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.