[SERVER-30785] Slow secondary kills primary Created: 22/Aug/17 Updated: 27/Oct/23 Resolved: 14/Sep/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, WiredTiger |
| Affects Version/s: | 3.4.3, 3.4.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | VictorGP | Assignee: | Mark Agarunov |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
|||||||||||||||||||||||||||||||||||
| Issue Links: |
|
|||||||||||||||||||||||||||||||||||
| Operating System: | ALL | |||||||||||||||||||||||||||||||||||
| Steps To Reproduce: | I'm reproducing this using cgroups to throttle IO in the secondary. Create some heavy traffic with inserts so we can stress the IO. In the secondaries, throttled the IO. I used this (tune the values as you wish for your setup):
If the IO throttling worked, you will the the secondary starting to lag. |
|||||||||||||||||||||||||||||||||||
| Participants: | ||||||||||||||||||||||||||||||||||||
| Description |
|
A slow secondary due to IO issues/congestion will kill the primary, and the secondary won't transition to primary due to its issues therefore the cluster will get down. We have had IO issues in our servers that made the secondary not able to write on time making it lag until it reached a point when it is so slow that you can't even get a mongo shell. What happens now is that since the secondary starts having issues, the WiredTiger dirty cache of the primary starts increasing until it reaches 20% of the total cache (the default value afaik), when it reaches 20% the primary dies. This means the whole replica set gets down because the secondary cannot transition to primary due to its IO issues. This was tested in Ubuntu Trusty with MongoDb versions 3.4.4 and 3.4.3, in happens in both. Worth to mention that the file system is EXT4, that we know, it is not recommended. Also, important to mention that during my tests, the balancer was stopped so the chunk migrations don't interfere with this and that the inserts performed to the cluster all of them had writeConcern 1, never majority. In the attached screenshot you can see the light blue node reaches 20% and a few minutes later it goes down to almost 0, this is when it died. |
| Comments |
| Comment by VictorGP [ 05/Oct/17 ] | |
|
Interesting, good to know, thanks for the clarification. We will disable it then. | |
| Comment by Eric Milkie [ 05/Oct/17 ] | |
|
The parameter indeed changes the behavior and performance profile of the server. It does so to be prepared in case a read with a read concern level of majority is issued. It actually doesn’t matter if no reads ever do occur. The parameter increases the CPU and memory consumption of the server, proportional to the number of writes issued. | |
| Comment by VictorGP [ 05/Oct/17 ] | |
|
Hi Eric, Although enableMajorityReadConcern is in the configuration, none of our clients were using majority read concern. In the example i did with cgroups to reproduce the issue, there are only writes, no reads, so i believe this parameter doesn't matter here. | |
| Comment by Eric Milkie [ 11/Sep/17 ] | |
|
Thanks for confirming the behavior. I wanted to make sure nothing crashed, as that behavior would be unexpected. | |
| Comment by VictorGP [ 08/Sep/17 ] | |
|
Hi Eric, I missed that log line when it receives the SIGTERM. So, my analysis of what happened is partially incorrect, because the primary died because this watchdog killed it. However, the point of the primary having issues due to the secondary having IO issues, still applies. The watchdog killed the primary because it wasn't responding to queries, this watchdog basically does db.stats(), if it fails several times in a row then it kills the process. If you manage to reproduce the problem you will see it, it doesn't matter if the primary actually dies or not, the important fact is that the primary is unusable (or super slow).
That makes sense, but it looks like if this data is 'dirty' and reaches 20%, the primary suffers. | |
| Comment by Eric Milkie [ 07/Sep/17 ] | |
|
The log from the primary shows it was shut down via a SIGTERM signal delivered from another process on the system:
The primary does take a while to shut down, however, and by 14:52 the process may have ended by itself before it finished all the shutdown tasks. Could it be that something delivered a SIGKILL? Maybe it was the same process that delivered the original SIGTERM? To answer your question about cache data, as a majority of secondaries becomes more lagged, the primary must store more interim data states in order to serve read concern majority requests. | |
| Comment by VictorGP [ 07/Sep/17 ] | |
|
Hi Mark, That's the exact moment when the primary reaches 20% of the dirty cache. Besides when the primary stopped processing requests or when it crashed, the main issue here is, why the primary dirty cache increases when there are IO issues in the secondary. Why can that happen? As an extra bit of information, we were running other experiments with a replicaset of 4 members plus an arbiter, and one of the secondaries had these IO issues, in that case the dirty cache grew to 20% in the primary but also in the other secondaries. I think it makes a chain effect depending on who replicates from who. | |
| Comment by Mark Agarunov [ 07/Sep/17 ] | |
|
Hello victorgp, Thank you for providing this data. Looking over the logs, it appears that the server was processing requests until 14:49:35, at which point it was shut down. Could you clarify what you mean by the node being 'dead', and if the node was manually shut down? Thanks, | |
| Comment by VictorGP [ 24/Aug/17 ] | |
|
Done. Along with the uploaded files i attach the corresponding dirty cache graph so you can compare timings. As you can see in the screenshot chart, green line is primary (msc2-2) and yellow line is secondary (msc2-1), the dirty cache starts growing at 14:31, then at 14:49 it reaches 20%. In the secondary logs, you can see that the first "msc2-2.stg:27017; HostUnreachable:" happens at 14:52:08 (the primary is unresponsive at this point) and then in the primary logs the server restarts at 14:53:54. You can also see that in the secondary logs, an election starts at 14:52:15, it wins the dry-run election but it never becomes primary due to: "Not starting an election, since we are not electable due to: Not standing for election again; already candidate" (we have also opened a ticket for this issue https://jira.mongodb.org/browse/SERVER-30727) I had to filter out COMMAND, QUERY and INSERT log traces, for security purpose, due to sensitive data, and also because the log file with them were 2Gb. | |
| Comment by Mark Agarunov [ 24/Aug/17 ] | |
|
Hello victorgp, I've created a secure upload portal so that you can send us these files privately. Thanks, | |
| Comment by VictorGP [ 24/Aug/17 ] | |
|
Can you provide me with a private channel to upload this? | |
| Comment by Mark Agarunov [ 24/Aug/17 ] | |
|
Hello victorgp, Thank you for the detailed report. To get some better insight into what may be causing this, could you please provide:
This should provide some more information to better diagnose this issue. Thanks, | |
| Comment by VictorGP [ 22/Aug/17 ] | |
|
I forgot to mention that, if before the WiredTiger dirty cache reaches 20%, you stop/kill the secondary, it keeps growing until you add the secondary back this time without issues. |