[SERVER-43038] Commit point can be stale on slaveDelay nodes and cause memory pressure Created: 26/Aug/19 Updated: 06/Dec/22 Resolved: 03/Jan/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 4.0.12 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Aristarkh Zagorodnikov | Assignee: | Backlog - Replication Team |
| Resolution: | Won't Fix | Votes: | 1 |
| Labels: | caching, wiredtiger | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 16.04.6 LTS db version v4.0.12 2xIntel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz total used free shared buff/cache available |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
After restarting a stale slaveDelay node, the learned higher OpTime in the same term can be forgotten. If the commit point is in a higher term on all other nodes, the slaveDelay node can only advance its commit point on getMore responses due to The issue will resolve when the slaveDelay node starts to apply oplog entries from the latest term. Memory pressure and evictions on slaveDelay nodes are undesired but not harmful. The same issue can happen without restart. Let's say an election happens in term 8 at time T0, but the node delays by 5 days and is still applying entries from term 7. At T0 + 2 days, another election occurs in term 9. Now the commit point is in term 9. At T0 + 5 days, when the delayed node starts to apply entries in term 8, it cannot advance its commit point beyond its last applied. Eventually, when the node starts to apply entries from term 9, everything's fine again. ======================================= After upgrading from 3.6 to 4.0.12 we encountered an overly high CPU consumption on our slave-delayed hidden replica set member. Restarting the member doesn't help, the CPU consumption goes down, but then goes up after some time. "Before" means before the CPU spike, "after" – after it (occured about 15:47:31 +/- 5s). When CPU consumption is high, according to `perf report` about 96% of time is spent in `__wt_evict` (see `mongod-after.perf.txt` and `mongod-after.perf.data`). This coincides with `cache overflow score` metric jumping up from 0 to 100 (see `caches-before.log` and `caches-after.log`), despite the `bytes currently in the cache` (5703522791) being much smaller than `maximum bytes configured` (8589934592). This is a hidden delayed secondary, so there should be next to no load except replicating writes which are pretty low-volume. Before upgrading to 4.0 we did not have any issues regarding this service. |
| Comments |
| Comment by Aristarkh Zagorodnikov [ 04/Jan/20 ] |
|
Tess, thank you for the clarification! |
| Comment by Tess Avitabile (Inactive) [ 03/Jan/20 ] |
|
Apologies, I accidentally posted my comment as internal-only. We are working on a project to make the server much more resilient to commit point lag in general, so we are not going to address this particular bug. |
| Comment by Aristarkh Zagorodnikov [ 03/Jan/20 ] |
|
Not even a single line about why this is not going to be fixed at all? |
| Comment by Aristarkh Zagorodnikov [ 28/Oct/19 ] |
|
Well, I'm glad you got to the bottom of this, and thank you for keeping us updated on the inner details. Looking forward to the permanent fix (even if it would take some time for it to be released). |
| Comment by Danny Hatcher (Inactive) [ 28/Oct/19 ] |
|
onyxmaster, at this point, we're fairly confident in our following analysis. Because this is due to our internal mechanisms, I have attempted to condense the information but please feel free to let us know if you're looking for more detail. The high CPU usage is due to high cache pressure which triggers application threads to be used for data eviction. This cache pressure was caused by lag in our read concern "majority" mechanism. This lag was caused by having a two-term difference between the replica set and the delayed operations. You've noticed the problem after restarts for two reasons:
As long as there were two elections during the slaveDelay, the node would have reached this state even without being restarted. However, the good news is this scenario will automatically recover once the delayed node starts applying operations from the same term. This situation is not desirable behavior but because so many factors have to occur we will be treating it as a lower priority for a permanent fix. |
| Comment by Aristarkh Zagorodnikov [ 16/Oct/19 ] |
|
Daniel, unfortunately we don't have this data, since it was rolled over a long time ago. Currently our delayed secondaries do not experience this issue, but I restarted them, attempting to trigger it. Sorry I couldn't help much |
| Comment by Danny Hatcher (Inactive) [ 15/Oct/19 ] |
|
onyxmaster, apologies for the delay in response; this has been a tough problem to crack. We think we have a solid idea but are missing a key piece of evidence. Earlier in the ticket I asked for the logs and diagnostic.data of the shard Primary. It looks like the diagnostics are actually from a different replica set Primary. I know it's been a while but do you still have the diagnostics for the Primary for the "driveFS-files-1" shard covering September 11th through 13th? If this data has rolled over but you can see the same symptoms of large CPU usage on another delayed Secondary, could you please upload logs and diagnostic.data for both the delayed node and the Primary for that shard? |
| Comment by Bruce Lucas (Inactive) [ 25/Sep/19 ] |
|
Summary (from above and as discussed in meeting): for some reason the delayed secondary stops updating its timestamps, causing data to be pinned, which causes the CPU symptom the customer observed. This was not happening before the node was restarted, but began happening repeatedly after the node was restarted. Regarding the "Restarting oplog query due to error: CursorNotFound" behavior specifically, we see that happening repeatedly both before and after the restart:
I think the reason for this is illustrated by the cycle from A to B.
This cycle happens repeatedly, but since that is happening both during the good state before the restart at C and after, this is not itself the cause of the problem after the restart. The underlying issue is that for some reason after the restart at C we are no longer regularly updating our timestamps ("set timestamp calls") even though we are regularly applying batches. However we do occasionally update the timestamp (e.g D and E), and as you noted this occurs at the same time that we fetch some more oplog entries (as described above). Interestingly if we zoom in on the few seconds after restart we see something similar:
From C to D there is a burst of oplog application because the node has been down for a bit and we have to catch up to the delay lag of 5 days. During this time we are in fact updating our timestamps and are not pinning data. Because we're applying ops faster than we do in steady state we are also doing regular getmores. But at D we're finished "catching up" and no longer need to do regular getmores for a while (a couple of hours actually). This is also when we stop updating our timestamps, even though we are still regularly applying ops and batches. This is the same behavior that we see later: when we next need to do getmores, we also update the timestamps. To summarize it looks like
I couldn't spot what might have changed after the restart that might have triggered this behavior. Maybe some race condition? Suggest forwarding to replicationt team. |
| Comment by Aristarkh Zagorodnikov [ 17/Sep/19 ] |
|
Attaching logs: mongod-new-logs-2.zip |
| Comment by Danny Hatcher (Inactive) [ 16/Sep/19 ] |
|
Thank you for recovering the logs; they do appear complete. It would be very helpful to maintain the MongoDB default logs for the length of this investigation but I understand if that's not possible. I can see that the logical session cache is attempting a refresh on the secondary but it seems to do nothing until the node shuts down a day later. The config metrics and logs that you provided were the original Primary of the config replica set but at the same time as the start of the problem (09-12T08:31) it is restarted and "c2.fs.drive.bru" becomes the Primary. Do you have the logs and metrics for that node as well? |
| Comment by Aristarkh Zagorodnikov [ 13/Sep/19 ] |
|
I converted the logs from our system to something that looks like raw logs: mongod-new-logs.zip Please take note that the timestamp in logs is different, most of the times here are local (UTC+3), but logs are in UTC time. |
| Comment by Aristarkh Zagorodnikov [ 13/Sep/19 ] |
|
Daniel, that's my bad, slave delay is enabled and is 5 days ("slaveDelay" : NumberLong(432000)). I turned it off when diagnosing, my colleague turned it on back (we use this node as a fat-finger DR copy), but I forgot about it and assumed it was still turned off. Anyways, it was turned on before restart. It's about 35 hours since restart and it still spikes CPU. Attaching shard primary metrics: metrics-shard-primary.zip I'll try to get logs, but we do not store raw logs for more than 24 hours and ship them to a log processing facility which has no easy way of recovering raw logs. |
| Comment by Danny Hatcher (Inactive) [ 13/Sep/19 ] |
|
Can you please confirm that there is no slaveDelay configured in the replica set config? I see consistent 5 day "lag" for this node which would typically indicate a 5 day slaveDelay. Once the node restarted, it triggered a full refresh of the sharding metadata. 24 hours does seem like a long time for that refresh to be running but if the sharded collection is incredibly large it could take a while. Could you please provide the following covering the restart until now?
|
| Comment by Aristarkh Zagorodnikov [ 13/Sep/19 ] |
|
It appears that upgrade has nothing to do with slave delay. Reloading the service may put it into this "bad" state. I've attached mongo-metrics.zip |
| Comment by Aristarkh Zagorodnikov [ 28/Aug/19 ] |
|
Okay. |
| Comment by Danny Hatcher (Inactive) [ 28/Aug/19 ] |
|
I can see a rise in CPU but the timeframe in the diagnostic.data provided is too short to draw any real conclusions as to a root cause. I propose that we close this ticket as Incomplete and when you encounter the problem again you can leave a comment here with the relevant data. |
| Comment by Aristarkh Zagorodnikov [ 28/Aug/19 ] |
|
Unfortunately we encountered this issue more than a week ago, and the node is already resynced and slave delay is turned off (we’re going to try turning it on a bit later), so there are no relevant FTDC data files that we can provide. |
| Comment by Danny Hatcher (Inactive) [ 27/Aug/19 ] |
|
Thank you for providing those files. In order for us to compare MongoDB's internal metrics over time, could you please attach a few hours of diagnostic.data covering before the upgrade of the node and a few hours after the upgrade? |