[SERVER-43038] Commit point can be stale on slaveDelay nodes and cause memory pressure Created: 26/Aug/19  Updated: 06/Dec/22  Resolved: 03/Jan/20

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 4.0.12
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Aristarkh Zagorodnikov Assignee: Backlog - Replication Team
Resolution: Won't Fix Votes: 1
Labels: caching, wiredtiger
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 16.04.6 LTS
Linux scorpius 4.15.0-58-generic #64~16.04.1-Ubuntu SMP Wed Aug 7 14:10:35 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

db version v4.0.12
git version: 5776e3cbf9e7afe86e6b29e22520ffb6766e95d4
OpenSSL version: OpenSSL 1.0.2g 1 Mar 2016
allocator: tcmalloc
modules: none
build environment:
distmod: ubuntu1604
distarch: x86_64
target_arch: x86_64

2xIntel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz

total used free shared buff/cache available
Mem: 125G 372M 61G 2.5M 64G 124G
Swap: 29G 0B 29G


Attachments: PNG File Screen Shot 2019-09-24 at 1.23.47 PM.png     PNG File Screen Shot 2019-10-09 at 12.47.34 PM.png     Text File caches.log     Zip Archive metrics-csrs-primary-2.zip     Zip Archive metrics-csrs-primary.zip     Zip Archive metrics-shard-primary.zip     File metrics.2019-08-16T12-02-49Z-00000     Zip Archive mongo-metrics.zip     File mongod-after.perf.data     Text File mongod-after.perf.txt     Text File mongod-after.perf.txt     File mongod-before.perf.data     Zip Archive mongod-new-logs-2.zip     Zip Archive mongod-new-logs.zip     Text File mongod.log     File mongodb-driveFS-files-1.conf     PNG File overview.png     PNG File scorpius-cache.png     PNG File scorpius-cpu-new.png     PNG File scorpius-cpu.png     PNG File scorpius-cpu2.png     PNG File zoom.png    
Issue Links:
Depends
Related
is related to SERVER-43632 Possible memory leak in 4.0 Closed
Assigned Teams:
Replication
Operating System: ALL
Participants:

 Description   

After restarting a stale slaveDelay node, the learned higher OpTime in the same term can be forgotten. If the commit point is in a higher term on all other nodes, the slaveDelay node can only advance its commit point on getMore responses due to SERVER-39831. On slaveDelay nodes, it's likely the buffer is already full, so the applier has to apply 16MB worth of oplog entries to make the room for bgsync to insert the last fetched batch and call another getMore. Applying 16MB oplog entires may be enough to trigger memory pressure, causing evictions.

The issue will resolve when the slaveDelay node starts to apply oplog entries from the latest term. Memory pressure and evictions on slaveDelay nodes are undesired but not harmful.

The same issue can happen without restart. Let's say an election happens in term 8 at time T0, but the node delays by 5 days and is still applying entries from term 7. At T0 + 2 days, another election occurs in term 9. Now the commit point is in term 9. At T0 + 5 days, when the delayed node starts to apply entries in term 8, it cannot advance its commit point beyond its last applied. Eventually, when the node starts to apply entries from term 9, everything's fine again.

=======================================
Original title and description:
WT eviction threads consume a lot of CPU even when there is no apparent cache pressure

After upgrading from 3.6 to 4.0.12 we encountered an overly high CPU consumption on our slave-delayed hidden replica set member. Restarting the member doesn't help, the CPU consumption goes down, but then goes up after some time.
We recorded some logs, perf traces and statistics snapshots, see attached files. Also included are FTDC files for the relevant interval and some graphs from our monitoring system.

"Before" means before the CPU spike, "after" – after it (occured about 15:47:31 +/- 5s).

When CPU consumption is high, according to `perf report` about 96% of time is spent in `__wt_evict` (see `mongod-after.perf.txt` and `mongod-after.perf.data`). This coincides with `cache overflow score` metric jumping up from 0 to 100 (see `caches-before.log` and `caches-after.log`), despite the `bytes currently in the cache` (5703522791) being much smaller than `maximum bytes configured` (8589934592).

This is a hidden delayed secondary, so there should be next to no load except replicating writes which are pretty low-volume. Before upgrading to 4.0 we did not have any issues regarding this service.



 Comments   
Comment by Aristarkh Zagorodnikov [ 04/Jan/20 ]

Tess, thank you for the clarification!

Comment by Tess Avitabile (Inactive) [ 03/Jan/20 ]

Apologies, I accidentally posted my comment as internal-only. We are working on a project to make the server much more resilient to commit point lag in general, so we are not going to address this particular bug.

Comment by Aristarkh Zagorodnikov [ 03/Jan/20 ]

Not even a single line about why this is not going to be fixed at all?

Comment by Aristarkh Zagorodnikov [ 28/Oct/19 ]

Well, I'm glad you got to the bottom of this, and thank you for keeping us updated on the inner details.

Looking forward to the permanent fix (even if it would take some time for it to be released).

Comment by Danny Hatcher (Inactive) [ 28/Oct/19 ]

onyxmaster, at this point, we're fairly confident in our following analysis. Because this is due to our internal mechanisms, I have attempted to condense the information but please feel free to let us know if you're looking for more detail.

The high CPU usage is due to high cache pressure which triggers application threads to be used for data eviction. This cache pressure was caused by lag in our read concern "majority" mechanism. This lag was caused by having a two-term difference between the replica set and the delayed operations.

You've noticed the problem after restarts for two reasons:

  • Restarts can trigger elections which change term
  • The node loses its in-memory representation of some metadata

As long as there were two elections during the slaveDelay, the node would have reached this state even without being restarted.

However, the good news is this scenario will automatically recover once the delayed node starts applying operations from the same term. This situation is not desirable behavior but because so many factors have to occur we will be treating it as a lower priority for a permanent fix.

Comment by Aristarkh Zagorodnikov [ 16/Oct/19 ]

Daniel, unfortunately we don't have this data, since it was rolled over a long time ago. Currently our delayed secondaries do not experience this issue, but I restarted them, attempting to trigger it. Sorry I couldn't help much

Comment by Danny Hatcher (Inactive) [ 15/Oct/19 ]

onyxmaster, apologies for the delay in response; this has been a tough problem to crack. We think we have a solid idea but are missing a key piece of evidence. Earlier in the ticket I asked for the logs and diagnostic.data of the shard Primary. It looks like the diagnostics are actually from a different replica set Primary. I know it's been a while but do you still have the diagnostics for the Primary for the "driveFS-files-1" shard covering September 11th through 13th?

If this data has rolled over but you can see the same symptoms of large CPU usage on another delayed Secondary, could you please upload logs and diagnostic.data for both the delayed node and the Primary for that shard?

Comment by Bruce Lucas (Inactive) [ 25/Sep/19 ]

Summary (from above and as discussed in meeting): for some reason the delayed secondary stops updating its timestamps, causing data to be pinned, which causes the CPU symptom the customer observed. This was not happening before the node was restarted, but began happening repeatedly after the node was restarted.

Regarding the "Restarting oplog query due to error: CursorNotFound" behavior specifically, we see that happening repeatedly both before and after the restart:

I think the reason for this is illustrated by the cycle from A to B.

  • At A we establish a new oplog query ("readersCreate") and fetch some oplog entries ("getmores num") until we fill the replication buffer ("repl buffer sizeBytes").
  • Then from A to B we are applying ops and draining the buffer, but not fetching any new data (note "repl buffer sizeBytes is declining from A to B), I imagine because the repl buffer is "full enough".
  • At B we hit a lower bound threshold (I hypothesize) and decide to fetch more bytes. But by this time since we haven't used the oplog query cursor for several hours since A it has timed out on the primary (I hypothesize), so we get CursorNotFound and have to restart the oplog query.

This cycle happens repeatedly, but since that is happening both during the good state before the restart at C and after, this is not itself the cause of the problem after the restart.

The underlying issue is that for some reason after the restart at C we are no longer regularly updating our timestamps ("set timestamp calls") even though we are regularly applying batches. However we do occasionally update the timestamp (e.g D and E), and as you noted this occurs at the same time that we fetch some more oplog entries (as described above).

Interestingly if we zoom in on the few seconds after restart we see something similar:

From C to D there is a burst of oplog application because the node has been down for a bit and we have to catch up to the delay lag of 5 days. During this time we are in fact updating our timestamps and are not pinning data. Because we're applying ops faster than we do in steady state we are also doing regular getmores. But at D we're finished "catching up" and no longer need to do regular getmores for a while (a couple of hours actually). This is also when we stop updating our timestamps, even though we are still regularly applying ops and batches.

This is the same behavior that we see later: when we next need to do getmores, we also update the timestamps.

To summarize it looks like

  • after the restart we appear to only be updating timestamps when we are doing getmores to replenish the buffer, even though we are actively apply ops and batches
  • but before the restart we were updating the timestamps even when we didn't need to do getmores for several hours but were actively applying ops that we had buffered

I couldn't spot what might have changed after the restart that might have triggered this behavior. Maybe some race condition? Suggest forwarding to replicationt team.

Comment by Aristarkh Zagorodnikov [ 17/Sep/19 ]

Attaching logs: mongod-new-logs-2.zip and metrics (11th September got rotated out): metrics-csrs-primary-2.zip

Comment by Danny Hatcher (Inactive) [ 16/Sep/19 ]

Thank you for recovering the logs; they do appear complete. It would be very helpful to maintain the MongoDB default logs for the length of this investigation but I understand if that's not possible.

I can see that the logical session cache is attempting a refresh on the secondary but it seems to do nothing until the node shuts down a day later. The config metrics and logs that you provided were the original Primary of the config replica set but at the same time as the start of the problem (09-12T08:31) it is restarted and "c2.fs.drive.bru" becomes the Primary. Do you have the logs and metrics for that node as well?

Comment by Aristarkh Zagorodnikov [ 13/Sep/19 ]

I converted the logs from our system to something that looks like raw logs: mongod-new-logs.zip

Please take note that the timestamp in logs is different, most of the times here are local (UTC+3), but logs are in UTC time.

Comment by Aristarkh Zagorodnikov [ 13/Sep/19 ]

Daniel, that's my bad, slave delay is enabled and is 5 days ("slaveDelay" : NumberLong(432000)). I turned it off when diagnosing, my colleague turned it on back (we use this node as a fat-finger DR copy), but I forgot about it and assumed it was still turned off. Anyways, it was turned on before restart. It's about 35 hours since restart and it still spikes CPU.

Attaching shard primary metrics: metrics-shard-primary.zip and CSRS primary metrics: metrics-csrs-primary.zip.

I'll try to get logs, but we do not store raw logs for more than 24 hours and ship them to a log processing facility which has no easy way of recovering raw logs.

Comment by Danny Hatcher (Inactive) [ 13/Sep/19 ]

Can you please confirm that there is no slaveDelay configured in the replica set config? I see consistent 5 day "lag" for this node which would typically indicate a 5 day slaveDelay.

Once the node restarted, it triggered a full refresh of the sharding metadata. 24 hours does seem like a long time for that refresh to be running but if the sharded collection is incredibly large it could take a while. Could you please provide the following covering the restart until now?

  • mongod log + diagnostic.data for the node having the problem
  • mongod log + diagnostic.data for the shard Primary
  • mongod log + diagnostic.data for the config Primary
Comment by Aristarkh Zagorodnikov [ 13/Sep/19 ]

It appears that upgrade has nothing to do with slave delay. Reloading the service may put it into this "bad" state.

I've attached mongo-metrics.zip for the specified interval (see graph below). The server was restarted at 11:31 and the issue was back again. There is no slaveDelay configured this time, and no upgrade (4.0.12 before and after the restart).

Comment by Aristarkh Zagorodnikov [ 28/Aug/19 ]

Okay.

Comment by Danny Hatcher (Inactive) [ 28/Aug/19 ]

I can see a rise in CPU but the timeframe in the diagnostic.data provided is too short to draw any real conclusions as to a root cause. I propose that we close this ticket as Incomplete and when you encounter the problem again you can leave a comment here with the relevant data.

Comment by Aristarkh Zagorodnikov [ 28/Aug/19 ]

Unfortunately we encountered this issue more than a week ago, and the node is already resynced and slave delay is turned off (we’re going to try turning it on a bit later), so there are no relevant FTDC data files that we can provide.

Comment by Danny Hatcher (Inactive) [ 27/Aug/19 ]

Thank you for providing those files. In order for us to compare MongoDB's internal metrics over time, could you please attach a few hours of diagnostic.data covering before the upgrade of the node and a few hours after the upgrade?

Generated at Thu Feb 08 05:02:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.