[SERVER-44740] huge oplog configuration causes memory use to grow without bound Created: 19/Nov/19  Updated: 10/Apr/20  Resolved: 09/Apr/20

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Keith Bostic (Inactive) Assignee: Rachelle Palmer
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive diag.zip    
Issue Links:
Depends
Duplicate
is duplicated by SERVER-44646 Test oplog stone behavior with very v... Closed
Operating System: ALL
Sprint: Execution Team 2019-12-16
Participants:
Case:

 Description   

In a customer case configuring the oplog to 3TB resulted in memory usage growing over time, apparently without bound.

Reducing the oplog size resolved the customer issue, but doing some testing around huge oplog configurations seems warranted.

EDIT:
Apparently the size of the oplog was not reduced, however, metrics show the application reduced the amount of oplog GB/Hr. Obviously this introduces the possibility that oplog deletion is lagging, it's not the size of the oplog that is the principle concern. 



 Comments   
Comment by Bruce Lucas (Inactive) [ 20/Dec/19 ]

So yeah, no sign of the leak in this repro - "allocated minus wt cache" remains steady.

Maybe smaller entries and/or 2-node repl set for another try?

Comment by Eric Milkie [ 20/Dec/19 ]

diag.zip Attaching new diagnostics data with longer run.

Comment by Eric Milkie [ 16/Dec/19 ]

Thanks Bruce. This was indeed compiled locally from the 3.4.23 tag. I ran a single node replica set, which I figured would have the same behavior in the oplog, except for the read load (standalone would not have an oplog at all).
It's an interesting observation about the oplog entry sizes. I could try trimming those down. I didn't keep running the experiment but I can run it longer now and give more ftdc data.

Comment by Bruce Lucas (Inactive) [ 16/Dec/19 ]

Thanks Eric. There does seem to be a very slight increase in allocated minus cache; will be interested to see results after running for a few days. Can you attach latest ftdc data?

I spotted a couple of differences between this and the customer issue, significance unknown:

  • Customer was running 3.4.23, this is identified as 0.0.0 - was it built from 3.4.23 codebase?
  • Customer was running a replica set, this looks like a standalone? Given that the issue only happened on the primary of the replica set for the customer and not on the secondary, it seems possible that this might not reproduce standalone either.
  • Average oplog entry size for customer was about 79 kB, vs about 1 MB for the repro.
Comment by Eric Milkie [ 13/Dec/19 ]

diag.zip Adding diagnostics from my running instance.

Comment by Eric Milkie [ 12/Dec/19 ]

That's a good idea, I'll rerun and collect that for you.

Comment by Bruce Lucas (Inactive) [ 12/Dec/19 ]

milkie, do you have the ftdc data from this run? I'd be interested in comparing it with the data from the customer that hit this issue.

Comment by Eric Milkie [ 12/Dec/19 ]

I set up a 3.4.23 server and ran it with a 3 GB oplog, then set up several shell workloads to fill up the oplog. I ran it for a couple days and ran the VTune memory allocation analyzer on it. Unfortunately, I was unable to reproduce any heap memory growth.

Comment by Bruce Lucas (Inactive) [ 20/Nov/19 ]

 Is this a duplicate of SERVER-44646 (or vice versa)? 

Generated at Thu Feb 08 05:06:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.