[SERVER-31177] MongoDB consumes all free memory, leading to throttled replication Created: 20/Sep/17  Updated: 17/Oct/17  Resolved: 25/Sep/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.4.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Max Bennedich Assignee: Mark Agarunov
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File mongodb-throttling-3.PNG     PNG File mongodb-throttling-5.PNG     PNG File mongodb-throttling-6.PNG     PNG File mongodb-throttling-7.PNG    
Issue Links:
Duplicate
duplicates WT-2670 Inefficient I/O when read full DB (po... Closed
Operating System: ALL
Participants:

 Description   

We have a small three member replicaset running MongoDB 3.4.1:

  • Primary. Physical local server, Windows Server 2012, 64 GB RAM, 6 cores. Hosted in Scandinavia.
  • Secondary. Amazon EC2, Windows Server 2016, r4.2xlarge, 61 GB RAM, 8 vCPUs. Hosted in Germany.
  • Arbiter. Tiny cloud based Linux instance.

The WiredTiger cache has its default size of around 50% of RAM, so the mongod process consumes around 32 GB in our case. Additionally, over time, MongoDB uses up all free memory via the filesystem cache (memory mapping). This is expected behavior AFAIK.

However, what we are seeing is that once the amount of available memory on the server drops below 1-4 % (by Windows' definition of "Available memory"), the replication speed from the primary to the secondary instance is being throttled / capped at just over 20 MBit/s. I.e., replication never goes above that speed, and if there is more data to replicate, it will queue up and result in replication lag.

This is not a pure bandwidth issue; for example while throttling is taking place, we can transfer data between the two servers over FTP at far more than 20 MBit/s.

To prove that low memory is causing this throttling, we ran a small script that allocated and freed around 10 GB memory on the server. Since there was almost no available memory, this memory was reallocated mainly from the filesystem cache, and in part form the mongod process. Immediately the throttling stopped, as shown in the attached screenshot, and replication occurred at full speed. This lasted for around 3 days until MongoDB had consumed all free memory through the filesystem cache, at which point replication was again throttled.

This is 100% reproducible, and in fact the workaround we are currently resorting to, to avoid replication lag.

We haven't found a way to configure the amount of memory used for memory mapping, and are currently thinking that this must be a bug within MongoDB? We haven't found anything useful in the logs explaining why throttling takes place, and tried looking for duplicates for this bug without success.



 Comments   
Comment by Max Bennedich [ 25/Sep/17 ]

The upgrade to 3.4.9 indeed solved the problem! The server memory consumption stays at a constant ~55%. Thanks for looking into this so quickly.

Comment by Mark Agarunov [ 21/Sep/17 ]

Hello mbl54,

Thank you for the information. I'll leave this ticket open until you can confirm that this has resolved the issue.

Thanks,
Mark

Comment by Max Bennedich [ 21/Sep/17 ]

Thanks for the update! I have upgraded our members to version 3.4.9. In the past, it has takes a few days for all memory to be consumed, so I will need until next week probably before I can tell you whether this solved the problem.

Comment by Mark Agarunov [ 20/Sep/17 ]

Hello mbl54,

Thank you for providing this information. Looking over this I suspect this may be due to WT-2670, which can cause the system to not automatically unmap memory once it has been used. Luckily this has been fixed in MongoDB 3.4.2, so my recommendation would be to upgrade to the latest version of MongoDB if possible and see if this resolves the issue.

Thanks,
Mark

Comment by Max Bennedich [ 20/Sep/17 ]

Thanks for looking into this! I am attaching logs and additional screenshots from an event of interest at 2017-09-18 22:40 CEST.

By 2017-09-18, we had experienced replication throttling for around 3 weeks nonstop (which is when we last restarted MongoDB).

At 2017-09-18 22:26 we started a Java program which slowly allocated 20 GB memory on the primary instance. This memory was all released at 22:38. The throttling stopped immediately (you can see in the graph that we had speeds well above 20 MBit/s already at 22:42).

Note: You may see some fluctuations in network traffic in the graph, and change in traffic pattern in the logs, between 2017-09-18 13:10 and 2017-09-18 15:01. During this period we switched the primary over to the AWS instance.

Also a note about time zones:

  • Attached graphs and timestamps in this message are given in CEST time zone.
  • All MongoDB log time stamps are one hour earlier, i.e. the event described above at 22:40 should be looked for around "21:40:00.000+0100".
Comment by Mark Agarunov [ 20/Sep/17 ]

Hello mbl54,

I've generated a secure upload portal so that you can send us this data privately. Files uploaded to the portal can only be accessed by MongoDB.

Thanks,
Mark

Comment by Mark Agarunov [ 20/Sep/17 ]

Hello mbl54,

Thank you for the report. To get a better idea of why the memory usage is limiting replication speed, could you please provide the following:

  • The complete logs from all mongod instances when this issue is present,
  • An archive (tar or zip) of the $dbpath/diagnostic.data directory from all mongod instances.

This should give us some insight into what may be causing this.

Thanks,
Mark

Generated at Thu Feb 08 04:26:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.