[SERVER-36346] Large memory consumption per pending getmore Created: 30/Jul/18  Updated: 29/Oct/23  Resolved: 26/Oct/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.2.0

Type: Improvement Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Mohammad Dashti (Inactive)
Resolution: Fixed Votes: 4
Labels: query-director-triage
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File comparison.png     Zip Archive diagnostic.data-prealloc.zip     Zip Archive diagnostic.data.no-prealloc.zip     PNG File no-prealloc-memory-usage.png     PNG File prealloc-memory-usage.png    
Issue Links:
Depends
depends on SERVER-60137 Improve Buffer Allocation for Common ... Closed
depends on SERVER-60138 Improve the allocated memory for getM... Closed
Related
Backwards Compatibility: Fully Compatible
Sprint: Query 2020-03-09, Query 2020-03-23, Query 2020-04-06, Query 2020-04-20, Query 2020-05-04, Query 2020-05-18, Query 2020-06-01, QE 2021-09-20, QE 2021-10-04, QE 2021-10-18, QE 2021-11-01
Participants:
Case:
Linked BF Score: 135

 Description   

For getmores we allocate a 32 MB buffer for the response up front, which remains allocated while the getmore is pending. (This is the 16 MB bson maximum document size, plus some amount for the response message overhead, rounded up to 32 MB by the buffer allocator). This can add up to a lot of memory overhead if there are a large number of pending getmores, for example if you have a lot of oplog tails or change streams.

Can this be improved? For example maybe we don't need to allocate for the worst case up front? Even the simple expedient of teaching the buffer allocator that it doesn't always need to stick to powers of two would improve this by a factor of two.



 Comments   
Comment by Mohammad Dashti (Inactive) [ 05/Nov/21 ]

In HELP-7195 customer uses Meteor framework to accomplish something similar to change streams. They were watching 800 collections for updates this way, resulting in 27 GB of memory allocated for the getmores.

800 * 32MB = 25.6 GB (out of the reported 27 GB) should've been related to the allocated buffer for pending getMores on tailable cursors. These buffers were likely retained for an extensive period of time waiting for a change to happen and reporting them back to the user. Now, with the change in this ticket (and all related tickets), we take a maximum of 16MB buffers for a very limited amount of time (i.e., after having at least a single record to return until returning it back to the user. It's usually sub-second).

Basically, for N pending getMore requests on a tailable cursor, we were allocating N*32MB for the whole period until a change occurs on that collection. Then, assuming that we're using a framework like Meteor, it'll constantly keep these N cursors, causing a constant N*32MB reserved. With the change in this ticket, now almost nothing will be reserved.

Comment by Mohammad Dashti (Inactive) [ 26/Oct/21 ]

Author:

{'name': 'Mohammad Dashti', 'email': 'mdashti@gmail.com', 'username': 'mdashti'}

Message: SERVER-60138 Improved the allocated memory for the `getMore` commands.
Branch: master
https://github.com/mongodb/mongo/commit/f6d975343b9f919b61e6d1b4e7333fe0ecfe50c4

Comment by Bruce Lucas (Inactive) [ 18/May/20 ]

Tip: it's much easier to make comparisons if the two runs are plotted on the same charts. If they are separated by a subsantial gap in time you can use the timeline "condensed" option to remove the gap so they both occupy about half of the width.

Left is new, without prealloc; right is old, with prealloc

Generally speaking, this is a very light load with very little concurrency. A heavier load with more concurrency may or may not look different.

                        left, new            right, old
                        no  prealloc         w/ prealloc
 
read latency            1.5-2x higher
scavenge, decommit                           higher, as expected
insegs                  higher (!?)
retrans                                      higher (but insignificant)

  • Read latency as measured at the server is higher without prealloc. However the timer for read latency begins after we have allocated the 32 MB response buffer, so the "old" numbers don't include the time required for allocating the buffer whereas the "new" numbers do, so this could be misleading. What did you see for latency at the client?
  • Scavenge and decommit rates are better without prealloc. This is expected as tcmalloc drives its rate of scavenging based on the rate at which bytes are returned to the pageheap, and the huge buffers we're allocating for getmores can be a significant driver of that and can cause notable peformance problems in heavier loads. So this is good, although this test was not stressful enough to see any improvement for the application as a result of this.
  • TCP input segments is higher without the change. Not really sure why. Maybe just due to artificial nature of test?
  • Retransmits are lower with the change, though not a particularly significant number to begin with. Also maybe just an artifact of the synthetic test.
Comment by Mathias Stearn [ 17/Apr/20 ]

In addition to tailable, we should also consider whether there is any batchSize restriction in the getmore request. It is also wasteful to allocate the giant buffer when the client is asking for only a few documents at a time.

Additionally, we should probably add a special case to https://github.com/mongodb/mongo/blob/05eb5a55e9a7f50db04c0a7a12e6cd06248bbab5/src/mongo/bson/util/builder.h#L404-L416, so that whenever we would allocate 16MB, we allocate 16MB + ~64KB instead. The 0.5% overallocation will frequently avoid the 99.5% overallocation we do today, which seems like a good tradeoff. Also, while we are in there, it is probably worth downsizing the allocation size by sizeof(SharedBuffer::Holder) (as long as it is still > minSize) to counteract the extra space requested at https://github.com/mongodb/mongo/blob/master/src/mongo/util/shared_buffer.h#L70, which is probably bumping us up to a higher TCMalloc size class.

Comment by Charlie Swanson [ 13/Aug/18 ]

david.storch I'd actually like to nominate this ticket for quick wins. After investigating a little more, it looks like this should be pretty doable.

It looks like the getMore command overrides Command::reserveBytesForReply() to specify that we should reserve ~16MB. Upon a first glance it looks like we could move this method to the 'parsed' 'Invocation' of the class, which has access to a GetMoreRequest. I think there's only one call site of 'reserveBytesForReply', from service_entry_point_common.cpp. If we have the GetMoreRequest, we could conditionalize the number of bytes to preallocate based on whether the cursor is tailable. If the cursor is tailable we usually wouldn't expect to need 16MB, since we'll only be getting the newest data. Does that sound crazy? It seems like it would certainly be a win for a change stream which I'd expect to usually require fewer than 16MB. For other tailable cursors it seems possible that there are a substantial number of full 16MB batches before the tailable part, but I'm not sure what better heuristic we could ever use, or how noticeable a regression like that would be.

Moving this back to 'Needs Triage' to discuss.

Generated at Thu Feb 08 04:42:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.