[SERVER-17561] Fragmentation/non-linearity in mmapv1 capped collections Created: 12/Mar/15 Updated: 06/Dec/22 Resolved: 14/Sep/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | MMAPv1 |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kevin Pulo | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Won't Fix | Votes: | 3 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Storage Execution
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Description |
OverviewCapped collections (mmapv1) do not necessarily store documents linearly (both within extents, and from extent to extent). Aside from the substantial shock to the "ring buffer" mental model commonly used for capped collections, this directly causes poor performance in (at least) two regards:
The situation is worsened by severely heterogenous document sizes. It occurs least when every document is the exact same size. The issue is important because for replication, the oplog is a single point of contention that can't be worked around. ImpactPrior to the The fix for Unfortunately, as the results below show, although the fix in Cause?I believe the reason for this is that CappedRecordStoreV1::allocRecord (previously cappedAlloc) uses the freelist to find a suitably sized record. After the first pass through, if none is found, then a document is removed from the front of the capped collection (in natural order), and it tries again. But the found record might be much earlier or later in the extent, or in another extent completely (perhaps — there does seem to be some code guarding against this, in at least some cases). The approach of removing documents in natural order to "make space" seems to assume they are in a strict linear order. However, this isn't the case, because the space near the end of extents can only be used by small documents (larger documents must go in the next extent). Suggested fixOne possible approach to fix this is for MMAPv1 capped collections to have a purely linear allocation strategy (even after the collection wraps around). For example, an algorithm something like:
Pros:
Cons:
|
| Comments |
| Comment by Kevin Pulo [ 16/Feb/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Determining if a capped collection is affectedApproach 1: freelist length
Approach 2: capped collection layout analysis
Regenerating the capped collection
Alternatively, if the capped collection is the oplog, then it is possible to instead follow the procedure to Change the Size of the Oplog — but when re-creating the oplog collection, the size can be kept the same. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kevin Pulo [ 16/Mar/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have done some further investigation into this issue with some simple workloads, including some amount of looking at how the problem worsens over time. Another indicator of this problem is a long freelist in the capped collection, which can be found by looking at the deletedCount field in the output from the validate command. In a nominally performing capped collection, this value is low (2 – 4), but if it is high (eg. over 20), then this indicates that the capped collection is suffering from fragmentation. (Note that the converse does not necessarily hold, i.e. it is possible to have a capped collection which is highly fragmented, yet with a short freelist.) (EDIT: This approach is unlikely to work, see Case 4 below.) The below repros were run on MongoDB 2.6.3, with a 100MB capped collection. 100 shells were used to insert documents with a spread of sizes. Every 5 minutes, the incoming operations were temporarily stopped so as to get a consistent view of the state of the capped collection. The animations contain charts that are similar to those earlier in this ticket. The y-axis is the diskLoc offset into the capped collection. The x-axis is the document number in $natural order, "rotated" so that the "physically first" document (the one with the lowest diskLoc) is at x=0 (which gives a more stable view of physical layout of the documents in the collection). Documents are joined with lines, so that an ideal capped collection is a diagonal line emanating from (0,0), and fragmentation is where the line is vertical, ie. a "discontinuity" or "jump" in the diagonal line. The animations are particularly good at showing the progression of fragmentation over time, and how the quality of the capped collection layout degrades. Case 1: Small documents
Case 2: Large documents
Case 3: Growing documents
Repro details
Case 4: Small documents (powers of 2)EDIT I re-ran Case 1, but with some additional code to pad documents so that they were powers of 2. This shows that the fragmentation still occurs, albeit at a slightly slower rate. This makes sense — the fundamental problem here is that when documents have different sizes, some freelist entries get skipped and then returned to later. Although padding to powers of 2 quantises the size, and so reduces the number of distinct document sizes, this freelist skipping still happens, and so the fragmentation still occurs. The workaround of padding documents so that they are all always exactly the same size (where a suitable practical maximum document size can be idenified) DOES work — as you would expect, the layout is always perfectly linear throughout the extents, and performance is as ideal as possible, since each incoming document displaces exactly one (perfectly sized) old document. The animation is just a diagonal line, too boring to bother showing.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kevin Pulo [ 12/Mar/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Workload 2: Pure gridfsThe workload was 10 parallel processes putting a bunch of large files into the db with mongofiles. This is done twice, and then all the files removed. Repeat.
The files were a random collection of largish files, with sizes:
ResultsThis is a single member replset, with 750MB (smallfiles=true) oplog, on the same host as Workload 1. (I think that smallfiles=false doesn't solve the problem, it just means you need ~4 times the workload to hit the problem, or would hit it ~4 times less frequently.) The results show that the oplog is initially not too bad: But it gets worse: And finally looks like this when the maxPasses assertion is tripped: I also tested a similar workload which, in addition to the GridFS workload above, was inserting documents with a single field. The field started out as an empty string, and increased 1 byte as each document was inserted, until it reached 32kb (and this process was repeated over and over). Although this didn't hit the maxPasses assertion, the oplog still ended up looking pretty awful: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kevin Pulo [ 12/Mar/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Workload 1: AggressiveAttached is a fairly pathological reproducer (could be tweaked to be worse). The workload has two stages:
ResultsI ran this on an m3.xlarge with smallfiles=true and a 750MB capped collection (so two extents, one full extent of 512MB and one with the remaining ~250MB). First I ran with 2.6.5, then upgraded to a version with
The magenta dots (labelled "2.6.5-maxPasses") are inserts that ended early due to the "maxPasses" assertion, while the blue dots ("2.6.5-no-maxPasses") are inserts that succeeded. The green dots ("2.8.0-rc3-pre-") are a version with (The gap in magenta dots around 2014-12-12 12:00 is when I briefly tried a milder workload, to see if it would still trigger the assertion.) The mean and median of slow operations (100+ms) have possibly increased by about ~120ms as a result of
Here is the layout of the capped collection at the labelled points in the run. The x-axis is documents in natural order. The y-axis is the linearised on-disk position (i.e. computed as $diskLoc.file * 512MB + $diskLoc.offset).
Notable features:
|