[SERVER-17561] Fragmentation/non-linearity in mmapv1 capped collections Created: 12/Mar/15  Updated: 06/Dec/22  Resolved: 14/Sep/18

Status: Closed
Project: Core Server
Component/s: MMAPv1
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kevin Pulo Assignee: Backlog - Storage Execution Team
Resolution: Won't Fix Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 0-freelist.png     PNG File 1b-freelist.png     PNG File 2b-freelist.png     PNG File figure_1_annotated.png     HTML File generate-disklocs     File poundfreelist.js     PNG File printOplogSizes-2extents-1-primed.png     PNG File printOplogSizes-2extents-2-hit.png     PNG File printOplogSizes-2extents-3-many.png     PNG File printOplogSizes-2extents-4-start-2.8.0-rc3-pre-.png     PNG File printOplogSizes-2extents-5-afterabit-2.8.0-rc3-pre-.png     PNG File printOplogSizes-2extents-6-afterlonger-2.8.0-rc3-pre-.png     PNG File printlayout-2.4.0-1-firstpass.png     PNG File printlayout-gridfs-7-newworkload-priming-outofdiskspace.png     PNG File printlayout-puregridfs-1-nearstart.png     PNG File printlayout-puregridfs-2-afterabit.png     PNG File printlayout-puregridfs-3-afteralot-crashes.png     File printlayout.js    
Issue Links:
Related
related to SERVER-6981 maxPasses assertion (on allocation fa... Closed
related to SERVER-12058 Primary should abort if encountered p... Closed
Assigned Teams:
Storage Execution
Operating System: ALL
Participants:

 Description   
Overview

Capped collections (mmapv1) do not necessarily store documents linearly (both within extents, and from extent to extent).

Aside from the substantial shock to the "ring buffer" mental model commonly used for capped collections, this directly causes poor performance in (at least) two regards:

  1. Insertion time, because too many documents need to be deleted.
  2. Fragmented natural order causes non-linear page access, and hence more paging/IO than might otherwise be necessary.

The situation is worsened by severely heterogenous document sizes. It occurs least when every document is the exact same size.

The issue is important because for replication, the oplog is a single point of contention that can't be worked around.

Impact

Prior to the SERVER-6981 fix, this issue caused the insert to fail with the exception "passes >= maxPasses in NamespaceDetails::cappedAlloc". If this happened on the oplog, then prior to SERVER-12058 being fixed (in 2.6.6), the oplog was rendered inconsistent. After SERVER-12058, the server would instead fassert if the oplog insert failed.

The fix for SERVER-6981 removed the assertion, but does not address the root cause of capped collection fragmentation. The fix there is to simply keep removing documents until the new one can be inserted, potentially causing very many "extra" removes to occur. This might cause performance problems, since previously maxPasses limited the runtime. There is also the (unknown) possible side-effect of permitting the fragmentation to worsen unbounded (where previously maxPasses limited it). SERVER-6981 has 3 duplicates, most notably SERVER-15265.

Unfortunately, as the results below show, although the fix in SERVER-6981 is necessary to address the problem, it is not sufficient. (Although the results below mix versions, I also tested from-scratch the version with SERVER-6981 fixed, and the results were the same.)

Cause?

I believe the reason for this is that CappedRecordStoreV1::allocRecord (previously cappedAlloc) uses the freelist to find a suitably sized record. After the first pass through, if none is found, then a document is removed from the front of the capped collection (in natural order), and it tries again. But the found record might be much earlier or later in the extent, or in another extent completely (perhaps — there does seem to be some code guarding against this, in at least some cases).

The approach of removing documents in natural order to "make space" seems to assume they are in a strict linear order. However, this isn't the case, because the space near the end of extents can only be used by small documents (larger documents must go in the next extent).

Suggested fix

One possible approach to fix this is for MMAPv1 capped collections to have a purely linear allocation strategy (even after the collection wraps around). For example, an algorithm something like:

  1. Find the "next" position where a new record can be stored. (i.e. at the "end" of the collection, after the final record, in natural order.)
  2. Will the incoming document will fit between this point and the end of the extent?
    1. If so, then use this point as the start of the new record.
    2. If not, then use the start of the next extent as the start of the new record.
  3. Either way, traverse the existing documents from the start of the collection, removing them from the record linked lists (but not freelisting them), until the current position - start position for new record >= new record length (i.e. enough space has been cleared for the incoming record).
  4. Write the incoming record (over the top of the old data) and add it to the linked lists (i.e. just put it there, don't look up freelists).

Pros:

  • Closer to the actual intention of the MMAPv1 capped allocation code (I believe).
  • Closer to the common "ring buffer" mental model.
  • Capped insertion would be constant time (well, linear in the size of the inserted document, of course).
  • Potentially much simpler code.
  • All of the messy fragmented capped collections shown in the graphs above would not occur.

Cons:

  • Some wasted space at the end of extents (worst case only ~16MB out of every 512MB (~3%)).
  • Large, invasive, and potentially hard-to-test change.


 Comments   
Comment by Kevin Pulo [ 16/Feb/17 ]

Determining if a capped collection is affected

Approach 1: freelist length

  1. Run validate() on the capped collection.
    • If the capped collection is the oplog, then this should be done while the replica set member has been temporarily restarted as a standalone.
  2. The length of the freelist is the value of the deletedCount field.
  3. The number of extents is the value of the extentCount field.
  4. If the length of the freelist is significantly more than approximately 26 times the number of extents, then this indicates fragmentation, and the capped collection should be regenerated.
  5. Otherwise, continue to performing Approach 2.

Approach 2: capped collection layout analysis

  1. Stop all writes to the capped collection.
    • If the capped collection is the oplog, then this should be done while the replica set member has been temporarily restarted as a standalone.
  2. Run the following command to obtain the layout of the capped collection (on the host in question, for performance reasons). This may take a long time to complete, depending on size of the capped collection. Adjust the start of the script to reflect the relevant capped collection, if it is not the oplog.

    mongo --quiet --port PORT --eval 'db.getSiblingDB("local").oplog.rs.find({}, { _id: 0, $diskLoc: 1 }).showDiskLoc().sort({$natural:1}).forEach(function (doc) { print( doc["$recordId"] ? doc["$recordId"].toNumber() : (doc["$diskLoc"].file * 2048*1048576 + doc["$diskLoc"].offset) ); });' | grep '^[0-9-]\+$' > layout
    

  3. Writes to the capped collection can now be resumed.
    • If the capped collection is the oplog, then the replica set member can be restarted in the replica set again.
  4. Analyse the fragmentation present in the layout of the capped collection by running the following commands.

    rmsnomaxneg='NR == 1 { prev = $1 / 1048576; t = 0; maxneg = 0; } NR > 1 { v = $1 / 1048576; diff = (v - prev); t += diff*diff; if (diff < maxneg) { maxneg = diff; } prev = v; } END { print sqrt(t - maxneg*maxneg); }'
    actual="$(cat layout | awk "$rmsnomaxneg")" ; best="$(cat layout | sort -n | awk "$rmsnomaxneg")" ; bc <<<"scale=2; $actual / $best"
    

  5. Results:
    • For a capped collection with little fragmentation, the analysis will output a value near 1.
    • Values larger than 1 indicate the degree to which the capped collection is fragmented.
    • Problems have been observed with values over approximately 60, therefore, if the value returned is larger than this, the capped collection should be regenerated.

Regenerating the capped collection

  1. Stop all writes to the capped collection.
    • If the capped collection is the oplog, then this should be done while the replica set member has been temporarily restarted as a standalone.
  2. Use mongodump to dump a copy of the capped collection.
  3. Drop the capped collection.
  4. Use mongorestore to restore the dump of the capped collection.
  5. Writes to the capped collection can now be resumed.
    • If the capped collection is the oplog, then the replica set member can be restarted in the replica set again.

Alternatively, if the capped collection is the oplog, then it is possible to instead follow the procedure to Change the Size of the Oplog — but when re-creating the oplog collection, the size can be kept the same.

Comment by Kevin Pulo [ 16/Mar/16 ]

I have done some further investigation into this issue with some simple workloads, including some amount of looking at how the problem worsens over time.

Another indicator of this problem is a long freelist in the capped collection, which can be found by looking at the deletedCount field in the output from the validate command. In a nominally performing capped collection, this value is low (2 – 4), but if it is high (eg. over 20), then this indicates that the capped collection is suffering from fragmentation. (Note that the converse does not necessarily hold, i.e. it is possible to have a capped collection which is highly fragmented, yet with a short freelist.)

(EDIT: This approach is unlikely to work, see Case 4 below.)
This work suggests another possible way of fixing this problem — adjust the capped record allocation strategy so that it is based on records being powers of 2, like regular collections. This would likely be less effective than completely reworking capped record allocation to not even have a freelist, but should be much less invasive and at least somewhat beneficial.

The below repros were run on MongoDB 2.6.3, with a 100MB capped collection. 100 shells were used to insert documents with a spread of sizes. Every 5 minutes, the incoming operations were temporarily stopped so as to get a consistent view of the state of the capped collection.

The animations contain charts that are similar to those earlier in this ticket. The y-axis is the diskLoc offset into the capped collection. The x-axis is the document number in $natural order, "rotated" so that the "physically first" document (the one with the lowest diskLoc) is at x=0 (which gives a more stable view of physical layout of the documents in the collection). Documents are joined with lines, so that an ideal capped collection is a diagonal line emanating from (0,0), and fragmentation is where the line is vertical, ie. a "discontinuity" or "jump" in the diagonal line. The animations are particularly good at showing the progression of fragmentation over time, and how the quality of the capped collection layout degrades.

Case 1: Small documents

  • Incoming documents were sized at ~850 bytes +/- 730 bytes, i.e. ~120 bytes – 1580 bytes.
  • Collection layout over time (youtube):
  • Freelist length over time:

Case 2: Large documents

  • Incoming documents were sized at ~100kb +/- 90kb bytes, i.e. ~10kb – 190kb.
  • Collection layout over time (youtube):
  • Freelist length over time:

Case 3: Growing documents

  • Incoming documents were initially sized at ~350 bytes +/- 330 bytes, i.e. ~20 bytes – 680 bytes.
  • Every minute the "base document size" (starting at 350 bytes) was increased by 1 byte.
  • This case shows emergent behaviour where the fragmentation spontaneously goes away after a while.
  • Collection layout over time (youtube):
  • Freelist length over time:

Repro details

  • setup.js

    db.dropDatabase()
    db.createCollection("test", { capped: true, size: 104857600 } )
    

  • genload.js

    var base = base || 128;
    var spread = spread || 0;
    var bumptime = bumptime || 60*1000;
    var bump = bump || 0;
     
    // bump = 1 makes base size increase by 1 byte per min (about 1kb every 17 hrs)
     
    var lastbump = Date.now();
    while (1) {
            var status = "run";
            try {
                    status = cat("status").trim();
            } catch (e) {
            }
            if (status == "stop") quit();
            if (status == "pause") { sleep(1000); continue; }
            db.test.insert({a:(new Array(base + Random.randInt(2*spread) - spread)).join("x")});
            var now = Date.now();
            if (now - lastbump > bumptime) {
                    // base size increases each minute
                    base += bump;
                    lastbump = now;
            }
    }
    

  • Starting a run (eg. case 1) (on same host)

    rm status ; for i in {1..100}; do mongo --eval 'base=850; spread=730' genload.js & done
    

  • Stopping a run

    echo stop > status
    

  • Pausing a run

    echo pause > status
    

  • Resuming a run

    echo run > status
    

  • grabstats (from another host)

    #!/bin/bash
     
    if [ -e stop ]; then
            rm stop
            exit
    fi
     
    host="$1"
    port="$2"
    base="$3"
    spread="$4"
     
    if [ "$host" == "" -o "$port" == "" -o "$base" == "" -o "$spread" == "" ]; then
            echo "Usage: grabstats <host> <port> <base> <spread>" 1>&2
            exit 1
    fi
     
    layoutfile="$(date "+%Y%m%d-%H%M%S").layout"
     
    ssh "$host" echo pause \> status
    sleep 1
     
    mongo --host "$host" --port "$port" printlayout.js > "$layoutfile"
     
    ssh "$host" echo run \> status
     
    sleep 5m
    exec "$0" "$@"
    

  • printlayout.js

    // wait for curop to have no inserts on db.test.getFullName()
    while ( db.currentOp({op:"insert", ns: db.test.getFullName()}).inprog.length > 0 ) sleep(100);
     
    printjson(db.stats());
    printjson(db.test.stats());
    printjson(db.test.validate(true));
     
    db.test.find({}, { _id: 0, $diskLoc: 1 }).showDiskLoc().sort({$natural:1}).forEach(printjson);
    

Case 4: Small documents (powers of 2)

EDIT

I re-ran Case 1, but with some additional code to pad documents so that they were powers of 2. This shows that the fragmentation still occurs, albeit at a slightly slower rate. This makes sense — the fundamental problem here is that when documents have different sizes, some freelist entries get skipped and then returned to later. Although padding to powers of 2 quantises the size, and so reduces the number of distinct document sizes, this freelist skipping still happens, and so the fragmentation still occurs.

The workaround of padding documents so that they are all always exactly the same size (where a suitable practical maximum document size can be idenified) DOES work — as you would expect, the layout is always perfectly linear throughout the extents, and performance is as ideal as possible, since each incoming document displaces exactly one (perfectly sized) old document. The animation is just a diagonal line, too boring to bother showing.

  • Incoming documents were sized at ~850 bytes +/- 730 bytes, rounded up to the next power of 2, i.e. exactly 256, 512, 1024 or 2048 bytes.
  • Collection layout over time (youtube):
  • genload.js

    var base = base || 128;
    var spread = spread || 0;
    var bumptime = bumptime || 60*1000;
    var bump = bump || 0;
    var pow2 = pow2 || false;
     
    // bump = 1 makes base size increase by 1 byte per min (about 1kb every 17 hrs)
     
    var lastbump = Date.now();
    while (1) {
            var status = "run";
            try {
                    status = cat("status").trim();
            } catch (e) {
            }
            if (status == "stop") quit();
            if (status == "pause") { sleep(1000); continue; }
            var doc = {a:(new Array(base + Random.randInt(2*spread) - spread)).join("x")};
            if (pow2) {
                    var docSize = Object.bsonsize(doc) + 17; // account for the _id that will be added
                    var targetSize = Math.pow(2, Math.ceil(Math.log(docSize)/Math.log(2)))
                    if (targetSize > docSize) {
                            if (targetSize - docSize < 8) {
                                    // extra field must be at least 8 bytes, because of overheads
                                    // so if we are too close, go up to the next power of 2
                                    targetSize *= 2;
                            }
                            doc._ = (new Array(targetSize - docSize - 8 + 1)).join("x");
                    }
            }
            db.test.insert(doc);
            var now = Date.now();
            if (now - lastbump > bumptime) {
                    // base size increases each minute
                    base += bump;
                    lastbump = now;
            }
    }
    

  • Starting a run (eg. case 1) (on same host)

    rm status ; for i in {1..100}; do mongo --eval 'base=850; spread=730; pow2=true;' genload.js & done
    

Comment by Kevin Pulo [ 12/Mar/15 ]
Workload 2: Pure gridfs

The workload was 10 parallel processes putting a bunch of large files into the db with mongofiles. This is done twice, and then all the files removed. Repeat.

while sleep 1; do
	for j in {1..10}; do
		for i in file*; do
			mongofiles --port 47017 put $i
		done &
	done
	wait
	for j in {1..10}; do
		for i in file*; do
			mongofiles --port 47017 put $i
		done &
	done
	wait
	for i in file*; do
		mongofiles --port 47017 delete $i
	done
done

The files were a random collection of largish files, with sizes:

$ ls -l file*
-rw-rw-r-- 1 ubuntu ubuntu 41096769 Dec 10 21:25 file01
-rw-rw-r-- 1 ubuntu ubuntu  1886192 Dec 16 04:14 file02
-rw-rw-r-- 1 ubuntu ubuntu  2053004 Dec 16 04:53 file03
-rw-rw-r-- 1 ubuntu ubuntu 46373628 Dec 12 03:13 file04
-rw-rw-r-- 1 ubuntu ubuntu   381256 Dec 11 05:42 file05
-rw-rw-r-- 1 ubuntu ubuntu  5512124 Dec 10 01:03 file06
-rw-rw-r-- 1 ubuntu ubuntu   247320 Dec 12 06:02 file07
-rw-rw-r-- 1 ubuntu ubuntu  3699232 Dec 12 07:11 file08
-rw-rw-r-- 1 ubuntu ubuntu  3834736 Dec 12 20:17 file09
-rw-rw-r-- 1 ubuntu ubuntu  3935780 Dec 12 20:47 file10
-rw-rw-r-- 1 ubuntu ubuntu  1588448 Dec 12 21:38 file11
-rw-rw-r-- 1 ubuntu ubuntu  1603004 Dec 13 04:19 file12
-rw-rw-r-- 1 ubuntu ubuntu  5587372 Dec 10 05:02 file13
-rw-rw-r-- 1 ubuntu ubuntu  4607272 Dec 10 05:09 file14
-rw-rw-r-- 1 ubuntu ubuntu  5173496 Dec 10 10:35 file15
-rw-rw-r-- 1 ubuntu ubuntu    96968 Dec 10 19:38 file16
-rw-rw-r-- 1 ubuntu ubuntu   249316 Dec 10 23:07 file17
-rw-rw-r-- 1 ubuntu ubuntu  5548184 Dec 10 23:51 file18
-rw-rw-r-- 1 ubuntu ubuntu  5393160 Dec 11 04:13 file19

Results

This is a single member replset, with 750MB (smallfiles=true) oplog, on the same host as Workload 1. (I think that smallfiles=false doesn't solve the problem, it just means you need ~4 times the workload to hit the problem, or would hit it ~4 times less frequently.)

The results show that the oplog is initially not too bad:

But it gets worse:

And finally looks like this when the maxPasses assertion is tripped:

I also tested a similar workload which, in addition to the GridFS workload above, was inserting documents with a single field. The field started out as an empty string, and increased 1 byte as each document was inserted, until it reached 32kb (and this process was repeated over and over). Although this didn't hit the maxPasses assertion, the oplog still ended up looking pretty awful:

Comment by Kevin Pulo [ 12/Mar/15 ]
Workload 1: Aggressive

Attached is a fairly pathological reproducer (could be tweaked to be worse). The workload has two stages:

  1. Prime the capped collection by throwing documents of many different sizes at it (every freelist bucket).
  2. Background noise of many small documents (under ~256 bytes), with periodic bursts of larger documents (100 docs of 4MB every 60 seconds).
Results

I ran this on an m3.xlarge with smallfiles=true and a 750MB capped collection (so two extents, one full extent of 512MB and one with the remaining ~250MB). First I ran with 2.6.5, then upgraded to a version with SERVER-6981 fixed (the nightly 2.8.0-rc3-pre-344df1a62e8efc748f7feded04ab2a20fdc3619f-2014-12-10). (Although these are from a regular capped collection with _id index, I also tested an actual oplog (so no _id index), and the results look the same.)

The magenta dots (labelled "2.6.5-maxPasses") are inserts that ended early due to the "maxPasses" assertion, while the blue dots ("2.6.5-no-maxPasses") are inserts that succeeded. The green dots ("2.8.0-rc3-pre-") are a version with SERVER-6981 fixed, so the assertion no longer occurs and all inserts run to completion.

(The gap in magenta dots around 2014-12-12 12:00 is when I briefly tried a milder workload, to see if it would still trigger the assertion.)

The mean and median of slow operations (100+ms) have possibly increased by about ~120ms as a result of SERVER-6981 (these results are from logLevel 0 (not 1), so are biased). The frequency of very long running operations has also likely increased.

2.6.5

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  101.0   164.0   192.0   320.5   346.0  2195.0
 
Number of operations (>100ms):  18858
Number of operations (>1000ms): 566 (3%)

2.8.0-rc3-pre-

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  101.0   232.0   313.0   439.7   357.0  7503.0
 
Number of operations (>100ms):  14229
Number of operations (>1000ms): 632 (4.5%)

Here is the layout of the capped collection at the labelled points in the run. The x-axis is documents in natural order. The y-axis is the linearised on-disk position (i.e. computed as $diskLoc.file * 512MB + $diskLoc.offset).

  • 0: Typical layout after the first pass (no documents deleted yet):
  • 1: After priming (2.6.5):
  • 2: After hitting maxPasses (2.6.5):
  • 3: After many maxPasses assertions (2.6.5):
  • 4: Just after upgrading to 2.8.0-rc3-pre-:
  • 5: After a short while on 2.8.0-rc3-pre- (1hr):
  • 6: After a long while on 2.8.0-rc3-pre- (7hrs):

Notable features:

  1. Records "jump around" considerably. The layout looks nothing like the linear situation that exists after the first pass, and which is the mental model of a capped collection.
  2. Sometimes a jump is just for one document, sometimes for a string of them.
  3. There are sections where a relatively small number of naturally sequential documents jump back and forth across nearly the entire file range.
  4. There are sections where the overall trend is to run backwards (i.e. many sub-sections that each go forwards, but these sub-sections are arranged "in reverse").
Generated at Thu Feb 08 03:44:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.