[SERVER-9378] make server auto-read-ahead on capped collections Created: 17/Apr/13  Updated: 10/Dec/14  Resolved: 17/Mar/14

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 2.4.1
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Dwight Merriman Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-9335 prefetch/fault-in the tail of the opl... Closed
Participants:

 Description   

If the readahead for the filesystem is set to a low value, this can slow writing to capped collections (particularly the oplog) as we need to read before we write it.

The proposal here is to have the server, in some background fashion, do some prefetching (touching?) of the oplog region where it is going to be inserting new documents soon. so not the position of this write but an upcoming block of data.

maybe 1MB at a time is touched for example. (note edge case: on a tiny capped collection, this may be a dumb value. maybe for tiny just don't do this logic at all.)

this touching could probably be done outside of the db locks, kind of like Record::touch().



 Comments   
Comment by Eliot Horowitz (Inactive) [ 17/Mar/14 ]

SERVER-9335 does this

Comment by Ben Becker [ 18/Feb/14 ]

Tested with the changes in SERVER-9335, and results are roughly the same. Also tried with and without the MADV_SEQUENTIAL change, but the difference is negligible.

Comment by Antoine Girbal [ 30/Apr/13 ]

Here is a repro with benchRun.
Initialize the collection and ops:

> db.getSiblingDB("local").createCollection("oplogtest", {capped: true, size: 500000000, autoIndexId: false});
> var doc = {
        "_id" : "9008729983994710815501",
        "val" : BinData(0,"AAAAnwAAAAGlOqFQAAAAAAKSOqFQAAAAAAOSOqFQAAAAAAQBAAAABZI6oVAAAAAAB2h0dHA6Ly93d3cuNDFiNTdiMjQzNTczLmNvbS83Zjc3MTUyNGM5NWMvYmI3YjEyNmRkNTc3LzJlZThlYWE5YzA2Yi8wZTkzZjU3NmY3NDQACDAwZjYzMGEzMTg4ODowZDAzYmU4YmNlZTg6MQAA//8=")
}
> ops = [{ op : "insert" , ns : "local.oplogtest" , doc : doc } ]
[
        {
                "op" : "insert",
                "ns" : "local.oplogtest",
                "doc" : {
                        "_id" : "9008729983994710815501",
                        "val" : BinData(0,"AAAAnwAAAAGlOqFQAAAAAAKSOqFQAAAAAAOSOqFQAAAAAAQBAAAABZI6oVAAAAAAB2h0dHA6Ly93d3cuNDFiNTdiMjQzNTczLmNvbS83Zjc3MTUyNGM5NWMvYmI3YjEyNmRkNTc3LzJlZThlYWE5YzA2Yi8wZTkzZjU3NmY3NDQACDAwZjYzMGEzMTg4ODowZDAzYmU4YmNlZTg6MQAA//8=")
                }
        }
]

Restart server with 16 RA

# blockdev --report | grep /dev | awk '{ print $7 ; }' | xargs -p blockdev --setra 16
# service mongod restart
# echo 3 > /proc/sys/vm/drop_caches

Then benchRun results:

> res = benchRun({ops: ops, parallel: 1, seconds: 20})                                                                                                        {
        "note" : "values per second",
        "errCount" : NumberLong(0),
        "trapped" : "error: not implemented",
        "insertLatencyAverageMicros" : 6.726367969494757,
        "insert" : 5245,
        "query" : 0,
        "update" : 0,
        "delete" : 0,
        "getmore" : 0,
        "command" : 52.55
}

Now restart node with 32 RA, results:

> res = benchRun({ops: ops, parallel: 1, seconds: 20})
{
        "note" : "values per second",
        "errCount" : NumberLong(0),
        "trapped" : "error: not implemented",
        "insertLatencyAverageMicros" : 7.318571226681741,
        "insert" : 35380,
        "query" : 0,
        "update" : 0,
        "delete" : 0,
        "getmore" : 0,
        "command" : 353.9
}

Comment by Antoine Girbal [ 23/Apr/13 ]

eliotdwight_10gen see above for quick repro.

Comment by Antoine Girbal [ 23/Apr/13 ]

here is an easy way to replicate the issue on any server regardless of RAM size, just make sure it's based on HD not SSD.
The following was tested on an EC2 m3.2xlarge 32GB RAM with 8 EBS 250piops in RAID0 running Amazon Linux.

Create a 500MB capped collection in the local db so that there is no replication for it.
This capped collection basically has same behavior as oplog.

db.getSiblingDB("local").createCollection("oplogtest", {capped: true, size: 500000000, autoIndexId: false});

Then set the readahead on all drives to 16, make sure to restart mongod, and drop FS cache.
Note the order is important, cache must be dropped after mongod start because part of oplog gets read otherwise.

# blockdev --report | grep /dev | awk '{ print $7 ; }' | xargs -p blockdev --setra 16
# service mongod restart
# echo 3 > /proc/sys/vm/drop_caches

Next insert documents in the collection

> use local
> var doc = {
        "_id" : "9008729983994710815501",
        "val" : BinData(0,"AAAAnwAAAAGlOqFQAAAAAAKSOqFQAAAAAAOSOqFQAAAAAAQBAAAABZI6oVAAAAAAB2h0dHA6Ly93d3cuNDFiNTdiMjQzNTczLmNvbS83Zjc3MTUyNGM5NWMvYmI3YjEyNmRkNTc3LzJlZThlYWE5YzA2Yi8wZTkzZjU3NmY3NDQACDAwZjYzMGEzMTg4ODowZDAzYmU4YmNlZTg6MQAA//8=")
}
> for (var i = 0; i < 3000000; ++i ) {
    db.oplogtest.insert(doc);
}

In the case of RA 16, result is about 5000 inserts / sec and disk is at 90%

insert  query update delete getmore command flushes mapped  vsize    res faults   locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn set repl       time
  5321      0      0      0       0       1       1  10.5g  21.5g    47m    291 local:13.9%          0       0|0     0|0     1m     2k     2 foo  PRI   23:06:10 
  4974      0      0      0       0       1       1  10.5g  21.5g    47m    272  local:9.3%          0       0|0     0|0     1m     2k     2 foo  PRI   23:06:11
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
dm-5              0.00     0.00  266.50   54.00  2132.00  2232.00    13.62     4.10   12.78   2.76  88.60

Now do above 3 commands to restart with RA 32. Result is now 30k inserts / sec and disk is only about 30% used.

insert  query update delete getmore command flushes mapped  vsize    res faults   locked db idx miss %     qr|qw   ar|aw  netIn netOut  conn set repl       time
 28917      0      0      0       0       1       1  10.5g  21.5g    55m      1 local:44.2%          0       0|1     0|1     7m     2k     2 foo  PRI   23:07:26 
 30371      0      0      0       0       1       1  10.5g  21.5g    68m      1 local:45.9%          0       0|0     0|1     7m     2k     2 foo  PRI   23:07:27
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
dm-5              0.00     0.00  449.50  179.50 13528.00 12908.00    42.03    12.24   19.46   0.44  27.60

According to above calculation, disk is about 18 times more efficient with RA 32!

Comment by Dwight Merriman [ 19/Apr/13 ]

fwiw:
http://landley.net/kdocs/ols/2007/ols2007v2-pages-273-284.pdf

Comment by Antoine Girbal [ 17/Apr/13 ]

I've been trying to test this by using the following function in oplog.cpp:

    void readaheadOplog() {
                if ( (oplogCount / 300 ) % 1000 == 0) {
//                    DiskLoc loc = rsOplogDetails->cappedFirstDeletedInCurExtent();
                    DiskLoc loc = oplogLastLoc;
                    MongoDataFile* mdf = cc().database()->getFile( loc.a() );
                    massert( 162382, "can't fetch extent file structure", mdf );
                    if ( -1 == readahead(mdf->getFd(), loc.getOfs(), 1000000) ) {
                                massert( 162372, str::stream() << "readahead failed on fd " << mdf->getFd()
                                         << " offset " << loc.getOfs() << " len " << 1000000
                                         << " : " << errnoWithDescription(errno), 0 );
                    }
                    cout << "readahead oplog " << oplogCount << " off=" << loc.getOfs() << endl;
                }
    }

oplogCount is just a counter that sums the document sizes inserted in the oplog.
A document is about 300 bytes, so the above logic kicks in every 1000 documents (300KB).
It then goes to readahead 1MB from the last disk location inserted into oplog.

I'm calling the readaheadOplog method from the updateObjects method in update.cpp which I'm hoping is outside of any lock.
It still will slow down that thread that happens to be doing the readahead but I'm using 32 threads so shouldnt be a problem.

But right now it doesnt work, the speed is still slow and the number of r/s is 8x that of w/s in iostat, for the same amount of data transfered.
When RA on disk is set to 32, r/s is lower than w/s.

Generated at Thu Feb 08 03:20:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.