[SERVER-16150] Large in memory collection scan very slow with WiredTiger compared to mmapv1 Created: 14/Nov/14  Updated: 23/Feb/15  Resolved: 05/Dec/14

Status: Closed
Project: Core Server
Component/s: Performance, Storage
Affects Version/s: 2.8.0-rc0
Fix Version/s: 2.8.0-rc1

Type: Bug Priority: Major - P3
Reporter: Adam Comerford Assignee: Mathias Stearn
Resolution: Done Votes: 0
Labels: wiredtiger
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File ttld_compression.png    
Issue Links:
Depends
Duplicate
is duplicated by SERVER-16164 optimize table scans on wiredtiger Closed
Related
related to SERVER-16444 Avoid copying data out of WT buffers ... Closed
Operating System: ALL
Steps To Reproduce:

Add a large amount of data into a collection (my test data generation is outlined in this gist).

Here are the various storage config options used:

//mmapv1
 
storage:
    dbPath: "/ssd/db/mmap"
    engine: "mmapv1"
 
//WT with compression off
 
storage:
    dbPath: "/ssd/db/wt_none"
    engine: "wiredtiger"
    wiredtiger:
        collectionConfig: "block_compressor="
 
// WT with snappy
 
storage:
    dbPath: "/ssd/db/wt_snappy"
    engine: "wiredtiger"
 
// WT with zlib
 
storage:
    dbPath: "/ssd/db/wt_zlib"
    engine: "wiredtiger"
    wiredtiger:
        collectionConfig: "block_compressor=zlib"

To force a collection scan, run the following:

var start = new Date().getTime();
db.data.find().explain("executionStats")
var end = new Date().getTime();
print("Time to touch data: " + (end - start) + "ms");

Start and end are not really required since this is an explain and contains timing info, but I was also using this to compare to (for example) the touch command, so I wanted apples to apples timing comparisons.

Participants:

 Description   

This may be "works as designed" (i.e. that WT is going to be slower for traversing large data structures in memory) but I would like to make sure and quantify the expected behavior here if that is the case.

While attempting to profile the benefits of compression in terms of bandwidth savings, the expected performance of the default snappy compression (which delivered decent on-disk compression) looked slower than expected, significantly slower than mmapv1.

That led to a round of testing to better understand what was going on here. So, I used 4 basic storage engine configurations:

  • mmapv1
  • WT with no compression ("block_compressor=")
  • WT with snappy (default, so no block_compressor specified)
  • WT with zlib ("block_compressor=zlib")

The only WT config that came close to the mmapv1 performance was zlib, and that was on the read from disk test. So, I decided to test on SSD rather than spinning media, the result was that everything got a bit faster, but the relative differences remained - WT was still significantly slower.

For my initial testing methodology, since I was trying to demonstrate the benefits of compression for IO bandwidth savings, I had been clearing the caches on the system after each run.

Now that IO appeared to have no effect I decided to do consecutive runs of the collection scan, which would make the second run all in-memory (the collection is <16GiB and the test machine has 32GiB RAM, even with indexes it would fit in memory, but indexes are not in play)

However the collection scan was still slow with WiredTiger even when the data was already loaded into RAM. The mmapv1 test dropped from the ~300 second range down to 13 seconds, but the WT testing showed no similar reduction - it did improve, but was still in the hundreds of seconds range rather than double digits.

I have tried tweaking cache_size, lsm, directio, readahead to no effect (the last two before I had ruled out IO issues completely), but no significant improvement either.

Basic initial graph attached, I will add detailed timing information, graphs, perf output below to avoid bloating the description too much.



 Comments   
Comment by Mathias Stearn [ 05/Dec/14 ]

Further improvements possible with SERVER-16444

Comment by Adam Comerford [ 24/Nov/14 ]

In case anyone was concerned about the variance in performance of WT "in-memory" here, it turns out that my data set size was exploding the default (15G) cache size in WiredTiger. Once I increased the cache_size parameter to 24G, the results looked as below, far more consistent for WT in-memory collection scans:

Engine/Compression Coll scan time (mem)
mmapv1/none 12.857
WT/none 22.236
WT/snappy 22.453
WT/zlib 22.617
Comment by Adam Comerford [ 17/Nov/14 ]

Summary table, patched (improved CollectionScan) 2.8.0-rc1-pre, SSD based storage:

Engine/Compression Coll scan time (disk) Coll scan time (mem) in-mem improvement
mmapv1/none 51.789 13.479 -
WT/none 178.283 34.269 -92.1%
WT/snappy 141.284 36.997 -91.4%
WT/zlib 147.025 68.585 -67.3%
Comment by Mathias Stearn [ 15/Nov/14 ]

We will be changing CollectionScan::work to call dataFor on the result of curr() before getNext().

Comment by Alex Gorrod [ 15/Nov/14 ]

I believe the issue here is with the code in CollectionScan::work. All code examples below are psuedo code of what is actually in the tree, in an attempt to simplify the explanation:

DiskLoc nextLoc = _iter->getNext();
member->obj = _iter->dataFor(nextLoc).releaseToBson();

The code in the WiredTiger iterator WiredTigerRecordStore::Iterator::getNext is doing (in pseudo code):

toReturn = curr();
WT_CURSOR *c = getCursor();
c->next();
_curr = c->get_key();
return toReturn;

Then the code in WiredTigerRecordStore::Iterator::dataFor is doing:

WiredTigerRecordStore::Iterator::dataFor(DiskLoc loc) {
    if( loc == _curr())
        return (current_value);
    else {
        // Open a new cursor and do a search to avoid repositioning the cursor being used for
        // a traversal.
        RecordStore::dataFor(loc);
    }
 

So the CollectionScan::work method is doing both a cursor->next and a cursor search because the value it passes to dataFor is no longer the value of curr() in the iterator (we've already stepped past that).

Comment by Adam Comerford [ 14/Nov/14 ]

Here's an at-a-glance timing table for the perf runs above, all performed on Ubuntu 14.04, with 2.8.0-rc0 on SSD storage:

Engine/Compression Coll scan time (disk) Coll scan time (mem)
mmapv1/none 51.789 13.479
WT/none 681.683 431.816
WT/snappy 648.605 430.514
WT/zlib 299.718 209.536
Comment by Adam Comerford [ 14/Nov/14 ]

For reference here's the perf report for mmapv1 (note, all testing with perf recording is being done on SSD, so this eclipses even zlib easily since IO bottleneck is gone)

// perf report output for initial load, subsequent scan in-mem (mmapv1)
// timing info: initial run (from disk) = 51.789s ; subsequent run (in-mem) = 13.479s
29.13%  mongod  [kernel.kallsyms]    [k] 0xffffffff8104f45a
  9.04%  mongod  mongod               [.] mongo::CollectionScan::work(unsigned long*)
  6.11%  mongod  mongod               [.] mongo::RecordStoreV1Base::getNextRecordInExtent(mongo::OperationContext*, mongo::DiskLoc const&) const
  5.93%  mongod  mongod               [.] mongo::PlanExecutor::getNext(mongo::BSONObj*, mongo::DiskLoc*)
  4.34%  mongod  mongod               [.] mongo::RecordAccessTracker::markAccessed(void const*)
  3.65%  mongod  mongod               [.] mongo::MmapV1ExtentManager::_recordForV1(mongo::DiskLoc const&) const
  3.37%  mongod  mongod               [.] mongo::WorkingSetMember::clear()
  2.58%  mongod  mongod               [.] mongo::RecordAccessTracker::checkAccessedAndMark(void const*)
  2.54%  mongod  mongod               [.] mongo::MmapV1ExtentManager::_getOpenFile(int) const
  1.99%  mongod  mongod               [.] mongo::SimpleRecordStoreV1Iterator::getNext()
  1.89%  mongod  mongod               [.] mongo::RecordAccessTracker::Slice::_get(int, unsigned long, bool)
  1.87%  mongod  mongod               [.] mongo::RecordStoreV1Base::getNextRecord(mongo::OperationContext*, mongo::DiskLoc const&) const
  1.62%  mongod  mongod               [.] mongo::WorkingSet::allocate()
  1.60%  mongod  mongod               [.] mongo::CollectionScan::returnIfMatches(mongo::WorkingSetMember*, unsigned long, unsigned long*)
  1.53%  mongod  mongod               [.] mongo::MmapV1ExtentManager::recordNeedsFetch(mongo::DiskLoc const&) const
  1.45%  mongod  mongod               [.] mongo::WorkingSet::free(unsigned long const&)
  1.41%  mongod  mongod               [.] mongo::MmapV1ExtentManager::recordForV1(mongo::DiskLoc const&) const
  1.25%  mongod  libpthread-2.19.so   [.] pthread_mutex_lock
  1.10%  mongod  mongod               [.] mongo::AndMatchExpression::matches(mongo::MatchableDocument const*, mongo::MatchDetails*) const
  1.08%  mongod  mongod               [.] mongo::PlanYieldPolicy::shouldYield()
  0.90%  mongod  mongod               [.] mongo::RecordStoreV1Base::dataFor(mongo::OperationContext*, mongo::DiskLoc const&) const
  0.85%  mongod  mongod               [.] mongo::ScopedTimer::~ScopedTimer()
  0.83%  mongod  libpthread-2.19.so   [.] pthread_mutex_unlock
  0.79%  mongod  mongod               [.] mongo::RecordAccessTracker::Rolling::access(unsigned long, short, bool)
  0.78%  mongod  mongod               [.] mongo::RecordStoreV1Base::recordFor(mongo::DiskLoc const&) const
  0.76%  mongod  [vdso]               [.] 0x0000000000000ce1
  0.61%  mongod  mongod               [.] mongo::LockerImpl<true>::inAWriteUnitOfWork() const
  0.60%  mongod  mongod               [.] mongo::CollectionScan::isEOF()
  0.60%  mongod  libc-2.19.so         [.] mincore
  0.50%  mongod  mongod               [.] mongo::SimpleRecordStoreV1Iterator::dataFor(mongo::DiskLoc const&) const                      

Comment by Adam Comerford [ 14/Nov/14 ]

Here is the snappy perf report - very similar to no compression, notably significantly slower than zlib, even after data is in-mem:

// perf report output for initial load, subsequent scan in-mem (WT, snappy compression)
// timing info: initial run (from disk) = 648.605s ; subsequent run (in-mem) = 430.514s
 53.12%  mongod  mongod               [.] __wt_row_search
 15.44%  mongod  mongod               [.] __wt_hazard_set
  6.94%  mongod  mongod               [.] __wt_page_in_func
  3.64%  mongod  mongod               [.] __wt_hazard_clear
  2.20%  mongod  mongod               [.] 0x0000000000f39e14
  2.12%  mongod  [kernel.kallsyms]    [k] 0xffffffff8104f45a
  1.88%  mongod  mongod               [.] snappy::RawUncompress(snappy::Source*, char*)
  1.25%  mongod  mongod               [.] __wt_btcur_search
  1.00%  mongod  mongod               [.] tc_malloc
  0.84%  mongod  mongod               [.] mongo::WiredTigerSession::releaseCursor(unsigned long, __wt_cursor*)
  0.81%  mongod  mongod               [.] __wt_btcur_next
  0.78%  mongod  mongod               [.] mongo::PlanExecutor::getNext(mongo::BSONObj*, mongo::DiskLoc*)
  0.71%  mongod  mongod               [.] tc_free
  0.70%  mongod  libc-2.19.so         [.] __memmove_ssse3
  0.69%  mongod  mongod               [.] mongo::intrusive_ptr_release(mongo::SharedBuffer::Holder*)             

Comment by Adam Comerford [ 14/Nov/14 ]

Some perf data for WiredTiger testing is below. I've only done zlib and non-compressed runs so far:

// perf report output for initial load, subsequent scan in-mem (WT, no compression)
// timing info: initial run (from disk) = 681.683s ; subsequent run (in-mem) = 431.816
 
 52.23%  mongod  mongod               [.] __wt_row_search
 15.12%  mongod  mongod               [.] __wt_hazard_set
  6.88%  mongod  mongod               [.] __wt_page_in_func
  4.98%  mongod  [kernel.kallsyms]    [k] 0xffffffff8104f45a
  3.55%  mongod  mongod               [.] __wt_hazard_clear
  3.22%  mongod  mongod               [.] 0x00000000009a2776
  1.33%  mongod  mongod               [.] __wt_btcur_search
  0.93%  mongod  mongod               [.] tc_malloc
  0.72%  mongod  mongod               [.] tc_free
  0.71%  mongod  mongod               [.] mongo::WiredTigerSession::releaseCursor(unsigned long, __wt_cursor*)
  0.65%  mongod  libc-2.19.so         [.] __memmove_ssse3
  0.64%  mongod  mongod               [.] mongo::intrusive_ptr_release(mongo::SharedBuffer::Holder*)
  0.64%  mongod  mongod               [.] __wt_btcur_next
  0.61%  mongod  mongod               [.] mongo::PlanExecutor::getNext(mongo::BSONObj*, mongo::DiskLoc*)

// perf report output for initial load, subsequent scan in-mem (WT, zlib compression) 
// timing info: initial run (from disk) = 299.718s ; subsequent run (in-mem) = 209.536s
 
 31.02%  mongod  mongod               [.] __wt_row_search
 15.69%  mongod  mongod               [.] inflate_fast
  8.42%  mongod  mongod               [.] __wt_hazard_set
  3.73%  mongod  mongod               [.] __wt_page_in_func
  3.71%  mongod  mongod               [.] 0x0000000000fec102
  3.08%  mongod  mongod               [.] adler32
  2.63%  mongod  [kernel.kallsyms]    [k] 0xffffffff8104f45a
  2.50%  mongod  mongod               [.] __wt_btcur_search
  2.50%  mongod  mongod               [.] inflate_table
  2.05%  mongod  mongod               [.] __wt_hazard_clear
  1.75%  mongod  mongod               [.] tc_malloc
  1.32%  mongod  mongod               [.] tc_free
  1.26%  mongod  mongod               [.] mongo::intrusive_ptr_release(mongo::SharedBuffer::Holder*)
  1.26%  mongod  libc-2.19.so         [.] __memmove_ssse3
  1.23%  mongod  mongod               [.] __wt_btcur_next
  1.11%  mongod  mongod               [.] mongo::PlanExecutor::getNext(mongo::BSONObj*, mongo::DiskLoc*)
  1.03%  mongod  mongod               [.] inflate
  0.93%  mongod  libc-2.19.so         [.] memset
  0.90%  mongod  mongod               [.] __wt_cursor_get_keyv
  0.82%  mongod  mongod               [.] mongo::WiredTigerSession::releaseCursor(unsigned long, __wt_cursor*)
  0.67%  mongod  mongod               [.] __wt_cursor_set_keyv
  0.64%  mongod  mongod               [.] __wt_page_inmem
  0.64%  mongod  mongod               [.] mongo::CollectionScan::work(unsigned long*)
  0.58%  mongod  mongod               [.] __wt_cursor_get_key
  0.56%  mongod  mongod               [.] mongo::Collection::documentNeedsFetch(mongo::OperationContext*, mongo::DiskLoc const&) const
  0.54%  mongod  mongod               [.] mongo::PlanYieldPolicy::shouldYield()
  0.51%  mongod  mongod               [.] __wt_cursor_get_valuev            

Comment by Eliot Horowitz (Inactive) [ 14/Nov/14 ]

mathias was looking at this yesterday

Generated at Thu Feb 08 03:40:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.