[SERVER-22423] Index pages are preferentially evicted Created: 01/Feb/16  Updated: 06/Dec/22  Resolved: 10/Jun/19

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.9, 3.2.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Backlog - Storage Execution Team
Resolution: Done Votes: 2
Labels: 3.7BackgroundTask, WTplaybook
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File WT-2391.png     PNG File WT-2391a.png     PNG File evictions.png     File ex_mongo.py    
Issue Links:
Backports
Depends
depends on WT-2391 De-prioritize eviction from indexes Closed
depends on WT-2452 Implement a mechanism that gives prio... Closed
Related
Assigned Teams:
Storage Execution
Operating System: ALL
Backport Requested:
v3.4, v3.2
Participants:

 Description   
  • Create a collection consisting of 10 M documents each 1 kB in size
  • Start mongod with a cache size of 10 GB, a little too small to hold the data set (incl index and overhead)
  • Start reading documents at random

Observed result:

  • At point A we have read most of the index into the cache - "pages read into cache" falls to near 0; whereas we are continuing to read collection pages into cache all the way through B. This reflects the fact that index pages contain many more entries than collection pages, so we hit all the index pages much more quickly than we hit the collection pages.
  • At the point B where the cache fills up we see collection pages being evicted at a rate of about 1k/s, and the rate of reading collection pages into the cache does not change when the cache fills: lru is essentially random in this case, and evicting a random collection page will not change the chance of a cache miss on the next random read, so cache miss rate states constant after the cache fills up
  • However we see index pages being evicted at nearly 4 k/s, about 3-4x the rate of collection pages, in spite of the fact that index pages are much "hotter" than collection pages so should be kept in cache by lru, and as a result rate of reads of index into the cache goes very high.


 Comments   
Comment by Eric Milkie [ 10/Jun/19 ]

Bruce, please reopen this if you suspect this issue still exists in 4.0+.

Comment by Asya Kamsky [ 20/Oct/17 ]

bruce.lucas this is quite old - have we seen any actual workloads that may be slow at least in part because of this issue? (Also it's not clear that 3.4 has the same issue since it's not mentioned in affected versions).

Comment by Alexander Gorrod [ 14/Mar/16 ]

bruce.lucas We have also opened WT-2452 - to implement a mechanism where indexes would be kept in cache preferentially.

Comment by Keith Bostic (Inactive) [ 14/Mar/16 ]

bruce.lucas, I didn't investigate why we start evicting index pages again at that point.

Is there reason to believe the index pages aren't being naturally aged out of the cache? I would expect some index pages to be touched once, and then age out of the cache without getting touched again, is that unexpected in this workload?

Also, the index pages are being evicted at about half the average rate of the collection pages (844 vs 313), so index pages are evicted about half as often as collection pages.

Let me know if we should pursue that question and I will; otherwise, what's the next step on this ticket?

Comment by Bruce Lucas (Inactive) [ 10/Mar/16 ]

keith.bostic, the graph you attached covers quite a bit longer time range than the original graph above, so to facilitate comparison I've added markers to your graph similar to the markers on the original graph:

  • at A the index has already been fully read into the cache while we are still reading the collection, reflecting the fact that the index is quite a bit smaller than the collection so it is quite a bit hotter. This is the same as the original graph.
  • at B the cache fills, and indeed the behavior now looks quite a bit better than it did starting at this point on the original graph: we are evicting collection pages preferentially now instead of index pages.
  • C corresponds to roughly the end of the original graph
  • however at D something changes and we are now suddenly evicting index pages at a high rate again (although not as bad as it was in the original graph above). Do you know what changed at this point? There's nothing in the metrics shown in the graph to indicate why the sudden change; any clues in the rest of the metrics?
Comment by Keith Bostic (Inactive) [ 07/Mar/16 ]

bruce.lucas, we've just merged some changes into the WiredTiger develop branch based on WT-2391, which give me better results on your test program. I've uploaded a snapshot:

Can you take a look?

Comment by Alexander Gorrod [ 03/Mar/16 ]

david.hows Please add comments to the Python code so that it's easier to understand the intent. It would also be beneficial to review the variable names you used so that are more self-describing.

Comment by David Hows [ 03/Mar/16 ]

Thanks keith.bostic, good catch.

I've fixed locally and pushed the changed version up.

Comment by Keith Bostic (Inactive) [ 02/Mar/16 ]

david.hows, I think there's a bug in the ex_mongo.py script:

% diff *.orig *.new
57c57
< 		idx_inl = idx_out
---
> 		idx_outl = idx_out

With that change, I see better results:

####
cache:		7736348157
inbound col-idx:	3753 : 2010
outbound col-idx:	3916 : 2098
in(mB) col-idx:	102 : 23
####
cache:		7730173864
inbound col-idx:	3754 : 2017
outbound col-idx:	3932 : 2054
in(mB) col-idx:	102 : 24
####
cache:		7730668615
inbound col-idx:	3768 : 1971
outbound col-idx:	3729 : 1994
in(mB) col-idx:	102 : 23
####
cache:		7731642074
inbound col-idx:	3771 : 2029
outbound col-idx:	3785 : 1970
in(mB) col-idx:	103 : 24

Comment by David Hows [ 11/Feb/16 ]

I've gone and done a comparative reproduction between 3.0 and 3.2.

Looking specifically, I can see that there are in both cases more pages being evicted from indexes than collections but the difference is far smaller.

In 3.0 we see 2000~ collection pages vs 6600 index pages. This compared with 3.2 where we see 2300 vs 2800 pages.

I've also looked into the structure of how the tables generated by the Bruce's shell script far above compare with those generated by the python script above. For the most part the 3.2 tables and WT tables show the same numbers (with small variation) of internal and leaf pages and with some deeper diving into the structures of pages the only real difference between the WT and MongoDB data sets are that there are 26 very small leaf pages in the Mongo index table.

These small pages in the Mongo index table are between 1212B and 2669B with between 234 and 478 entries compared with the average of 12280B in size and 2000~ entries for both Mongo and WT index tables. These and some small variation in the number of entries in the larger pages in the MongoDB tables are the only notable differences.

Comment by David Hows [ 10/Feb/16 ]

Have written a WT python script that aims to emulate the workload we are testing.

Given the outputs here, I currently believe the issue exists as reported in WT.

There is a noticeable difference: within this repro we are reading in significantly less data in the indexes as a raw value. However, when thought of as a % of the size of the tables in question, we are reading a far higher proportion of index data (33%+) as the index table is around 150MB compared to the 1.5GB collection table.

Below you cam see how WT reads far more index pages than collection pages. You can also see that the index reads account for far less data, but this does (as mentioned) represent a far higher proportion of the whole collection being read into RAM each second.

####
cache:		7733962837
inbound col-idx:	7244 : 12196
outbound col-idx:	7549 : 96971
in(mB) col-idx:	197 : 49
####
cache:		7747829858
inbound col-idx:	7378 : 11848
outbound col-idx:	7111 : 100929
in(mB) col-idx:	200 : 50
####
cache:		7733515335
inbound col-idx:	7275 : 12261
outbound col-idx:	7572 : 105522
in(mB) col-idx:	197 : 50
####
cache:		7752287805
inbound col-idx:	7166 : 12083
outbound col-idx:	6732 : 109713
in(mB) col-idx:	195 : 52
####
cache:		7738316737
inbound col-idx:	7061 : 11843
outbound col-idx:	7687 : 113415
in(mB) col-idx:	191 : 46
####
cache:		7757094756
inbound col-idx:	6481 : 11463
outbound col-idx:	5947 : 116644
in(mB) col-idx:	176 : 39
####
cache:		7745255762
inbound col-idx:	6528 : 12039
outbound col-idx:	6551 : 120945
in(mB) col-idx:	177 : 44
####
cache:		7734035206
inbound col-idx:	6081 : 11808
outbound col-idx:	6173 : 125387
in(mB) col-idx:	165 : 48
####
cache:		7735931598
inbound col-idx:	6271 : 11757
outbound col-idx:	6278 : 129681
in(mB) col-idx:	170 : 51

Comment by David Hows [ 05/Feb/16 ]

Scratch that, found the difference, had an incorrect leaf page size from an earlier test that wasn't being displayed when I reviewed the data files due to WT-2381.

Comment by David Hows [ 04/Feb/16 ]

michael.cahill,

I've been doing some reproduction and wanted to answer what I could of your initial questions.

  1. No, it did not. Bruce's script has a restart between the load and read phases.
  2. Decreasing the leaf_page_max value had a small positive impact, at 16KB and 8KB the average rate of pages evicted from 2:3 to 3:4, meaning we evicted slightly fewer index pages for every collection page as the max size dropped.
Comment by Michael Cahill (Inactive) [ 04/Feb/16 ]

david.hows, interesting questions are:

  1. does restarting mongod in between the build and query phases make any difference?
  2. can we tweak anything that changes it? One theory is that indexes end up with more, smaller pages in cache than collections – does making the collection page size smaller make any difference to what's happening?
  3. can we get the same behaviour in standalone WT?

If our LRU selection for eviction was perfect, we don't think that page size or number of pages per tree should matter.

The intuition is that index pages should stick in cache more because they're hotter – we keep revisiting pages in indexes more often because there are more keys per page. But in practice we only approximate LRU, we don't bump the read generation all that often, we only take a certain number of pages per tree, etc. Any or all of those could be contributing to us making suboptimal choices about which pages to evict.

Comment by Bruce Lucas (Inactive) [ 02/Feb/16 ]

alexander.gorrod, here's the script I used, including data collection.

dbpath=/ssd/db
threads=25     # threads for creating collection
total=10000000 # total number of records
size=1000      # size of each record
 
function ss-start {
    delay=$1; shift
    mongo $* --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep(1000*$delay)}"
}
 
function cs-start {
    db=$1; shift
    c=$1; shift
    delay=$1; shift
    mongo $db $* --eval "
        while(true) {
            s = db.$c.stats();
            s.time = new Date();
            print(JSON.stringify(s))
            sleep(1000*$delay)
        }
    "
}
 
function cs-inx-start {
    db=$1; shift
    c=$1; shift
    inx=$1; shift
    delay=$1; shift
    mongo $db $* --eval "
        while(true) {
            s = db.$c.stats({indexDetails:true});
            s.time = new Date();
            s.wiredTiger = s.indexDetails['$inx']
            print(JSON.stringify(s))
            sleep(1000*$delay)
        }
    "
}
 
function monitor {
    ss-start 1 >$dbpath/ss.log &
    cs-start test c 1 >$dbpath/coll.log &
    cs-inx-start test c _id_ 1 >$dbpath/inx.log &
}
 
function create {
    (
        for t in $(seq $threads); do
            mongo --eval "
                x = ''
                for (var i=0; i<$size; i++)
                    x += 'x'
                count = $total / $threads
                every = 1000
                for (var i=0; i<count; ) {
                    var bulk = db.c.initializeUnorderedBulkOp();
                    for (var j=0; j<every; j++, i++)
                        bulk.insert({_id: i*$threads+$t, x: x})
                    bulk.execute()
                    if ($t==1)
                        print(i)
                }
            " &
        done
        wait
    )
}
 
function read {
    mongo --eval "
        while (true) {
            id = Math.floor(Math.random()*$total)
            db.c.findOne({_id: id})
        }
    "
}
 
function clean {
    killall -9 -w mongod mongo mongos
    rm -rf $dbpath
    mkdir -p $dbpath
}
 
function start {
    killall -w mongod mongo mongos
    mongod --dbpath $dbpath --logpath $dbpath.log --storageEngine wiredTiger --wiredTigerCacheSizeGB 10 --fork
}
 
clean; start; monitor; create
start; monitor; read

Comment by Alexander Gorrod [ 01/Feb/16 ]

Thanks bruce.lucas - I'll get the reproducer working and work towards understanding the reason we are selecting pages for eviction poorly. It would be useful if you can add a shell script that generates the workload you describe.

Generated at Thu Feb 08 04:00:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.