[SERVER-28166] Assess effects of pinning a lot of content in storage engine cache Created: 02/Mar/17  Updated: 29/Jan/18  Resolved: 24/Apr/17

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Alexander Gorrod Assignee: Daniel Gottlieb (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File cache_pressure.png    
Issue Links:
Depends
depends on WT-3296 LAS table fixes/improvements Closed
Related
related to SERVER-31099 Automate testing when oldest_timestam... Closed
Backwards Compatibility: Fully Compatible
Sprint: Storage 2017-04-17, Storage 2017-05-08
Participants:
Case:

 Description   

There is a point where there is a significant performance decrease with WiredTiger storage engine if too much content is required to remain in cache to service old operations. The reason for the performance difference is that WiredTiger starts paging content out from cache to disk in temporary files, when old updates may be required for old readers, but are no longer maintained in the database file.

The read-concern-majority functionality can cause a lot of updates to be pinned in cache, especially if there is significant lag in a replica set using read-concern majority.

The goal of this ticket is to construct a workload that forces an excessive amount of history to be generated and to assess the behavior of the system under such circumstances.



 Comments   
Comment by Githook User [ 28/Apr/17 ]

Author:

{u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith.bostic@mongodb.com'}

Message: WT-3296 LAS table fixes/improvements (#3397)

  • Changes for SERVER-28166: Assess effects of pinning a lot of content in storage engine cache

Instead of counting the update chains we skip/don't-skip, calculate the memory held in the update chain. Then if there are skipped updates in the chain, count that memory as memory we won't get back if we rewrite the page, else count it as memory we will recover. Change the test for rewriting a page in memory from a percentage of the update chains that we skipped/didn't-skip to whether or not we'll recover at least 2KB from rewriting the page. This change should avoid rewriting pages in memory where we're just wasting time, that is, tiny pages or pages without any updates on them.

Reconciliation has a test in __rec_write_check_complete() of whether an update-restore page reconciliation should fail because there aren't enough updates being discarded that it's worth rewriting the page in memory. That test incorrectly used the count of currently allocated boundary structures in its calculations, not the count of boundary structures used in a specific reconciliation. That bug led to eviction repeatedly doing update-restore reconciliation and rewriting pages in memory, to little or no gain, instead of switching to the LAS table.

Further, the test was intended to be if 10% of the update chains won't have to be re-instantiated when the page is rewritten. I think the test was wrong, but regardless, implementing the test correctly doesn't help with the test load I'm using. (My test is creating a snapshot and then inserting records until the cache is full.) Re-instantiating the page in memory stalls in that test case, which makes sense because rewriting a page in memory isn't evicting anything, it's just shrinking the memory being pinned down.

Change the test: if we can discard a chunk of the original page, rewrite the page in memory. Else, fallback to the lookaside table if we think lookaside will work.

  • Try to shorten update chains during eviction.
  • The memory in the update chain should include memory held by aborted transactions. Rather than put another test inside the loop, split the loop into two parts, eviction and checkpoint. Checkpoint doesn't need to calculate the memory in the update chain, nor does it have to track the minimum transaction ID.
  • Add a smoke test for the LAS table.
  • Skip reserved items when inserting update records into the LAS table, they're never restored.
  • Don't fall back to the LAS table when configured for in-memory eviction.
  • Replace __wt_evict's test of WT_EVICT_IN_MEMORY in with a test of WT_CONN_IN_MEMORY, there's no reason to further complicate the pass-back of the flags value to that function.
Comment by Daniel Gottlieb (Inactive) [ 20/Apr/17 ]

Attached is a screenshot of how cache pressure affects inserts into the system. The insert rate and the LAS file size is graphed with a datapoint generated every second.

A delayed secondary with --enableMajorityReadConcern was used to prevent named snapshots from being dropped. A 1GB cache size was used. Objects were 213 bytes of highly compressible bson. Inserts were done in a tight loop inside the mongo shell.

Named snapshots stopped being generated at ~23 seconds after data collection was started. The insert rate started out at ~700 inserts per second. The LAS table seems to become engaged at time 151 when two things happened. The first is a clear a 1-2 second slowdown in inserts. The second is that this marks the beginning when the overall insert rate dropped from ~700 to ~500 inserts per second.

Also notable is that every 60 seconds, presumably aligning with the WT flush to disk, the insert rate drops to 0 and the LAS file size increases.

Comment by Keith Bostic (Inactive) [ 08/Apr/17 ]

alexander.gorrod, daniel.gottlieb, I've taken a look at this, and the count doesn't match what I'm seeing happen at WiredTiger's level.

Once I get into a steady state, where the count is pinned, in WiredTiger I see a stuck cache, evicting aggressively, but successfully evicting 3M pages a second. The raw counts indicate we're inserting 450M rows and committing 100K transactions a second. We're doing heavy transaction rollback (maybe 1 transaction in 7), but we're still committing most of our transactions.

The reason the LAS file isn't getting involved is that eviction isn't haven't any problem doing normal eviction.

I'll keep digging – maybe the next thing to understand is why the named snapshot isn't preventing eviction?

Comment by Alexander Gorrod [ 07/Apr/17 ]

keith.bostic Would you mind taking a look at this please? It's a case where I'd expected the LAS file to allow us to make progress, but we don't. I suspect it's my not understanding a restriction on LAS file usage.

What the test is doing, is to create a named snapshot at a point in time, then doing a bunch of inserts in separate transactions. I'd thought the old history would be kept around, but paged out to the LAS file when the cache became full. Instead what happens is that the cache gets pinned full and no progress is made.

daniel.gottlieb can help you get the workload running locally - it shouldn't need much more than the script and a local build of MongoDB.

Comment by Daniel Gottlieb (Inactive) [ 05/Apr/17 ]

An addendum: maintaining an old named snapshot in wiredtiger will prevent wiredtiger from paging content out of the cache to disk. Instead of seeing decreased performance for content being paged out, the application will eventually see its updates halt. In addition, CPU utilization will become very high, presumably from the eviction threads working hard and making little progress until ultimately hitting a wall.

A shell script that demonstrates this behavior is attached. It uses a delayed secondary to artificially pin the named snapshots on the primary.

# Shutdown mongods specific to a previous run. Sensitive to the dbpath.
for pid in `ps aux | grep mongod | grep rs0/db | awk '{print $2}'`; do
    echo "Killing $pid"
    kill $pid
    while kill -0 "$pid"; do
	sleep 0.1
    done
done
 
# Clean the dbpath.
rm -rf /mnt/data0/rs0/db0/*
rm -rf /mnt/data0/rs0/db1/*
 
# Start up two mongods that will belong to the same replica set. Use a small cache size to trigger the condition sooner. Set the oplog size larger than the cache size to avoid considering how that affects the test.
./mongod --dbpath /mnt/data0/rs0/db0/ --port 3000 --replSet rs0 --oplogSize 2000 --wiredTigerCacheSizeGB 1 --logpath /mnt/data0/rs0/db0.log --enableMajorityReadConcern --fork
./mongod --dbpath /mnt/data0/rs0/db1/ --port 3001 --replSet rs0 --oplogSize 2000 --wiredTigerCacheSizeGB 1 --logpath /mnt/data0/rs0/db1.log --enableMajorityReadConcern --fork
 
# Configure replica set with the mongod on port 3000 mongod to be the primary. Configure the secondary to have a 1 hour delay when applying oplogs. This forces the named snapshots on the primary to stop advancing.
echo 'rs.initiate({_id: "rs0", members: [{_id: 0, host: "keylime:3000"}, {_id: 1, host: "keylime:3001", slaveDelay:3600, priority: 0}]})' | mongo --port 3000
 
# Wait for the replica set to come up.
while echo "replSetMemberStatePrompt()" | mongo --port 3000 | grep rs0 | grep -v PRIMARY; do
    sleep 0.5
done
# And sleep some more because the previous loop apparently isn't always good enough.
sleep 2
 
# Spin up a background process to insert data.
(echo "while (true) { db.bla.insert({x: 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'}); }" | mongo --port 3000) &
tail -f /mnt/data0/rs0/db0.log

A command that counts the collection size every second:

while true; do echo "print(\"Count: \" + db.bla.count())" | mongo --port 3000 | grep "Count:" ; sleep 1; done

Sample output shows ~600 inserts per second in the normal case:

Count: 220740
Count: 221352
Count: 222019
Count: 222665
Count: 223243
Count: 223829
Count: 224451
Count: 225102
Count: 225792

And writes start to slow down considerably:

Count: 418403
Count: 418403
Count: 418405
Count: 418405
Count: 418405
Count: 418417
Count: 418435
Count: 418466
Count: 418466
Count: 418466
Count: 418466
Count: 418499
Count: 418506

Before stopping altogether:

Count: 418644
Count: 418644
Count: 418647
Count: 418647
Count: 418647
Count: 418647
Count: 418647
Count: 418647
Count: 418647
Count: 418647
Count: 418659
Count: 418659
Count: 418660
Count: 418660
Count: 418660

The count remained at 418660 for two minutes before terminating the test.

Note that reconfiguring to remove the slave delay allows the snapshots to advance which was done once in the run associated with the count output. Thus running the script alone should result in a lower count before hitting the limit. Otherwise I expect the end count to be relatively consistent between runs.

Generated at Thu Feb 08 04:17:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.