[SERVER-28166] Assess effects of pinning a lot of content in storage engine cache Created: 02/Mar/17 Updated: 29/Jan/18 Resolved: 24/Apr/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Alexander Gorrod | Assignee: | Daniel Gottlieb (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Sprint: | Storage 2017-04-17, Storage 2017-05-08 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||
| Description |
|
There is a point where there is a significant performance decrease with WiredTiger storage engine if too much content is required to remain in cache to service old operations. The reason for the performance difference is that WiredTiger starts paging content out from cache to disk in temporary files, when old updates may be required for old readers, but are no longer maintained in the database file. The read-concern-majority functionality can cause a lot of updates to be pinned in cache, especially if there is significant lag in a replica set using read-concern majority. The goal of this ticket is to construct a workload that forces an excessive amount of history to be generated and to assess the behavior of the system under such circumstances. |
| Comments |
| Comment by Githook User [ 28/Apr/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith.bostic@mongodb.com'}Message:
Instead of counting the update chains we skip/don't-skip, calculate the memory held in the update chain. Then if there are skipped updates in the chain, count that memory as memory we won't get back if we rewrite the page, else count it as memory we will recover. Change the test for rewriting a page in memory from a percentage of the update chains that we skipped/didn't-skip to whether or not we'll recover at least 2KB from rewriting the page. This change should avoid rewriting pages in memory where we're just wasting time, that is, tiny pages or pages without any updates on them. Reconciliation has a test in __rec_write_check_complete() of whether an update-restore page reconciliation should fail because there aren't enough updates being discarded that it's worth rewriting the page in memory. That test incorrectly used the count of currently allocated boundary structures in its calculations, not the count of boundary structures used in a specific reconciliation. That bug led to eviction repeatedly doing update-restore reconciliation and rewriting pages in memory, to little or no gain, instead of switching to the LAS table. Further, the test was intended to be if 10% of the update chains won't have to be re-instantiated when the page is rewritten. I think the test was wrong, but regardless, implementing the test correctly doesn't help with the test load I'm using. (My test is creating a snapshot and then inserting records until the cache is full.) Re-instantiating the page in memory stalls in that test case, which makes sense because rewriting a page in memory isn't evicting anything, it's just shrinking the memory being pinned down. Change the test: if we can discard a chunk of the original page, rewrite the page in memory. Else, fallback to the lookaside table if we think lookaside will work.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 20/Apr/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Attached is a screenshot of how cache pressure affects inserts into the system. The insert rate and the LAS file size is graphed with a datapoint generated every second.
A delayed secondary with --enableMajorityReadConcern was used to prevent named snapshots from being dropped. A 1GB cache size was used. Objects were 213 bytes of highly compressible bson. Inserts were done in a tight loop inside the mongo shell. Named snapshots stopped being generated at ~23 seconds after data collection was started. The insert rate started out at ~700 inserts per second. The LAS table seems to become engaged at time 151 when two things happened. The first is a clear a 1-2 second slowdown in inserts. The second is that this marks the beginning when the overall insert rate dropped from ~700 to ~500 inserts per second. Also notable is that every 60 seconds, presumably aligning with the WT flush to disk, the insert rate drops to 0 and the LAS file size increases. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Keith Bostic (Inactive) [ 08/Apr/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
alexander.gorrod, daniel.gottlieb, I've taken a look at this, and the count doesn't match what I'm seeing happen at WiredTiger's level. Once I get into a steady state, where the count is pinned, in WiredTiger I see a stuck cache, evicting aggressively, but successfully evicting 3M pages a second. The raw counts indicate we're inserting 450M rows and committing 100K transactions a second. We're doing heavy transaction rollback (maybe 1 transaction in 7), but we're still committing most of our transactions. The reason the LAS file isn't getting involved is that eviction isn't haven't any problem doing normal eviction. I'll keep digging – maybe the next thing to understand is why the named snapshot isn't preventing eviction? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 07/Apr/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
keith.bostic Would you mind taking a look at this please? It's a case where I'd expected the LAS file to allow us to make progress, but we don't. I suspect it's my not understanding a restriction on LAS file usage. What the test is doing, is to create a named snapshot at a point in time, then doing a bunch of inserts in separate transactions. I'd thought the old history would be kept around, but paged out to the LAS file when the cache became full. Instead what happens is that the cache gets pinned full and no progress is made. daniel.gottlieb can help you get the workload running locally - it shouldn't need much more than the script and a local build of MongoDB. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 05/Apr/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
An addendum: maintaining an old named snapshot in wiredtiger will prevent wiredtiger from paging content out of the cache to disk. Instead of seeing decreased performance for content being paged out, the application will eventually see its updates halt. In addition, CPU utilization will become very high, presumably from the eviction threads working hard and making little progress until ultimately hitting a wall. A shell script that demonstrates this behavior is attached. It uses a delayed secondary to artificially pin the named snapshots on the primary.
A command that counts the collection size every second:
Sample output shows ~600 inserts per second in the normal case:
And writes start to slow down considerably:
Before stopping altogether:
The count remained at 418660 for two minutes before terminating the test. Note that reconfiguring to remove the slave delay allows the snapshots to advance which was done once in the run associated with the count output. Thus running the script alone should result in a lower count before hitting the limit. Otherwise I expect the end count to be relatively consistent between runs. |