[SERVER-16881] Pause due to sudden 4GB increase in cached bytes under WiredTiger Created: 15/Jan/15  Updated: 26/Jan/15  Resolved: 21/Jan/15

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 2.8.0-rc5
Fix Version/s: 3.0.0-rc6

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Keith Bostic (Inactive)
Resolution: Done Votes: 0
Labels: wiredtiger
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 4GB-jump.png     PNG File 9cf04338af2dd4f9ac46e5d74345c7ea228d9e8f.png    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

During one 10-minute run of a heavy mixed workload a period of several seconds of very low throughput was seen:

  • During the pause all statistics appear to be consistent with a large number of cache evictions, resulting in a lot of writes, system cpu time, and context switches. This heavy eviction load seems like a plausible cause for the low throughput during this period.
  • Immediately before the pause the number of bytes in the cache jumped by about 4 GiB between two samples about 0.1 seconds apart. This is a plausible trigger for the heavy eviction load.
  • During the pause bytes and pages in cache declined slowly, presumably a result of the page evictions.
  • The pause ends with a sudden drop in the size of the cache by about 4 GiB between two samples about 0.1 seconds apart.

Here are the specific samples

  • just before increase
    "localTime":"2015-01-15T18:42:20.344Z"
    "bytes currently in the cache":4315898330
    "pages currently held in the cache":60756
  • just after increase
    "localTime":"2015-01-15T18:42:20.472Z"
    "bytes currently in the cache":8594367123
    "pages currently held in the cache":60598
  • just before decrease
    "localTime":"2015-01-15T18:42:25.058Z"
    "bytes currently in the cache":7242022268
    "pages currently held in the cache":35169
  • just after decrease
    "localTime":"2015-01-15T18:42:25.170Z"
    "bytes currently in the cache":2941126278
    "pages currently held in the cache":35137

The size of the sudden jump was: 7242022268 - 2941126278 = 3.98 GiB

The size of the sudden drop was: 8594367123 - 4315898330 = 4.01 GiB

Neither a sudden increase nor a sudden decrease of cache size by 4 GiB in 0.1 seconds due to real cached data is plausible.



 Comments   
Comment by Keith Bostic (Inactive) [ 26/Jan/15 ]

This issue was caused by an accounting bug in tracking the memory currently held in cache, and I recall SERVER-16902 was a problem in selecting pages for splitting. (The accounting problem certainly might have exacerbated16902, though.)

Comment by Mark Callaghan [ 26/Jan/15 ]

Is this caused by the intermittent use of huge, as in 4GB large), leaf pages? See comments from https://jira.mongodb.org/browse/SERVER-16902

Comment by Keith Bostic (Inactive) [ 20/Jan/15 ]

Thanks, bruce.lucas@10gen.com.

If I instrument a WiredTiger branch to drop core when this happens, and build a binary on MCI, would that make it possible for you to give me a stack trace? (Not sure if MCI builds objects with symbol tables?)

Generated at Thu Feb 08 03:42:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.