[SERVER-16997] wiredtiger's "bytes currently in the cache" exceeds "maximum bytes configured" and eviction threads are deadlocked Created: 22/Jan/15  Updated: 18/Sep/15  Resolved: 29/Jan/15

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: 3.0.0-rc8

Type: Bug Priority: Major - P3
Reporter: Michael O'Brien Assignee: Alexander Gorrod
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2015-01-23 at 2.27.22 pm.png     PNG File cache_accounting.png     File capped_test.js     Text File gdb.txt     Text File gdb2.txt    
Issue Links:
Related
related to SERVER-17100 mongod hung running mongo-perf Closed
related to SERVER-16941 Cache can grow indefinitely under Wir... Closed
Tested
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:
Linked BF Score: 0

 Description   

not sure how to reproduce, it happened on dan's test server after running a heavy workload.

dan:PRIMARY> db.serverStatus().wiredTiger.cache
{
	"tracked dirty bytes in the cache" : 108113698,
	"bytes currently in the cache" : NumberLong("2305843128198539"),
	"maximum bytes configured" : 2147483648,
	"bytes read into cache" : 74306876990,
	"bytes written from cache" : 55231308853,
	"pages evicted by application threads" : 100627896064,
	"checkpoint blocked page eviction" : 1930,
	"unmodified pages evicted" : 812357,
	"page split during eviction deepened the tree" : 4,
	"modified pages evicted" : 1753265,
	"pages selected for eviction unable to be evicted" : 1651488,
	"pages evicted because they exceeded the in-memory maximum" : 1619,
	"pages evicted because they had chains of deleted items" : 253949,
	"failed eviction of pages that exceeded the in-memory maximum" : 3363,
	"hazard pointer blocked page eviction" : 3565,
	"internal pages evicted" : 385,
	"maximum page size at eviction" : 147633879,
	"eviction server candidate queue empty when topping up" : 28707,
	"eviction server candidate queue not empty when topping up" : 255279,
	"eviction server evicting pages" : 185,
	"eviction server populating queue, but not evicting pages" : 283801,
	"eviction server unable to reach eviction goal" : 1,
	"pages split during eviction" : 6618,
	"pages walked for eviction" : 159135276,
	"in-memory page splits" : 545,
	"tracked dirty pages in the cache" : 588,
	"pages currently held in the cache" : 1145,
	"pages read into cache" : 2401943,
	"pages written from cache" : 1840049
}

monogstat shows that the cache is totally full of dirty bytes:

  2723  7960   5648     *0       0     0|0    72.8  100.0       0  3.4G 2.8G   0|0  8|84   16m    14m  103 dan  PRI 13:28:10
  3779 10952   7308     *0       0     1|0    89.3   99.9       0  3.4G 2.8G   0|0 23|66   21m    19m  103 dan  PRI 13:28:12
insert query update delete getmore command % dirty % used flushes vsize  res qr|qw ar|aw netIn netOut conn set repl     time
  1787  5054   3349     *0       0     0|0   100.1  100.9       0  3.4G 2.8G   0|0 28|72   10m     9m  103 dan  PRI 13:28:14
    *0    *0     *0     *0       0     1|0   100.1  100.9       0  3.4G 2.8G   0|0 27|73  160b     9k  103 dan  PRI 13:28:16
    *0    *0     *0     *0       0     0|0   100.1  100.9       0  3.4G 2.8G   0|0 27|73    1k     9k  103 dan  PRI 13:28:18
    *0    *0     *0     *0       0     0|0   100.1  100.9       0  3.4G 2.8G   0|0 27|73   39b     8k  103 dan  PRI 13:28:20
     7    12     12     *0       0     1|0   100.2  100.9       0  3.4G 2.8G   0|0 20|80   31k    35k  103 dan  PRI 13:28:22
     6    20     13     *0       0     0|0   100.2  101.0       0  3.4G 2.8G   0|0 22|78   39k    42k  103 dan  PRI 13:28:24
    *0     7      4     *0       0     1|0   100.2  101.0       0  3.4G 2.8G   0|0 20|80    9k    22k  103 dan  PRI 13:28:26
    *0    *0     *0     *0       0     0|0   100.2  101.0       0  3.4G 2.8G   0|0 21|79  921b     8k  103 dan  PRI 13:28:28
    *0    *0     *0     *0       0     0|0   100.2  101.0       0  3.4G 2.8G   0|0 21|79   39b     8k  103 dan  PRI 13:28:30
    *0    *0     *0     *0       0     1|0   100.2  101.0       0  3.4G 2.8G   0|0 21|79   95b     8k  103 dan  PRI 13:28:32



 Comments   
Comment by Daniel Pasette (Inactive) [ 29/Jan/15 ]

Resolved with https://github.com/wiredtiger/wiredtiger/pull/1616

Comment by Githook User [ 27/Jan/15 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@wiredtiger.com'}

Message: SERVER-16997 Avoid overflow keys in WT indexes by increasing internal page max size

(cherry picked from commit bd4956d96e195e7a0072fd87e3793e4f442f92af)
Branch: v3.0
https://github.com/mongodb/mongo/commit/64f8cf8dac852d1df58e9e63ec0dba6568b09a01

Comment by Githook User [ 26/Jan/15 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@wiredtiger.com'}

Message: SERVER-16997 Avoid overflow keys in WT indexes by increasing internal page max size
Branch: master
https://github.com/mongodb/mongo/commit/bd4956d96e195e7a0072fd87e3793e4f442f92af

Comment by Daniel Pasette (Inactive) [ 23/Jan/15 ]

Added a test which repros the issue quickly.

Run mongod with a very small configured wiredTiger cache:
mongod --dbpath /home/dan/wt-data/ --storageEngine wiredTiger --wiredTigerEngineConfigString='cache_size=20M'

Run capped_test.js:
mongo capped_test.js

Watch insert throughput drop to zero using mongostat:
mongostat

This does not reproduce without the change to capped collections use of truncate:

commit 612f66c61c12b705fe3c5ed251d63525f1bf650a
Author: Mathias Stearn <mathias@10gen.com>
Date:   Wed Jan 21 18:09:10 2015 -0500
 
    SERVER-16919 Improve WT capped collection age-out
 
    * Provide a start cursor to range truncate()
    * Don't call capped deletion callback for the oplog
    * Provide backpressure when inserts get ahead of age-out

Comment by Alexander Gorrod [ 23/Jan/15 ]

> I wonder whether this hang is due to the new code to handle WT_ROLLBACK rather than the memory accounting issue.

This has been reproduced with code that doesn't contain those changes, so they aren't the cause.

Comment by Daniel Pasette (Inactive) [ 23/Jan/15 ]

Attaching another view of stats gathered earlier this afternoon. I'm able to trigger this with just 101.5% cache utilization – doesn't need to be a very high amount over.

Comment by Alexander Gorrod [ 23/Jan/15 ]

Stats graph

Comment by Daniel Pasette (Inactive) [ 22/Jan/15 ]

To set up the workload that I'm running which triggers this:

yum install -y git gcc automake autoconf libtool
git clone https://github.com/mongodb/mongo-c-driver
cd mongo-c-driver
./autogen.sh
make
make install
git clone https://github.com/johnlpage/WorkLoad
cd WorkLoad
make
 
Then run:
./loadsrv -h hostname:port -p 100 -d 600

Comment by Daniel Pasette (Inactive) [ 22/Jan/15 ]

Able to make this happen much faster by dropping wiredTigerCacheSizeGB to 1GB.

Comment by Keith Bostic (Inactive) [ 22/Jan/15 ]

My guess is this isn't a deadlock. The bytes in the cache has gone wrong:

"bytes currently in the cache" : NumberLong("2305843128198539"),

there aren't any pages to evict because they've all been evicted, and the system is just thrashing.

Comment by Daniel Pasette (Inactive) [ 22/Jan/15 ]

Second occurrence attached as gdb2.txt

Comment by Geert Bosch [ 22/Jan/15 ]

We have hundreds of threads waiting on the evict_waiter_cond while an evict_server is
waiting in __wt_thread_join at evict_lru.c:181. How is conn->evict_workers protected there?

Comment by Daniel Pasette (Inactive) [ 22/Jan/15 ]

Attaching backtrace as gdb.txt

Comment by Daniel Pasette (Inactive) [ 22/Jan/15 ]
  • Server version: 612f66c61c12b705fe3c5ed251d63525f1bf650a
  • Startup: mongod --slowms 5000 --dbpath /home/dan/wt-data/ --storageEngine wiredTiger --replSet dan --oplogSize 100 --wiredTigerCacheSizeGB 2
  • run rs.initiate() from the shell
  • John Page's Workload (using commit 81338e3d5589fb5e1cd5a066cc50c9a72a92def6): ./loadsrv -h localhost:27017 -p 100 -d 6000
Generated at Thu Feb 08 03:42:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.