Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Critical - P2
Fix Version/s: 3.0.8
Affects Version/s: 3.0.6, 3.0.7
Component/s: Performance, WiredTiger
Labels:
None

Backwards Compatibility:
Fully Compatible
Confidence Status:
None
Work Order:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Setup:

3.0.6
two shards, each standalone mongod
shard key is _id (not hashed), using standard default ObjectIDs.
WT cache size 10 GB
single ttl collection set to expire documents after 10 minutes
4 threads inserting 1 kB documents into the ttl collection

After about 25 minutes of run time (about 15 minutes after ttl deletions started) one of the shards became completely stuck. Stack traces captured after it became stuck are attached:

TTLMonitor, splitVector, serverStatus are all stuck waiting for full cache:
- TTLMonitor is stuck in __wt_page_in_func while traversing the collection
- serverStatus and splitVector are both stuck in WT somewhere within mongo::WiredTigerRecoveryUnit::_txnOpen - why would a transaction open be stuck waiting for a full cache?
no sign of any eviction activity (or of the eviction worker threads, but maybe they're hidden in the "??" call sites?).

Here's an overview of the run for the shard that became stuck:

inserts are fast but unbalanced because only one shard at a time is active in this particular test (due to simple shard key).
insert rate drops very dramatically when ttl deletions begin at A. This performance hit seems suprisingly large.
ttl deletion passes run A-B, C-D, E-F, G-. Each begins with a bump in "ttl passes" and end with a bump in "ttl deletedDocuments"
WT range of IDs pinned correlates well with the deletion passes. However I don't believe there are intentional long-running transactions, but rather I suspect one or more deletions just run very slowly.
it appears that at H the cache becomes full and everything gets stuck, presumably because there are no evictions happening as per the stack traces above which were captured after H.
interestingly though evictions appear to have been happening as normal right up to H - so why would they stop?
there is no data after H because serverStatus has become stuck as well.
the only clear connection to any sharding-related activity that I can find is the stuck splitVector call; however it has been running frequently and seems to be stuck in the same way serverStatus is, so it seems unlikely it is a culprit.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

diagnostic.data.tar
Oct 12 2015 02:55:14 PM UTC
233 kB
Bruce Lucas
gdbmon.html
Oct 12 2015 02:55:14 PM UTC
206 kB
Bruce Lucas
gdbmon.log
Oct 12 2015 02:55:14 PM UTC
1.17 MB
Bruce Lucas
repro-04-gdbmon-s0r0.html
Oct 12 2015 07:42:37 PM UTC
6.29 MB
Bruce Lucas
repro-04-gdbmon-s0r0.log
Oct 12 2015 07:42:37 PM UTC
6.59 MB
Bruce Lucas
repro-06-diagnostic.data.tar
Oct 13 2015 04:45:35 PM UTC
1.84 MB
Bruce Lucas
repro-06-gdbmon.html
Oct 13 2015 04:45:35 PM UTC
5.70 MB
Bruce Lucas
repro-06-gdbmon.png
Oct 13 2015 04:45:35 PM UTC
448 kB
Bruce Lucas
repro-06-ts.png
Oct 13 2015 04:45:35 PM UTC
135 kB
Bruce Lucas
repro-12.png
Oct 15 2015 09:05:49 PM UTC
141 kB
Bruce Lucas
repro-14.png
Oct 18 2015 03:00:39 PM UTC
111 kB
Bruce Lucas
repro-14-diagnostic.data.tar
Oct 18 2015 03:00:39 PM UTC
127 kB
Bruce Lucas
repro-14-gdbmon.html
Oct 18 2015 03:00:39 PM UTC
11.24 MB
Bruce Lucas
repro-14-gdbmon.png
Oct 18 2015 03:00:39 PM UTC
172 kB
Bruce Lucas
repro-15.png
Oct 19 2015 05:17:20 AM UTC
66 kB
Bruce Lucas
standalone.png
Oct 12 2015 04:38:48 PM UTC
130 kB
Bruce Lucas
standalone-diagnostic.data.tar
Oct 12 2015 04:38:48 PM UTC
444 kB
Bruce Lucas
stuck.png
Oct 12 2015 02:55:14 PM UTC
202 kB
Bruce Lucas

is related to

SERVER-20996 74% performance regression in db.c.remove()

Closed

SERVER-21027 Reduced performance of index lookups after removing documents from collection

Closed

related to

SERVER-20877 Under cache-full conditions serverStatus can become stuck

Closed

Assignee:: Michael Cahill (Inactive)
Reporter:: Bruce Lucas (Inactive)
Participants:: Alexander Gorrod, Bruce Lucas, Geert Bosch, Githook User, Kaloian Manassiev, Michael Cahill
Votes:: 0 Vote for this issue
Watchers:: 17 Start watching this issue

Created:: Oct 12 2015 02:55:14 PM UTC
Updated:: Dec 10 2015 11:30:46 PM UTC
Resolved:: Nov 12 2015 01:32:35 AM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates