Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: WT2.8.0
Affects Version/s: None
Component/s: None
Labels:
None

Sprint:
None
Story Points:
None

This ticket has evolved. It looks like the biggest problem is that using zlib compression we are creation a 2.5MB page on disk (when uncompressed) - the cache size is only 1MB. So a page swap with that page causes eviction to stall.

Old analysis follows:

There is a test/format workload that has hung due to a full cache. The configuration has a single worker thread and a 1MB cache.

The only application thread is:

#2  0x000000000043196e in __wt_cond_wait (session=0x2b778b0, cond=0x2b75b50,
    usecs=100000) at ../src/include/misc.i:18
#3  0x0000000000435d08 in __wt_cache_eviction_worker (session=0x2b778b0,
    busy=false, pct_full=328) at ../src/evict/evict_lru.c:1544
#4  0x00000000004a5802 in __wt_cache_eviction_check (session=0x2b778b0,
    busy=false, didworkp=0x0) at ../src/include/cache.i:236
#5  0x00000000004a5f54 in __wt_txn_begin (session=0x2b778b0, cfg=0x0)
    at ../src/include/txn.i:266
#6  0x00000000004a5fd4 in __wt_txn_autocommit_check (session=0x2b778b0)
    at ../src/include/txn.i:287
#7  0x00000000004a842f in __wt_page_in_func (session=0x2b778b0, ref=0x345a5c0,
    flags=0, file=0x6c2f45 "../src/btree/col_srch.c", line=93)
    at ../src/btree/bt_read.c:575
#8  0x00000000004c237d in __wt_page_swap_func (session=0x2b778b0,
    held=0x2b75ec0, want=0x345a5c0, flags=0,
    file=0x6c2f45 "../src/btree/col_srch.c", line=93)
    at ../src/include/btree.i:1260
#9  0x00000000004c2baa in __wt_col_search (session=0x2b778b0, recno=87641,
    leaf=0x0, cbt=0x7f23b0032d40) at ../src/btree/col_srch.c:93
#10 0x0000000000516f4d in __cursor_col_search (session=0x2b778b0,
    cbt=0x7f23b0032d40, leaf=0x0) at ../src/btree/bt_cursor.c:226
#11 0x0000000000518821 in __wt_btcur_remove (cbt=0x7f23b0032d40)
    at ../src/btree/bt_cursor.c:670
#12 0x00000000004da976 in __curfile_remove (cursor=0x7f23b0032d40)
    at ../src/cursor/cur_file.c:331
---Type <return> to continue, or q <return> to quit---
#13 0x00000000004131fa in col_remove (cursor=0x7f23b0032d40,
    key=0x7f23c891cdf0, keyno=87641, notfoundp=0x7f23c891cda4)
    at ../../../test/format/ops.c:1183
#14 0x00000000004115fa in ops (arg=0x255cd50) at ../../../test/format/ops.c:426

i.e: It is an auto-commit transaction that is doing a cache full check before allocating an ID.

There are 5 pages in cache, 848 bytes of them on internal pages. Two pages belong to the file:wt. One is a small internal page, the other is a 2.5MB leaf page.

An oddity is that the session that is in __wt_txn_begin already has a snapshot allocated:

(gdb) p session->txn
$30 = {id = 0, isolation = WT_ISO_SNAPSHOT, snap_min = 4, snap_max = 4,
  snapshot = 0x7f23b0032960, snapshot_count = 0, txn_logsync = 0, mod = 0x0,
  mod_alloc = 0, mod_count = 0, logrec = 0x0, notify = 0x0, ckpt_lsn = {
    file = 0, offset = 0}, full_ckpt = false, ckpt_nsnapshot = 0,
  ckpt_snapshot = 0x0, flags = 8}

Which is keeping the system wide snap_min pinned to 4:

(gdb) p $3->txn_global
$22 = {current = 4, last_running = 4, oldest_id = 4, scan_count = 0,
  checkpoint_id = 0, checkpoint_gen = 0, checkpoint_pinned = 0,
  nsnap_rwlock = 0x2b731b0, nsnap_oldest_id = 0, nsnaph = {tqh_first = 0x0,
    tqh_last = 0x2b6b478}, states = 0x2b94d80}

is depended on by

SERVER-22146 WiredTiger changes for 3.3.1

Closed

Assignee:: [DO NOT USE] Backlog - Storage Execution Team
Reporter:: Alexander Gorrod
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Oct 19 2015 11:34:09 PM UTC
Updated:: Jun 06 2019 10:28:35 PM UTC
Resolved:: Jan 18 2016 10:25:07 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates