Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Critical - P2
Fix Version/s: None
Affects Version/s: 3.2.3, 3.2.4, 3.2.5
Component/s: Internal Code, WiredTiger
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Steps To Reproduce:

Hide

I can't prepare simple re-procduce script.
But I think this can happen in 1~2 dayes with heavy read/write traffic.

Show
I can't prepare simple re-procduce script. But I think this can happen in 1~2 dayes with heavy read/write traffic.
Confidence Status:
None
Work Order:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

I am doing performance test with 4-shard(3 members for each replica-set).
There's about 1K~2K Query/second and 500~1500 Update(upsert)/second user requests for each shard.

But some shard(primary) is stuck and can't process user request in 10~24 hours after restart. Once this case happen then user requests are blocked for a few hours or never released. Sometimes they released in 10~30 minutes.

I found this case happens when WiredTiger cache usage is getting over 95%. And during this time, all foreground threads which have "write ticket" are doing __wt_evict() => __wt_txn_update_oldest().

According to checking some wiredtiger source code,
If cache usage is getting over 95%(eviction_trigger), looks like that all eviction server and foreground threads are responsible for LRU eviction. But there's only one thread can update oldest_transaction by global_txn->scan_count. If they can't update then loop finding oldest transaction no again and again.

I am not sure suppressing cache usage with eviction_trigger is planned, But all foreground threads and eviction server starting LRU eviction at the same time might be trouble. (e.g. All ticket holder foreground threads are doing scan all sessions at the same time, and updating oldest transaction is more difficult becuase scan_count is getting high at maximum ticket count).

Usually cache usage is 80% and during this time everything is fine, But once cache usage is getting over 80% increasing usage is not stopped up to 96~97%. after that Queued writer and reader is increasing, and active reader and writers are scanning oldest transaction.

Attached pstack_primary.txt is stack trace of primary when this case happen.
And attached pstack_secondary.txt is stack trace of secondary after primary /secondary switching(stepdown). (Actually pstack_primary.txt and pstack_seconday.txt are stack trace for the same server, just replication role is changed)

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

CacheUsage_vs_QueuedReaderWriters.png
Apr 18 2016 12:38:01 PM UTC
101 kB
matt.lee
collection.txt
Apr 18 2016 12:38:01 PM UTC
0.5 kB
matt.lee
metrics.2016-04-18T09-51-46Z-00000.tar.gz
Apr 18 2016 03:28:50 PM UTC
3.50 MB
matt.lee
mongod.conf
Apr 18 2016 12:38:01 PM UTC
0.8 kB
matt.lee
mongostat.txt
Apr 18 2016 12:38:01 PM UTC
1.37 MB
matt.lee
pstack_primary.txt
Apr 18 2016 12:38:01 PM UTC
73.38 MB
matt.lee
pstack_secondary.txt
Apr 18 2016 12:38:01 PM UTC
115 kB
matt.lee
WiredTiger-MetricGraphs.png
May 04 2016 09:50:37 AM UTC
354 kB
matt.lee
WiredTiger-WT2560-MetricGraphs.png
May 05 2016 04:59:37 AM UTC
247 kB
matt.lee
wt-2560-mongodb-3.2.diff
May 04 2016 04:46:52 AM UTC
23 kB
Michael Cahill

duplicates

WT-2560 Stuck trying to update oldest transaction ID

Closed

is duplicated by

SERVER-23777 and all user requests are All foreground threads are stuck in __wt_evict()

Closed

links to

Pull request #1078

Assignee:: Michael Cahill (Inactive)
Reporter:: 아나 하리
Participants:: Alexander Gorrod, Michael Cahill, Ramon Fernandez Marina, 아나 하리
Votes:: 1 Vote for this issue
Watchers:: 11 Start watching this issue

Created:: Apr 18 2016 12:38:00 PM UTC
Updated:: Feb 08 2023 03:40:58 PM UTC
Resolved:: Jun 08 2016 11:45:59 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Forms

Activity

People

Dates