[SERVER-29217] request stuck when dirty cache reach 20% Created: 15/May/17 Updated: 15/Nov/21 Resolved: 12/Jun/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.2.10 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Zhang Youdong | Assignee: | Kelsey Schubert |
| Resolution: | Cannot Reproduce | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Participants: |
| Description |
|
mongostat data shown as belows
|
| Comments |
| Comment by Kelsey Schubert [ 12/Jun/17 ] | |||||||||||
|
Hi zyd_com, An examination of the diagnostic.data you've uploaded indicates that this is a custom, unsupported build. Unfortunately, since this build contains modifications to the source code, we cannot debug it. If you're able to reproduce this issue using a standard build, we'll be able to continue to investigate. Thank you, | |||||||||||
| Comment by Michael Cahill (Inactive) [ 16/May/17 ] | |||||||||||
|
zyd_com, I've looked through the data you uploaded. This node was in fact running MongoDB 3.2.9, and was a secondary. I don't see large numbers of queued operations: I assume they were on the primary? Looking at the eviction statistics, I think it would be worth trying a more recent release of MongoDB such as 3.2.13, which includes many improvements in this area. Is it possible for you to test 3.2.13, even if that is on a hidden secondary? michael_liuxin, you are correct that WiredTiger's eviction code relies on an eviction server thread to keep the queues full so that other threads can make progress. If something blocks the eviction server thread, that could cause the behavior reported here where operations stall. This is one of the areas we have improved in recent releases of MongoDB, which is why I think it would be good to test with the latest version to see whether this issue has been resolved. | |||||||||||
| Comment by Michael Liu [ 16/May/17 ] | |||||||||||
|
@Thomas Schubert There has a analogous stuck problem with @Youdong Zhang. And I reproduce it with smaller cacheSize setting in wiredtiger-2.8.1 in MongoDB v3.2.10. deep into the evict_server/evict_worker's code I have some clues. I'm not quite sure that may help you, but may be useful I think. 1). in the pstack.rtf file, much _evict_page from txn_open(application thread) or __evict_lru_pages(evict worker) block on mutex, and the others call __wt_reconcile() and __rec_row_leaf(). And I see lots threads are keep waiting on __wt_cache_eviction_worker()'s conn->evict_threads.wait_cond(mostly. this means application thread is waiting evict_server's __evict_lru_walk to wake up). 2). In my scenario. GDB breaks at __evict_lru_pages() and It shows that the evict_worker invoke __evict_page() and __wt_reconcile() inside always returned EBUSY. so the evict_worker retry it more times. And I can't find a evict_server appearing(evict_worker isn't gonna compete to be a evict_server). in evict_lru.c:861:
3). If there is no evict_server to wake up application thread. The application thread will be timed out and retry. In evict_lru.c:1797. and application's request looks like be blocked. And there is still a problem in my scenario. why __wt_reconcile() -> __rec_row_leaf_insert() -> __rec_txn_read() will return EBUSY ??.... I found only two statements would do that :
This problem may be similar to @Youdong Zhang I guess. But I'm not familiar the code inside __rec_txn_read() very clearly .... and hope for your help.. Thanks | |||||||||||
| Comment by Zhang Youdong [ 16/May/17 ] | |||||||||||
|
@Thomas Schubert I have uploaded the diagnostic.data file. | |||||||||||
| Comment by Kelsey Schubert [ 15/May/17 ] | |||||||||||
|
Hi zyd_com, So we can investigate, would you please provide an archive of the diagnostic.data and complete log files for the affected mongod? I've created a secure upload portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time. Thank you, |