[SERVER-20008] Stress test deadlock in WiredTiger Created: 18/Aug/15 Updated: 13/Oct/15 Resolved: 10/Sep/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | None |
| Fix Version/s: | 3.0.7, 3.1.8 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Eitan Klein | Assignee: | Michael Cahill (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | 32qa | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Backport Completed: | |||||||||||||
| Steps To Reproduce: | Insert only workload (hammer.mongo) |
||||||||||||
| Participants: | |||||||||||||
| Description |
|
Issue Status as of Sep 10, 2015 ISSUE SUMMARY USER IMPACT WORKAROUNDS AFFECTED VERSIONS FIX VERSION Configuration: 3 members replica set Two problems: 1) Primary node is up and running but not able to perform any CRUD operations (mongostat and other db. . insert({}) hang), however failover didn't occur. 2) WiredTiger execute endless loop in !__wt_tree_walk and holding CRUD operations w/o timeout/watchdog for robustness (See debugger output for the lock owner)
RS.Status
|
| Comments |
| Comment by Githook User [ 26/Aug/15 ] | |
|
Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}Message: (cherry picked from commit 33f5597916964a6b4956bccac15644b0d61bbb36) | |
| Comment by Githook User [ 26/Aug/15 ] | |
|
Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}Message: | |
| Comment by Githook User [ 25/Aug/15 ] | |
|
Author: {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}Message: Merge pull request #2148 from wiredtiger/ (cherry picked from commit 38dad395053b3eca1998c6c1402adc74fc4cba61) | |
| Comment by Githook User [ 25/Aug/15 ] | |
|
Author: {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}Message: Merge pull request #2148 from wiredtiger/ (cherry picked from commit 38dad395053b3eca1998c6c1402adc74fc4cba61) | |
| Comment by Githook User [ 25/Aug/15 ] | |
|
Author: {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}Message: Merge pull request #2148 from wiredtiger/
| |
| Comment by Githook User [ 25/Aug/15 ] | |
|
Author: {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}Message: Merge pull request #2148 from wiredtiger/
| |
| Comment by Githook User [ 25/Aug/15 ] | |
|
Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}Message: | |
| Comment by Githook User [ 25/Aug/15 ] | |
|
Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}Message: Also, when clearing eviction walk points before the eviction server goes to sleep, clear all walks, not just ones that sessions are waiting to be have cleared. | |
| Comment by Eitan Klein [ 25/Aug/15 ] | |
|
michael.cahill | |
| Comment by Michael Cahill (Inactive) [ 25/Aug/15 ] | |
|
I spent more time stepping through the eviction thread – it appears to be stuck considering a very small set of candidate pages, none of which can actually be evicted. I'm still looking today but I have some ideas about where to look... eitan.klein, how reproducible is this? If I had a patch build to test, would it be easy to tell whether the problem is fixed? | |
| Comment by Michael Cahill (Inactive) [ 24/Aug/15 ] | |
|
eitan.klein, I have spent some time stepping through the process in a debugger. It hasn't hung, as far as I can tell: there is a checkpoint in progress that seems to be taking a long time. How long was the process in this state before you attached the debugger? | |
| Comment by Michael Cahill (Inactive) [ 24/Aug/15 ] | |
|
eitan.klein, thanks for the timeseries HTML, I have taken a look and there is something interesting going on. In future, if you could just attach the server status log file, that would be great: there are many command-line options to bruce.lucas@10gen.com's timeseries tool (including restricting the range for graphing) that can help a lot in seeing the detail. Here is some analysis of the timeseries data:
As you can see from the marked timestamps, every 90 minutes something is causing a spike in activity. The hang occurred during one such spike. At these times, a checkpoint takes much longer than usual (up to 20 seconds instead of ~5). One possible explanation would be some periodic activity on the machine (e.g., backups). Do you know of anything that happens every 90 minutes? In terms of the hang itself, something has gone wrong with eviction. The eviction thread is not finding pages to add to the eviction queue, and consequently application operations are blocked because there is no free space in the cache. I have attached to the debugger, I'll see if I can see what is going wrong. | |
| Comment by Michael Cahill (Inactive) [ 19/Aug/15 ] | |
|
eitan.klein, it is normal for the eviction thread to spend a lot of time walking files looking for pages to evict, so that stack doesn't help find the source of the problem. Can you please gather server status in the usual way while the test is running:
That is the best starting point for analysing what is going on. |