[SERVER-32182] Deadlock in wiredtiger Created: 06/Dec/17 Updated: 29/Jan/18 Resolved: 12/Jan/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.4.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Sergey Zagursky | Assignee: | Keith Bostic (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 16.04.2 LTS, x86 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Sprint: | Storage 2018-01-29 | ||||||||
| Participants: | |||||||||
| Description |
|
The symptoms of the issue are: |
| Comments |
| Comment by Sergey Zagursky [ 13/Jan/18 ] |
|
I haven't encountered this issue any more. We've upgraded MongoDB to 3.4.10 shortly after reporting the issue here. |
| Comment by Keith Bostic (Inactive) [ 12/Jan/18 ] |
|
sz, I'm going to close this ticket for now: I can't think of any way to pursue this problem without additional information and I've been unable to reproduce the failure in any test. Please don't hesitate to re-open this ticket or open a new one if there's any additional information or further problems. |
| Comment by Keith Bostic (Inactive) [ 04/Jan/18 ] |
|
sz, I was wondering if you've seen this failure in the last month? And while reviewing the ticket, I realized we hadn't asked you if the failure was on the same piece of hardware (and if so, if that hardware is still running?) Thanks! |
| Comment by Sergey Zagursky [ 18/Dec/17 ] |
No.
Unfortunately, it isn't reproducible consistently enough. It occured three times total, 11 Nov, 02 Dec and 04 Dec. There is no pattern I'm aware of. 02 Dec and 04 Dec definitely weren't under heavy load. In fact the load was pretty low. I still have core dump here. Can I be of any help inspecting it? I can be your hands and eyes |
| Comment by Keith Bostic (Inactive) [ 18/Dec/17 ] |
|
sz, I'm afraid we're stuck on this one: we've reviewed the information and run experiments trying to reproduce the failure, all without success. There was the one problem we've documented running on Azure, but that's the only explanation that we have, and the problem isn't happening anywhere else as far as we know. Has the deadlock occurred again, since you filed the ticket? If this is reproducible for you in a reasonable amount of time, the next step might be to provide you an instrumented build for you to run (but that build might quite possibly have different performance characteristics than the standard builds). Would that be possible and worth the additional effort for you? Otherwise, I don't see any way to make progress on this one. I'm truly sorry for the inconvenience, I wish it were otherwise! |
| Comment by Sergey Zagursky [ 07/Dec/17 ] |
We've seen this problem three times so far.
The wait time was different each time. But it definitely wasn't on the scale of seconds. IIRC the shortest was 10 minutes. The longest was almost 50 minutes.
I personally don't think our issue is related to |
| Comment by Keith Bostic (Inactive) [ 07/Dec/17 ] |
|
sz, a couple of questions: How often have you seen this problem, does it repeat, or was it just the one time? And, how long did you wait for the secondary to wake up? There is a possibly related problem one of our developers flagged ( |
| Comment by Ramon Fernandez Marina [ 06/Dec/17 ] |
|
Thanks for the detailed report sz, we're looking at the stack traces you provided. |
| Comment by Sergey Zagursky [ 06/Dec/17 ] |
|
I've attached stack traces of all mongod threads at the moment of deadlock. |