[SERVER-32182] Deadlock in wiredtiger Created: 06/Dec/17  Updated: 29/Jan/18  Resolved: 12/Jan/18

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.4.6
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Sergey Zagursky Assignee: Keith Bostic (Inactive)
Resolution: Cannot Reproduce Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 16.04.2 LTS, x86
AWS EC2 i3 instance


Attachments: Text File stacktrace.txt    
Issue Links:
Related
is related to WT-3504 Deadlock in wiredtiger Closed
Backwards Compatibility: Fully Compatible
Sprint: Storage 2018-01-29
Participants:

 Description   

The symptoms of the issue are:
1. One of the secondary nodes in MongoDB replicaset stops responding on all read requests. Oplog application is also stopped on this node.
2. Other nodes in replicaset view that failed node as healthy because it responds on pings and status requests.
3. Investigation shows that all requests on hanged server are waiting for GlobalLock.



 Comments   
Comment by Sergey Zagursky [ 13/Jan/18 ]

I haven't encountered this issue any more. We've upgraded MongoDB to 3.4.10 shortly after reporting the issue here.

Comment by Keith Bostic (Inactive) [ 12/Jan/18 ]

sz, I'm going to close this ticket for now: I can't think of any way to pursue this problem without additional information and I've been unable to reproduce the failure in any test. Please don't hesitate to re-open this ticket or open a new one if there's any additional information or further problems.

Comment by Keith Bostic (Inactive) [ 04/Jan/18 ]

sz, I was wondering if you've seen this failure in the last month?

And while reviewing the ticket, I realized we hadn't asked you if the failure was on the same piece of hardware (and if so, if that hardware is still running?)

Thanks!

Comment by Sergey Zagursky [ 18/Dec/17 ]

Has the deadlock occurred again, since you filed the ticket?

No.

If this is reproducible for you in a reasonable amount of time, the next step might be to provide you an instrumented build for you to run (but that build might quite possibly have different performance characteristics than the standard builds). Would that be possible and worth the additional effort for you?

Unfortunately, it isn't reproducible consistently enough. It occured three times total, 11 Nov, 02 Dec and 04 Dec. There is no pattern I'm aware of. 02 Dec and 04 Dec definitely weren't under heavy load. In fact the load was pretty low.
As this is production system, we are somewhat limited performance-wise. We definitely can't allow x2 throughput reduce. 10-20% performance penalty should be OK. What will performance penalty be?

I still have core dump here. Can I be of any help inspecting it? I can be your hands and eyes

Comment by Keith Bostic (Inactive) [ 18/Dec/17 ]

sz, I'm afraid we're stuck on this one: we've reviewed the information and run experiments trying to reproduce the failure, all without success.

There was the one problem we've documented running on Azure, but that's the only explanation that we have, and the problem isn't happening anywhere else as far as we know.

Has the deadlock occurred again, since you filed the ticket?

If this is reproducible for you in a reasonable amount of time, the next step might be to provide you an instrumented build for you to run (but that build might quite possibly have different performance characteristics than the standard builds). Would that be possible and worth the additional effort for you?

Otherwise, I don't see any way to make progress on this one.

I'm truly sorry for the inconvenience, I wish it were otherwise!

Comment by Sergey Zagursky [ 07/Dec/17 ]

keith.bostic

How often have you seen this problem, does it repeat, or was it just the one time?

We've seen this problem three times so far.

And, how long did you wait for the secondary to wake up?

The wait time was different each time. But it definitely wasn't on the scale of seconds. IIRC the shortest was 10 minutes. The longest was almost 50 minutes.

There is a possibly related problem one of our developers flagged (WT-3461/SERVER-31215), but the fix for that problem isn't yet included in a MongoDB 3.4 release (it will be included in the 3.4.11 release).

I personally don't think our issue is related to WT-3461/SERVER-31215 because system time was pretty stable at the moment of failure. There also were no manual time adjustements at that time. Timestamps in mongod.log are seemingly consequent. Although we can't exclude skew possibility for sure, because we have no time monitoring on failed instances.

Comment by Keith Bostic (Inactive) [ 07/Dec/17 ]

sz, a couple of questions:

How often have you seen this problem, does it repeat, or was it just the one time?

And, how long did you wait for the secondary to wake up?

There is a possibly related problem one of our developers flagged (WT-3461/SERVER-31215), but the fix for that problem isn't yet included in a MongoDB 3.4 release (it will be included in the 3.4.11 release). However, while that problem can occur on Linux systems, it has been generally been seen on Windows systems, specifically on Azure.

Comment by Ramon Fernandez Marina [ 06/Dec/17 ]

Thanks for the detailed report sz, we're looking at the stack traces you provided.

Comment by Sergey Zagursky [ 06/Dec/17 ]

I've attached stack traces of all mongod threads at the moment of deadlock.

Generated at Thu Feb 08 04:29:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.