[SERVER-44058] Spontaneous deadlock on mongos Created: 17/Oct/19  Updated: 24/Nov/19  Resolved: 24/Nov/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Sergey Zagursky Assignee: Dmitry Agranat
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File bts.txt    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

We've hit an issue with spontaneous mongos deadlocks. Unfortunately I can't provide any hints for reproducing it but I've managed to take a core dump from troubled mongos process. See attached file with stacktraces from the core dump.
It looks like all threads are waiting for ThreadPoolTaskExecutor to release _mutex which is acquired by waitForEvent method waiting for some event.



 Comments   
Comment by Dmitry Agranat [ 24/Nov/19 ]

Hi sz,

Thank you for the update. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,
Dima

Comment by Sergey Zagursky [ 20/Nov/19 ]

Hi @Dmitry Agranat!

We haven't encountered this issue since then. I can't attach the logs from diagnostic.data now because they were rolled over.

Comment by Dmitry Agranat [ 11/Nov/19 ]

Hi sz,

If this is still an issue for you, please upload the requested information.

Thanks,
Dima

Comment by Dmitry Agranat [ 20/Oct/19 ]

Thanks Sergey,

Looking at the provided mongoS log, we can see a cluster of such errors shortly after the time you've mentioned:

2019-10-17T09:10:32.875+0000 I NETWORK  [conn618771] Marking host mongo-product6-2b.REDACTED.x:REDACTED as failed :: caused by :: InterruptedDueToReplStateChange: operation was interrupted

This suggests that during this time, we had elections on shard product6, which would also explain the mentioned "waiting for some event".

To validate this theory, please upload mongoD logs the $dbpath/diagnostic.data directory (the contents are described here), covering the time of this event from all members of this shard.

Thanks,
Dima

Comment by Sergey Zagursky [ 17/Oct/19 ]

Uploaded mongos.log.20191017. The incident started at 09:25 UTC (approximately). Timestamps in the log file are UTC too.
We had another incident with several mongod s a bit earlier, so there are a lot of messages in the log about connectivity.

Comment by Dmitry Agranat [ 17/Oct/19 ]

Hi Sergey, here is the uploader link. Please also mention the time and the timezone of the reported event.

Comment by Sergey Zagursky [ 17/Oct/19 ]

As for logs I need to upload them using support uploader. Can you provide a link to it?

Comment by Sergey Zagursky [ 17/Oct/19 ]

Sure. Sorry for not reporting full info. We use 4.0.10.

$ mongos --version
mongos version v4.0.10
git version: c389e7f69f637f7a1ac3cc9fae843b635f20b766
OpenSSL version: OpenSSL 1.0.2g  1 Mar 2016
allocator: tcmalloc
modules: none
build environment:
    distmod: ubuntu1604
    distarch: x86_64
    target_arch: x86_64

$ mongod --version
db version v4.0.10
git version: c389e7f69f637f7a1ac3cc9fae843b635f20b766
OpenSSL version: OpenSSL 1.0.2g  1 Mar 2016
allocator: tcmalloc
modules: none
build environment:
    distmod: ubuntu1604
    distarch: x86_64
    target_arch: x86_64

Comment by Dmitry Agranat [ 17/Oct/19 ]

Hi sz,

Could you provide the full mongoS log covering the time of the event as well as MongoDB version?

Thanks,
Dima

Generated at Thu Feb 08 05:04:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.