[SERVER-40100] Mongos hangs without any log messages Created: 13/Mar/19 Updated: 08/Jan/24 Resolved: 15/Apr/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Internal Code |
| Affects Version/s: | 4.0.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Artem | Assignee: | Eric Sedor |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Steps To Reproduce: | Unknown. |
||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Symptoms: mongos listen port, but don't serve client requests for few hours (until restart). Nothing suspicious in the logs found. I was make core dump and backtrace for all threads (backtrace file is attached to this issue). How often: around twice a month per production environment. Version: mongos version v4.0.4 Topology:
|
| Comments |
| Comment by Artem [ 08/May/19 ] | ||||||
|
After removing queries, killed by `maxTimeMS` timeout we had no any hangs. | ||||||
| Comment by Eric Sedor [ 15/Apr/19 ] | ||||||
|
bozaro, we believe | ||||||
| Comment by Eric Sedor [ 19/Mar/19 ] | ||||||
|
Thanks bozaro, this is good to know. We are attempting an internal reproduction around lots of queries with $maxTimeMS set. | ||||||
| Comment by Artem [ 19/Mar/19 ] | ||||||
|
Looks like we fix query timeouts. Problem query looks like: `db.someCollection.find({_id: {$in: [... over 20 000 values ...]}}).maxTimeMS(10000)`. And there are all of requested `_id` in database. | ||||||
| Comment by Artem [ 19/Mar/19 ] | ||||||
|
The incident with the last backtraces occurred around `2019-03-18T18:47:44.016+0000Z`. About maxTimeMS: we use this option for almost all queries with timeout 10 sec. Also, we have increased the timeouts count in recent times due to some features. We are trying to reduce it. It is very likely that this may be related. | ||||||
| Comment by Eric Sedor [ 18/Mar/19 ] | ||||||
|
Thanks for the additional information bozaro; we will be attempting an internal reproduction of this. Can you clarify what time UTC that latest "similar" issue occurred? In case it helps us, can you let us know if you are setting maxTimeMS for operations, and if so what value you're providing? | ||||||
| Comment by Artem [ 17/Mar/19 ] | ||||||
|
Today we got very simillar issue on one host. Backtraces and diagnostics data was attached.
| ||||||
| Comment by Mira Carey [ 13/Mar/19 ] | ||||||
|
My first theory: in ClusterCursorManager::stats, and specifically ClusterCursorManager::CursorEntry::isKillPending(). See thread 11:
Note that we're in checkForInterruptNoAssert from ClusterCusorManager::stats(), which in turn is calling markKilled. For why, see isKillPending()
That's a problem because:
I think this doesn't usually come up because the ops using cluster cursors usually haven't exceeded maxtimems. That may be an avenue for an isolated repro. | ||||||
| Comment by Eric Sedor [ 13/Mar/19 ] | ||||||
|
Hello and thanks for the report and backtrace. Can you please also provide the logs leading up to a hang as well as an archive (tar or zip) the $dbpath/diagnostic.data directory (described here)? |