[SERVER-53857] Successive segfaults of several cluster members Created: 17/Jan/21 Updated: 06/Dec/22 Resolved: 25/May/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability |
| Affects Version/s: | 4.4.2, 4.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Lucas Bonnet | Assignee: | Backlog - Service Architecture |
| Resolution: | Duplicate | Votes: | 2 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Assigned Teams: |
Service Arch
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Sprint: | Query Execution 2021-02-22, Query Execution 2021-03-08, Query Execution 2021-03-22, Query Execution 2021-04-05, Query Execution 2021-04-19, Query Execution 2021-05-03, Query Execution 2021-05-17 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
Our primary (called "archi" in the logs, IP ending in .29) crashed with a segfault at night, during a low traffic time. One secondary (called "sonic" in the logs, IP ending in .31) became primary to take over, and immediately crashed too, with a different stack trace. Finally, after a manual restart of servers to get the cluster with enough voting members (our voting members were slightly misconfigured at that point), another secondary (called "loquy", IP ending in .34) tried to become primary and crashed too (stack trace identical to the first secondary). After a last restart of all of them, they recovered. I have attached all 3 log excerpts and stack traces. The "diagnostic.data" files for the day represent ~160MB total, I wasn't sure it was good form to plop that in a ticket. They are available here: https://database.lichess.org/mongo-crash/ The primary is running 4.4.2, the secondaries are on 4.4.3 (I was waiting for a maintenance window to upgrade primary and didn't dare to do it mid-incident). The only recent admin operation on that cluster was setting the minimum opLog window to 25h (via CLI), and restarting a few secondaries (not affected by those crashes) with the matching config file setting. |
| Comments |
| Comment by Kyle Suarez [ 25/May/21 ] | |
|
Per billy.donahue, I'm closing this issue as a duplicate of lucas@lichess.org, sorry for the delay in getting back to you. Please see | |
| Comment by Kyle Suarez [ 25/May/21 ] | |
|
Service Arch, could you please take a look at this to confirm if it's the same issue as in | |
| Comment by Billy Donahue [ 01/Apr/21 ] | |
|
It's really | |
| Comment by Dmitry Agranat [ 16/Feb/21 ] | |
|
Hi lucas@lichess.org, sorry to hear you've experienced yet another issue:
This new issue is being tracked by | |
| Comment by Lucas Bonnet [ 16/Feb/21 ] | |
|
Hello,
this cluster crash happened a second time, at peak time for us, but this time it happened in a loop (each time a server recovered, it crashed again a few seconds later) until we disabled the feature relying on those 999 BSON nested documents. We have since rewrote this part of the code and moved the "study" collection to a distinct mongodb server with a different data storage format, no longer relying on this undocumented feature.
However, during a routine maintenance to finally upgrade our PRI server from 4.4.2 to 4.4.3, we experienced another cluster crash, with the following timeline (UTC):
I attached new stack traces excerpts (trace-2-* files) but they look different, is that worthy of a new issue or is this related to our first crash? Also please tell me if you need more info, logs, or diagnostics data. | |
| Comment by Dmitry Agranat [ 26/Jan/21 ] | |
|
lucas@lichess.org, thank you for the report, I was able to reproduce this issue. We're assigning this ticket to the appropriate team to be evaluated against our currently planned work. Updates will be posted on this ticket as they happen. | |
| Comment by Lucas Bonnet [ 18/Jan/21 ] | |
|
Hello, thanks for looking into it. We actually have maxBSONDepth: 999 in our config files because we rely on deeply nested structures for one collection, I have attached our config file for one server, they're all identical. | |
| Comment by Dmitry Agranat [ 18/Jan/21 ] | |
|
Hi lucas@lichess.org, based on our documentation, MongoDB supports no more than 100 levels of nesting for BSON documents. By counting the lines with the backtrace, I saw exactly 100 of BSONObjBuilder frames. Does this observation align with your workload? | |
| Comment by Lucas Bonnet [ 17/Jan/21 ] | |
|
Forgot to state the obvious:
|