[SERVER-53857] Successive segfaults of several cluster members Created: 17/Jan/21  Updated: 06/Dec/22  Resolved: 25/May/21

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 4.4.2, 4.4.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Lucas Bonnet Assignee: Backlog - Service Architecture
Resolution: Duplicate Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File archi.log.bz2     File loquy.log.bz2     File mongod.conf     File sonic.log.bz2     Text File trace-2-archi.txt     Text File trace-2-higgs.txt     Text File trace-2-loquy.txt     Text File trace-archi.txt     Text File trace-sonic.txt    
Issue Links:
Duplicate
duplicates SERVER-53566 Investigate and reproduce "opCtx != n... Closed
Related
is related to SERVER-49468 Invalidate previous OperationContext ... Closed
is related to SERVER-53566 Investigate and reproduce "opCtx != n... Closed
Assigned Teams:
Service Arch
Operating System: ALL
Sprint: Query Execution 2021-02-22, Query Execution 2021-03-08, Query Execution 2021-03-22, Query Execution 2021-04-05, Query Execution 2021-04-19, Query Execution 2021-05-03, Query Execution 2021-05-17
Participants:

 Description   

Our primary (called "archi" in the logs, IP ending in .29) crashed with a segfault at night, during a low traffic time.

One secondary (called "sonic" in the logs, IP ending in .31) became primary to take over, and immediately crashed too, with a different stack trace.

Finally, after a manual restart of servers to get the cluster with enough voting members (our voting members were slightly misconfigured at that point), another secondary (called "loquy", IP ending in .34) tried to become primary and crashed too (stack trace identical to the first secondary). After a last restart of all of them, they recovered.

I have attached all 3 log excerpts and stack traces. The "diagnostic.data" files for the day represent ~160MB total, I wasn't sure it was good form to plop that in a ticket. They are available here: https://database.lichess.org/mongo-crash/

The primary is running 4.4.2, the secondaries are on 4.4.3 (I was waiting for a maintenance window to upgrade primary and didn't dare to  do it mid-incident).

The only recent admin operation on that cluster was setting the minimum opLog window to 25h (via CLI), and restarting a few secondaries (not affected by those crashes) with the matching config file setting.



 Comments   
Comment by Kyle Suarez [ 25/May/21 ]

Per billy.donahue, I'm closing this issue as a duplicate of SERVER-53566.

lucas@lichess.org, sorry for the delay in getting back to you. Please see SERVER-53566 for a summary of where the fix has been released.

Comment by Kyle Suarez [ 25/May/21 ]

Service Arch, could you please take a look at this to confirm if it's the same issue as in SERVER-53566 as Billy suggests?

Comment by Billy Donahue [ 01/Apr/21 ]

It's really SERVER-53566.

Comment by Dmitry Agranat [ 16/Feb/21 ]

Hi lucas@lichess.org, sorry to hear you've experienced yet another issue:

{"t":{"$date":"2021-02-16T06:49:28.076+00:00"},"s":"F",  "c":"-",        "id":23079,   "ctx":"TopologyVersionObserver","msg":"Invariant failure","attr":{"expr":"opCtx != nullptr && _opCtx == nullptr","file":"src/mongo/db/client.cpp","line":126}}

This new issue is being tracked by SERVER-49468. it shows that a backport to 4.4 was requested but not yet implemented.

Comment by Lucas Bonnet [ 16/Feb/21 ]

Hello,

 

this cluster crash happened a second time, at peak time for us, but this time it happened in a loop (each time a server recovered, it crashed again a few seconds later) until we disabled the feature relying on those 999 BSON nested documents. We have since rewrote this part of the code and moved the "study" collection to a distinct mongodb server with a different data storage format, no longer relying on this undocumented feature.

 

However, during a routine maintenance to finally upgrade our PRI server from 4.4.2 to 4.4.3, we experienced another cluster crash, with the following timeline (UTC):

  • 2021-02-16T06:49:28.064 : stepDown manually sent to archi, our PRI
  • 2021-02-16T06:49:28.071 : archi crashes
  • 2021-02-16T06:49:28.132 : loquy notices the change of PRI
  • 2021-02-16T06:49:28.145 : loquy becomes PRI as planned
  • 2021-02-16T06:49:28.156 : loquy crashes
  • 2021-02-16T06:49:39.004 : sonic notices there is no PRI, triggers an election
  • 2021-02-16T06:49:39.006 : sonic becomes PRI
  • 2021-02-16T06:49:39.017 : sonic crashes
  • 2021-02-16T06:50:12.904 : loquy manually restarted (right before boole crashes, but too late to rejoin the cluster in time)
  • 2021-02-16T06:50:22.704 : boole notices there is no PRI, triggers an election
  • 2021-02-16T06:50:22.709 : boole becomes PRI
  • 2021-02-16T06:50:22.719 : boole crashes
  • 2021-02-16T06:50:47.788 : sonic manually restarted
  • 2021-02-16T06:50:53.969 : sonic recovers
  • 2021-02-16T06:50:55.336 : loquy recovers (becomes PRI)
  • 2021-02-16T06:50:55.414 : higgs crashes (despite never becoming PRI)
  • 2021-02-16T06:50:55.586 : boole manually restarted
  • 2021-02-16T06:51:01.585 : boole recovers
  • 2021-02-16T06:51:15.123 : archi manually restarted after upgrade to 4.4.3
  • 2021-02-16T06:51:33.377 : archi recovers (becomes PRI), loquy becomes SEC
  • 2021-02-16T06:51:50.564 : higgs manually restarted
  • 2021-02-16T06:51:53.781 : higgs recovers

I attached new stack traces excerpts (trace-2-* files) but they look different, is that worthy of a new issue or is this related to our first crash? Also please tell me if you need more info, logs, or diagnostics data.

Comment by Dmitry Agranat [ 26/Jan/21 ]

lucas@lichess.org, thank you for the report, I was able to reproduce this issue. We're assigning this ticket to the appropriate team to be evaluated against our currently planned work. Updates will be posted on this ticket as they happen.

Comment by Lucas Bonnet [ 18/Jan/21 ]

Hello, thanks for looking into it.

We actually have maxBSONDepth: 999 in our config files because we rely on deeply nested structures for one collection, I have attached our config file for one server, they're all identical.

Comment by Dmitry Agranat [ 18/Jan/21 ]

Hi lucas@lichess.org, based on our documentation, MongoDB supports no more than 100 levels of nesting for BSON documents.

By counting the lines with the backtrace, I saw exactly 100 of BSONObjBuilder frames.

Does this observation align with your workload?

Comment by Lucas Bonnet [ 17/Jan/21 ]

Forgot to state the obvious:

  • Ubuntu 18.04 with <1yr uptime
  • Upstream mongodb packages
  • NVMe storage and generous hardware configuration (96 or 128GB RAM, 16 cores)
  • 1.5TB used by mongodb data files on disk on primary
Generated at Thu Feb 08 05:32:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.