Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-53857

Successive segfaults of several cluster members

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 4.4.2, 4.4.3
    • Component/s: Stability
    • Labels:
      None
    • Service Arch
    • ALL
    • Query Execution 2021-02-22, Query Execution 2021-03-08, Query Execution 2021-03-22, Query Execution 2021-04-05, Query Execution 2021-04-19, Query Execution 2021-05-03, Query Execution 2021-05-17

      Our primary (called "archi" in the logs, IP ending in .29) crashed with a segfault at night, during a low traffic time.

      One secondary (called "sonic" in the logs, IP ending in .31) became primary to take over, and immediately crashed too, with a different stack trace.

      Finally, after a manual restart of servers to get the cluster with enough voting members (our voting members were slightly misconfigured at that point), another secondary (called "loquy", IP ending in .34) tried to become primary and crashed too (stack trace identical to the first secondary). After a last restart of all of them, they recovered.

      I have attached all 3 log excerpts and stack traces. The "diagnostic.data" files for the day represent ~160MB total, I wasn't sure it was good form to plop that in a ticket. They are available here: https://database.lichess.org/mongo-crash/

      The primary is running 4.4.2, the secondaries are on 4.4.3 (I was waiting for a maintenance window to upgrade primary and didn't dare to  do it mid-incident).

      The only recent admin operation on that cluster was setting the minimum opLog window to 25h (via CLI), and restarting a few secondaries (not affected by those crashes) with the matching config file setting.

        1. archi.log.bz2
          14 kB
        2. loquy.log.bz2
          65 kB
        3. mongod.conf
          0.4 kB
        4. sonic.log.bz2
          13 kB
        5. trace-2-archi.txt
          14 kB
        6. trace-2-higgs.txt
          33 kB
        7. trace-2-loquy.txt
          35 kB
        8. trace-archi.txt
          46 kB
        9. trace-sonic.txt
          8 kB

            Assignee:
            backlog-server-servicearch [DO NOT USE] Backlog - Service Architecture
            Reporter:
            lucas@lichess.org Lucas Bonnet
            Votes:
            2 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: