Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-63984

Primary replica member becomes unavailable during normal operation

    • Type: Icon: Bug Bug
    • Resolution: Community Answered
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Environment:
      Ubuntu 16.04
      XSF
      Kernel - 4.4.0-1128-aws #142-Ubuntu SMP Fri Apr 16 12:42:33 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
      Disable Transparent Huge disabled
      AWS m5.large (2cpu\8gb)
      SSD GP3 450 Gb
      monogo-org-server - 4.2.17
    • ALL

      For no apparent reason, our primary replica member of one of the shards got unresponsive until we restarted it.

      The incident lasted for about 35 minutes. During that time we saw almost 100% consumption on the primary and its load average was up to 60 times the normal values.

      From the logs (as of the beginning of the incident) we understood only this:
      1) The amount of opened connections started increasing.
      2) Some opened cursors got timed out.
      3) The pooled connections to other members got dropped (due to shutdown, but we didn't try to shut down the primary at that time).
      ```
      I CONNPOOL [TaskExecutorPool-0] Dropping all pooled connections to some-secondary:27017 due to ShutdownInProgress: Pool for some-secondary:27017 has expired.
      ```
      4) After some time no log entries appear (for about 25 minutes) until we restarted the primary.

      Our cluster configuration:

      • shard cluster with 10 shards
      • four replicas in each shard
      • about 400 GB of data in storage size per shard

      Replica server configuration:

      • Ubuntu 16.04
      • XSF
      • Kernel - 4.4.0-1128-aws #142-Ubuntu SMP Fri Apr 16 12:42:33 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
      • Disable Transparent Huge disabled
      • AWS m5.large (2cpu\8gb)
      • SSD GP3 450 Gb
      • monogo-org-server - 4.2.17

      `diagnostic.data` of the primary and one of the secondary attached to the post.

        1. data.diagnostics.zip
          14.66 MB
        2. diagnostic.data.zip
          48.37 MB
        3. diagnostic.data-1.zip
          14.10 MB
        4. diagnostic.data-2.zip
          19.85 MB
        5. gdb_2022-04-12_14-15-43.txt
          535 kB
        6. gdb_2022-05-12_09-59-24.zip
          146 kB
        7. gdb.html
          27 kB
        8. metrics.zip
          31.69 MB

            Assignee:
            dmitry.agranat@mongodb.com Dmitry Agranat
            Reporter:
            vladimirred456@gmail.com Vladimir Beliakov
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: