Server stalled after a CPU spike

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Done
    • Priority: Major - P3
    • None
    • Affects Version/s: 4.2.12
    • Component/s: None
    • None
    • DevProd Last Mile
    • ALL
    • Hide

      We couldn't find stable reproduce steps.

      Show
      We couldn't find stable reproduce steps.
    • None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None

      We have a PSA Replica Set, each data-bearing node has 32 cores, 64GB memory and 3TB SSD. This has been running fine for over two years now, but recently, while the data size keeps growing, we ran into a weird problem, twice in a month:

      When high traffic occurred, primary's CPU(we use primaryPrefrred read preference) first went up to around 90%, then drop down to below 50%, and all queries slowed down after the drop.

      We have examined systctl params, ulimits params, filesystem configs(XFS, no TPH) , WiredTiger cache usage(arount 80%), disk limits(throughput and IOPS), WiredTiger cache dirty percentage(around %5), etc, but couldn't figure out what's the rational behind the stall. Please help to confirm if this is a bug, or give us a clue on what are we doing working.

      See attachments for related FTDC files.

      We know version 4.2.12 has been EoL, apologes first if you find this issue is inapposite.

      Many Thanks!

        1. image-2024-04-16-17-47-39-450.png
          image-2024-04-16-17-47-39-450.png
          170 kB
        2. metrics.2024-04-07T20-46-39Z-00000
          9.85 MB
        3. metrics.2024-04-08T01-36-39Z-00000
          9.83 MB
        4. metrics.2024-04-08T05-37-31Z-00000
          9.85 MB

            Assignee:
            Chris Kelly
            Reporter:
            Aaron Wang
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: