Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-102388

Primary Node is frozen momentarily post step up from an rolling index build/drop

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 4.4.9
    • Component/s: None
    • None
    • ALL
    • Hide

      We have not been able to reproduce this consistently as of now, and hence it's a bit difficult to share the steps, though if we do see a consistent occurrence will update the ticket here.

      Show
      We have not been able to reproduce this consistently as of now, and hence it's a bit difficult to share the steps, though if we do see a consistent occurrence will update the ticket here.
    • None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None

      Summary:

      We have observed a few cases of primary node freezing momentarily (~5-20 seconds) post step up from a rolling index build/drop on our setup running on mongo db version 4.4.9. 

      At this point we have only been able to observe this on our replica sets running on Ubuntu Noble (24.04.2) with kernel version of 6.8.0-1023-aws , though we have a hunch even our setup on Ubuntu Focal (20.04.6) with kernel version 5.15.0-1056-aws may have the same behaviour but less aggravated.

      Events Leading to freeze:

      We run rolling index builds on our setup like mentioned above where we build indexes on the secondaries first by restarting them in standalone mode and then add them back to the replica set. We then step down the active primary and run the same step on that node too. The series of events when we see a freeze lines up with the following:

      • Run index builds on the secondaries by restarting mongo in standalone mode and then add the node back to the replica set.
      • Step Down the primary node to transition it to a secondary role, and build an index on it following the step above.
      • The newly stepped up primary stalls and leads to a brief unavailability window.
      • The degree of stall varies ranging from a small percentage of queries going through to all queries completely being stalled.
      • When the above mentioned stall happens, depending on the degree of it, we see:
      • In most cases no election is triggered in the replica set indicating the primary was heartbeating fine (though heartbeats could have slowed down, we don’t know the degree of slowness yet since our log level heartbeats logs are not turned on in our setup)
      • In some cases, a new term of election runs the replica set indicating the secondaries could not see a primary.
      • We also see all secondaries stop receiving more batches from the oplog fetcher cursor indicating that it is also stalled. This also manifests as increased replication lag.

      Setup notes:

      While we do not see this happening on every rolling index build, most of the cases we have seen stalls of the nature mentioned above we have had the following setup:

      • Nodes running mongo are AWS i3en.12xlarge nodes (48 vCPUs, 384 GB memory)
      • Index builds are on collections with sizes ranging from 70-200 million documents and occupying ~200GB+ on disk space (we use zstd compressor with wired tiger)

      Observations and Analysis:

      When we dived into this further through looking at FTDC metrics from the time of stall and also running system wide perf traces some interesting observations came up (FTDC screenshots and flamegraphs from perf attached to the ticket):

      • FTDC shows us a step function increase in % cpu utilisation (system) whereas % cpu utilisation (user) remains extremely low, indicating that it is something in the kernel space that is starving the host.
      • We also see a lack of data points from FTDC for the duration of the stall indicating that FDTC collector threads were also stalled.
      • Stack traces from perf indicate that most of the calls are contending on a lock in the kernel protecting LRU lists used for OS page cache management.
      • Most of these are locks being grabbed by slab_shinker trying to shrink shadow entry lists.

      Some Questions:

      • We wanted to check if any such regressions around such themes as above have been seen in testing or reported in the community?
      • As a part of mitigation for this we are planning to add a step to flush out OS page caches before re-adding the node to the replica set post index build/drop, are there any caveats around performance impact due to this that we should be aware of?
      • Also are there tunables in mongo/wired tiger which we could leverage to alleviate the effects we are seeing currently?

        1. perf-trace.png
          658 kB
          Deep Vyas

            Assignee:
            Unassigned Unassigned
            Reporter:
            vyasdeep.dv@gmail.com Deep Vyas
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: