Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-103499

Write an integration benchmark for gcc unwind bug

    • Type: Icon: Task Task
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Workload Scheduling
    • Fully Compatible
    • Workload Scheduling 2025-04-14, Workload Scheduling 2025-04-28
    • None
    • 0
    • None
    • None
    • None
    • None
    • None
    • None

      A recent customer outage indicated that a huge number of concurrent uasserts resulted in system unavailability for a few seconds. This was long enough for their drivers not to meet their SLA, and drop their entire connection pools (causing them to churn their connections).  mathias@mongodb.com suspects this is related to gcc<12 having a lock for stack unwinding (see fix commit here), and wrote the following microbenchmark in the server code to test the hypothesis:

      diff --git a/src/mongo/db/mongod.cpp b/src/mongo/db/mongod.cpp
      index 40287e87737..daac7a8eb1b 100644
      --- a/src/mongo/db/mongod.cpp
      +++ b/src/mongo/db/mongod.cpp
      @@ -31,6 +31,82 @@
       #include "mongo/util/quick_exit.h"
       #include "mongo/util/text.h"  // IWYU pragma: keep
       
      +#include <condition_variable>
      +#include <mutex>
      +#include <thread>
      +#include <vector>
      +
      +using namespace std::literals;
      +
      +std::mutex mx;
      +std::condition_variable cv;
      +bool ready = false;
      +
      +const int DEPTH = 100;
      +const int THREADS = 10'000;
      +const bool USE_STATUS = false;
      +
      +template <int depth = DEPTH>
      +[[gnu::noinline, noreturn]] void thrower() {
      +    // big string to avoid SSO
      +    std::string s =
      +        "                                                                                         ";
      +    if constexpr (depth)
      +        thrower<depth - 1>();
      +
      +    {
      +        auto lk = std::unique_lock(mx);
      +        cv.wait(lk, [] { return ready; });
      +    }
      +    uasserted(1234, "boom");
      +}
      +
      +template <int depth = DEPTH>
      +[[gnu::noinline]] mongo::Status returner() {
      +    // big string to avoid SSO
      +    std::string s =
      +        "                                                                                         ";
      +    if constexpr (depth) {
      +        auto s = returner<depth - 1>();
      +        if (!s.isOK())
      +            return s;
      +    }
      +
      +    {
      +        auto lk = std::unique_lock(mx);
      +        cv.wait(lk, [] { return ready; });
      +    }
      +    return {mongo::ErrorCodes::Error(1234), "boom"};
      +}
      +
      +int main() {
      +    std::thread([] {
      +        std::this_thread::sleep_for(1s);
      +        auto lk = std::unique_lock(mx);
      +        ready = true;
      +        cv.notify_all();
      +    }).detach();
      +    std::vector<std::thread> threads;
      +    for (int i = 0; i < THREADS; i++) {
      +        threads.emplace_back([] {
      +            if (USE_STATUS) {
      +                (void)returner();
      +            } else {
      +                try {
      +                    thrower();
      +                } catch (...) {
      +                }
      +            }
      +        });
      +    }
      +    for (auto&& t : threads) {
      +        t.join();
      +    }
      +}
      +
      +
      +#if 0
      +
       #if defined(_WIN32)
       // In Windows, wmain() is an alternate entry point for main(), and receives the same parameters
       // as main() but encoded in Windows Unicode (UTF-16); "wide" 16-bit wchar_t characters.  The
      @@ -45,3 +121,4 @@ int main(int argc, char* argv[]) {
           mongo::quickExit(mongo::mongod_main(argc, argv));
       }
       #endif
      +#endif
      
      

       

      It looks like this is indeed a problem, but we'd like to reproduce the issue in integration with conditions similar to the original workload.

      The test:

      ~10k threads waiting in mongos trying to acquire a connection from the connection pool, and then all timing out concurrently.

      Acceptance criteria:

      Test against a system libgcc_s which has not been patched to include the above fix (gcc<12, but beware that Ubuntu 22.04 has been patched to include this in their gcc12), and then test again using libgcc_s from our v5 toolchain (using LD_PRELOAD):

      > time LD_PRELOAD=/opt/mongodbtoolchain/v4/lib64/libgcc_s.so.1 bazel-bin/install/bin/test
      > time LD_PRELOAD=/opt/mongodbtoolchain/v5/lib64/libgcc_s.so.1 bazel-bin/install/bin/test

            Assignee:
            guillaume.racicot@mongodb.com Guillaume Racicot
            Reporter:
            matt.broadstone@mongodb.com Matt Broadstone
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: