-
Type:
Task
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Workload Scheduling
-
Fully Compatible
-
Workload Scheduling 2025-04-14, Workload Scheduling 2025-04-28
-
None
-
0
-
None
-
None
-
None
-
None
-
None
-
None
A recent customer outage indicated that a huge number of concurrent uasserts resulted in system unavailability for a few seconds. This was long enough for their drivers not to meet their SLA, and drop their entire connection pools (causing them to churn their connections). mathias@mongodb.com suspects this is related to gcc<12 having a lock for stack unwinding (see fix commit here), and wrote the following microbenchmark in the server code to test the hypothesis:
diff --git a/src/mongo/db/mongod.cpp b/src/mongo/db/mongod.cpp index 40287e87737..daac7a8eb1b 100644 --- a/src/mongo/db/mongod.cpp +++ b/src/mongo/db/mongod.cpp @@ -31,6 +31,82 @@ #include "mongo/util/quick_exit.h" #include "mongo/util/text.h" // IWYU pragma: keep +#include <condition_variable> +#include <mutex> +#include <thread> +#include <vector> + +using namespace std::literals; + +std::mutex mx; +std::condition_variable cv; +bool ready = false; + +const int DEPTH = 100; +const int THREADS = 10'000; +const bool USE_STATUS = false; + +template <int depth = DEPTH> +[[gnu::noinline, noreturn]] void thrower() { + // big string to avoid SSO + std::string s = + " "; + if constexpr (depth) + thrower<depth - 1>(); + + { + auto lk = std::unique_lock(mx); + cv.wait(lk, [] { return ready; }); + } + uasserted(1234, "boom"); +} + +template <int depth = DEPTH> +[[gnu::noinline]] mongo::Status returner() { + // big string to avoid SSO + std::string s = + " "; + if constexpr (depth) { + auto s = returner<depth - 1>(); + if (!s.isOK()) + return s; + } + + { + auto lk = std::unique_lock(mx); + cv.wait(lk, [] { return ready; }); + } + return {mongo::ErrorCodes::Error(1234), "boom"}; +} + +int main() { + std::thread([] { + std::this_thread::sleep_for(1s); + auto lk = std::unique_lock(mx); + ready = true; + cv.notify_all(); + }).detach(); + std::vector<std::thread> threads; + for (int i = 0; i < THREADS; i++) { + threads.emplace_back([] { + if (USE_STATUS) { + (void)returner(); + } else { + try { + thrower(); + } catch (...) { + } + } + }); + } + for (auto&& t : threads) { + t.join(); + } +} + + +#if 0 + #if defined(_WIN32) // In Windows, wmain() is an alternate entry point for main(), and receives the same parameters // as main() but encoded in Windows Unicode (UTF-16); "wide" 16-bit wchar_t characters. The @@ -45,3 +121,4 @@ int main(int argc, char* argv[]) { mongo::quickExit(mongo::mongod_main(argc, argv)); } #endif +#endif
It looks like this is indeed a problem, but we'd like to reproduce the issue in integration with conditions similar to the original workload.
The test:
~10k threads waiting in mongos trying to acquire a connection from the connection pool, and then all timing out concurrently.
Acceptance criteria:
Test against a system libgcc_s which has not been patched to include the above fix (gcc<12, but beware that Ubuntu 22.04 has been patched to include this in their gcc12), and then test again using libgcc_s from our v5 toolchain (using LD_PRELOAD):
> time LD_PRELOAD=/opt/mongodbtoolchain/v4/lib64/libgcc_s.so.1 bazel-bin/install/bin/test > time LD_PRELOAD=/opt/mongodbtoolchain/v5/lib64/libgcc_s.so.1 bazel-bin/install/bin/test
- related to
-
SERVER-104316 Enable static link of the libgcc library for AL2 and AL2023
-
- Needs Scheduling
-