-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Catalog and Routing
-
Fully Compatible
-
ALL
-
CAR Team 2025-05-26
-
0
-
None
-
3
-
TBD
-
None
-
None
-
None
-
None
-
None
-
None
-
0
Note: This is not an actual bug, but addressing this task will help avoid confusion during debugging by preventing misleading log messages.
Context
The Automerger currently holds a mutex while getting the collection names with mergeable chunks from every shard. While this approach is functionally correct, it creates a potential debugging challenge: any unexpected delay or bug in the fetching logic (e.g., a failure in the ShardRegister) will result in the mutex being held longer than expected. This can cause other threads—particularly those trying to acquire the same mutex—to hang, making it appear as though the problem lies elsewhere (e.g., in the Balancer).
Hence, if the Automerger gets blocked due to the ShardRegister call, this could block the "Balancer" main thread, consequently affecting a potential server shutdown thread.
Goal
To improve clarity and reduce the chance of misdiagnosis during debugging, the fetching of collections from shards should be moved outside the mutex-protected section. Only the logic that absolutely requires synchronization should remain under the mutex. This will ensure that any issues in external services (like ShardRegister) do not lead to mutex contention or misleading failure signals.
Outcome
This change won’t alter current behavior but will:
- Reduce the risk of long-held mutexes due to unrelated issues.
- Prevent the Balancer and BalancerSecondary threads from being blocked unnecessarily due to unrelated bugs.
- Improve reliability and accuracy of failure diagnostics.