-
Type:
Improvement
-
Resolution: Gone away
-
Priority:
Major - P3
-
None
-
Affects Version/s: 3.7.3
-
Component/s: Sharding
-
Sharding EMEA
-
Sharding EMEA 2023-05-29, Sharding EMEA 2023-06-12
-
None
-
3
-
None
-
None
-
None
-
None
-
None
-
None
Per conversation with Kal, I've been running into deadlocks while trying to replace our TLS transport, specifically during ReplSetTest shutdown sequence, the fsync lock is set, but shortly thereafter, the Balancer attempts to start a round.
https://github.com/mongodb/mongo/blob/cdb8f2f7ad472416c579c6c13292d3fb361d94cb/src/mongo/db/s/balancer/balancer.cpp#L347
_checkOIDs throws an exception when it notices that the shards are offline (as they should be), and the exception catcher then tries to log the action which requires an (unavailable) write lock.
https://github.com/mongodb/mongo/blob/cdb8f2f7ad472416c579c6c13292d3fb361d94cb/src/mongo/db/s/balancer/balancer.cpp#L410
Meanwhile, the ReplSetTest shutdown sequence gets stuck behind a read lock attempting to fetch collStats, but can't because the Balancer's write lock is still pending. https://github.com/mongodb/mongo/blob/cdb8f2f7ad472416c579c6c13292d3fb361d94cb/src/mongo/shell/replsettest.js#L1633
See also the following stack: https://gist.github.com/sgolemon/f957e2e2f38e14c0d3a0a661991c7a94