[SERVER-33888] Enabling fsyncLock on the config server primary may cause operations to block behind the Balancer thread Created: 14/Mar/18  Updated: 27/Oct/23  Resolved: 07/Jun/23

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.7.3
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Sara Golemon Assignee: Marcos José Grillo Ramirez
Resolution: Gone away Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Sharding EMEA
Sprint: Sharding EMEA 2023-05-29, Sharding EMEA 2023-06-12
Participants:

 Description   

Per conversation with Kal, I've been running into deadlocks while trying to replace our TLS transport, specifically during ReplSetTest shutdown sequence, the fsync lock is set, but shortly thereafter, the Balancer attempts to start a round.

https://github.com/mongodb/mongo/blob/cdb8f2f7ad472416c579c6c13292d3fb361d94cb/src/mongo/db/s/balancer/balancer.cpp#L347
_checkOIDs throws an exception when it notices that the shards are offline (as they should be), and the exception catcher then tries to log the action which requires an (unavailable) write lock.
https://github.com/mongodb/mongo/blob/cdb8f2f7ad472416c579c6c13292d3fb361d94cb/src/mongo/db/s/balancer/balancer.cpp#L410

Meanwhile, the ReplSetTest shutdown sequence gets stuck behind a read lock attempting to fetch collStats, but can't because the Balancer's write lock is still pending. https://github.com/mongodb/mongo/blob/cdb8f2f7ad472416c579c6c13292d3fb361d94cb/src/mongo/shell/replsettest.js#L1633

See also the following stack: https://gist.github.com/sgolemon/f957e2e2f38e14c0d3a0a661991c7a94



 Comments   
Comment by Garaudy Etienne [ 31/May/23 ]

nandini.bhartiya@mongodb.com jack.mulrow@mongodb.com isn't this what we're going to do for usable backups for sharding on community? lol cc ratika.gandhi@mongodb.com

Comment by Steve Briskin (Inactive) [ 21/Mar/18 ]

alyson.cabral, Backup doesn't use fsyncLock so no impact. Thanks for checking!

Comment by Alyson Cabral (Inactive) [ 21/Mar/18 ]

steve.briskin Is this important for backup?

Comment by Kaloian Manassiev [ 16/Mar/18 ]

Marking it 3.7 Desired, because it is not a deadlock.

Generated at Thu Feb 08 04:34:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.