[SERVER-20058] mongos deadlock while replacing catalog manager Created: 20/Aug/15  Updated: 19/Sep/15  Resolved: 20/Aug/15

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.1.7
Fix Version/s: 3.1.7

Type: Bug Priority: Major - P3
Reporter: Andy Schwerin Assignee: Kaloian Manassiev
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 8 08/28/15
Participants:

 Description   

The important stack trace from the hang analyzer is below. The thing to notice is the reentrancy to the catalog manager. Inside a catalog manager call, ShardConnection goes to refresh sharding metadata via the forwarding catalog manager. If the process detects that it needs to change the catalog manager in the inner operation, it fails to drop the lock on the outer operation, and so waits forever for the catalog manager to get changed out.

  mongo::ForwardingCatalogManager::waitForCatalogManagerChange() ()
  mongo::ForwardingCatalogManager::getAllShards(std::vector<mongo::ShardType, std::allocator<mongo::ShardType> >*) ()
  mongo::ShardRegistry::reload() ()
  mongo::ShardRegistry::getShard(std::string const&)
 
  mongo::(anonymous namespace)::checkShardVersion(mongo::OperationContext*, mongo::DBClientBase*, std::string const&, std::shared_ptr<mongo::ChunkManager>, bool, int) ()
  mongo::VersionManager::checkShardVersionCB(mongo::OperationContext*, mongo::ShardConnection*, bool, int) ()
  mongo::ShardConnection::_finishInit() ()
  mongo::ShardConnection::get() ()
  mongo::DBClientMultiCommand::sendAll() ()
  mongo::ConfigCoordinator::executeBatch(mongo::BatchedCommandRequest const&, mongo::BatchedCommandResponse*) ()
  mongo::CatalogManagerLegacy::writeConfigServerDirect(mongo::BatchedCommandRequest const&, mongo::BatchedCommandResponse*) ()
  mongo::ForwardingCatalogManager::writeConfigServerDirect(mongo::BatchedCommandRequest const&, mongo::BatchedCommandResponse*) ()
  mongo::CatalogManager::update(std::string const&, mongo::BSONObj const&, mongo::BSONObj const&, bool, bool, mongo::BatchedCommandResponse*) ()
  mongo::Balancer::_ping(mongo::OperationContext*, bool) ()
  mongo::Balancer::run() ()                                                            
  mongo::BackgroundJob::jobBody() ()



 Comments   
Comment by Githook User [ 20/Aug/15 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-20058 ConfigCoordinator should not do SetShardVersion exchange

Adds an argument to DBClientMultiCommand so it doesn't do SetShardVersion
exchange when talking to config servers. This causes deadlock with catalog
manager change on upgrade from Sync to CSRS.
Branch: master
https://github.com/mongodb/mongo/commit/b5180dfcadbc39dec6773a0d4a86f89508c79aa1

Comment by Kaloian Manassiev [ 20/Aug/15 ]

This is because of the fix for SERVER-19395 (this Git commit).

DBClientMultiCommand should go through ShardConnection at least for the shards, because otherwise there are cases where they are not sharding aware. However, for the config server it should be fine to create DBClientConnections.

Comment by Andy Schwerin [ 20/Aug/15 ]

Indeed, the error appears to be that DBClientMultiCommand::sendAll() is creating ShardConnections for connections of type MASTER when dispatching commands to the three config servers. VersionManager::isVersionableCB sees that the connections are of type MASTER, and decides this must mean they're to standalone shards, rather than a config server, and so treats the connections as versionable.

I suspect the error is that CatalogManagerLegacy should not be using DBClientMultiCommand to execute config server writes. kaloian.manassiev, are there other reasonable options?

Comment by Andy Schwerin [ 20/Aug/15 ]

It's strange that we're using a ShardConnection (that checks shard version information) to do operations on the config server.

Generated at Thu Feb 08 03:53:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.