[SERVER-18504] "Lock for createDatabase is taken" triggered by mongorestore restoring multiple collections in parallel Created: 16/May/15  Updated: 19/Sep/15  Resolved: 19/May/15

Status: Closed
Project: Core Server
Component/s: Sharding, Tools
Affects Version/s: None
Fix Version/s: 3.1.4

Type: Bug Priority: Major - P3
Reporter: Michael O'Brien Assignee: Daniel Alabi
Resolution: Done Votes: 0
Labels: UT
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to TOOLS-747 Gracefully handle failing to acquire ... Closed
Backwards Compatibility: Fully Compatible
Participants:

 Description   

max.hirschhorn@10gen.com located this in a patch build:

https://evergreen.mongodb.com/task/mongodb_mongo_master_linux_64_2518e60c495a700cbb44237425ecf064db970dbd_15_05_15_19_37_32_multiversion_linux_64

it is a consequence of mongorestore's parallel restore triggering multiple collections being created at once, each needing to get the database lock on the server side but only one.
Unclear if this should be implemented as a fix in the tool itself to serialize collection creation to avoid this, or improve the lock handling in the server (increase timeout?) to prevent the error



 Comments   
Comment by Daniel Alabi [ 19/May/15 ]

A more permanent fix is for mongorestore to retry appropriately when there's a timeout resulting from a distributed lock contention on database creation. A TOOLS ticket will be created to track this issue.

Comment by Githook User [ 19/May/15 ]

Author:

{u'username': u'alabid', u'name': u'Daniel Alabi', u'email': u'alabidan@gmail.com'}

Message: SERVER-18504 Increase timeout for grabbing distributed lock when creating a database
Branch: master
https://github.com/mongodb/mongo/commit/82556682e04616d9ba31ea76b68c0e45d0dae671

Comment by Andy Schwerin [ 18/May/15 ]

While the tools should be prepared for this to fail, kaloian.manassiev pointed out that we could retry for more than 1 second to acquire the needed distributed lock. In practice, that would probably eliminate the problem. Assigning to alabid to turn up the timeout

Comment by Kaloian Manassiev [ 18/May/15 ]

Regarding increasing the distributed lock timeout:-

We could increase it by a little bit, but it is always possible that many threads creating database or collection at the same time will end up causing contention and the lock will timeout. So it would be best if the tools are expecting this and retry appropriately.

Comment by Max Hirschhorn [ 16/May/15 ]

Yes, it is a sharding-only behavior. Had chatted about it with kaloian.manassiev and it'll happen when the second client times out on acquiring the distributed lock to create the database.

Comment by Andy Schwerin [ 16/May/15 ]

This is a sharding-only behavior, right? Just clarifying.

Generated at Thu Feb 08 03:47:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.