[SERVER-9208] database_names blocks while index is being rebuilt on a shard Created: 02/Apr/13 Updated: 15/Mar/15 Resolved: 02/Mar/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Concurrency |
| Affects Version/s: | 2.2.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | James Blackburn | Assignee: | J Rassi |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Participants: |
| Description |
c.database_names() doesn't appear to return while a large index is being rebuilt on one of the shards in a sharded cluster. We had a server crash due to exceeding the filedescriptor limit. Now, having restarted it with an increased fd limit. The index has taken 30mins to build to 20%. During this time, operations like listing database_names() seem to hang forever
|
| Comments |
| Comment by James Blackburn [ 15/Mar/15 ] | ||||
|
In this case a single non-privileged user can make a distributed shared cluster inaccessible for all other users. In general you have no control over how users choose to build indexes. This happened to us - a big index build locked up all other connections to the cluster. The definition of denial of service. | ||||
| Comment by Ramon Fernandez Marina [ 14/Mar/15 ] | ||||
|
While one could argue whether doing foreground index builds is a good default or not, if the documented blocking behavior (DOS is not a term that applies here at all) of foreground index builds is a concern then applications should build their indexes in the background. | ||||
| Comment by James Blackburn [ 13/Mar/15 ] | ||||
|
Hi Ramon, | ||||
| Comment by Ramon Fernandez Marina [ 02/Mar/15 ] | ||||
|
A call to "listDatabases" (like via PyMongo's database_names()) is expected to block until the index build completes: the "listDatabases" command briefly acquires a read lock on each database in the server in order to read basic stats. Foreground index builds acquire a database write lock on the database being written to, and don't release the lock until the index build completes. However I wasn't able to reproduce any blocking in authentication in 2.6.8 or the upcoming 3.0.0. | ||||
| Comment by Ramon Fernandez Marina [ 02/Mar/15 ] | ||||
|
jblackburn, we haven't heard back from you since Thomas' last question above, so I'm going to mark this ticket as resolved. Also I couldn't see in MMS any system running 2.2.3 any more, but if this is still an issue for you please feel free to re-open the ticket. Thanks, | ||||
| Comment by Thomas Rueckstiess [ 24/Jul/14 ] | ||||
|
James, I noticed this is still marked unresolved. Is this still an issue for you? Have you had a chance to upgrade to a more recent version? Regards, | ||||
| Comment by James Blackburn [ 05/Apr/13 ] | ||||
|
It looks like other databases aren't fully locked up – they're just really slow, even after the replica set reboot.
Again, there's no CPU load, and mongostat won't connect. Edit: Killing the mongods seems to have stopped the index build phew. | ||||
| Comment by James Blackburn [ 05/Apr/13 ] | ||||
|
And it's still locked up after rebooting the replicas. Is there any workaround at all? How can I avoid 5 days of downtime? | ||||
| Comment by James Blackburn [ 05/Apr/13 ] | ||||
|
Actually, it's not locked up, it's just very, very, slow. The mongod itself isn't under any load, afaics – there's no CPU usage. mongostat blocks on connect. There's nothing interesting in the logs. I'm going to kill the primary and hope for the best. | ||||
| Comment by James Blackburn [ 05/Apr/13 ] | ||||
|
This has happened to another DB in a standalone replicaset. Someone has been rebuilding an index (since Apr 3) - it's now 60% reindexed. Now, other databases in that Mongo have locked up. It's not clear how to recover from this, as, AFAICS I can't cancel the indexing job. | ||||
| Comment by James Blackburn [ 02/Apr/13 ] | ||||
|
An accidental foreground (the default) index build can take a Mongo cluster our of service for the duration. At current rate, the rebuild will take 2 hours. Trying to cancel the operation through a mongos appears to have no effect:
| ||||
| Comment by James Blackburn [ 02/Apr/13 ] | ||||
|
The foreground index operation prevent authentication against the shard which means we can't attempt to cancel the index operation. |