[SERVER-9208] database_names blocks while index is being rebuilt on a shard Created: 02/Apr/13  Updated: 15/Mar/15  Resolved: 02/Mar/15

Status: Closed
Project: Core Server
Component/s: Concurrency
Affects Version/s: 2.2.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: James Blackburn Assignee: J Rassi
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

import pymongo
c = pymongo.Connection('localhost:27119')
c.database_names()

c.database_names() doesn't appear to return while a large index is being rebuilt on one of the shards in a sharded cluster.

We had a server crash due to exceeding the filedescriptor limit. Now, having restarted it with an increased fd limit. The index has taken 30mins to build to 20%. During this time, operations like listing database_names() seem to hang forever

    c.database_names()
  File "/local/home/jblackburn/net/pymongo-2.4.2/pymongo/mongo_client.py", line 1068, in database_names
    self.admin.command("listDatabases")["databases"]]
  File "/local/home/jblackburn/net/pymongo-2.4.2/pymongo/database.py", line 390, in command
    result = self["$cmd"].find_one(command, **extra_opts)
  File "/local/home/jblackburn/net/pymongo-2.4.2/pymongo/collection.py", line 598, in find_one
    for result in self.find(spec_or_id, *args, **kwargs).limit(-1):
  File "/local/home/jblackburn/net/pymongo-2.4.2/pymongo/cursor.py", line 814, in next
    if len(self.__data) or self._refresh():
  File "/local/home/jblackburn/net/pymongo-2.4.2/pymongo/cursor.py", line 763, in _refresh
    self.__uuid_subtype))
  File "/local/home/jblackburn/net/pymongo-2.4.2/pymongo/cursor.py", line 700, in __send_message
    **kwargs)
  File "/local/home/jblackburn/net/pymongo-2.4.2/pymongo/mongo_client.py", line 915, in _send_message_with_response
    return self.__send_and_receive(message, sock_info)
  File "/local/home/jblackburn/net/pymongo-2.4.2/pymongo/mongo_client.py", line 893, in __send_and_receive
    return self.__receive_message_on_socket(1, request_id, sock_info)
  File "/local/home/jblackburn/net/pymongo-2.4.2/pymongo/mongo_client.py", line 878, in __receive_message_on_socket
    header = self.__receive_data_on_socket(16, sock_info)
  File "/local/home/jblackburn/net/pymongo-2.4.2/pymongo/mongo_client.py", line 866, in __receive_data_on_socket
    chunk = sock_info.sock.recv(length)
KeyboardInterrupt



 Comments   
Comment by James Blackburn [ 15/Mar/15 ]

In this case a single non-privileged user can make a distributed shared cluster inaccessible for all other users.

In general you have no control over how users choose to build indexes. This happened to us - a big index build locked up all other connections to the cluster. The definition of denial of service.

Comment by Ramon Fernandez Marina [ 14/Mar/15 ]

While one could argue whether doing foreground index builds is a good default or not, if the documented blocking behavior (DOS is not a term that applies here at all) of foreground index builds is a concern then applications should build their indexes in the background.

Comment by James Blackburn [ 13/Mar/15 ]

Hi Ramon,
From your description of the behaviour I think this is a real issue. It means anyone building a non-trivial index on a mongo cluster will not only lock up their database, but potentially block all clients who listDatabases. The result is a single app can DOS an entire shared cluster.

Comment by Ramon Fernandez Marina [ 02/Mar/15 ]

A call to "listDatabases" (like via PyMongo's database_names()) is expected to block until the index build completes: the "listDatabases" command briefly acquires a read lock on each database in the server in order to read basic stats. Foreground index builds acquire a database write lock on the database being written to, and don't release the lock until the index build completes.

However I wasn't able to reproduce any blocking in authentication in 2.6.8 or the upcoming 3.0.0.

Comment by Ramon Fernandez Marina [ 02/Mar/15 ]

jblackburn, we haven't heard back from you since Thomas' last question above, so I'm going to mark this ticket as resolved.

Also I couldn't see in MMS any system running 2.2.3 any more, but if this is still an issue for you please feel free to re-open the ticket.

Thanks,
Ramón.

Comment by Thomas Rueckstiess [ 24/Jul/14 ]

James, I noticed this is still marked unresolved. Is this still an issue for you? Have you had a chance to upgrade to a more recent version?

Regards,
Thomas

Comment by James Blackburn [ 05/Apr/13 ]

It looks like other databases aren't fully locked up – they're just really slow, even after the replica set reboot.

time c.diags_index.diags_collection.find_one()
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 19.66 s
Out[6]: 

Again, there's no CPU load, and mongostat won't connect.

Edit: Killing the mongods seems to have stopped the index build phew.

Comment by James Blackburn [ 05/Apr/13 ]

And it's still locked up after rebooting the replicas.

Is there any workaround at all? How can I avoid 5 days of downtime?

Comment by James Blackburn [ 05/Apr/13 ]

Actually, it's not locked up, it's just very, very, slow.

The mongod itself isn't under any load, afaics – there's no CPU usage. mongostat blocks on connect. There's nothing interesting in the logs. I'm going to kill the primary and hope for the best.

Comment by James Blackburn [ 05/Apr/13 ]

This has happened to another DB in a standalone replicaset.

Someone has been rebuilding an index (since Apr 3) - it's now 60% reindexed.

Now, other databases in that Mongo have locked up. It's not clear how to recover from this, as, AFAICS I can't cancel the indexing job.

Comment by James Blackburn [ 02/Apr/13 ]

An accidental foreground (the default) index build can take a Mongo cluster our of service for the duration. At current rate, the rebuild will take 2 hours.

Trying to cancel the operation through a mongos appears to have no effect:

mongos> db.killOp('rs1:6860')

Comment by James Blackburn [ 02/Apr/13 ]

The foreground index operation prevent authentication against the shard which means we can't attempt to cancel the index operation.

Generated at Thu Feb 08 03:19:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.