[CSHARP-1682] Race condition could result in GridFS indexes not being created Created: 01/Jun/16  Updated: 23/Sep/16  Resolved: 07/Jun/16

Status: Closed
Project: C# Driver
Component/s: GridFS
Affects Version/s: 2.2.4
Fix Version/s: 2.3

Type: Bug Priority: Critical - P2
Reporter: Alessandro Catale Assignee: Robert Stam
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Original Description

The original description is unchanged below. See additional information below under Diagnosis.

We use the C# GridFS Package 2.2.3/2.2.4 against a Replica Set running MongoDB 3.2.6 and WiredTiger.

We're facing the following Problem:

  • Open a Stream using OpenDownloadStreamByNameAsync with a CancellationToken results in hanging the Application
  • Our GridFS do have about about 500K to xxxxK Chunks

After comparing various Installation of our Application on differnet Platforms (DEV, INT, STAGING, PROD),
we noticed, that the Index files_id_1_n_1 in .chunks collection and filename_1_uploadDate_1 in .files is missing.
We also reference to this Article: https://www.mongodb.com/blog/post/building-mongodb-applications-binary-files-using-gridfs-part-2

On the event on creating the "files_id_1_n_1 " index on the .chunks collection manually, the platform worked as expected.

After analysing the C# Driver code on GitHub, we understood that the indecies are created by the driver (and not the server).

Under which conditions, this index is created / or not created?

  • could it be, that if the cancellation token is set to a small timeout (e.g 5 Seconds), the index is never created (because it takes longer to build - and get cancelled)?
  • is the index needed by the driver - and blocks internally if not there?
  • is it safe to create the index manually?

This issue is really urgent to us because it affects our caching-strategy and the recreation of those indexes on our production environment takes to much time. When we recreate that index the mongoDb takes all tickets available which makes our application unusable ! We can't even connect to the primary though the shell or mongochef.

Best regards,
Alessandro Catale
__________________________________________________________
ICT Application Operation Manager

Office +41 58 221 48 55
Mobile +41 79 271 04 12
alessandro.catale@swisscom.com
__________________________________________________________
Swisscom IT Services AG
Outsourcing Services
Pfingstweidstrasse 51
8005 Zürich
www.swisscom.ch/it-services

Swisscom IT Services
Outsourcing | Workplace | SAP | Finance

Diagnosis

There is a race condition in the EnsureIndexes method: https://github.com/mongodb/mongo-csharp-driver/blob/0ad339c4c889076680245c3786bddf3ddd2654e3/src/MongoDB.Driver.GridFS/GridFSBucket.cs#L768

If two threads attempt to upload a file at the same time it is possible that the first thread will acquire the __ensuredIndexes lock on line 770 and then see a non-empty collection on line 773 because the second thread might have uploaded a different file in the meantime.



 Comments   
Comment by Githook User [ 07/Jun/16 ]

Author:

{u'username': u'rstam', u'name': u'rstam', u'email': u'robert@robertstam.org'}

Message: CSHARP-1682: Fix race condition in GridFS EnsureIndexes.
Branch: master
https://github.com/mongodb/mongo-csharp-driver/commit/08e449d57699ff937169ee0d726d757fcadd8e21

Comment by Robert Stam [ 01/Jun/16 ]

I think another workaround would be for each thread to create its own instance of GridFSBucket. That would also eliminate the race condition because each bucket would only have one thread using it.

Creating an instance of GridFSBucket is fairly cheap. So this workaround would not be expensive. It would result in each thread probing the files collection to see if it is empty, but that's exactly what the other workaround I suggested does.

Comment by Alessandro Catale [ 01/Jun/16 ]

Thanks! We'll test it asap and I'll give you feedback asap!

Comment by Robert Stam [ 01/Jun/16 ]

I don't have a workaround for re-creating the missing indexes.

They have to be created for GridFS to work. Your only options are to create them in the foreground or the background.

Comment by Robert Stam [ 01/Jun/16 ]

Here's a workaround that would prevent the issue from occurring again with the current driver. Since the problem is that there is a race condition that could result in the indexes not being created, just call this helper method first before uploading a file:

public static void EnsureGridFSIndexes(IGridFSBucket bucket)
{
    var filesCollectionName = $"{bucket.Options.BucketName}.files";
    var filesCollection = bucket.Database.GetCollection<BsonDocument>(filesCollectionName);
    var firstFileId = filesCollection.Find(new BsonDocument()).Limit(1).Project(new BsonDocument("_id", 1)).FirstOrDefault();
    if (firstFileId == null)
    {
        var chunksCollectionName = $"{bucket.Options.BucketName}.chunks";
        var chunksCollection = bucket.Database.GetCollection<BsonDocument>(chunksCollectionName);
 
        filesCollection.Indexes.CreateOne("{ filename : 1, uploadDate : 1 }");
        chunksCollection.Indexes.CreateOne("{ files_id : 1, n : 1 }", new CreateIndexOptions { Unique = true });
    }
}

This will do a quick round trip to the server to probe whether the files collection is empty or not, and if it is it will create the indexes. It doesn't matter if multiple threads happen to see an empty collection at the same time. The server knows how to handle multiple simultaneous create index requests for the same index and will only create the index once.

Comment by Alessandro Catale [ 01/Jun/16 ]

Ok, would you like to skype tommorrow? I'll arrange that our developpers would be also available.

Would your workaround be creating the index on a secondary? That's not a solution for us and it will still take too much time.

We also couldn't resync the node (we took him offline in order to be able to drop these cache-collections) last time. We opened up also a ticket: https://jira.mongodb.org/browse/CS-31114

Comment by Robert Stam [ 01/Jun/16 ]

We don't have a release planned yet for the near future.

I can help you come up with a workaround. That would solve your problem in the short term. And be much faster than waiting for the next release.

Comment by Alessandro Catale [ 01/Jun/16 ]

Thank you very much! I tested it and got the same results.

It's really urgent because a recreate of the chunks-index causes nearly a crash in our production cacheDb. It seems like mongoDB takes all tickets available only for the recreate! We have 3'500'000 items and we're just at the beginning...

When can we expect a solution for this issue?

Comment by Robert Stam [ 01/Jun/16 ]

I would like to add that there is no safe way to drop the GridFS collections while an application that uses them is running.

The issue is that no matter which collection you drop first there could be consistency problems between the two. There is no command to the server to drop two collections atomically.

There may be an upload or download in progress when you drop the collections, and that will result in either errors or a corrupted upload.

Comment by Robert Stam [ 01/Jun/16 ]

Yes, that sounds plausible. If the GridFS collections are dropped while the application is running the application won't notice and won't re-create the indexes. And restarting the application when the collections are no longer empty also won't re-create the indexes.

And it also looks like the race condition in EnsureIndexes could result in the indexes not getting created in the first place. We'll fix that.

Comment by Alessandro Catale [ 01/Jun/16 ]

in a very first try, we can confirm this:

  • we stopped all the applications, dropped the collection, started them => index was there.

so we though about how this could have been happen:

  • maybe one dropped the grid fs collections
  • files got inserted by normal operations
  • gridfs got super slow (after some time)
  • a restart of the application has then not created the indexes (because maybe not empty / too many documents)

could you think this is the case?

Comment by Robert Stam [ 01/Jun/16 ]

If you dropped the collections without restarting the applications the indexes would not have been re-created. Each process only checks once whether the indexes need to be created. It doesn't check repeatedly.

And yes, GridFS would work without the indexes but would be doing full collection scans, so as soon as the collections got big enough the GridFS operations would slow significantly.

Comment by Alessandro Catale [ 01/Jun/16 ]

We also dropped the collection and the indexes have not been created. We also found out that is runs without indexes but after a certain time it's unusable.

Comment by Alessandro Catale [ 01/Jun/16 ]

Thank you Robert for your super fast answer!

We set up a new environment with an empty replicase, the latest stable driver and db and noticed the same problem: the driver has created the collections, but on uploading the first files, the gridfs specific index were not created.

We then retested the same procedure with a local installation (no replicaset, no authentisation etc. etc.) - and the index got created. It kind of smeels like your idea of the problem.

We use GridFS as a caching storage for pictures & videos. The possibility of hitting the db fast is highly.

Comment by Robert Stam [ 01/Jun/16 ]

The indexes needed by GridFS are created when the very first file is uploaded to GridFS and the collection is empty.

The GridFS code itself never deletes these indexes. But if these indexes are ever deleted, they will not magically come back. They will have to be manually recreated (which as you have noted is resource intensive if the GridFS collections are large).

Looking at the GridFS code it looks like there might be a race condition if multiple threads simultaneously attempt to upload the very first GridFS file. But if you've already uploaded thousands of GridFS files this did probably not affect you.

Generated at Wed Feb 07 21:40:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.