[CDRIVER-4296] mongoc_gridfs_file_set_id() does not work when the file has many chunks. Created: 16/Feb/22  Updated: 27/Oct/23  Resolved: 05/Apr/22

Status: Closed
Project: C Driver
Component/s: GridFS
Affects Version/s: 1.21.0
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: kevin wanglong_ Assignee: Jesse Williamson (Inactive)
Resolution: Gone away Votes: 0
Labels: needs-first-responder
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Summary

mongoc_gridfs_file_set_id() does not work when the file has many chunks.

Environment

version: mongo-c-driver-1.21.0

host: debian 11 64-bit x86

gcc: gcc (Debian 10.2.1-6) 10.2.1 20210110

 

How to Reproduce

I use the example-gridfs tool to upload a big file, but the fs.files._id is not  as expected.

Then I upload a samll file, fs.files._id is as expected.

test> db.fs.files.find({})
[

{ _id: ObjectId("620cd61f9f63db1b8d012941"), chunkSize: 261120, filename: 'ss', length: Long("429273416"), uploadDate: ISODate("2022-02-16T10:46:55.642Z") }

,

{ _id: 1, chunkSize: 261120, filename: 'aa', length: Long("170027"), uploadDate: ISODate("2022-02-16T10:58:44.562Z") }

]

 



 Comments   
Comment by PM Bot [ 05/Apr/22 ]

There hasn't been any recent activity on this ticket, so we're resolving it. Thanks for reaching out! Please feel free to comment on this if you're able to provide more information.

Comment by Jesse Williamson (Inactive) [ 21/Mar/22 ]

To highlight what I think the most straightforward workaround is: just use auto-id generation: rather than setting the id manually as in the example program, just use the one assigned by GridFS.

See-also discussion here:
https://jira.mongodb.org/browse/JAVA-1983

Comment by Jesse Williamson (Inactive) [ 19/Mar/22 ]

This is caused by behavior in the deprecated mongoc_gridfs_t API, which does not conform to the current GridFS API specification. It may be sufficient to update the C Driver's GridFS example program and add a note to the documentation.

For cause, reproduction, and discussion, see above.

Comment by Jesse Williamson (Inactive) [ 19/Mar/22 ]

Thank you again for reporting this issue, and for your patience while it was investigated.

Unfortunately, I see no way to check the underlying status of the is_dirty flag through the mongoc_gridfs_t API, and without that checking to see whether the stream has been written appears to only be indirectly possible. Another
idea is to always generate a UUID on the client side and avoid the stream call (see discussion below), but other than
working around the issue I don't see a direct way of resolving this via the mongoc_gridfs_api.

Instead, you're encouraged to use the newer mongoc_gridfs_bucket_t GridFS API, and upgrade from mongoc_gridfs_t if possible. The mongoc_gridfs_t implementation used by the example program does not does not comply to the GridFS specification and has been deprecated.

You can read further information about the C Driver's support for GridFS and possible migration strategies from
the old and not recommended mongoc_gridfs_t implementation (used by the example program) and the newer mongoc_gridfs_bucket_t implementation here:
http://mongoc.org/libmongoc/current/gridfs.html

You can learn more about the GridFS API specification here:
https://github.com/mongodb/specifications/blob/master/source/gridfs/gridfs-spec.rst#api

The deprecated mongoc_gridfs_t API, for reference:
http://mongoc.org/libmongoc/current/mongoc_gridfs_t.html

I've included a discussion and details on why you are seeing this behavior below.

I hope that is helpful! Thank you again for your effort in bringing this issue to our attention!

-Jesse

*Discussion:

Under the right circumstances (such as a being asked to seek to 0 in a large (2GB, standard chunk size) and unsaved "new" stream created by mongoc_gridfs_create_file_from_stream() can wind up having it's "is_dirty" flag being un-set. This means that operations like changing its id aren't allowed on it before it has been directly saved by the user, because the id has already been auto-generated and written on account of a hidden mongoc_gridfs_file_seek() call.

This is inconsistent with the behavior of the same calls on smaller, non-chunked files, which will still have an is_dirty value of 0 after mongoc_gridfs_create_file_from_stream() and/or mongoc_gridfs_file_seek() call.

Neither mongoc_gridfs_create_file_from_stream()'s nor mongoc_gridfs_file_seek()'s documentation does not indicate this side effect, and the call in both cases does not produce an error.

monoc_gridfs_create_file_from_stream()'s documentation says it returns a "newly allocated" file, and there is a note that it will read the stream until EOF; mongoc_gridfs_file_seek() does not mention affecting the new-ness state of the file (making this behavior a bit surprising).

  • To reproduce:

In our example program, example-gridfs, this can be observed with the method suggested by the submitter:

fallocate -l 1KB foo-1kb
fallocate -l 2GB bar-2gb

./example-gridfs write foo ./foo-1kb

...and then:

./example-gridfs write bar ./bar-2gb
Cannot set file id after saving file.

Notice that the first file uses the value set by the example program, but the large file uses an auto-generated id:
db.fs.files.find({})

[

{ _id: 1, chunkSize: 261120, filename: 'foo', length: Long("1000"), uploadDate: ISODate("2022-03-19T00:04:17.945Z") }

,

{ _id: ObjectId("62351e38952337de1d0f8be1"), chunkSize: 261120, filename: 'bar', length: Long("2000000000"), uploadDate: ISODate("2022-03-19T00:05:12.207Z") }

]

  • Analysis:

This is because mongoc_gridfs_create_file_from_stream() in example-gridfs.c:116 has called mongoc_gridfs_file_seek() in
mongoc-gridfs.c:329, which in turn has written the file (mongoc-gridfs-file.c:971) via _mongoc_gridfs_file_flush_page() in
(mongoc-gridfs-file.c:674).

Our example program assumes that the file is still new (and, indirectly, therefore has "is_dirty" still set), and in (example-gridfs.c:130) when mongoc_gridfs_file_set_id() is called the function fails because the stream has actually already been written to a file as per above.

Note the "mongofiles" Go tool generates ids on its own in every case rather than allowing autogeneration, so it doesn't see this issue.

  • Recommendations:

There are two approaches we might consider. The first is to see if it's possible to avoid the write in both functions, or at least in mongoc_gridfs_create_file_from_stream(). This effort might be disproportionate unless this is frequently encountered. One workaround is to do what the "mongofiles" tool does and always generate UUIDs on the client side manually; another is to update the written file chunks when a change of id is needed, which is a potentially expensive operation.

In any case, it is probably worthwhile to be sure this behavior is documented-- a comment in the example and update to the deprecated API documentation would be helpful.

Comment by Jesse Williamson (Inactive) [ 17/Feb/22 ]

Hello, thank you for reporting this issue! We will make time to investigate and compare it with CDRIVER-1976 soon.

Comment by kevin wanglong_ [ 16/Feb/22 ]

I had the same problem as  https://jira.mongodb.org/browse/CDRIVER-1976.

It is normal when the file has only one chunk.

Generated at Wed Feb 07 21:20:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.