[CDRIVER-1983] BSON_TYPE_ARRAY - what is the right way to determine that the root document is an array? Created: 06/Jan/17 Updated: 27/Oct/23 Resolved: 17/Jan/17 |
|
| Status: | Closed |
| Project: | C Driver |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Minor - P4 |
| Reporter: | Arseny Vakhrushev | Assignee: | Unassigned |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
Hi everyone, There are some API calls in mongoc that require BSON arrays as their input. A good example would be aggregation pipelines for instance. But that's hardly the main cause for the following important question. When marshalling high-level language types to and from BSON, one encounters a problem with how to reliably determine if a root document is an array. There is no such issue for nested arrays because one can rely on the element's type - BSON_DOCUMENT vs BSON_ARRAY. More specifically, if I want to build a root BSON array, I need to append elements with string keys "0", "1", "2", etc. effectively converting them from the original indexes with bson_uint32_to_string(). The output BSON will be like this:
If I fail to do this, the underlying bson_append_array() routine in mongoc will complain in stderr about improperly formed array keys. But what happens if I want to restore the original array in a high-level language from such a document? Obviously, I need to somehow determine that this document is indeed an array because it would otherwise be incorrect to marshal it back as a document (associative array) with string keys "0", "1", "2", etc. Should I first parse all the keys converting them from strings to integers and checking if they are in ascending order and then traverse the document again? If yes, there is no bson_string_to_uint32() call for quick backward conversion, and I am left to use slow GLIBC calls like strtol() to do that. If no, I can just check the first key and see if it is "0", but that doesn't seem to be reliable to me because one can forge a document like
So, what is the official canonical way to marshal root BSON arrays back to a high-level language? Of course, this leaves one lingering with a question as why BSON is designed the way that it needs string keys "0", "1", "2",... for arrays at all instead of having a proper way to format arrays without them. This can be done fairly easily retaining backward compatibility. Thanks! |
| Comments |
| Comment by Arseny Vakhrushev [ 19/Jan/17 ] | |||||||||||||||||||||||||||||
|
Single-typed arrays are a good development indeed! And I strongly vote for it.
That is indeed a very rare problem I wouldn't vote for solving solely. That's why I created a question instead of an issue in the first place. But if you are really following me, you can see that these things ...
... are present independently from the initial issue. They are always side-by-side with arrays in libbson, and they aren't looking great. Moreover, they can't and won't be fixed by single-typed arrays alone except, of course, if you deprecate BSON_TYPE_ARRAY altogether making it impossible to have mixed-type arrays. This will make BSON incompatible with JSON though. | |||||||||||||||||||||||||||||
| Comment by A. Jesse Jiryu Davis [ 19/Jan/17 ] | |||||||||||||||||||||||||||||
|
Arseny, thanks for your five cents. That's an interesting proposal. However, it solves a very rare problem. The case where a MongoDB client has a document and doesn't know whether it's an array or not arises very rarely - essentially, only in mongoc_collection_aggregate. We have a better and more radical proposal that we plan to implement within the next few server releases: fixed-type arrays, SERVER-9380. If we're going to make the effort to deeply change the structure of BSON, then fixed-type arrays are a better investment. Please watch and vote for that ticket. | |||||||||||||||||||||||||||||
| Comment by Arseny Vakhrushev [ 18/Jan/17 ] | |||||||||||||||||||||||||||||
|
Yes, I am aware about bson_uint32_to_string()'s pregenerated values. That's what I meant by saying that the opposite operation should be at least as fast to be worth the efforts put into bson_uint32_to_string(). I must admit that the design of arrays in BSON doesn't look nice and clean to me. May I ask you why the hassle with converting integer keys to strings, storing them (they occupy space), converting them back to check, etc. exists? Wouldn't it be much cleaner to specify arrays without keys altogether? Let's assume we could extend the document specification, so:
becomes:
and we demand that either the e_list part or the (value*) part should be empty. In other words, one must not mix associative and sequential parts together. With this: Yes, this might imply a new set of bson_append_-like functions or extending the existing ones, but the API will look much cleaner and the above benefits seem to be attainable for a small effort. Just my five cents... | |||||||||||||||||||||||||||||
| Comment by A. Jesse Jiryu Davis [ 18/Jan/17 ] | |||||||||||||||||||||||||||||
|
That's correct. Consider either comparing the first key to "0" or, if you need to check all N keys, then generate each key "i" from 0 to N with bson_uint32_to_string and compare the i'th key to the current key. Note that bson_uint32_to_string has pregenerated the first 1000 keys so it's effectively only the strncmp of a very short string that you're paying for. | |||||||||||||||||||||||||||||
| Comment by Arseny Vakhrushev [ 18/Jan/17 ] | |||||||||||||||||||||||||||||
|
Now we're cooking. To iterate all the top-level keys quickly and check their integer values, there clearly should be some kind of the opposite to bson_uint32_to_string(), namely bson_string_to_uint32():
Otherwise, all the effort put into bson_uint32_to_string() to speed up conversion of integer keys goes waste on the way back. So, the only viable option to determine "array-likeness" right now is to check if the first key is "0". Is that correct? | |||||||||||||||||||||||||||||
| Comment by A. Jesse Jiryu Davis [ 17/Jan/17 ] | |||||||||||||||||||||||||||||
|
You can determine if the BSON is array-like by iterating all the top-level On Tue, Jan 17, 2017 at 5:05 PM Arseny Vakhrushev (JIRA) <jira@mongodb.org> | |||||||||||||||||||||||||||||
| Comment by Arseny Vakhrushev [ 17/Jan/17 ] | |||||||||||||||||||||||||||||
|
Thanks for your attention, Jesse!
This is simply not true. To prove that, one can run:
And the result is:
So clearly, mongoc does in fact internally distinguish between root arrays and documents. Hence my initial question. If I am writing a binding to mongoc for a high-level language, this leads to the following problem. When a high-level array is converted to a type that wraps around bson_t which is then fed to a mongoc_collection_aggregate() wrapper for example, there should be a way to check the argument that it's a root BSON array to let the user know. Otherwise, I'll get the above error message in stderr (which should in fact be propagated upwards btw). There are two ways to achieve that now: A similar problem arises when I try to do transitions like: High-level Array type ---> Wrapper around bson_t ---> High-level Array type I should be able to restore the initial high-level array from a wrapper type whenever I get a fresh copy of bson_t. To do that, I need to know how to determine if bson_t is an array. Hence my initial question. To elaborate further, consider that I am mapping two methods of a mongoc_gridfs_file_t: To summarize things up, the API complains about root documents not being "properly formatted" whereas there's no way to determine if documents are indeed "properly formatted" before providing them to other methods. Hope this will be useful.... | |||||||||||||||||||||||||||||
| Comment by A. Jesse Jiryu Davis [ 17/Jan/17 ] | |||||||||||||||||||||||||||||
|
We haven't heard back in a while, let us know if you have more questions. | |||||||||||||||||||||||||||||
| Comment by Hannes Magnusson [ 06/Jan/17 ] | |||||||||||||||||||||||||||||
|
I'm not entirely following, so excuse me if I'm being daft, but I think you are asking how to determine if the "container BSON" or "root BSON" is an array or document? There is no such thing called "root BSON array". The "root BSON" is defined as a document per the spec, so the canonical container is always document. Does that make sense? | |||||||||||||||||||||||||||||
| Comment by J Rassi [ 06/Jan/17 ] | |||||||||||||||||||||||||||||
|
Moved to CDRIVER. | |||||||||||||||||||||||||||||
| Comment by Arseny Vakhrushev [ 06/Jan/17 ] | |||||||||||||||||||||||||||||
|
Oh, sorry. This mostly pertains to CDRIVER and libbson, not CXX. Thanks! |