[CDRIVER-1983] BSON_TYPE_ARRAY - what is the right way to determine that the root document is an array? Created: 06/Jan/17  Updated: 27/Oct/23  Resolved: 17/Jan/17

Status: Closed
Project: C Driver
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Minor - P4
Reporter: Arseny Vakhrushev Assignee: Unassigned
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Hi everyone,

There are some API calls in mongoc that require BSON arrays as their input. A good example would be aggregation pipelines for instance. But that's hardly the main cause for the following important question.

When marshalling high-level language types to and from BSON, one encounters a problem with how to reliably determine if a root document is an array. There is no such issue for nested arrays because one can rely on the element's type - BSON_DOCUMENT vs BSON_ARRAY.

More specifically, if I want to build a root BSON array, I need to append elements with string keys "0", "1", "2", etc. effectively converting them from the original indexes with bson_uint32_to_string(). The output BSON will be like this:

{ "0" : "aaa", "1" : "bbb", "2" : "ccc", ... }

If I fail to do this, the underlying bson_append_array() routine in mongoc will complain in stderr about improperly formed array keys.

But what happens if I want to restore the original array in a high-level language from such a document? Obviously, I need to somehow determine that this document is indeed an array because it would otherwise be incorrect to marshal it back as a document (associative array) with string keys "0", "1", "2", etc.

Should I first parse all the keys converting them from strings to integers and checking if they are in ascending order and then traverse the document again?

If yes, there is no bson_string_to_uint32() call for quick backward conversion, and I am left to use slow GLIBC calls like strtol() to do that.

If no, I can just check the first key and see if it is "0", but that doesn't seem to be reliable to me because one can forge a document like

{ "0" : "aaa", "2" : "bbb", "4" : "ccc" }

which will lose information if treated like an array.

So, what is the official canonical way to marshal root BSON arrays back to a high-level language?

Of course, this leaves one lingering with a question as why BSON is designed the way that it needs string keys "0", "1", "2",... for arrays at all instead of having a proper way to format arrays without them. This can be done fairly easily retaining backward compatibility.

Thanks!



 Comments   
Comment by Arseny Vakhrushev [ 19/Jan/17 ]

Single-typed arrays are a good development indeed! And I strongly vote for it.

The case where a MongoDB client has a document and doesn't know whether it's an array or not arises very rarely - essentially, only in mongoc_collection_aggregate.

That is indeed a very rare problem I wouldn't vote for solving solely. That's why I created a question instead of an issue in the first place.

But if you are really following me, you can see that these things ...

  • The need to fiddle with integer/string keys via bson_uint32_to_string();
  • The need to store ugly string keys "0", "1", "2" ... in documents which will save space;
  • The need to report ugly array errors to stderr;

... are present independently from the initial issue. They are always side-by-side with arrays in libbson, and they aren't looking great. Moreover, they can't and won't be fixed by single-typed arrays alone except, of course, if you deprecate BSON_TYPE_ARRAY altogether making it impossible to have mixed-type arrays. This will make BSON incompatible with JSON though.

Comment by A. Jesse Jiryu Davis [ 19/Jan/17 ]

Arseny, thanks for your five cents. That's an interesting proposal. However, it solves a very rare problem. The case where a MongoDB client has a document and doesn't know whether it's an array or not arises very rarely - essentially, only in mongoc_collection_aggregate.

We have a better and more radical proposal that we plan to implement within the next few server releases: fixed-type arrays, SERVER-9380. If we're going to make the effort to deeply change the structure of BSON, then fixed-type arrays are a better investment. Please watch and vote for that ticket.

Comment by Arseny Vakhrushev [ 18/Jan/17 ]

Yes, I am aware about bson_uint32_to_string()'s pregenerated values. That's what I meant by saying that the opposite operation should be at least as fast to be worth the efforts put into bson_uint32_to_string().

I must admit that the design of arrays in BSON doesn't look nice and clean to me. May I ask you why the hassle with converting integer keys to strings, storing them (they occupy space), converting them back to check, etc. exists? Wouldn't it be much cleaner to specify arrays without keys altogether?

Let's assume we could extend the document specification, so:

document ::= int32 e_list "\x00"

becomes:

document ::= int32 e_list "\x00" (value*)

and we demand that either the e_list part or the (value*) part should be empty. In other words, one must not mix associative and sequential parts together.

With this:
1) We will retain backward compatibility when the (value*) part is absent;
2) We will be able to quickly determine if the document is array-like by comparing the first key to "\x00";
3) No need to fiddle with integer/string keys via bson_uint32_to_string();
4) No need to store ugly string keys "0", "1", "2" ... in documents which will save space;
4) No need to report ugly array errors to stderr;
5) We can get rid of the BSON_TYPE_ARRAY type in the future or continue using it for backward compatibility;

Yes, this might imply a new set of bson_append_-like functions or extending the existing ones, but the API will look much cleaner and the above benefits seem to be attainable for a small effort.

Just my five cents...

Comment by A. Jesse Jiryu Davis [ 18/Jan/17 ]

That's correct. Consider either comparing the first key to "0" or, if you need to check all N keys, then generate each key "i" from 0 to N with bson_uint32_to_string and compare the i'th key to the current key. Note that bson_uint32_to_string has pregenerated the first 1000 keys so it's effectively only the strncmp of a very short string that you're paying for.

Comment by Arseny Vakhrushev [ 18/Jan/17 ]

Now we're cooking. To iterate all the top-level keys quickly and check their integer values, there clearly should be some kind of the opposite to bson_uint32_to_string(), namely bson_string_to_uint32():

There is no bson_string_to_uint32() call for quick backward conversion, and I am left to use slow GLIBC calls like strtol() to do that.

Otherwise, all the effort put into bson_uint32_to_string() to speed up conversion of integer keys goes waste on the way back.

So, the only viable option to determine "array-likeness" right now is to check if the first key is "0". Is that correct?

Comment by A. Jesse Jiryu Davis [ 17/Jan/17 ]

You can determine if the BSON is array-like by iterating all the top-level
keys and asserting they are the strings "0", "1", "2", .... As a
heuristic, check if the first key is the string "0".

On Tue, Jan 17, 2017 at 5:05 PM Arseny Vakhrushev (JIRA) <jira@mongodb.org>

Comment by Arseny Vakhrushev [ 17/Jan/17 ]

Thanks for your attention, Jesse!
Hannes' answer merely repeated what I actually asked - "There's no way to determine if a root BSON is an array because there's no such thing as a root BSON array". Hope the following will not waste everyone's time including me....

There is no such thing called "root BSON array". The "root BSON" is defined as a document per the spec, so the canonical container is always document.

This is simply not true. To prove that, one can run:

#include <mongoc.h>
 
int main() {
	mongoc_client_t *client;
	mongoc_collection_t *collection;
	mongoc_cursor_t *cursor;
	bson_t pipeline;
 
	mongoc_init();
 
	client = mongoc_client_new("mongodb://127.0.0.1");
	BSON_ASSERT(client);
	collection = mongoc_client_get_collection(client, "test-database", "test-collection");
	BSON_ASSERT(collection);
 
	bson_init(&pipeline);
	BSON_APPEND_BOOL(&pipeline, "a", true);
 
	cursor = mongoc_collection_aggregate(collection, 0, &pipeline, 0, 0);
	BSON_ASSERT(cursor);
 
	bson_destroy(&pipeline);
	mongoc_cursor_destroy(cursor);
	mongoc_collection_destroy(collection);
	mongoc_client_destroy(client);
	mongoc_cleanup();
	return 0;
}

And the result is:

bson_append_array(): invalid array detected. first element of array parameter is not "0".

So clearly, mongoc does in fact internally distinguish between root arrays and documents. Hence my initial question.

If I am writing a binding to mongoc for a high-level language, this leads to the following problem. When a high-level array is converted to a type that wraps around bson_t which is then fed to a mongoc_collection_aggregate() wrapper for example, there should be a way to check the argument that it's a root BSON array to let the user know. Otherwise, I'll get the above error message in stderr (which should in fact be propagated upwards btw).

There are two ways to achieve that now:
1) I need to store additional information (a flag) along with a bson_t;
2) I need to rely on the contents of a bson_t itself to determine if it's a root array or not (note that I do exactly that for nested arrays relying on their type);

A similar problem arises when I try to do transitions like:

High-level Array type ---> Wrapper around bson_t ---> High-level Array type

I should be able to restore the initial high-level array from a wrapper type whenever I get a fresh copy of bson_t. To do that, I need to know how to determine if bson_t is an array. Hence my initial question.

To elaborate further, consider that I am mapping two methods of a mongoc_gridfs_file_t:
1) mongoc_gridfs_file_get_aliases() - returns a root BSON array which I should be able to restore as a high-level array based on its contents;
2) mongoc_gridfs_file_set_aliases() - aniticipates a root BSON array; otherwise, it produces the same error message about invalid array keys;

To summarize things up, the API complains about root documents not being "properly formatted" whereas there's no way to determine if documents are indeed "properly formatted" before providing them to other methods.

Hope this will be useful....

Comment by A. Jesse Jiryu Davis [ 17/Jan/17 ]

We haven't heard back in a while, let us know if you have more questions.

Comment by Hannes Magnusson [ 06/Jan/17 ]

I'm not entirely following, so excuse me if I'm being daft, but I think you are asking how to determine if the "container BSON" or "root BSON" is an array or document?

There is no such thing called "root BSON array". The "root BSON" is defined as a document per the spec, so the canonical container is always document.

Does that make sense?

Comment by J Rassi [ 06/Jan/17 ]

Moved to CDRIVER.

Comment by Arseny Vakhrushev [ 06/Jan/17 ]

Oh, sorry. This mostly pertains to CDRIVER and libbson, not CXX. Thanks!

Generated at Wed Feb 07 21:13:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.