[CSHARP-254] Counting documents - 32 bits enough? Created: 22/Jun/11  Updated: 05/Apr/19  Resolved: 22/Jul/11

Status: Closed
Project: C# Driver
Component/s: None
Affects Version/s: 1.1
Fix Version/s: 1.2

Type: Task Priority: Minor - P4
Reporter: Aristarkh Zagorodnikov Assignee: Robert Stam
Resolution: Done Votes: 0
Labels: question
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Minor Change

 Description   

I noticed that MongoCollection.Count(...) returns an Int32, while I do not recall MongoDB having a limit of 2^31 documents per collection (this might be the case, I wasn't trying to remember for too long).
I wonder if such aggregation methods should return an Int64 to prevent loss (or even misinterpretation) of information on very large collections. While I agree that most of users should not occur this problem, it still looks like worth looking at.
I took some time and checked Java driver (IIRC the only other one strongly-typed with collection abstraction), it uses "long" type (http://api.mongodb.org/java/current/com/mongodb/DBCollection.html#count() ), which (my Java knowledge is outdated by like 5 years or so, but google comes to help: http://download.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html) is 64-bit.

I understand that changing return type of something like Count() might break existing code (an explicit conversion is required), so this might be not so easy. If the need for 64-bit result is considered by you, but you don't like the idea of breaking existing code, introducing Count64() (concrete name isn't that important) counterparts to the existing aggregation methods might be the best way to go in the short term, eventually deprecating old 32-bit results and replacing them with Count32(), moving Count64() to Count().



 Comments   
Comment by Robert Stam [ 22/Jul/11 ]

I have changed the return values of most of the above items from Int32 to Int64.

The only exception: I left the types of GeoNearResult as Int32 as on closer examination I don't believe they can overflow Int32. Reasonsing: ObjectsLoaded must be a relatively small number because all results must be returned in a single document so the number of hits is limited by the document size. NumberScanned is either equal to ObjectsLoaded or some small multiple of it (it would be larger if we have additional filter criteria or if the grid is too coarse). BTreeLocations must be less than or equal to NumberScanned.

Comment by Robert Stam [ 09/Jul/11 ]

Thanks for your comment. After reading it, I agree with you. If there is any possibility at all that the result value will ever exceed the range of Int32 then the type of that result should be Int64.

Comment by Aristarkh Zagorodnikov [ 09/Jul/11 ]

I don't see (that doesn't mean they don't exist, my knowledge is certainly limited) any reasons to have values in the second treated in any different way than the ones from the first group.
While I understand that in most of cases these values are not going beyond the 32-bit integer range, I really don't see that much difference, for example between MongoCursor.Count and MapReduceResults.OutputCount – both are results of a server-side query over a group of documents and are equally likely to produce same results if the M/R operation doesn't have many things to reduce.

Comment by Robert Stam [ 08/Jul/11 ]

The following properties and methods are currently defined as Int32 but should have been Int64:

MongoCollection
Count

MongoCursor
Count
Size

MongoGridFSFileInfo
Length

The following properties and methods are also currently defined as Int32 but seem very unlikely to ever exceed that range (although theoretically they could):

GeoNearResult
BTreeLocations
NumberScanned

GetLastError
DocumentsAffected

MapReduceResults
EmitCount
OutputCount
InputCount

Any operation involving this second group that would cause a value outside the range of Int32 would probably time out first.

My current thinking is to just change the first group to use Int64, and leave the second group alone. Any comments?

Comment by Robert Stam [ 22/Jun/11 ]

You're right, Count should have returned an Int64.

Probably the cleanest fix in the long run is to just change the return type to Int64. We might still be early enough (barely) in the life-cycle of the driver where a few minor breaking changes can still be tolerated (for example, version 1.1 had a few breaking changes).

Note: whatever resolution is chosen for this JIRA, we should at the same time address all other places where an Int32 might not be large enough.

Generated at Wed Feb 07 21:36:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.