[SERVER-3645] Sharded collection counts (on primary) can report too many results Created: 18/Aug/11 Updated: 27/Oct/22 Resolved: 30/Mar/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | None |
| Fix Version/s: | 3.7.4 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Greg Studer | Assignee: | Ian Boros |
| Resolution: | Done | Votes: | 52 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Query 2018-03-12, Query 2018-03-26, Query 2018-04-09 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
SummaryCount does not filter out unowned (orphaned) documents and can therefore report larger values than one will find via a normal query, or using itcount() in the shell. CausesThe following conditions can lead to counts being off:
WorkaroundA workaround to get accurate counts is to ensure all migrations have been cleaned up and no migrations are active. To query non-primaries you must also ensure that there is no replication lag including any migration data, in addition to the above requirements. Non-Primary ReadsFor issues with counts/reads from non-primaries please see Behavior of "fast count" and non-"fast count"A "fast count" is a count run without a predicate. It is "fast" because the implementation only reads the metadata, without fetching any documents. The problem of count() reporting inaccurate results has been fixed for non-"fast counts," that is, starting in 4.0, counts which are run with a predicate are accurate when run on sharded clusters. "Fast counts" (count() run without a predicate) may still report too many documents (see SERVER-33753). In general, if one needs an accurate count of how many documents are in a collection, we do not recommend using the count command. Instead, we suggest using the $count aggregation stage, like this:
See the docs. For users who need the performance of "fast count", and are okay with approximate results, we suggest using $collStats instead of the count command:
|
| Comments |
| Comment by Asya Kamsky [ 05/Mar/19 ] | |||||||||||||||||||||||||||||
|
lucasoares this was a pretty large change, so we currently don't have plans to backport it. | |||||||||||||||||||||||||||||
| Comment by Lucas [ 26/Feb/19 ] | |||||||||||||||||||||||||||||
|
Hello! This will be backported to 3.6? Thank you. | |||||||||||||||||||||||||||||
| Comment by Githook User [ 30/Mar/18 ] | |||||||||||||||||||||||||||||
|
Author: {'email': 'ian.boros@10gen.com', 'name': 'Ian Boros'}Message: | |||||||||||||||||||||||||||||
| Comment by Githook User [ 26/Mar/18 ] | |||||||||||||||||||||||||||||
|
Author: {'email': 'ian.boros@10gen.com', 'name': 'Ian Boros'}Message: Revert 0c082a81047cee66821e295d02e3588f7934ff64: | |||||||||||||||||||||||||||||
| Comment by Ian Whalen (Inactive) [ 21/Mar/18 ] | |||||||||||||||||||||||||||||
|
Apologies for the additional email, but something about JIRA's underlying DB got out of sync so I'm reopening and closing this to get the ticket into the correct resolved state. | |||||||||||||||||||||||||||||
| Comment by Matt Muscari [ 21/Mar/18 ] | |||||||||||||||||||||||||||||
|
Unsubscribe | |||||||||||||||||||||||||||||
| Comment by Ian Boros [ 21/Mar/18 ] | |||||||||||||||||||||||||||||
|
See "Behavior of "fast count" and non-"fast count" section in the description for the behavior for 4.0 on. | |||||||||||||||||||||||||||||
| Comment by Githook User [ 21/Mar/18 ] | |||||||||||||||||||||||||||||
|
Author: {'email': 'ian.boros@10gen.com', 'name': 'Ian Boros'}Message: | |||||||||||||||||||||||||||||
| Comment by Charlie Swanson [ 08/Jan/18 ] | |||||||||||||||||||||||||||||
|
This came up again during work on | |||||||||||||||||||||||||||||
| Comment by Holger Morch [ 08/Mar/17 ] | |||||||||||||||||||||||||||||
|
This is a 6 | |||||||||||||||||||||||||||||
| Comment by Farid Sharipov [ 10/Nov/15 ] | |||||||||||||||||||||||||||||
|
In our case we are seeing this problem in un-sharded collection with Mongo 3.0
| |||||||||||||||||||||||||||||
| Comment by Markus Mahlberg [ 28/Oct/15 ] | |||||||||||||||||||||||||||||
|
Interestingly, an aggregation counting the documents returns the correct value:
| |||||||||||||||||||||||||||||
| Comment by Patrick Lauber [ 03/Jun/15 ] | |||||||||||||||||||||||||||||
|
this bug cost me almost a week as i tried to understand why my import numbers differed from my source... | |||||||||||||||||||||||||||||
| Comment by Jon Hyman [ 09/Dec/14 ] | |||||||||||||||||||||||||||||
|
Do you know if this is going to make it into 2.8? We have a 6 hour balancer window and our counts can be wrong during 25% of the day due to it. | |||||||||||||||||||||||||||||
| Comment by Kevin J. Rice [ 28/Jul/14 ] | |||||||||||||||||||||||||||||
|
@Sam Flint - THANK YOU, that explain().n saves a bunch of time! Great hint! | |||||||||||||||||||||||||||||
| Comment by sam flint [ 27/Jul/14 ] | |||||||||||||||||||||||||||||
|
You can use explain() to capture the correct count and this is much faster than itcount(). We put this in our client side application to call explain().n , {client_id:1}).count() , {client_id:1}).itcount() , {client_id:1}).explain().n | |||||||||||||||||||||||||||||
| Comment by Srinivas Mutyala [ 24/Apr/14 ] | |||||||||||||||||||||||||||||
|
Additional issues with Sharded set-up & when it's sharding. 1) After the MongoDB Initial sync, the number of documents in the individual shards are nearly equal – but not exactly. 2) During the sync in progress, if we query the total number of documents thru the mongos(router) we’re seeing indefinite results. Ideally, it should be same number all the time.If the issue is with count() is fine, but what about the data consistency. 3) Initial sync time is directly proportional to the total data size and very slower. Need a fix for it. | |||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 09/Jan/14 ] | |||||||||||||||||||||||||||||
|
This ticket is tracking count() - which uses metadata for collection to quickly get the total count of documents. There is a different ticket tracking the fact that when you query secondaries with broadcast query (i.e. untargeted, not involving the shard key) and there is either migration in progress or orphan documents left from an aborted migration, the secondary doesn't know to filter them out the way the primary would. That ticket is https://jira.mongodb.org/browse/SERVER-5931 - the workaround of reading from primaries when using non-targeted queries will work for you. If you are using targeted queries (one with the shard key) then this should be a problem whether you are on primaries or secondaries. | |||||||||||||||||||||||||||||
| Comment by Jon Hyman [ 09/Jan/14 ] | |||||||||||||||||||||||||||||
|
No, I have a query criteria. I see; I only get the issues when using secondary reads, so perhaps it is something unrelated to this issue. I patched my driver (Moped) to print out information, here's what I got from a count. Though as I mentioned, I can also reproduce this when querying over the data and incrementing my own counter, so it's not just on a count(). As far as I can tell, when it is bad, it is always the same shard. ObjectRocket, our hosting provider, pointed me to this issue but perhaps we can troubleshoot separately. GOOD (returns a count of 101844) Response [225, -645272788, 23, 1, 0, 0, 0, 1] Operation #<Moped::Protocol::Command @length=145 @request_id=23 @response_to=0 @op_code=2004 @flags=[:slave_ok] @full_collection_name="REDACTED.$cmd" @skip=0 @limit=-1 @selector={:count=>"COLLECTION_REDACTED", :query=>{REDACTED} @fields=nil>, reply #<Moped::Protocol::Reply @length=225 @request_id=-645272788 @response_to=23 @op_code=1 @flags=[] @cursor_id=0 @offset=0 @count=1 @documents=[{"shards"=>{"2f7abb37320b715c8ed68c86d29d93a7"=>25754, "cabe9e1b214ce57538a20ca6688a8ee0"=>25300, "dcb91ad1cd3630601020f83d7b6883e0"=>25581, "f1678636cab6fa77c76a9264ea9963a7"=>25209} , "n"=>101844, "ok"=>1.0}]> BAD (returns a count of 99503) Response [225, -645270088, 24, 1, 0, 0, 0, 1] Operation #<Moped::Protocol::Command @length=145 @request_id=24 @response_to=0 @op_code=2004 @flags=[:slave_ok] @full_collection_name="REDACTED.$cmd" @skip=0 @limit=-1 @selector={:count=>"COLLECTION_REDACTED", :query=>{REDACTED} @fields=nil>, reply #<Moped::Protocol::Reply @length=225 @request_id=-645270088 @response_to=24 @op_code=1 @flags=[] @cursor_id=0 @offset=0 @count=1 @documents=[{"shards"=>{"2f7abb37320b715c8ed68c86d29d93a7"=>23413, "cabe9e1b214ce57538a20ca6688a8ee0"=>25300, "dcb91ad1cd3630601020f83d7b6883e0"=>25581, "f1678636cab6fa77c76a9264ea9963a7"=>25209} , "n"=>99503, "ok"=>1.0}]> | |||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 09/Jan/14 ] | |||||||||||||||||||||||||||||
|
jonhyman are these straight count() without a condition? This is not tied to secondary reads - during migrations the same documents exist on more than one shard (on the primary) on a perfectly functioning system. | |||||||||||||||||||||||||||||
| Comment by Jon Hyman [ 09/Jan/14 ] | |||||||||||||||||||||||||||||
|
Hi there, We just noticed this bug (it's been happening for a long time from our logs). We have a sharded setup and run a count over some documents using secondary reads. While troubleshooting, I noticed that even if I iterated over the query and incremented my own counter, I would get the same wrong result part of the time. It's extremely sporadic, I'll go periods of the day where if I run the same query over and over again, it returns the correct count every other query and the incorrect count every other query. The problem here is now I'm extremely distrusting of secondary reads. Our application uses secondary reads in a lot of places because we have hundreds of millions of documents. It seems like I need to remove all secondary read preferences from my data. Does that seem like the right solution to you? If so, this is horrible IMO and secondary reads are broken in the worst way (silently returning incorrect data). | |||||||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 06/Nov/13 ] | |||||||||||||||||||||||||||||
|
This is a major change and unfortunately did not make the cut for 2.6 | |||||||||||||||||||||||||||||
| Comment by James Smith [ 05/Nov/13 ] | |||||||||||||||||||||||||||||
|
Is a fix for this targeted for 2.6? | |||||||||||||||||||||||||||||
| Comment by auto [ 05/Oct/11 ] | |||||||||||||||||||||||||||||
|
Author: {u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}Message: buildbot fix test b/c of | |||||||||||||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 18/Aug/11 ] | |||||||||||||||||||||||||||||
|
We can' do a real count - need to count more meta data |