Core Server
  1. Core Server
  2. SERVER-3645

count() reports too many results in sharded setup

    Details

    • Type: Bug Bug
    • Status: Open Open
    • Priority: Major - P3 Major - P3
    • Resolution: Unresolved
    • Affects Version/s: 1.9.2
    • Fix Version/s: 2.7 Desired
    • Component/s: Sharding
    • Labels:
      None
    • Backport:
      No
    • Operating System:
      ALL
    • # Replies:
      8
    • Last comment by Customer:
      true
    • Driver changes needed?:
      No driver changes needed

      Description

      If count()'ing a sharded query, and there is no filter for the count, each individual shard count will be based off the applySkipLimit() results which are based off NamespaceDetails::stats.nrecords. If migrations are in progress, this will result in overcounting the documents in the collections, since no filtering is done of the docs not yet committed or not yet deleted.

      Potential fix - add a always-true query to the sharded count command, forcing an actual count.

        Issue Links

          Activity

          Hide
          Eliot Horowitz
          added a comment -

          We can' do a real count - need to count more meta data

          Show
          Eliot Horowitz
          added a comment - We can' do a real count - need to count more meta data
          Hide
          auto
          added a comment -

          Author:

          {u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}

          Message: buildbot fix test b/c of SERVER-3645 on migrates
          Branch: master
          https://github.com/mongodb/mongo/commit/d8f91a17afa59a31222eeb377690a88af74be498

          Show
          auto
          added a comment - Author: {u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'} Message: buildbot fix test b/c of SERVER-3645 on migrates Branch: master https://github.com/mongodb/mongo/commit/d8f91a17afa59a31222eeb377690a88af74be498
          Hide
          James Smith
          added a comment -

          Is a fix for this targeted for 2.6?

          Show
          James Smith
          added a comment - Is a fix for this targeted for 2.6?
          Hide
          Dan Pasette
          added a comment -

          This is a major change and unfortunately did not make the cut for 2.6

          Show
          Dan Pasette
          added a comment - This is a major change and unfortunately did not make the cut for 2.6
          Hide
          Jon Hyman
          added a comment -

          Hi there,

          We just noticed this bug (it's been happening for a long time from our logs). We have a sharded setup and run a count over some documents using secondary reads.

          While troubleshooting, I noticed that even if I iterated over the query and incremented my own counter, I would get the same wrong result part of the time. It's extremely sporadic, I'll go periods of the day where if I run the same query over and over again, it returns the correct count every other query and the incorrect count every other query.

          The problem here is now I'm extremely distrusting of secondary reads. Our application uses secondary reads in a lot of places because we have hundreds of millions of documents. It seems like I need to remove all secondary read preferences from my data. Does that seem like the right solution to you? If so, this is horrible IMO and secondary reads are broken in the worst way (silently returning incorrect data).

          Show
          Jon Hyman
          added a comment - Hi there, We just noticed this bug (it's been happening for a long time from our logs). We have a sharded setup and run a count over some documents using secondary reads. While troubleshooting, I noticed that even if I iterated over the query and incremented my own counter, I would get the same wrong result part of the time. It's extremely sporadic, I'll go periods of the day where if I run the same query over and over again, it returns the correct count every other query and the incorrect count every other query. The problem here is now I'm extremely distrusting of secondary reads. Our application uses secondary reads in a lot of places because we have hundreds of millions of documents. It seems like I need to remove all secondary read preferences from my data. Does that seem like the right solution to you? If so, this is horrible IMO and secondary reads are broken in the worst way (silently returning incorrect data).
          Hide
          Asya Kamsky
          added a comment -

          Jon Hyman are these straight count() without a condition? This is not tied to secondary reads - during migrations the same documents exist on more than one shard (on the primary) on a perfectly functioning system.

          Show
          Asya Kamsky
          added a comment - Jon Hyman are these straight count() without a condition? This is not tied to secondary reads - during migrations the same documents exist on more than one shard (on the primary) on a perfectly functioning system.
          Hide
          Jon Hyman
          added a comment -

          No, I have a query criteria. I see; I only get the issues when using secondary reads, so perhaps it is something unrelated to this issue. I patched my driver (Moped) to print out information, here's what I got from a count. Though as I mentioned, I can also reproduce this when querying over the data and incrementing my own counter, so it's not just on a count().

          As far as I can tell, when it is bad, it is always the same shard. ObjectRocket, our hosting provider, pointed me to this issue but perhaps we can troubleshoot separately.

          GOOD (returns a count of 101844)

          Response [225, -645272788, 23, 1, 0, 0, 0, 1]
          reply doc {"shards"=>

          {"2f7abb37320b715c8ed68c86d29d93a7"=>25754, "cabe9e1b214ce57538a20ca6688a8ee0"=>25300, "dcb91ad1cd3630601020f83d7b6883e0"=>25581, "f1678636cab6fa77c76a9264ea9963a7"=>25209}, "n"=>101844, "ok"=>1.0}
          Operation #<Moped::Protocol::Command
          @length=145
          @request_id=23
          @response_to=0
          @op_code=2004
          @flags=[:slave_ok]
          @full_collection_name="REDACTED.$cmd"
          @skip=0
          @limit=-1
          @selector={:count=>"COLLECTION_REDACTED", :query=>{REDACTED}
          @fields=nil>, reply #<Moped::Protocol::Reply
          @length=225
          @request_id=-645272788
          @response_to=23
          @op_code=1
          @flags=[]
          @cursor_id=0
          @offset=0
          @count=1
          @documents=[{"shards"=>{"2f7abb37320b715c8ed68c86d29d93a7"=>25754, "cabe9e1b214ce57538a20ca6688a8ee0"=>25300, "dcb91ad1cd3630601020f83d7b6883e0"=>25581, "f1678636cab6fa77c76a9264ea9963a7"=>25209}

          , "n"=>101844, "ok"=>1.0}]>
          => 101844

          BAD (returns a count of 99503)

          Response [225, -645270088, 24, 1, 0, 0, 0, 1]
          reply doc {"shards"=>

          {"2f7abb37320b715c8ed68c86d29d93a7"=>23413, "cabe9e1b214ce57538a20ca6688a8ee0"=>25300, "dcb91ad1cd3630601020f83d7b6883e0"=>25581, "f1678636cab6fa77c76a9264ea9963a7"=>25209}, "n"=>99503, "ok"=>1.0}
          Operation #<Moped::Protocol::Command
          @length=145
          @request_id=24
          @response_to=0
          @op_code=2004
          @flags=[:slave_ok]
          @full_collection_name="REDACTED.$cmd"
          @skip=0
          @limit=-1
          @selector={:count=>"COLLECTION_REDACTED", :query=>{REDACTED}
          @fields=nil>, reply #<Moped::Protocol::Reply
          @length=225
          @request_id=-645270088
          @response_to=24
          @op_code=1
          @flags=[]
          @cursor_id=0
          @offset=0
          @count=1
          @documents=[{"shards"=>{"2f7abb37320b715c8ed68c86d29d93a7"=>23413, "cabe9e1b214ce57538a20ca6688a8ee0"=>25300, "dcb91ad1cd3630601020f83d7b6883e0"=>25581, "f1678636cab6fa77c76a9264ea9963a7"=>25209}

          , "n"=>99503, "ok"=>1.0}]>
          => 99503

          Show
          Jon Hyman
          added a comment - No, I have a query criteria. I see; I only get the issues when using secondary reads, so perhaps it is something unrelated to this issue. I patched my driver (Moped) to print out information, here's what I got from a count. Though as I mentioned, I can also reproduce this when querying over the data and incrementing my own counter, so it's not just on a count(). As far as I can tell, when it is bad, it is always the same shard. ObjectRocket, our hosting provider, pointed me to this issue but perhaps we can troubleshoot separately. GOOD (returns a count of 101844) Response [225, -645272788, 23, 1, 0, 0, 0, 1] reply doc {"shards"=> {"2f7abb37320b715c8ed68c86d29d93a7"=>25754, "cabe9e1b214ce57538a20ca6688a8ee0"=>25300, "dcb91ad1cd3630601020f83d7b6883e0"=>25581, "f1678636cab6fa77c76a9264ea9963a7"=>25209}, "n"=>101844, "ok"=>1.0} Operation #<Moped::Protocol::Command @length=145 @request_id=23 @response_to=0 @op_code=2004 @flags= [:slave_ok] @full_collection_name="REDACTED.$cmd" @skip=0 @limit=-1 @selector={:count=>"COLLECTION_REDACTED", :query=>{REDACTED} @fields=nil>, reply #<Moped::Protocol::Reply @length=225 @request_id=-645272788 @response_to=23 @op_code=1 @flags=[] @cursor_id=0 @offset=0 @count=1 @documents=[{"shards"=>{"2f7abb37320b715c8ed68c86d29d93a7"=>25754, "cabe9e1b214ce57538a20ca6688a8ee0"=>25300, "dcb91ad1cd3630601020f83d7b6883e0"=>25581, "f1678636cab6fa77c76a9264ea9963a7"=>25209} , "n"=>101844, "ok"=>1.0}]> => 101844 BAD (returns a count of 99503) Response [225, -645270088, 24, 1, 0, 0, 0, 1] reply doc {"shards"=> {"2f7abb37320b715c8ed68c86d29d93a7"=>23413, "cabe9e1b214ce57538a20ca6688a8ee0"=>25300, "dcb91ad1cd3630601020f83d7b6883e0"=>25581, "f1678636cab6fa77c76a9264ea9963a7"=>25209}, "n"=>99503, "ok"=>1.0} Operation #<Moped::Protocol::Command @length=145 @request_id=24 @response_to=0 @op_code=2004 @flags= [:slave_ok] @full_collection_name="REDACTED.$cmd" @skip=0 @limit=-1 @selector={:count=>"COLLECTION_REDACTED", :query=>{REDACTED} @fields=nil>, reply #<Moped::Protocol::Reply @length=225 @request_id=-645270088 @response_to=24 @op_code=1 @flags=[] @cursor_id=0 @offset=0 @count=1 @documents=[{"shards"=>{"2f7abb37320b715c8ed68c86d29d93a7"=>23413, "cabe9e1b214ce57538a20ca6688a8ee0"=>25300, "dcb91ad1cd3630601020f83d7b6883e0"=>25581, "f1678636cab6fa77c76a9264ea9963a7"=>25209} , "n"=>99503, "ok"=>1.0}]> => 99503
          Hide
          Asya Kamsky
          added a comment -

          This ticket is tracking count() - which uses metadata for collection to quickly get the total count of documents.

          There is a different ticket tracking the fact that when you query secondaries with broadcast query (i.e. untargeted, not involving the shard key) and there is either migration in progress or orphan documents left from an aborted migration, the secondary doesn't know to filter them out the way the primary would. That ticket is https://jira.mongodb.org/browse/SERVER-5931 - the workaround of reading from primaries when using non-targeted queries will work for you. If you are using targeted queries (one with the shard key) then this should be a problem whether you are on primaries or secondaries.

          Show
          Asya Kamsky
          added a comment - This ticket is tracking count() - which uses metadata for collection to quickly get the total count of documents. There is a different ticket tracking the fact that when you query secondaries with broadcast query (i.e. untargeted, not involving the shard key) and there is either migration in progress or orphan documents left from an aborted migration, the secondary doesn't know to filter them out the way the primary would. That ticket is https://jira.mongodb.org/browse/SERVER-5931 - the workaround of reading from primaries when using non-targeted queries will work for you. If you are using targeted queries (one with the shard key) then this should be a problem whether you are on primaries or secondaries.

            People

            • Votes:
              20 Vote for this issue
              Watchers:
              40 Start watching this issue

              Dates

              • Created:
                Updated:
                Days since reply:
                14 weeks, 3 days ago
                Date of 1st Reply: