[SERVER-17518] Overflow sort after upgrade from 2.4.3 to 2.6.8 Created: 09/Mar/15  Updated: 12/Mar/15  Resolved: 10/Mar/15

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Paul Moosman Assignee: David Storch
Resolution: Done Votes: 0
Labels: exceeds, large_db, overflow, sort
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Centos


Operating System: Linux
Participants:

 Description   

We are getting the following error:

Overflow sort stage buffered data usage of 33554527 bytes exceeds internal limit of 33554432 bytes code:17406

For the following query:

   MyDBCollection.com.xyz.client query:
   { $query: { net: "TTY", tags: { $in: [ "us_uk", "us_ub" ] } }, $orderby: { _id:-1 } }

Here is the index:

db.com.xyz.client.getIndexes()
  [
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "ns" : "MyDBCollection.com.xyz.client",
                "name" : "_id_"
        },
        {
                "v" : 1,
                "key" : {
                        "tk" : 1
                },
                "ns" : "MyDBCollection.com.xyz.client",
                "name" : "tk_1"
        },
        {
                "v" : 1,
                "key" : {
                        "net" : 1,
                        "tags" : 1
                },
                "ns" : "MyDBCollection.com.xyz.client",
                "name" : "net_1_tags_1"
        }
  ]

We recently upgraded from 2.4.3 to 2.6.8. We run a primary and 3 replicas. No sharding.

The collection has about 780,000 records.

Here is the full log:

    2015-03-05T19:12:06.880-0500 [conn15479] query
    MyDBCollection.com.xyz.client query: { $query: { net:
    "TTY", tags: { $in: [ "us_uk", "us_ub" ] } }, $orderby: { _id:
    -1 } } planSummary: IXSCAN { net: 1, tags: 1 }, IXSCAN { net: 1, tags:
    1 } cursorid:984918561948 ntoreturn:1000 ntoskip:0 nscanned:703402
    nscannedObjects:703402 keyUpdates:0 numYields:539 locks(micros)
    r:8380719 nreturned:1000 reslen:81020 5825ms
    2015-03-05T19:12:07.016-0500 [repl writer worker 3] build index on:
    MyDBCollection2.e.a.m.client.kitapp
    properties: { v: 1, key: { _id: 1 }, name: "_id_", ns:
    "MyDBCollection2.e.a.m.client.kitapp" }
    2015-03-05T19:12:07.016-0500 [repl writer worker 3]      added index
    to empty collection
    2015-03-05T19:12:07.018-0500 [repl writer worker 3] build index on:
    MyDBCollection2.e.a.m.client.kitapp
    properties: { v: 1, key: { state: 1 }, name: "state_1", ns:
    "MyDBCollection2.e.a.m.client.kitapp" }
    2015-03-05T19:12:07.018-0500 [repl writer worker 3]      added index
    to empty collection
    2015-03-05T19:12:07.026-0500 [repl writer worker 5] build index on:
    MyDBCollection2.e.a.m.client.kitapp
    properties: { v: 1, key: { updated: 1 }, name: "updated_1", ns:
    "MyDBCollection2.e.a.m.client.kitapp",
    expireAfterSeconds: 604800 }
    2015-03-05T19:12:07.027-0500 [repl writer worker 5]      added index
    to empty collection
    2015-03-05T19:12:07.062-0500 [repl writer worker 10] build index on:
    MyDBCollection2.e.a.m.client.kitextension
    properties: { v: 1, key: { _id: 1 }, name: "_id_", ns:
    "MyDBCollection2.e.a.m.client.kitextension"
    }
    2015-03-05T19:12:07.062-0500 [repl writer worker 10]     added index
    to empty collection
    2015-03-05T19:12:07.064-0500 [repl writer worker 10] build index on:
    MyDBCollection2.e.a.m.client.kitextension
    properties: { v: 1, key: { state: 1 }, name: "state_1", ns:
    "MyDBCollection2.e.a.m.client.kitextension"
    }
    2015-03-05T19:12:07.064-0500 [repl writer worker 10]     added index
    to empty collection
    2015-03-05T19:12:07.069-0500 [repl writer worker 11] build index on:
    MyDBCollection2.e.a.m.client.kitextension
    properties: { v: 1, key: { updated: 1 }, name: "updated_1", ns:
    "MyDBCollection2.e.a.m.client.kitextension",
    expireAfterSeconds: 604800 }
    2015-03-05T19:12:07.069-0500 [repl writer worker 11]     added index
    to empty collection
    2015-03-05T19:12:07.118-0500 [repl writer worker 10] build index on:
    MyDBCollection2.com.a.m.client.kitapp properties: {
    v: 1, key: { _id: 1 }, name: "_id_", ns:
    "MyDBCollection2.com.a.m.client.kitapp" }
    2015-03-05T19:12:07.118-0500 [repl writer worker 10]     added index
    to empty collection
    2015-03-05T19:12:07.119-0500 [repl writer worker 10] build index on:
    MyDBCollection2.com.a.m.client.kitapp properties: {
    v: 1, key: { state: 1 }, name: "state_1", ns:
    "MyDBCollection2.com.a.m.client.kitapp" }
    2015-03-05T19:12:07.120-0500 [repl writer worker 10]     added index
    to empty collection
    2015-03-05T19:12:07.132-0500 [repl writer worker 3] build index on:
    MyDBCollection2.com.a.m.client.kitapp properties: {
    v: 1, key: { updated: 1 }, name: "updated_1", ns:
    "MyDBCollection2.com.a.m.client.kitapp",
    expireAfterSeconds: 604800 }
    2015-03-05T19:12:07.133-0500 [repl writer worker 3]      added index
    to empty collection
    2015-03-05T19:12:07.170-0500 [repl writer worker 3] build index on:
    MyCollection2.com.a.m.client.kitextension
    properties: { v: 1, key: { _id: 1 }, name: "_id_", ns:
    "MyCollection2.com.a.m.client.kitextension" }
    2015-03-05T19:12:07.170-0500 [repl writer worker 3]      added index
    to empty collection
    2015-03-05T19:12:07.170-0500 [repl writer worker 3] build index on:
    MyCollection2.com.a.m.client.kitextension
    properties: { v: 1, key: { state: 1 }, name: "state_1", ns:
    "MyCollection2.com.a.m.client.kitextension" }
    2015-03-05T19:12:07.171-0500 [repl writer worker 3]      added index
    to empty collection
    2015-03-05T19:12:07.175-0500 [repl writer worker 10] build index on:
    MyCollection2.com.a.m.client.kitextension
    properties: { v: 1, key: { updated: 1 }, name: "updated_1", ns:
    "MyCollection2.com.a.m.client.kitextension",
    expireAfterSeconds: 604800 }
    2015-03-05T19:12:07.175-0500 [repl writer worker 10]     added index
    to empty collection
    2015-03-05T19:12:09.376-0500 [conn15501] ERROR: Runner error, stats:
    2015-03-05T19:12:09.498-0500 [conn15501] getmore
    MyDBCollection.com.xyz.client cursorid:984918561948
    ntoreturn:1000 keyUpdates:0 exception: getMore runner error: Overflow
    sort stage buffered data usage of 33554527 bytes exceeds internal
    limit of 33554432 bytes code:17406 numYields:152 locks(micros)
    r:1951439 nreturned:1 reslen:166 1511ms
    2015-03-05T19:12:09.522-0500 [conn15501] killcursors: found 0 of 1
    2015-03-05T19:12:09.522-0500 [conn15501] killcursors  keyUpdates:0
    numYields:0 locks(micros) r:10 17ms



 Comments   
Comment by David Storch [ 12/Mar/15 ]

Please refer to the documentation on index-based sorts for further information on when a sort order can or cannot be obtained from scanning an index in order.

Comment by Paul Moosman [ 11/Mar/15 ]

"My guess is that, even with the index, the size of the result set fed into the in-memory sort exceeds the 32MB limit. The selectivity of the index is not sufficient for avoiding a large in-memory sort."

Can you tell me more about this? What do you mean when you say "the size of the result set fed into in-memory sort". There is no in-memory sort with an index, correct?

Comment by David Storch [ 11/Mar/15 ]

Hi pmoosman,

Ok. to solve this problem we add an index on "sortKey". Now if we ask Mongo for the sorted documents, Mongo loads the index into memory and fetches the documents based on the index.

My guess is that, even with the index, the size of the result set fed into the in-memory sort exceeds the 32MB limit. The selectivity of the index is not sufficient for avoiding a large in-memory sort.

The problem I think we are seeing is that sometimes Mongo uses the index and sometimes it does not. When it does not, we see the Overflow error. When it does, we don't see the overflow error. Why?

I do not think this is the case. The logs pasted above indicate that in the case where you hit the overflow error, the server is doing an index scan over index {net: 1, tags: 1} with the correct index bounds:

                    { "type" : "IXSCAN",
                      "works" : 841000,
                      "yields" : 7377,
                      "unyields" : 7377,
                      "invalidates" : 345,
                      "advanced" : 840999,
                      "needTime" : 0,
                      "needFetch" : 0,
                      "isEOF" : 1,
                      "keyPattern" : "{ net: 1, tags: 1 }",
                      "isMultiKey" : 1,
                      "boundsVerbose" : "field #0['net']: [\"TTX\", \"TTX\"], field #1['tags']: [\"us_uk\", \"us_uk\"], [\"us_ub\", \"us_ub\"]",

Why is Mongo using the index sometimes and sometimes not? Is the index somehow bad? Do we need to give Mongo a "hint". Is there a bug in the Mongo server that sometimes gets confused?

I do not see evidence of a bug in the server. Looking at your list of indices, the {net: 1, tags: 1} index is almost certainly the most efficient index available for this query. This means that the server is choosing the correct index, and that a hint would not help you.

If you can provide additional information that may point to the existence of a bug please feel free to reopen the ticket. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience. A question like this involving more discussion would be best posted on the mongodb-user group.

Best,
Dave

Comment by Paul Moosman [ 11/Mar/15 ]

I really appreciate your taking time to create an example. And I now understand limit better. Thanks. I understand that if limit worked for our use case that it would be a good option. However, in our use case we need batchSize. Also your example seems to leave out the index.

Here is an example closer to our use case.

Let's say we have a collection with 10 million documents and have a sort key "sortKey". Let say we want to iterate through the entire collection. So we specify a batchSize of 1000. Without an index, Mongo loads the documents into memory and sorts them. However with 10 million documents the in-memory sort will fail if the documents are larger then 32MB. Ok. to solve this problem we add an index on "sortKey". Now if we ask Mongo for the sorted documents, Mongo loads the index into memory and fetches the documents based on the index.

The problem I think we are seeing is that sometimes Mongo uses the index and sometimes it does not. When it does not, we see the Overflow error. When it does, we don't see the overflow error. Why?

Why is Mongo using the index sometimes and sometimes not? Is the index somehow bad? Do we need to give Mongo a "hint". Is there a bug in the Mongo server that sometimes gets confused? Something in the Mongo server is deciding to use the index or not. How is it making that decision?

Comment by David Storch [ 11/Mar/15 ]

Hi pmoosman,

Are you saying that if we use batchSize that the server will do an in-memory sort?

Let me demonstrate with an example. Say you have 1000 documents in some collection foo. Each of these documents has an _id field and a sortkey field. Let's say that the sortkey field for each document is a random integer on the interval [0, 100]. Let's also say that, if no batchSize is specified, the server will use a default batchSize of 100 documents.

The first question we are interested in is, how should the server answer the following query?

db.foo.find().sort({sortkey: 1});

Well, it will have to scan the entire collection and sort the results of the scan in memory by sortkey. Once this is done, it gives back the first 100 results in the initial batch (because 100 is the default batchSize). The client then requests 9 subsequent getMores, each consisting of 100 more documents, which are still buffered by the server in sorted order.

Now let's ask a similar question, namely, how should the server answer the following query?

db.foo.find().sort({sortkey: 1}).batchSize(500);

This is almost the same as the last case: scan the collection, sort the whole collection in memory, and then return two batches of results. 2.4.x versions of the server have a bug where they will sometimes return the first 500 docs for this query and then close the cursor. This is the bug which I believe your application code happens to rely on.

How do things differ for limit? Consider the following query:

db.foo.find().sort({sortkey: 1}).limit(500);

In this case the correct behavior is for the server to only return 500 documents rather than the full 1000. Therefore, it can answer this query using a top-K sort. This involves scanning the collection and keeping the smallest 500 values for sortkey buffered in memory at any given time. Since the default batchSize is 100, the server should return the 500 documents in 5 batches (using 4 getMores).

If your application were to use limit rather than batchSize, it could avoid hitting the memory limit by using a less memory intensive top-K sort, as opposed to sorting the entire result set.

If so, then why does our query work sometimes and other times it does not? Is the server sometimes using the index and other times not using the index?

The server's in-memory sort implementation enforces a hard memory limit of 32MB. My guess is that sometimes the query result set is within this limit, and that sometimes it exceeds this limit. Any time that the result set grows larger than 32MB, the query will fail.

Hope this was helpful.

Best,
Dave

Comment by Paul Moosman [ 10/Mar/15 ]

I don't quite understand your response. Are you saying that if we use batchSize that the server will do an in-memory sort? If so, then why does our query work sometimes and other times it does not? Is the server sometimes using the index and other times not using the index?

Comment by David Storch [ 10/Mar/15 ]

Hi pmoosman,

Thanks for the additional info. I think I understand what is happening here now after seeing your application code, so no need to gather the logLevel 1 logs. I do not think it is related to migration or replica sets.

In 2.4, queries with batchSize and a sort were treated the same way on the server as queries with limit and sort. Please refer to the example below, run against a 2.4.12 server, which uses a toy data set based on this bug report:

> db.version();
2.4.12
> t.ensureIndex({net: 1, tags: 1});
> var padding = new Array(1024);
> for (var i = 0; i < (1024*40); i++) { t.insert({net: "TTY", tags: ["us_uk", "foo"], padding: padding}); }
> t.find({net: "TTY", tags: {$in: ["us_uk", "us_ub"]}}).sort({_id: 1}).hint({net: 1, tags: 1}).batchSize(1000).itcount();
1000 // THIS IS A BUG! There are more than 1000 matches, we should have gotten all the matches back in separate batches.
> t.find({net: "TTY", tags: {$in: ["us_uk", "us_ub"]}}).sort({_id: 1}).hint({net: 1, tags: 1}).limit(1000).itcount();
1000

This demonstrates a bug in 2.4, namely that the server erroneously treated a batchSize as a hard limit. In 2.6, this bug has been fixed. However, in order to answer the batchSize query above, the server must sort the entire result set in memory. In your case, this result set is too large to be sorted in memory, leading to the observed overflow error.

You may be able to resolve this issue by updating your application code to use limit rather than batchSize. This will instruct the server to perform a top-K sort and should restore the behavior you saw with version 2.4.3.

It appears from the available information that the system is working as designed, so I am going to close this report.

Best,
Dave

Comment by Paul Moosman [ 10/Mar/15 ]

setWriteConcern(WriteConcern.ACKNOWLEDGED);
setReadPreference(ReadPreference.secondaryPreferred());

Comment by Paul Moosman [ 10/Mar/15 ]

As far as getting logLevel 1 logs, we are still working on reproduce this in our development environments. Do you have any hints on reproducing it? Do you think it is related to migration? Or replica sets?

Comment by Paul Moosman [ 10/Mar/15 ]

We specify batchSize, 1000. We specify skip 0 for the first set and skip X + 1000 for each subsequent.

    @Override
    public Fetcher beginFetch(String appId, String net, List<String> tags, int skip, int batchSize, String... valueNames)
    {
        //get collection for this app
        DBCollection collection = deviceDB.getCollection(ourId);
        //build query for network with optional tags
        BasicDBObject query = new BasicDBObject(NET, net);
        if ((tags != null) && !tags.isEmpty())
            query.put(TAGS, new BasicDBObject("$in", tags));
        //build fields requested
        BasicDBObject fields = new BasicDBObject("_id", 0);
        if (valueNames != null)
            for (String field : valueNames)
                fields.append(field, 1);
 
        log.debug("query: " + query);
 
        //return marker for this query
        DBCursor cursor = null;
        if (getdbUseSnapshot())
            cursor = collection.find(query, fields).skip(skip).batchSize(batchSize).snapshot(); 
        else 
            cursor = collection.find(query, fields).sort(new BasicDBObject("_id", -1)).skip(skip).batchSize(batchSize);
 
        return new MongoFetcher(cursor, valueNames, batchSize);
    }
 
...
 
        public boolean nextBatch()
        {
            batchEnd = false;
            return (cursor.hasNext());
        }
 
...

getdbUseSnapshot should be false

Comment by David Storch [ 10/Mar/15 ]

Thanks for the additional info pmoosman. The logs you pasted above indicate that the server performs a top-K sort to generate the first 1000 query results, while only buffering 1000 documents at a time in memory. It hits the overflow error while doing a full in-memory sort in order to generate additional results. Presumably the client is asking for another batch of results, rather than just looking for a hard limit of 1000.

I'd like to get answers to a few more questions in order to move forward:

  • I see notoreturn:1000 in the logs. Do you know if your application is setting a limit or a batchSize (or both)? Would it be possible to paste the snippet of your application code that generates the offending queries?
  • It would still be useful to have logLevel 1 logs covering the time period before, during, and after an instance of this error. This will clearly indicate the sequence of query and getmore requests the server is receiving from the application. Would you be able to collect these logs?

Best,
Dave

Comment by Paul Moosman [ 10/Mar/15 ]

Here is another log entry showing both the query and error:

2015-03-05T19:12:01.017-0500 [conn22260] query MyCollection.c.a.m.app query: { $query: { net: "TTX", tags: { $in: [ "us_uk", "us_ub" ] } }, $orderby: { _id: -1 } } planSummary: IXSCAN { net: 1, tags: 1 }, IXSCAN { net: 1, tags: 1 } cursorid:890486569734 ntoreturn:1000 ntoskip:0 nscanned:840999 nscannedObjects:840999 keyUpdates:0 numYields:1069 locks(micros) r:10032828 nreturned:1000 reslen:243684 6951ms
2015-03-05T19:12:03.161-0500 [conn22260] ERROR: Runner error, stats:
{ "type" : "PROJECTION",
  "works" : 943919,
  "yields" : 7377,
  "unyields" : 7377,
  "invalidates" : 345,
  "advanced" : 1000,
  "needTime" : 0,
  "needFetch" : 1,
  "isEOF" : 0,
  "children" : [ 
    { "type" : "KEEP_MUTATIONS",
      "works" : 943919,
      "yields" : 7377,
      "unyields" : 7377,
      "invalidates" : 345,
      "advanced" : 1000,
      "needTime" : 942917,
      "needFetch" : 1,
      "isEOF" : 0,
      "children" : [ 
        { "type" : "OR",
          "works" : 943919,
          "yields" : 7377,
          "unyields" : 7377,
          "invalidates" : 345,
          "advanced" : 1000,
          "needTime" : 942917,
          "needFetch" : 1,
          "isEOF" : 0,
          "dupsTested" : 1000,
          "dupsDropped" : 0,
          "locsForgotten" : 338,
          "matchTested_0" : 0,
          "matchTested_1" : 0,
          "children" : [ 
            { "type" : "SORT",
              "works" : 842003,
              "yields" : 7377,
              "unyields" : 7377,
              "invalidates" : 345,
              "advanced" : 1000,
              "needTime" : 841000,
              "needFetch" : 1,
              "isEOF" : 1,
              "forcedFetches" : 0,
              "memUsage" : 361582,
              "memLimit" : 33554432,
              "children" : [ 
                { "type" : "FETCH",
                  "works" : 841001,
                  "yields" : 7377,
                  "unyields" : 7377,
                  "invalidates" : 345,
                  "advanced" : 840999,
                  "needTime" : 0,
                  "needFetch" : 1,
                  "isEOF" : 1,
                  "alreadyHasObj" : 0,
                  "forcedFetches" : 0,
                  "matchTested" : 0,
                  "children" : [ 
                    { "type" : "IXSCAN",
                      "works" : 841000,
                      "yields" : 7377,
                      "unyields" : 7377,
                      "invalidates" : 345,
                      "advanced" : 840999,
                      "needTime" : 0,
                      "needFetch" : 0,
                      "isEOF" : 1,
                      "keyPattern" : "{ net: 1, tags: 1 }",
                      "isMultiKey" : 1,
                      "boundsVerbose" : "field #0['net']: [\"TTX\", \"TTX\"], field #1['tags']: [\"us_uk\", \"us_uk\"], [\"us_ub\", \"us_ub\"]",
                      "yieldMovedCursor" : 0,
                      "dupsTested" : 840999,
                      "dupsDropped" : 0,
                      "seenInvalidated" : 338,
                      "matchTested" : 0,
                      "keysExamined" : 840999,
                      "children" : [] } ] } ] }, 
            { "type" : "SORT",
              "works" : 101916,
              "yields" : 7377,
              "unyields" : 7377,
              "invalidates" : 345,
              "advanced" : 0,
              "needTime" : 101914,
              "needFetch" : 0,
              "isEOF" : 0,
              "forcedFetches" : 0,
              "memUsage" : 33554447,
              "memLimit" : 33554432,
              "children" : [ 
                { "type" : "FETCH",
                  "works" : 101914,
                  "yields" : 7377,
                  "unyields" : 7377,
                  "invalidates" : 345,
                  "advanced" : 101914,
                  "needTime" : 0,
                  "needFetch" : 0,
                  "isEOF" : 0,
                  "alreadyHasObj" : 0,
                  "forcedFetches" : 0,
                  "matchTested" : 0,
                  "children" : [ 
                    { "type" : "IXSCAN",
                      "works" : 101914,
                      "yields" : 7377,
                      "unyields" : 7377,
                      "invalidates" : 345,
                      "advanced" : 101914,
                      "needTime" : 0,
                      "needFetch" : 0,
                      "isEOF" : 0,
                      "keyPattern" : "{ net: 1, tags: 1 }",
                      "isMultiKey" : 1,
                      "boundsVerbose" : "field #0['net']: [\"TTX\", \"TTX\"], field #1['tags']: [\"us_uk\", \"us_uk\"], [\"us_ub\", \"us_ub\"]",
                      "yieldMovedCursor" : 0,
                      "dupsTested" : 101914,
                      "dupsDropped" : 0,
                      "seenInvalidated" : 0,
                      "matchTested" : 0,
                      "keysExamined" : 101914,
                      "children" : [] } ] } ] } ] } ] } ] }
2015-03-05T19:12:03.276-0500 [conn22260] getmore MyCollection.c.a.m.app cursorid:890486569734 ntoreturn:1000 keyUpdates:0 exception: getMore runner error: Overflow sort stage buffered data usage of 33554447 bytes exceeds internal limit of 33554432 bytes code:17406 numYields:146 locks(micros) r:1403460 nreturned:1 reslen:166 1006ms

Comment by Paul Moosman [ 10/Mar/15 ]

I pulled the log files. Here is is more info from the log:

2015-03-05T19:12:09.376-0500 [conn15501] ERROR: Runner error, stats:
{ "type" : "PROJECTION",
  "works" : 849914,
  "yields" : 6644,
  "unyields" : 6644,
  "invalidates" : 60,
  "advanced" : 1000,
  "needTime" : 0,
  "needFetch" : 1,
  "isEOF" : 0,
  "children" : [ 
    { "type" : "KEEP_MUTATIONS",
      "works" : 849914,
      "yields" : 6644,
      "unyields" : 6644,
      "invalidates" : 60,
      "advanced" : 1000,
      "needTime" : 848912,
      "needFetch" : 1,
      "isEOF" : 0,
      "children" : [ 
        { "type" : "OR",
          "works" : 849914,
          "yields" : 6644,
          "unyields" : 6644,
          "invalidates" : 60,
          "advanced" : 1000,
          "needTime" : 848912,
          "needFetch" : 1,
          "isEOF" : 0,
          "dupsTested" : 1000,
          "dupsDropped" : 0,
          "locsForgotten" : 55,
          "matchTested_0" : 0,
          "matchTested_1" : 0,
          "children" : [ 
            { "type" : "SORT",
              "works" : 704406,
              "yields" : 6644,
              "unyields" : 6644,
              "invalidates" : 60,
              "advanced" : 1000,
              "needTime" : 703403,
              "needFetch" : 1,
              "isEOF" : 1,
              "forcedFetches" : 0,
              "memUsage" : 239895,
              "memLimit" : 33554432,
              "children" : [ 
                { "type" : "FETCH",
                  "works" : 703404,
                  "yields" : 6644,
                  "unyields" : 6644,
                  "invalidates" : 60,
                  "advanced" : 703402,
                  "needTime" : 0,
                  "needFetch" : 1,
                  "isEOF" : 1,
                  "alreadyHasObj" : 0,
                  "forcedFetches" : 0,
                  "matchTested" : 0,
                  "children" : [ 
                    { "type" : "IXSCAN",
                      "works" : 703403,
                      "yields" : 6644,
                      "unyields" : 6644,
                      "invalidates" : 60,
                      "advanced" : 703402,
                      "needTime" : 0,
                      "needFetch" : 0,
                      "isEOF" : 1,
                      "keyPattern" : "{ net: 1, tags: 1 }",
                      "isMultiKey" : 1,
                      "boundsVerbose" : "field #0['net']: [\"TTY\", \"TTY\"], field #1['tags']: [\"us_uk\", \"us_uk\"], [\"us_ub\", \"us_ub\"]",
                      "yieldMovedCursor" : 0,
                      "dupsTested" : 703402,
                      "dupsDropped" : 0,
                      "seenInvalidated" : 55,
                      "matchTested" : 0,
                      "keysExamined" : 703402,
                      "children" : [] } ] } ] }, 
            { "type" : "SORT",
              "works" : 145508,
              "yields" : 6644,
              "unyields" : 6644,
              "invalidates" : 60,
              "advanced" : 0,
              "needTime" : 145506,
              "needFetch" : 0,
              "isEOF" : 0,
              "forcedFetches" : 8,
              "memUsage" : 33554527,
              "memLimit" : 33554432,
              "children" : [ 
                { "type" : "FETCH",
                  "works" : 145506,
                  "yields" : 6644,
                  "unyields" : 6644,
                  "invalidates" : 60,
                  "advanced" : 145506,
                  "needTime" : 0,
                  "needFetch" : 0,
                  "isEOF" : 0,
                  "alreadyHasObj" : 0,
                  "forcedFetches" : 0,
                  "matchTested" : 0,
                  "children" : [ 
                    { "type" : "IXSCAN",
                      "works" : 145506,
                      "yields" : 6644,
                      "unyields" : 6644,
                      "invalidates" : 60,
                      "advanced" : 145506,
                      "needTime" : 0,
                      "needFetch" : 0,
                      "isEOF" : 0,
                      "keyPattern" : "{ net: 1, tags: 1 }",
                      "isMultiKey" : 1,
                      "boundsVerbose" : "field #0['net']: [\"TTY\", \"TTY\"], field #1['tags']: [\"us_uk\", \"us_uk\"], [\"us_ub\", \"us_ub\"]",
                      "yieldMovedCursor" : 0,
                      "dupsTested" : 145506,
                      "dupsDropped" : 0,
                      "seenInvalidated" : 8,
                      "matchTested" : 0,
                      "keysExamined" : 145506,
                      "children" : [] } ] } ] } ] } ] } ] }
2015-03-05T19:12:09.498-0500 [conn15501] getmore DeviceRegistrations.com.aol.mobile.aolclient cursorid:984918561948 ntoreturn:1000 keyUpdates:0 exception: getMore runner error: Overflow sort stage buffered data usage of 33554527 bytes exceeds internal limit of 33554432 bytes code:17406 numYields:152 locks(micros) r:1951439 nreturned:1 reslen:166 1511ms
2015-03-05T19:12:09.522-0500 [conn15501] killcursors: found 0 of 1
2015-03-05T19:12:09.522-0500 [conn15501] killcursors  keyUpdates:0 numYields:0 locks(micros) r:10 17ms

Comment by Paul Moosman [ 10/Mar/15 ]

I'll work on getting you log. Yes, the query came from the Java driver, I believe this is the Java Driver version: mongo-java-driver-2.12.3.jar, but I will double check that.

We are still working on a consistent reproducible case.

Comment by David Storch [ 10/Mar/15 ]

Hi pmoosman,

I will need to gather a bit more information in order to determine what is happening here. The log pasted above does not show that { $query: { net: "TTY", tags: { $in: [ "us_uk", "us_ub" ] } }, $orderby: { _id:-1 } } is the query hitting the overflow error. Note that this query happens on connection conn15479 whereas the error occurs on conn15501. It looks like the stats information dumped to the log when query execution fails is missing, i.e. the following line is truncated:

2015-03-05T19:12:09.376-0500 [conn15501] ERROR: Runner error, stats:

The following information would be helpful for moving forward.

  1. Can you confirm that these queries are being delivered by the Java driver? What version of the Java driver are you using?
  2. Would it be possible to reproduce the issue while the server is running at logLevel 1, and then provide a snippet of the log covering the time period before, during, and after the error? You can increase the logLevel using the command below. Don't forget to restore your original logLevel afterwards.

    var old = db.adminCommand({setParameter: 1, logLevel: 1});
    // Reproduce the error.
    db.adminCommand({setParameter: 1, logLevel: old.was}); // Afterwards, restore the old logLevel.
    

Thanks in advance for your help.

Best,
Dave

Generated at Thu Feb 08 03:44:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.