[SERVER-7694] external sort for find command Created: 16/Nov/12  Updated: 09/Oct/19  Resolved: 02/Oct/19

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: None
Fix Version/s: 4.3.1

Type: New Feature Priority: Major - P3
Reporter: Aaron Staple Assignee: David Storch
Resolution: Done Votes: 12
Labels: query_triage
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-6157 Refactor BSONObjExtSort Closed
is depended on by SERVER-7676 find usings sort,skip,limit where sor... Closed
Documented
is documented by DOCS-13066 Investigate changes in SERVER-7694: e... Closed
Duplicate
is duplicated by SERVER-5374 batchSize is a hard limit for an in m... Closed
Related
related to SERVER-9444 Use new Sorter for Aggregation $sort ... Closed
related to SERVER-23768 Document internalQueryExecMaxBlocking... Closed
related to SERVER-43683 Make find command with allowDiskUse o... Closed
is related to SERVER-4716 reexamine scan and order memory limit... Closed
is related to SERVER-3867 aggregation: use external sort to ha... Closed
Backwards Compatibility: Fully Compatible
Sprint: Query 2019-10-07
Participants:
Case:

 Description   

Right now when query results must be sorted before being returned, a top-N sort is performed in memory with a memory footprint cap enforced. The goal for this ticket is move to external sorting if too much memory is required, see discussion in SERVER-4716.



 Comments   
Comment by David Storch [ 02/Oct/19 ]

As of development version 4.3.1 (which will evolve into the 4.4 GA release), users can opt into allowing disk use for sorting in a find operation. This ability has existed for many releases for aggregate operations: see this documentation page showing an example of an agg pipeline that requires an external sort. The same functionality is now available for find operations. In the shell, the allowDiskUse:true parameter to the find command can be set using the following syntax:

db.collection.find(<match>, <projection>).sort(<sort>).allowDiskUse();

By default in versions >=4.3.1, mongod will begin spilling data to disk once the memory requirements exceed 100MB. This maximum memory consumption threshold can be configured at runtime or at startup using the internalQueryExecMaxBlockingSortBytes setParameter (see this page for details on how to configure setParameters). If internalQueryExecMaxBlockingSortBytes is exceeded when allowDiskUse is true, data will be spilled to disk during execution of the sort; if allowDiskUse is false, the query will fail.

As part of this change, the execution statistics reported by the explain command for the SORT stage have changed slightly:

  • memUsage has been replaced by totalDataSizeSorted, which gives the total number of bytes of data sorted. Since the data now might be either in memory or on disk, totalDataSizeSorted is a better metric for understanding how much data was sorted than the memory usage. A future improvement could additional add an additional metric such as peakMemoryConsumption, which would describe the maximum memory usage over the course of the execution of the sort.
  • A new boolean called usedDisk is now available in order to indicate whether or not the SORT stage had to spill data to disk. The value of usedDisk can only be true if the application has set allowDiskUse:true on either a find or aggregate operation.
Comment by Githook User [ 02/Oct/19 ]

Author:

{'username': 'dstorch', 'email': 'david.storch@mongodb.com', 'name': 'David Storch'}

Message: SERVER-7694 Enable allowing disk use for sorts in the find command.

For find commands that request a sort, the query execution
engine can now spill data to disk if the memory requirements
would exceed 'internalQueryExecBlockingSortBytes' and the
'allowDiskUse' parameter is set to true. By default, the
memory threshold is currently 100MB. This allows
applications which need to sort server-side to no longer be
subject to an arbitrary data size threshold.
Branch: master
https://github.com/mongodb/mongo/commit/62f03390c9957eba4c250eb3893b873391add3d2

Comment by Luke Thompson [ 10/Feb/14 ]

Thanks dan@10gen.com, cool, that's a great step.

I think that it doesn't address the core of the problem though.

We (and probably lots of others) may have written code and deployed it to production which is basically a ticking time-bomb because of this limitation.

Right now, our datasets are small enough that the sort is occurring in memory, but as the data grows we near the memory cap.

One day a single record will be added to a collection, and the app will start crashing.

A warning in the log that a query is using a large amount of resources for sorting and should have an index would be great, but the DB should keep responding even if the query slows when sort data sets no longer fit in memory.

Comment by Daniel Pasette (Inactive) [ 10/Feb/14 ]

lthompson@infomedia.com.au, starting in the 2.6 MongoDB release, you should be able to use the aggregation framework for unindexed sorts. See SERVER-9444.

Comment by Luke Thompson [ 31/Jan/14 ]

Thanks in advance for addressing this.

I just had a commercial support ticket (CS-10484) to clear up why this behaviour was happening:
"Is there a way to configure MongoDB to behave as we would have expected it to (basically to suck it up and sort slowly)?"

Comment by Jeffrey Yemin [ 26/Apr/13 ]

As SERVER-5374 is marked as a duplicate of this, in order to satisfy that use case, the server needs to ensure that even when the number of results is below the memory footprint cap, queries with an explicit batch size should return the same number of documents as those without. Given the wire protocol, there is no way for the server to tell the difference between limit and batch size as specified in the driver API, so to make this work properly, the server has to return a cursor if number of results > numberToReturn, regardless of whether it chooses to use external sort or in memory sort.

Comment by Eliot Horowitz (Inactive) [ 18/Jan/13 ]

@aaron - yes, exactly.

Comment by Aaron Staple [ 17/Jan/13 ]

SERVER-5374 (batchSize is a hard limit for an in memory sort) was closed as a duplicate of this ticket. This suggests that we may want to change the behavior of scan and order queries to support getMore (retrieving results in batches instead of all in the initial query).

Generated at Thu Feb 08 03:15:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.