[SERVER-7549] document level stats Created: 05/Nov/12  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Admin, Storage
Affects Version/s: 2.3.0
Fix Version/s: None

Type: New Feature Priority: Minor - P4
Reporter: Matt Campbell Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 1
Labels: document, stats
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-33582 Add $bsonSize expression Closed
Assigned Teams:
Query Execution
Participants:

 Description   

Implement stats similar to those found on the db and col levels.

Currently there is no efficient way of obtaining stats such as the size of a document without sending the document down the wire to the client and bson encoding the document.

Suggest storing document stats as meta data beside each document in a collection but only return such stats data when requested as shown in the following examples.

Return a summary (aggregation of stats):

db.col.findOne({}).stats();
db.col.find({}).stats();

  • would return a document similar to db.stats() and col.stats() and contain an aggregation of all documents in the server cursor
  • in the case of findOne it would represent the stats of a single document because only one document in the cursor (thus by implementing it at the cursor level it covers both single document and aggregation scenarios)

Return documents and stats embedded using a flag on the find() operation:

db.col.find({},

{stats:true}

);

  • stats could be attached as an embedded document in the _stats key on each document
  • as the stats would be located beside the document on disk it should be a quick and efficient operation to perform

As you can see from the examples above this would be best implemented on the server cursor. I would suggest storing stats meta data beside documents on disk as opposed to storing them in a separate hash table or other data structure. This is to ensure efficient retrievals of both documents and stats in a flexible manner and to ensure writes remain fast.



 Comments   
Comment by Eric Milkie [ 19/Feb/19 ]

The aggregation pipeline could provide this sort of information with new operators.

Comment by Matt Campbell [ 08/Nov/12 ]

For both single and aggregated doc stats:

ns (collection which document is stored in - useful if you are passing objects around a system without being context bound or have a wrapper)
dataSize/size (size of the document data)
storageSize (size of storage allocated to document, ie includes overheads like padding - as close as possible to physical disk usage)

For aggregated doc stats:

count (number docs in cursor / stats aggregate)
avgDocSize (datasize divided by count)

Possible ideas (not considered core):

count of keys per doc (could be top level or drilldown into embedded docs)
Indexes used by a document (purely a match of collection indexes to document keys)

RATIONALE:

ns - allows a document to traverse through an application knowing its 'home' and having identity

dataSize - useful for clients which may be bandwidth aware and want to know the size of set of document before choosing to pull them down the wire (ie think mobile or other bandwidth constrain or resource constrained device). This would allow them to make decisions on how much data to pull down.

storageSize - in multi-tenant environments this would allow us to quickly report the physical disk used by a set of documents belonging to a client (contained in a single shared collection). Eg a multi-tenant collection of products we would be able to quickly report the disk usage in a dashboard to each user for that type of object

count - simple - number docs in cursor
avgDocSize - again useful for bandwidth aware clients when looking at a set of documents in a cursor as opposed to a single document where dataSize would make more sense. The client could calculate this using dataSize and count on a set of documents.

Comment by Eliot Horowitz (Inactive) [ 08/Nov/12 ]

Besides size, what stats are you looking for?

Generated at Thu Feb 08 03:14:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.