[SERVER-57767] dataSize command returns wrong number of documents when there are orphaned documents Created: 16/Jun/21  Updated: 14/Jan/22  Resolved: 14/Jan/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Cheahuychou Mao Assignee: Garaudy Etienne
Resolution: Won't Do Votes: 1
Labels: query-director-triage, sharding-product-sync
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Operating System: ALL
Participants:
Case:

 Description   

Currently, the dataSize command on mongos does not target shards based on the range specified in the command. In addition, each shard uses the range in the command to do the counting. So if there are orphaned documents, the command will return the wrong number of documents. 



 Comments   
Comment by Max Hirschhorn [ 21/Sep/21 ]

Thanks for the thoughtful questions kyle.suarez. I've flagged this ticket for the Sharding product sync meeting so we can discuss/research more about the use cases for the dataSize command.

To add my own thoughts here:

  • If the dataSize supported readConcern options, then there'd be a natural way of using {level: "available"} to indicate whether to include unowned documents on the shards.
  • The storageStats returned by $collStats stage is likely more helpful for users who want to understand the physical size on disk because their configuration (by default) has compression from WiredTiger enabled.
Comment by Kyle Suarez [ 21/Sep/21 ]

After a discussion with christopher.harris, while we think that changing dataSize to exclude orphans makes sense, we also want to point out that there might also be a use case for including orphans: specifically, if an administrator were interested in understanding the true physical size of a collection on disk, orphans and all.

cheahuychou.mao, what was the original use case that led to this ticket? Do you think it would make sense from a user perspective to, say, change the default behavior of dataSize to ignore orphans but also introduce a new option flag that will include orphans if specified?

Comment by Cheahuychou Mao [ 21/Sep/21 ]

Confirmed with max.hirschhorn that we don't use the dataSize command internally on sharding. We use Collection::dataSize() in the code for splitVector and chunk migration cloning. So I think we should make the dataSize command do shard filtering. 

Comment by Kyle Suarez [ 21/Sep/21 ]

If we want the size of owned documents, then yes, it sounds like we should add a SHARDING_FILTER. But if an administrator wants to know the actual true size of data on disk, then the SHARDING_FILTER is potentially omitting relevant documents.

cheahuychou.mao, do you know if we use the dataSize command internally? For example, is it used by sharding to determine if we need to make a chunk migration?

Sending to sebastien.mendez's team for investigation.

Generated at Thu Feb 08 05:42:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.