[SERVER-57767] dataSize command returns wrong number of documents when there are orphaned documents Created: 16/Jun/21 Updated: 14/Jan/22 Resolved: 14/Jan/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Cheahuychou Mao | Assignee: | Garaudy Etienne |
| Resolution: | Won't Do | Votes: | 1 |
| Labels: | query-director-triage, sharding-product-sync | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Operating System: | ALL | ||||
| Participants: | |||||
| Case: | (copied to CRM) | ||||
| Description |
|
Currently, the dataSize command on mongos does not target shards based on the range specified in the command. In addition, each shard uses the range in the command to do the counting. So if there are orphaned documents, the command will return the wrong number of documents. |
| Comments |
| Comment by Max Hirschhorn [ 21/Sep/21 ] |
|
Thanks for the thoughtful questions kyle.suarez. I've flagged this ticket for the Sharding product sync meeting so we can discuss/research more about the use cases for the dataSize command. To add my own thoughts here:
|
| Comment by Kyle Suarez [ 21/Sep/21 ] |
|
After a discussion with christopher.harris, while we think that changing dataSize to exclude orphans makes sense, we also want to point out that there might also be a use case for including orphans: specifically, if an administrator were interested in understanding the true physical size of a collection on disk, orphans and all. cheahuychou.mao, what was the original use case that led to this ticket? Do you think it would make sense from a user perspective to, say, change the default behavior of dataSize to ignore orphans but also introduce a new option flag that will include orphans if specified? |
| Comment by Cheahuychou Mao [ 21/Sep/21 ] |
|
Confirmed with max.hirschhorn that we don't use the dataSize command internally on sharding. We use Collection::dataSize() in the code for splitVector and chunk migration cloning. So I think we should make the dataSize command do shard filtering. |
| Comment by Kyle Suarez [ 21/Sep/21 ] |
|
If we want the size of owned documents, then yes, it sounds like we should add a SHARDING_FILTER. But if an administrator wants to know the actual true size of data on disk, then the SHARDING_FILTER is potentially omitting relevant documents. cheahuychou.mao, do you know if we use the dataSize command internally? For example, is it used by sharding to determine if we need to make a chunk migration? Sending to sebastien.mendez's team for investigation. |