[SERVER-72513] Validation skips reporting inconsistencies on big collections Created: 04/Jan/23 Updated: 27/Oct/23 Resolved: 10/Jan/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Yuhong Zhang | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Storage Execution
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
We map keystrings to an array of size 65536 by hashing to efficiently validate index consistency during validation. In the array, for each location, we store the number of keys and the bytes it takes for them. We increment the key count when traversing through the collection, and decrement when traversing through the indexes, but increment the bytes for both. In the end, non-zero-count positions are considered inconsistent, and we will try to report them if the bytes don't exceed the memory limit for validation. However, for big collections with lots of indexes, hash collisions can be common. And even though the problematic key is small enough to be reported, we might think it won't fit because we will see the total bytes used by all keys hashed to this position. |
| Comments |
| Comment by Yuhong Zhang [ 04/Jan/23 ] |
|
With hashing doing its job, each bucket in the array will map to data worth (size of all keystrings)/64k bytes. And this number should be way below the memory used for the dataset. As a result, it should be safe to always keep one non-zero bucket and return the inconsistency for validate. We can probably close this ticket as work as designed and do the fix in |