[SERVER-54675] Round collection data size to zero if found to be negative on startup or coming out of replication rollback Created: 20/Feb/21 Updated: 29/Oct/23 Resolved: 25/Mar/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 4.4.7, 5.0.0-rc0 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Max Hirschhorn | Assignee: | Gregory Noma |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||
| Sprint: | Execution Team 2021-03-22, Execution Team 2021-04-05 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 34 | ||||||||||||||||
| Description |
|
Replication rollback may lead to the collection's reported data size to drift from its actual data size. As observed in the random_moveChunk_broadcast_delete_transaction.js FSM workload when running with stepdowns enabled, it is possible for the collection's reported data size to become negative from having overcounted the effects of the delete operations. (Note that while Collection::dataSize() returns a uint64_t, it may actually represents a signed 64-bit integer.) Overcounting the effects of the delete operations may cause a chunk migration to incorrectly fail with a ChunkTooBig error response due to calculating a nonsensical average document size of 737869762948382062 bytes. This is likely only an issue in testing because it effectively requires the workload to be deleting all of the documents in the collection.
|
| Comments |
| Comment by Githook User [ 17/May/21 ] |
|
Author: {'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}Message: (cherry picked from commit 9989aca5d45bf2d7c21f69d8a0192347c39cfcb5) |
| Comment by Githook User [ 25/Mar/21 ] |
|
Author: {'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}Message: |
| Comment by Daniel Gottlieb (Inactive) [ 22/Feb/21 ] |
|
I think the storage/collection layer should prevent underflow and return a 0 here. But for completeness, I think it's worth being clear that data size will drift every time we crash/rollback data on a collection (and correcting that would be a cost payed by restart/rollback recovery). For example, I would expect insert-only collections to report a larger data size than what the collection actually contains. Every rollback would increase the error (in an absolute number of bytes way). Hopefully that's not a concern for other related calculations sharding/moveChunk makes. |
| Comment by Max Hirschhorn [ 22/Feb/21 ] |
|
That's a fair point kaloian.manassiev. I had been under the impression there wouldn't be an appetite to make data size accurate following replication rollback and didn't think about asking the storage execution team for anything else. geert.bosch, louis.williams, daniel.gottlieb, what do you all think about clamping data size to zero if the value is found to be negative on startup or coming out of replication rollback? |
| Comment by Kaloian Manassiev [ 22/Feb/21 ] |
|
It seems like a bug to me that after recovery, Collection::dataSize can return a non-sensical value. Shouldn't this be an execution/storage bug instead? |