Change this data size comparison with a comparison on number of chunks: this is equivalent because that part of the test simply needs to verify that no move has happened due to expected no-op balancing rounds.
When a range deletion happens:
1. The documents in range are deleted.
2. The counter of orphans is updated, triggering also an update of the orphans counter on the BalancerStatsRegistry.
Note that points 1-2 are not happening in the same storage transaction, and - even if they would - the update of the in-memory stats registry would still not happen atomically.
The test uses the get stats for balancing command that - when invoked for a namespace:
A. Retrieves the total data size for the collection from the storage stats.
B. Retrieves the number of orphans for the collection from the BalancerStatsRegistry.
The test is doing the following:
- Wait for the collection to be fully balanced.
- Retrieve the actual collection data size (dataSize - num orphans) from the most loaded shard.
- Wait for 3 balancing rounds that must not move any data.
- Retrieve again the actual collection data size (dataSize - num orphans) from the most loaded shard.
- Make sure the actual data size on the most loaded shard didn't change following the no-op balancing rounds.
According to the failure, the actual data size on the donor was 6 before the no-op rounds and became 7 after the no-op rounds. However, no move has happened and the rounds really resulted in no-ops.
The only viable explanation is that the following flow interleaving happened: 1 - A - 2 - B. Basically the balancer stats registry retrieved the data size after an orphaned document was deleted but before the number of orphans was updated. This resulted in an off-by-one.