Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 6.0.3, 6.1.0-rc0
Affects Version/s: None
Component/s: None
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v6.0
Sprint:
Sharding EMEA 2022-06-13
Linked BF Score:
21
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Change this data size comparison with a comparison on number of chunks: this is equivalent because that part of the test simply needs to verify that no move has happened due to expected no-op balancing rounds.

Long explanation (from a build failure)

When a range deletion happens:
1. The documents in range are deleted.
2. The counter of orphans is updated, triggering also an update of the orphans counter on the BalancerStatsRegistry.

Note that points 1-2 are not happening in the same storage transaction, and - even if they would - the update of the in-memory stats registry would still not happen atomically.

The test uses the get stats for balancing command that - when invoked for a namespace:
A. Retrieves the total data size for the collection from the storage stats.
B. Retrieves the number of orphans for the collection from the BalancerStatsRegistry.

The test is doing the following:

Wait for the collection to be fully balanced.
Retrieve the actual collection data size (dataSize - num orphans) from the most loaded shard.
Wait for 3 balancing rounds that must not move any data.
Retrieve again the actual collection data size (dataSize - num orphans) from the most loaded shard.
Make sure the actual data size on the most loaded shard didn't change following the no-op balancing rounds.

According to the failure, the actual data size on the donor was 6 before the no-op rounds and became 7 after the no-op rounds. However, no move has happened and the rounds really resulted in no-ops.

The only viable explanation is that the following flow interleaving happened: 1 - A - 2 - B. Basically the balancer stats registry retrieved the data size after an orphaned document was deleted but before the number of orphans was updated. This resulted in an off-by-one.

Assignee:: Pierlauro Sciarelli
Reporter:: Pierlauro Sciarelli
Participants:: Githook User, Pierlauro Sciarelli
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: May 27 2022 03:19:04 PM UTC
Updated:: Oct 29 2023 09:37:37 PM UTC
Resolved:: Jun 01 2022 07:31:33 PM UTC
Confidence Status Last Update:: 01/Jun/22 11:34 AM

Details

Description

Long explanation (from a build failure)

Attachments

Forms

Activity

People

Dates