[SERVER-70602] Handle faulty balancerCompliant reporting by waiting for some no-op balancing rounds Created: 17/Oct/22  Updated: 29/Oct/23  Resolved: 03/Nov/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.1.1, 6.0.3, 6.2.0-rc0

Type: Bug Priority: Major - P3
Reporter: Silvia Surroca Assignee: Silvia Surroca
Resolution: Fixed Votes: 0
Labels: auto-reverted
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.1, v6.0
Sprint: Sharding EMEA 2022-10-31, Sharding EMEA 2022-11-14
Participants:
Linked BF Score: 35

 Description   

The tests are already double checking the collection balancerComplaint (SERVER-69136) through the awaitCollectionBalance function due to the "Won't fixed" bug SERVER-68777

However, in SERVER-68777 we only considered the possibility of having a wrong collection size on the donor and not on the recipient.

Why recipient may report a wrong collection size?

Because the actual migration of the documents is not executed atomically with the update of orphanCounter on the recipient side. So, the recipient shard would report a longer collection size than the actual one, if the request is addressed before the update of orphanCounter and after the actual migration of the documents.

NOTE: The recipient clears the 'orphan' tag of the received chunk once the migration commit state is reached.

Example of failure

 
1. First balancerCompliant returns too early because the donor shard is reporting a lower value for collection size due to SERVER-68777. At that moment the last migration is still in progress.

DONOR size RECIPIENT size
7340214 MB (actual: 8388816 MB) 5243010 MB

2. At that moment there are no orphans so we don't have to wait for them to be 0.

3. Second balancerCompliant returns too early as well because the recipient has already received the documents but didn't update the orphans counter. So we return wrongly from awaitCollectionBalanced.

DONOR size RECIPIENT size
8388816 MB 6291612 MB (actual: 5243010 MB)

4. Once the migration ends we finally get the proper size on each shard

DONOR size RECIPIENT size
7340214 MB 6291612 MB


 Comments   
Comment by Githook User [ 04/Nov/22 ]

Author:

{'name': 'Silvia Surroca', 'email': 'silvia.surroca@mongodb.com', 'username': 'silviasuhu'}

Message: SERVER-70602 Handle faulty balancerCompliant reporting by waiting for some no-op balancing rounds

(cherry picked from commit 8e7978fb75cad95f864255810c655f62a0a9408d)
Branch: v6.0
https://github.com/mongodb/mongo/commit/826697fdc36740ab8543d81880d6d641eba0685a

Comment by Githook User [ 04/Nov/22 ]

Author:

{'name': 'Silvia Surroca', 'email': 'silvia.surroca@mongodb.com', 'username': 'silviasuhu'}

Message: SERVER-70602 Handle faulty balancerCompliant reporting by waiting for some no-op balancing rounds

(cherry picked from commit 8e7978fb75cad95f864255810c655f62a0a9408d)
Branch: v6.1
https://github.com/mongodb/mongo/commit/ea4b2d784664f3240b56ad99ba66b33ff4e0330f

Comment by Silvia Surroca [ 03/Nov/22 ]

Yes, I've just created both backports

  • BACKPORT-13952
  • BACKPORT-13953
Comment by Tommaso Tocci [ 03/Nov/22 ]

silvia.surroca@mongodb.com is this affecting also 6.0 and 6.1? Do we need a backport?

Comment by Githook User [ 03/Nov/22 ]

Author:

{'name': 'Silvia Surroca', 'email': 'silvia.surroca@mongodb.com', 'username': 'silviasuhu'}

Message: SERVER-70602 Handle faulty balancerCompliant reporting by waiting for some no-op balancing rounds
Branch: master
https://github.com/mongodb/mongo/commit/8e7978fb75cad95f864255810c655f62a0a9408d

Comment by xgen-buildbaron-user [ 22/Oct/22 ]

Ticket re-opened due to revert. concurrency_sharded_with_stepdowns_and_balancer began a consistent failure of jstests/concurrency/fsm_workloads/collection_defragmentation.js

Comment by Githook User [ 22/Oct/22 ]

Author:

{'name': 'auto-revert-processor', 'email': 'dev-prod-dag@mongodb.com'}

Message: Revert "SERVER-70602 Handle faulty balancerCompliant reporting by waiting for some no-op balancing rounds"

This reverts commit 06e43e02a452ae1c4fffcffb0242b4a528bdacb4.
Branch: master
https://github.com/mongodb/mongo/commit/56a7b05b1ca726ab6ba719f244902126e6285c53

Comment by Githook User [ 21/Oct/22 ]

Author:

{'name': 'Silvia Surroca', 'email': 'silvia.surroca@mongodb.com', 'username': 'silviasuhu'}

Message: SERVER-70602 Handle faulty balancerCompliant reporting by waiting for some no-op balancing rounds
Branch: master
https://github.com/mongodb/mongo/commit/06e43e02a452ae1c4fffcffb0242b4a528bdacb4

Generated at Thu Feb 08 06:16:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.