[SERVER-78052] Properly handle conflict between balancer splitting due to zoning and auto-merger Created: 13/Jun/23  Updated: 29/Oct/23  Resolved: 06/Jul/23

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 7.1.0-rc0, 7.0.0-rc7

Type: Bug Priority: Major - P3
Reporter: Pierlauro Sciarelli Assignee: Pierlauro Sciarelli
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
is caused by SERVER-74872 Auto-merger must keep on issuing requ... Closed
Related
Assigned Teams:
Sharding EMEA
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.0
Sprint: Sharding EMEA 2023-06-26, Sharding EMEA 2023-07-10
Participants:

 Description   

The auto-merger currently works on a secondary thread executed concurrently with the balancer thread and its behavior can be summarized as follows:

  • (1) while the balancer is enabled:
    • (2) while there are <collection, shard> with mergeable chunks (mergeability requirements documented in DOCS-15976)
      • (3) for each <collection, shard> discovered by (2):
        • (4) squash together mergeable chunks
        • (5) sleep for 15 seconds
    • (6) sleep for 1 hour

As part of a balancing round, the balancer is taking care of splitting chunks according to the configured zones so that they can then be moved off. Since splitting is an operation that does not imply ownership change, 2 or more split chunks are always mergeable as long as they reside on the same shard at least for the history window (defined in DOCS-15976).

The conflict between the balancer splitting chunks for zoning and the auto-merger squashing together mergeable chunks had been considered acceptable based on the following ideas:

  • The auto-merger may merge chunks belonging to different zones that are currently residing on the same shard
  • But anyway the auto-merger will then "go to sleep" for 1 hour
  • This leaves enough time for the balancer to split again and keep on moving data (avoiding future merges)

It turns out that - given the extreme slowness of splits in case of several hundred of zones - there is a perfect interleaving leading to the following continuous conflict between the balancer and the auto-merger:

  • (A) The balancer starts splitting chunks
  • (B) The auto-merger discovers mergeable chunks due to (2)
  • (C) Due to (4), the auto-merger squashes together chunks that were just split because of (A)
  • (D) The auto-merger sleeps 15 seconds due to (5) while (A) is still running and discovers new chunks due to (2)
  • (E) The balancer finishes (A) but part of the split chunks have been merged back
  • Back to A, repeat


 Comments   
Comment by Githook User [ 10/Jul/23 ]

Author:

{'name': 'Pierlauro Sciarelli', 'email': 'pierlauro.sciarelli@mongodb.com', 'username': 'pierlauro'}

Message: SERVER-78052 + SERVER-78767 Adapt `commitMergeAllChunksOnShard` to take into account zones
Branch: v7.0
https://github.com/mongodb/mongo/commit/9eee01c3bd439752e5c3c7d6164dd605cd6a71d8

Comment by Githook User [ 06/Jul/23 ]

Author:

{'name': 'Pierlauro Sciarelli', 'email': 'pierlauro.sciarelli@mongodb.com', 'username': 'pierlauro'}

Message: SERVER-78052 Adapt `commitMergeAllChunksOnShard` to take into account zones
Branch: master
https://github.com/mongodb/mongo/commit/c1faace9f55cc80a8d70703fe5b36d15c0082846

Generated at Thu Feb 08 06:37:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.