Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-92349

Fix noBalance reset on sharding DDL coordinator

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Cluster Scalability
    • Fully Compatible
    • v8.0
    • Cluster Scalability Priorities
    • 200

      TLDR: it looks like When constructShardingDDLCoordinatorInstance() is triggered with operationType of kReshardCollection, it resets noBalance: False during the reshard, even if the collection has noBalance: True set previously. We want this value to persist through reshard rather than being reset to a default of false.

      Background:
      BF-33587 showed an issue with the indexed_insert_ttl.js test with 2 separate suites that were configured for continuous resharding. This caused the test to fail as the TTL Monitor couldn't remove the TTL Index documents from the collection as they were constantly being resharded. The TTL Monitor cannot proceed while a collection is being resharded.

      Initial proposed fix:
      The proposed fix for this was to have add_remove_shards.py check the collections for noBalance:true, and if the collection contained noBalance:true, then it would skip adding it to the tracked_unsharded_colls, thus hoping to stop sharding once the JS test started the TTL Check. This should work as before the TTL check, the balance is turned off, setting noBalance:true on the collection with the TTL indexes present.

      Code change in add_remove_shards.py around line 511:

      for coll in tracked_colls:
      +   if "noBalance" in coll and coll["noBalance"] == True:
      +       continue
          if "unsplittable" in coll:
              tracked_unsharded_colls.append(coll)
          else:
              sharded_colls.append(coll) 

      Problem: This fix however did not always work as we found instances where the resharding would happen shortly after setting noBalance:true on the collection. Once the sharding DDL Coordinator started a reshard, noBalance field would be reset back to false. Thus allowing the collection with the TTL index to continue being resharded throughout the check, and causing the test to fail.

      Task at hand:
      When constructShardingDDLCoordinatorInstance() is triggered and starts reshardCollection, it seems to reset the noBalance value back to false at some point. We want this value to persist through reshard. This should allow the initial proposed fix mentioned above to fix the original BF.

            Assignee:
            cheahuychou.mao@mongodb.com Cheahuychou Mao
            Reporter:
            dominic.hernandez@mongodb.com Dominic Hernandez
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: