Investigate why resharding timeseries abort tests fail with BadValue in sharding_last_lts multiversion context

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • 200
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Background

      While working on SERVER-126217, commit 56cd9c722fb removed both multiversion_incompatible and featureFlagReshardingForTimeseries tags from:

      • jstests/sharding/resharding_timeseries/reshard_timeseries_abort_command.js
      • jstests/sharding/resharding_timeseries/reshard_timeseries_abort_during_building_index.js
      • jstests/sharding/resharding_timeseries/reshard_timeseries_abort_in_preparing_to_donate.js
      • jstests/sharding/resharding_timeseries/reshard_timeseries_abort_while_monitoring_to_commit.js

      Why the Removal Was Considered Safe

      Based on Abdul's (zorro786) reasoning in PR #54052 (SERVER-126218), inline thread on jstests/sharding/libs/resharding_failover_helpers.js:

      > "multiversion_incompatible is only needed if you kill/shutdown nodes. It is not needed when you perform a stepUp operation. The reason restarts break multiversion is that MongoRunner would relaunch the binary using the new version's mongod from disk. Pure stepdown does not relaunch any binary, so it's safe."

      None of the four abort tests restart nodes — they only use failpoints. The featureFlagReshardingForTimeseries removal was also sound: the flag has default: true, version: 8.0 in src/mongo/s/resharding/resharding_feature_flag.idl, making it redundant with requires_fcv_80.

      The Actual Failure

      The four tests fail consistently (not flaky) in sharding_last_lts across all 13 burn_in variants in Evergreen patch 6a186d950b47bf0007b2aa9b.

      Failure Mechanism

      In reshard_timeseries_abort_command.js:

      1. The test waits on failpoint reshardingPauseRecipientBeforeCloning (maxTimeMS: 600000).
      2. The resharding thread fails silently before reaching it: {{"Ignoring response from the resharding thread: { ok: 0, errmsg: 'Chunk range must start at global min for new shard key', code: 2, codeName: 'BadValue' }

        "}}

      3. Since the operation never passes validation, the failpoint is never entered.
      4. waitForFailPoint times out with MaxTimeMSExpired (code 50) after 10 minutes.

      All four tests follow this same pattern — each waits on a different failpoint, but all require the resharding operation to pass initial validation, which it never does.

      Potential Root Cause: checkForHolesAndOverlapsInChunks

      The BadValue (code 2) originates in src/mongo/db/s/resharding/resharding_util.cpp:181:

      uassert(ErrorCodes::BadValue,
              "Chunk range must start at global min for new shard key",
              SimpleBSONObjComparator::kInstance.evaluate(chunks.front().getMin() ==
                                                          keyPattern.globalMin()));
      

      In the mixed-version cluster (master coordinator + 8.0 shard nodes) of sharding_last_lts, chunk ranges for timeseries resharding do not start at globalMin of the new shard key. This passes on pure-master but fails in mixed-version.

      The enterprise-rhel-8-64-bit-multiversion variant additionally produced core dumps from both mongod-8.0. and mongod. during teardown, suggesting the ReshardingTest fixture crashes when cleaning up after the failed operation.

      Parsley Log Links

      reshard_timeseries_abort_command.js (job1, ~10 min failure):
      https://parsley.corp.mongodb.com/test/mongodb_mongo_master_enterprise_rhel_8_64_bit_dynamic_generated_by_burn_in_tags_burn_in:sharding_last_lts_enterprise_rhel_8_64_bit_dynamic_generated_by_burn_in_tags_0_patch_5bad8a762b61fdedab5a9420d7ce78170318a12b_6a186d950b47bf0007b2aa9b_26_05_28_16_31_30/0/8a10bee3bfa50daeab5d5546e4752d87?shareLine=0

      reshard_timeseries_abort_while_monitoring_to_commit.js (multiversion, with core dumps):
      https://parsley.corp.mongodb.com/test/mongodb_mongo_master_enterprise_rhel_8_64_bit_multiversion_generated_by_burn_in_tags_burn_in:sharding_last_lts_enterprise_rhel_8_64_bit_multiversion_generated_by_burn_in_tags_3_patch_5bad8a762b61fdedab5a9420d7ce78170318a12b_6a186d950b47bf0007b2aa9b_26_05_28_16_31_30/0/133355d35086d40ce2b483717b09ed0c?shareLine=0

      Files to Investigate

      • src/mongo/db/s/resharding/resharding_util.cppcheckForHolesAndOverlapsInChunks() line 181: why do chunk ranges not start at globalMin for timeseries resharding in a mixed 8.0/master cluster?
      • src/mongo/s/resharding/resharding_feature_flag.idl — verify featureFlagReshardingForTimeseries FCV gating (check_against_fcv: legacy_fcv_snapshot_only) in mixed-version contexts.
      • jstests/sharding/libs/resharding_test_fixture.js — how does ReshardingTest initialize chunk ranges for timeseries, and does this path differ between 8.0 and master?

      Short-Term Workaround

      multiversion_incompatible has been added back to the four abort tests in SERVER-126217 to unblock CI.

            Assignee:
            Anja Kalaba
            Reporter:
            Anja Kalaba
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: