Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-103379

dispatchShardPipeline uses stale CRI when targeting a specific mergeShardId, leading to assertion failure

    • Type: Icon: Task Task
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Optimization
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      When dispatchShardPipeline dispatches to a specific mergeShardId, it continues to use a CollectionRoutingInfo (CRI) from the originally targeted namespace, which may not match the actual shard the pipeline is sent to.

      This causes a violation in the database version protocol, specifically this tassert in buildDatabaseVersionedRequest:

          tassert(
                      10162102,
                      fmt::format("Expected exactly one shard matching the database primary shard when no shard "
                                              "version is required. Found shard: {}, expected: {}, for namespace: {}.",
                                              shardId.toString(),
                                              cri.getDbPrimaryShardId().toString(),
                                              nss.toStringForErrorMsg()),
                      shardId == cri.getDbPrimaryShardId());
      

      Here is an example of a failing patch

      This happens when:

      • Any aggregation runs against a user database (like $listClusterCatalog) 
      • The pipeline sets mergeShardId = "config"
      • But the CRI passed to dispatchShardPipeline corresponds to the original user database, whose primary shard does not match "config"
      • This results in an assertion failure because the shardId used to dispatch does not match cri.getDbPrimaryShardId().

      Another example in this log:

      • The aggregation ran on the test database, which has primary shard shard-rs1.
      • The $merge stage wrote to a collection (outColl) that was already created on shard-rs0.
      • As a result, the aggregation was dispatched to shard-rs0 for execution.
      • However, the CollectionRoutingInfo used was still for the test database (primary: shard-rs1), causing a mismatch with the target shard and triggering the tassert above.

      NOTE: We have currently added an exception to skip the tassert, but the intention is to remove it and enable it to prevent future misusages when versioning a request.

      To summarize: If you're dispatching a request to a specific shard for a database, you must use a CollectionRoutingInfo (CRI) whose dbPrimaryShardId matches the target shard.

            Assignee:
            Unassigned Unassigned
            Reporter:
            meryama.nadim@mongodb.com Meryama Nadim
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: