Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-45119

CollectionShardingState::getCurrentShardVersionIfKnown returns collection version instead of shard version

    • Fully Compatible
    • ALL
    • Sharding 2019-12-30, Sharding 2020-04-20

      Issue Status as of Jul 31, 2020

      ISSUE DESCRIPTION AND IMPACT

      A bug in shard version checking causes a race condition between parallel chunk migrations and auto-split activity.

      If the race condition occurs, an affected shard becomes unable to update its sharding metadata, and operations that require data from that shard will fail.

      While it is possible for the issue to clear on its own, it is likely to persist until action is taken.

      DIAGNOSIS AND AFFECTED VERSIONS

      Sharded clusters with 2 or more shards running MongoDB versions <=4.2.5 and version 4.0.17 are impacted. The bug is much more likely to be triggered on 4.0.17 than other versions, however.

      If the bug is triggered, client operations will begin failing with "version mismatch detected" (StaleConfig) errors. And, corresponding mongos logs will include "requested shard version differs from config shard version" error messages.

      REMEDIATION AND WORKAROUNDS

      If running MongoDB version 4.0.17, downgrade to 4.0.16 or upgrade to 4.0.18 when it becomes available.

      If running MongoDB version 4.2.5, upgrade to version 4.2.6 when it becomes available.

      In the event a version change is not possible, this issue can be partially mitigated by:

      • Disabling the balancer
      • Waiting for the balancer to stop running.
      • Running the following command on the primary replica set member of each shard:
      db.adminCommand({_flushRoutingTableCacheUpdates: ns, syncFromConfig: true})
      

      If you re-enable the balancer, the bug can be triggered again.

      Note: If the sharded cluster is running with authentication enabled, you would need to grant the internal action on the cluster resource, to run the _flushRoutingTableCacheUpdates command:

      You could create a new role with the internal privilege on the cluster resource, and then grant this role to the admin user as below. Replace ADMIN_USER with the username for the admin.

      use admin;
      db.createRole({
        role: "flush_routing_table_cache_updates",
        privileges: [
           { resource: { cluster: true }, actions: [ "internal" ] },
        ],
        roles: [  ]
      });
      
      db.grantRolesToUser("ADMIN_USER", ["flush_routing_table_cache_updates"])
      

      FIX VERSIONS

      4.2.6 and 4.0.18

      original description

      This should call getShardVersion() instead of getCollVersion(). It's only usage is here. Fortunately the check here is still valid even though we were returning the collection version. Basically if a shard knows about collection version X, and shard version Y, then it's not possible for the actual shard version to be between X and Y, because otherwise it would know about it.

        1. SERVER-45119 - Repro.js
          1.0 kB
          Kaloian Manassiev

            Assignee:
            kaloian.manassiev@mongodb.com Kaloian Manassiev
            Reporter:
            matthew.saltz@mongodb.com Matthew Saltz (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            24 Start watching this issue

              Created:
              Updated:
              Resolved: