Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-45119

CollectionShardingState::getCurrentShardVersionIfKnown returns collection version instead of shard version

    XMLWordPrintable

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Sprint:
      Sharding 2019-12-30, Sharding 2020-04-20
    • Case:

      Description

      Issue Status as of Jul 31, 2020

      ISSUE DESCRIPTION AND IMPACT

      A bug in shard version checking causes a race condition between parallel chunk migrations and auto-split activity.

      If the race condition occurs, an affected shard becomes unable to update its sharding metadata, and operations that require data from that shard will fail.

      While it is possible for the issue to clear on its own, it is likely to persist until action is taken.

      DIAGNOSIS AND AFFECTED VERSIONS

      Sharded clusters with 2 or more shards running MongoDB versions <=4.2.5 and version 4.0.17 are impacted. The bug is much more likely to be triggered on 4.0.17 than other versions, however.

      If the bug is triggered, client operations will begin failing with "version mismatch detected" (StaleConfig) errors. And, corresponding mongos logs will include "requested shard version differs from config shard version" error messages.

      REMEDIATION AND WORKAROUNDS

      If running MongoDB version 4.0.17, downgrade to 4.0.16 or upgrade to 4.0.18 when it becomes available.

      If running MongoDB version 4.2.5, upgrade to version 4.2.6 when it becomes available.

      In the event a version change is not possible, this issue can be partially mitigated by:

      • Disabling the balancer
      • Waiting for the balancer to stop running.
      • Running the following command on the primary replica set member of each shard:

      db.adminCommand({_flushRoutingTableCacheUpdates: ns, syncFromConfig: true})
      

      If you re-enable the balancer, the bug can be triggered again.

      Note: If the sharded cluster is running with authentication enabled, you would need to grant the internal action on the cluster resource, to run the _flushRoutingTableCacheUpdates command:

      You could create a new role with the internal privilege on the cluster resource, and then grant this role to the admin user as below. Replace ADMIN_USER with the username for the admin.

      use admin;
      db.createRole({
        role: "flush_routing_table_cache_updates",
        privileges: [
           { resource: { cluster: true }, actions: [ "internal" ] },
        ],
        roles: [  ]
      });
       
      db.grantRolesToUser("ADMIN_USER", ["flush_routing_table_cache_updates"])
      

      FIX VERSIONS

      4.2.6 and 4.0.18

      original description

      This should call getShardVersion() instead of getCollVersion(). It's only usage is here. Fortunately the check here is still valid even though we were returning the collection version. Basically if a shard knows about collection version X, and shard version Y, then it's not possible for the actual shard version to be between X and Y, because otherwise it would know about it.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              kaloian.manassiev Kaloian Manassiev
              Reporter:
              matthew.saltz Matthew Saltz
              Participants:
              Votes:
              1 Vote for this issue
              Watchers:
              23 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: