Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-46845

Shard which received a StaleShardVersion can get stuck indefinitely in a moveChunk command

    • Fully Compatible
    • ALL
    • v4.4
    • Sharding 2020-03-23, Sharding 2020-04-06, Sharding 2020-04-20
    • 15

      When a shard updates its knowledge of its shard version, post migration commit, it logs a message which looks like this:

      [ShardedClusterFixture:job0:shard1:primary] 2020-02-04T15:18:15.510+0000 I  COMMAND  [conn291] command admin.$cmd appName: "tid:54" command: getMore { getMore: 3975999160804323048, collection: "$cmd.aggregate", lsid: { id: UUID("d2eb0b2e-2ff8-4263-ab12-e5f9514ff6a4"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, $clusterTime: { clusterTime: Timestamp(1580829495, 39), signat[ShardedClusterFixture:job0:shard1:primary] 2020-02-04T15:18:11.560+0000 I  SHARDING [conn55] Updating metadata for collection config.system.sessions from collection version: 15|0||5e398add924cca4d6c4487b2, shard version: 0|0||5e398add924cca4d6c4487b2 to collection version: 16|0||5e398add924cca4d6c4487b2, shard version: 16|0||5e398add924cca4d6c4487b2 due to version change
      

      This log line comes from here and if we zoom inside CollectionMetadata::toStringBasic(), the call to log the current shard version will invoke ChunkManager::getVersion(ShardId).

      If it so happens that the CatalogCache's entry for a collection gets invalidated with the local shard id and there is a concurrently running migration, it is possible that the completion of the chunk migration will get stuck indefinitely, because ChunkManager::getVersion will keep throwing ShardInvalidatedForTargetingInfo exceptions and will keep getting retried under refreshFilteringMetadataUntilSuccess

      I think the bug is currently not happening, because somehow after the logging changes were committed this line is no longer logged in the test output. For example here.

            Assignee:
            blake.oler@mongodb.com Blake Oler
            Reporter:
            kaloian.manassiev@mongodb.com Kaloian Manassiev
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: