Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-46845

Shard which received a StaleShardVersion can get stuck indefinitely in a moveChunk command

    XMLWordPrintable

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.4
    • Sprint:
      Sharding 2020-03-23, Sharding 2020-04-06, Sharding 2020-04-20
    • Linked BF Score:
      15

      Description

      When a shard updates its knowledge of its shard version, post migration commit, it logs a message which looks like this:

      [ShardedClusterFixture:job0:shard1:primary] 2020-02-04T15:18:15.510+0000 I  COMMAND  [conn291] command admin.$cmd appName: "tid:54" command: getMore { getMore: 3975999160804323048, collection: "$cmd.aggregate", lsid: { id: UUID("d2eb0b2e-2ff8-4263-ab12-e5f9514ff6a4"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, $clusterTime: { clusterTime: Timestamp(1580829495, 39), signat[ShardedClusterFixture:job0:shard1:primary] 2020-02-04T15:18:11.560+0000 I  SHARDING [conn55] Updating metadata for collection config.system.sessions from collection version: 15|0||5e398add924cca4d6c4487b2, shard version: 0|0||5e398add924cca4d6c4487b2 to collection version: 16|0||5e398add924cca4d6c4487b2, shard version: 16|0||5e398add924cca4d6c4487b2 due to version change
      

      This log line comes from here and if we zoom inside CollectionMetadata::toStringBasic(), the call to log the current shard version will invoke ChunkManager::getVersion(ShardId).

      If it so happens that the CatalogCache's entry for a collection gets invalidated with the local shard id and there is a concurrently running migration, it is possible that the completion of the chunk migration will get stuck indefinitely, because ChunkManager::getVersion will keep throwing ShardInvalidatedForTargetingInfo exceptions and will keep getting retried under refreshFilteringMetadataUntilSuccess

      I think the bug is currently not happening, because somehow after the logging changes were committed this line is no longer logged in the test output. For example here.

        Attachments

          Activity

            People

            Assignee:
            blake.oler Blake Oler
            Reporter:
            kaloian.manassiev Kaloian Manassiev
            Participants:
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: