Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-38472

A config server can return early for a shardCollection command even if the shard hasn't finished its own shardCollection command

    XMLWordPrintable

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.0
    • Sprint:
      Sharding 2018-12-31, Sharding 2019-01-14
    • Linked BF Score:
      71

      Description

      1. A config server calls shardCollection
      2. The shard begins shardCollection
      3. The shard writes chunks and the metadata entry for the collection.
      4. The config server steps down, cancelling its shardCollection command.
      5. A new config server steps up.
      6. The new config server retries the shardCollection command.
      7. The new config server sees that the metadata entry for the collection has been written, erroneously assuming that the existence of a metadata entry implies that the shard has finished its shardCollection command. This in turn causes the distributed lock to be released, meaning chunk migrations and splits can get in.
      8. A subsequent moveChunk operation can acquire the collection dist lock and because of this can attempt acquiring the critical section, which currently crashes the server.

      A config server should not be able to early return if the shard's shardCollection command is not complete.

        Attachments

          Activity

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: