Uploaded image for project: 'MongoDB Database Tools'
  1. MongoDB Database Tools
  2. TOOLS-1310

Mongodump Cursor Timeout on Sharded Clusters with --gzip option

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.2.6
    • Component/s: mongodump
    • Labels:
      None
    • Environment:
      AWS EC2 Environment
      Ubuntu 12.04 LTS
      6 x i2.4xLarge Instances, in 2 Groups of 3
      MongoDB 3.0.12
      18 Shards, 9 shards per group
      1 Primary, 1 Secondary, 1 Hidden Secondary per Group
      MongoDump host running local MongoS
    • Platforms 2016-08-26
    • v3.2

      NOTE: I am aware of TOOLS-902 and that the cause was not found. The workaround of increasing the cursor timeout globally across every MongoD only masks the underlying issue.

      When using the --gzip option with Mongodump, which compresses each collection as it pulls them down from the Cluster, it is possible to encounter "getMore: cursor didn't exist on server" failures during the dump process against sharded collections.

      There are a number of moving parts to recreate the failure (some of which are still under investigation for confirmation). However, one fact is present in every failure: The collection is sharded and distributed across all 18 shards.

      Initially, we suspected an unbalanced shard distribution, as our first failures were against collections that had uneven document distribution (caused by wide size variation of the documents). However, a more recent failure was against a collection which was very evenly balanced, and contained ~150 million documents.

      I believe the core issue is that MongoS opens a cursor against each shard for a given collection, and is then dependant on Mongodump to iterate through the collection fast enough to ensure every cursor to every shard is iterated in less than 10 minute intervals. If Mongodump doesn't iterate through the collection fast enough (i.e. gzip compression is slowing it down, or host CPU is consumed by other processes, or one shard holds a large number of the documents vs other shards), the cursors opened on a given shard may be closed due to the timeout.

      This would also mean that MongoDump with --gzip could be very sensitive to the performance of the host it is run on. It would also suggest that the likelihood of encountering this issue increases with the number of Shards present in the cluster.

      Additional notes for consideration:

      • When excluding the --gzip option, the mongodump works fine (but takes 3-5x longer).
      • When using the --gzip option AND setting --numParallelProcesses to 1, the mongodump succeeds (but again, also takes 3.5x longer).

      It makes sense that in TOOLS-902 that increasing the global cursor timeout helped resolve the issue for one customer. However, I don't believe this is a genuine fix for the problem.

      Suggestions:

      • Setting batch size for the cursor to a smaller number? i.e. default / number-of-shards? Or making it an option for an end user to specify (when they hit this issue)?
      • Encouraging the SERVER team to implement per-cursor timeout values to allow a custom longer window (vs using the global method)?

            Assignee:
            gabriel.russell@mongodb.com Gabriel Russell (Inactive)
            Reporter:
            dave.muysson@360pi.com Dave Muysson
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: