Uploaded image for project: 'MongoDB Database Tools'
  1. MongoDB Database Tools
  2. TOOLS-2770

Allow for mongodump to parallelize a single collection.

    XMLWordPrintable

Details

    • New Feature
    • Status: Accepted
    • Major - P3
    • Resolution: Unresolved
    • None
    • 100.7.0
    • mongodump
    • None

    Description

      When reading data that suffers from high network and I/O latency, a common technique to improve throughput is to parallelize the workload.

      For mongodump, parallelizing a single collection scan could do the following:

      1. $sample the collection and record a good number of _id values.
      2. Sort the _id values and add pairs of [start _id, end _id] into a work queue.
      3. Have some configurable number of worker threads pop work items, perform range queries and write the results.

      There are trade-offs for this algorithm, sequential scans are being replaced with random access lookups, so I wouldn't recommend this being a default.

      Additionally, the feature is less valuable when mongodump is already distributing work by scanning multiple collections in parallel. It's not clear to me if/how the existing --numParallelColletions should mix with this.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              daniel.gottlieb@mongodb.com Daniel Gottlieb
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: