Uploaded image for project: 'MongoDB Database Tools'
  1. MongoDB Database Tools
  2. TOOLS-2770

Allow for mongodump to parallelize a single collection.

    • Type: Icon: New Feature New Feature
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • 100.8.1
    • Affects Version/s: None
    • Component/s: mongodump
    • Labels:
    • 1,201

      When reading data that suffers from high network and I/O latency, a common technique to improve throughput is to parallelize the workload.

      For mongodump, parallelizing a single collection scan could do the following:

      1. $sample the collection and record a good number of _id values.
      2. Sort the _id values and add pairs of [start _id, end _id] into a work queue.
      3. Have some configurable number of worker threads pop work items, perform range queries and write the results.

      There are trade-offs for this algorithm, sequential scans are being replaced with random access lookups, so I wouldn't recommend this being a default.

      Additionally, the feature is less valuable when mongodump is already distributing work by scanning multiple collections in parallel. It's not clear to me if/how the existing --numParallelColletions should mix with this.

            Unassigned Unassigned
            daniel.gottlieb@mongodb.com Daniel Gottlieb (Inactive)
            1 Vote for this issue
            4 Start watching this issue