Uploaded image for project: 'MongoDB Database Tools'
  1. MongoDB Database Tools
  2. TOOLS-1672

Improve docs for forceTableScan

    • Type: Icon: Task Task
    • Resolution: Unresolved
    • Priority: Icon: Minor - P4 Minor - P4
    • None
    • Affects Version/s: None
    • Component/s: mongodump
    • Labels:

      This ticket is really a documentation update request, but I figure it's going to require technical verification so I'm creating here in Tools first.

      I was caught out in some testing when I did a mongodump + mongorestore with a WiredTIger-using server. (All v3.4.) I wanted the data to be re-inserted by _id order. I was not using the --forceTableScan option of mongodump, or the --numInsertionWorkersPerCollection arg of mongorestore, but by looking at the recordId values when iterating the documents by _id I found the re-inserted documents were in an order that looked exactly like the original collection-scan order.

      psuedo repro code
      Unable to find source-code formatter for language: txt. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
      db.coll.find({}, ..., {showRecordId: true}).sort({_id: 1}).forEach(print(_id + ": " recordId));
      1: 1
      2: 605003
      3: 1277000
      4: 1805464
      5: 2
      2: 605004
      3: 1277001
      db.coll.find({}, ..., {showRecordId: true}).sort({_id: 1}).forEach(print(_id + ": " recordId));
      1: 1
      2: 621037
      3: 1284322
      4: 1791763
      5: 2
      2: 621038
      3: 1284324
      //(same story at high _id values, not just from 1 ~)

      I spent a fair bit of time working this out as I had a big dataset that I wanted to use the exact document data of. (Just rearranged more serially in the underlying storage for performance comparison reasons.)

      It seems my expectation that it would be dumped in _id order is outdated. Reading today I found it was valid for MMAP due to the hint used below, so long as you don't specify a query or a sort. (Code taken from the v3.2 branch, but I assume is the same 3.0 - 3.4).

      void fillOutPlannerParams(OperationContext* txn,
                                Collection* collection,
                                CanonicalQuery* canonicalQuery,
                                QueryPlannerParams* plannerParams) {
          // MMAPv1 storage engine should have snapshot() perform an index scan on _id rather than a
          // collection scan since a collection scan on the MMAP storage engine can return duplicates
          // or miss documents.
          if (isMMAPV1()) {
              plannerParams->options |= QueryPlannerParams::SNAPSHOT_USE_ID;

      But without that hint being added (as it is for the MMAP case), it seems that you will get a plain collection scan.

      Status QueryPlanner::plan(const CanonicalQuery& query,
                                const QueryPlannerParams& params,
                                std::vector<QuerySolution*>* out) {
          // If snapshot is set, default to collscanning. If the query param SNAPSHOT_USE_ID is set,
          // snapshot is a form of a hint, so try to use _id index to make a real plan. If that fails,
          // just scan the _id index.
          // Don't do this if the query is a geonear or text as as text search queries must be answered
          // using full text indices and geoNear queries must be answered using geospatial indices.
          if (query.getParsed().isSnapshot() &&
              !QueryPlannerCommon::hasNode(query.root(), MatchExpression::GEO_NEAR) &&
              !QueryPlannerCommon::hasNode(query.root(), MatchExpression::TEXT)) {
              const bool useIXScan = params.options & QueryPlannerParams::SNAPSHOT_USE_ID;
              if (!useIXScan) {
                  QuerySolution* soln = buildCollscanSoln(query, isTailable, params);
                  if (soln) {
                  return Status::OK();
              } else {
                  // Find the ID index in indexKeyPatterns. It's our hint.
                  for (size_t i = 0; i < params.indices.size(); ++i) {
                      if (isIdIndex(params.indices[i].keyPattern)) {
                          hintIndex = params.indices[i].keyPattern;

      So the real request of this ticket is: I think 'it's typically _id order' advice given below in the documentation for mongodump is now misleading, as WiredTiger is more common, and we should remove it.


      Forces mongodump to scan the data store directly: typically, mongodump saves entries as they appear in the index of the _id field. If you specify a query --query, mongodump will use the most appropriate index to support that query.

      Use --forceTableScan to skip the index and scan the data directly. Typically there are two cases where this behavior is preferable to the default:

      If you have key sizes over 800 bytes that would not be present in the _id index.
      Your database uses a custom _id field.
      When you run with --forceTableScan, mongodump does not use $snapshot. As a result, the dump produced by mongodump can reflect the state of the database at many different points in time.

      Use --forceTableScan with extreme caution and consideration.

      If you agree I can make suggestions for the wording and port it to a docs ticket.

            Unassigned Unassigned
            akira.kurogane Akira Kurogane
            0 Vote for this issue
            4 Start watching this issue