Uploaded image for project: 'MongoDB Database Tools'
  1. MongoDB Database Tools
  2. TOOLS-1672

Improve docs for forceTableScan

    • Type: Icon: Task Task
    • Resolution: Unresolved
    • Priority: Icon: Minor - P4 Minor - P4
    • None
    • Affects Version/s: None
    • Component/s: mongodump
    • Labels:
      None

      This ticket is really a documentation update request, but I figure it's going to require technical verification so I'm creating here in Tools first.

      I was caught out in some testing when I did a mongodump + mongorestore with a WiredTIger-using server. (All v3.4.) I wanted the data to be re-inserted by _id order. I was not using the --forceTableScan option of mongodump, or the --numInsertionWorkersPerCollection arg of mongorestore, but by looking at the recordId values when iterating the documents by _id I found the re-inserted documents were in an order that looked exactly like the original collection-scan order.

      psuedo repro code
      Unable to find source-code formatter for language: txt. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
      db.coll.find({}, ..., {showRecordId: true}).sort({_id: 1}).forEach(print(_id + ": " recordId));
      1: 1
      2: 605003
      3: 1277000
      4: 1805464
      5: 2
      2: 605004
      3: 1277001
      ...
      mongodump
      mongorestore
      db.coll.find({}, ..., {showRecordId: true}).sort({_id: 1}).forEach(print(_id + ": " recordId));
      1: 1
      2: 621037
      3: 1284322
      4: 1791763
      5: 2
      2: 621038
      3: 1284324
      ...
      //(same story at high _id values, not just from 1 ~)
      

      I spent a fair bit of time working this out as I had a big dataset that I wanted to use the exact document data of. (Just rearranged more serially in the underlying storage for performance comparison reasons.)

      It seems my expectation that it would be dumped in _id order is outdated. Reading today I found it was valid for MMAP due to the hint used below, so long as you don't specify a query or a sort. (Code taken from the v3.2 branch, but I assume is the same 3.0 - 3.4).

      void fillOutPlannerParams(OperationContext* txn,
                                Collection* collection,
                                CanonicalQuery* canonicalQuery,
                                QueryPlannerParams* plannerParams) {
      ...
      ...
      
          // MMAPv1 storage engine should have snapshot() perform an index scan on _id rather than a
          // collection scan since a collection scan on the MMAP storage engine can return duplicates
          // or miss documents.
          if (isMMAPV1()) {
              plannerParams->options |= QueryPlannerParams::SNAPSHOT_USE_ID;
          }
      }
      

      But without that hint being added (as it is for the MMAP case), it seems that you will get a plain collection scan.

      Status QueryPlanner::plan(const CanonicalQuery& query,
                                const QueryPlannerParams& params,
                                std::vector<QuerySolution*>* out) {
          ...
          ...
      
          // If snapshot is set, default to collscanning. If the query param SNAPSHOT_USE_ID is set,
          // snapshot is a form of a hint, so try to use _id index to make a real plan. If that fails,
          // just scan the _id index.
          //
          // Don't do this if the query is a geonear or text as as text search queries must be answered
          // using full text indices and geoNear queries must be answered using geospatial indices.
          if (query.getParsed().isSnapshot() &&
              !QueryPlannerCommon::hasNode(query.root(), MatchExpression::GEO_NEAR) &&
              !QueryPlannerCommon::hasNode(query.root(), MatchExpression::TEXT)) {
              const bool useIXScan = params.options & QueryPlannerParams::SNAPSHOT_USE_ID;
      
              if (!useIXScan) {
                  QuerySolution* soln = buildCollscanSoln(query, isTailable, params);
                  if (soln) {
                      out->push_back(soln);
                  }
                  return Status::OK();
              } else {
                  // Find the ID index in indexKeyPatterns. It's our hint.
                  for (size_t i = 0; i < params.indices.size(); ++i) {
                      if (isIdIndex(params.indices[i].keyPattern)) {
                          hintIndex = params.indices[i].keyPattern;
                          break;
                      }
                  }
              }
          }
      
          ...
      }
      

      So the real request of this ticket is: I think 'it's typically _id order' advice given below in the documentation for mongodump is now misleading, as WiredTiger is more common, and we should remove it.

      --forceTableScan

      Forces mongodump to scan the data store directly: typically, mongodump saves entries as they appear in the index of the _id field. If you specify a query --query, mongodump will use the most appropriate index to support that query.

      Use --forceTableScan to skip the index and scan the data directly. Typically there are two cases where this behavior is preferable to the default:

      If you have key sizes over 800 bytes that would not be present in the _id index.
      Your database uses a custom _id field.
      When you run with --forceTableScan, mongodump does not use $snapshot. As a result, the dump produced by mongodump can reflect the state of the database at many different points in time.

      IMPORTANT
      Use --forceTableScan with extreme caution and consideration.

      If you agree I can make suggestions for the wording and port it to a docs ticket.

            Assignee:
            Unassigned Unassigned
            Reporter:
            akira.kurogane Akira Kurogane
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: