Loading...

Type: Task
Resolution: Unresolved
Priority: Minor - P4
Fix Version/s: None
Affects Version/s: None
Component/s: mongodump
Labels:
None

Epic Link:
TOOLS UX quick wins
Days in Current Status:
2,111
Backport Requested:

v4.2

This ticket is really a documentation update request, but I figure it's going to require technical verification so I'm creating here in Tools first.

I was caught out in some testing when I did a mongodump + mongorestore with a WiredTIger-using server. (All v3.4.) I wanted the data to be re-inserted by _id order. I was not using the --forceTableScan option of mongodump, or the --numInsertionWorkersPerCollection arg of mongorestore, but by looking at the recordId values when iterating the documents by _id I found the re-inserted documents were in an order that looked exactly like the original collection-scan order.

psuedo repro code

Unable to find source-code formatter for language: txt. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml

db.coll.find({}, ..., {showRecordId: true}).sort({_id: 1}).forEach(print(_id + ": " recordId));
1: 1
2: 605003
3: 1277000
4: 1805464
5: 2
2: 605004
3: 1277001
...
mongodump
mongorestore
db.coll.find({}, ..., {showRecordId: true}).sort({_id: 1}).forEach(print(_id + ": " recordId));
1: 1
2: 621037
3: 1284322
4: 1791763
5: 2
2: 621038
3: 1284324
...
//(same story at high _id values, not just from 1 ~)

I spent a fair bit of time working this out as I had a big dataset that I wanted to use the exact document data of. (Just rearranged more serially in the underlying storage for performance comparison reasons.)

It seems my expectation that it would be dumped in _id order is outdated. Reading today I found it was valid for MMAP due to the hint used below, so long as you don't specify a query or a sort. (Code taken from the v3.2 branch, but I assume is the same 3.0 - 3.4).

void fillOutPlannerParams(OperationContext* txn,
                          Collection* collection,
                          CanonicalQuery* canonicalQuery,
                          QueryPlannerParams* plannerParams) {
...
...

    // MMAPv1 storage engine should have snapshot() perform an index scan on _id rather than a
    // collection scan since a collection scan on the MMAP storage engine can return duplicates
    // or miss documents.
    if (isMMAPV1()) {
        plannerParams->options |= QueryPlannerParams::SNAPSHOT_USE_ID;
    }
}

But without that hint being added (as it is for the MMAP case), it seems that you will get a plain collection scan.

Status QueryPlanner::plan(const CanonicalQuery& query,
                          const QueryPlannerParams& params,
                          std::vector<QuerySolution*>* out) {
    ...
    ...

    // If snapshot is set, default to collscanning. If the query param SNAPSHOT_USE_ID is set,
    // snapshot is a form of a hint, so try to use _id index to make a real plan. If that fails,
    // just scan the _id index.
    //
    // Don't do this if the query is a geonear or text as as text search queries must be answered
    // using full text indices and geoNear queries must be answered using geospatial indices.
    if (query.getParsed().isSnapshot() &&
        !QueryPlannerCommon::hasNode(query.root(), MatchExpression::GEO_NEAR) &&
        !QueryPlannerCommon::hasNode(query.root(), MatchExpression::TEXT)) {
        const bool useIXScan = params.options & QueryPlannerParams::SNAPSHOT_USE_ID;

        if (!useIXScan) {
            QuerySolution* soln = buildCollscanSoln(query, isTailable, params);
            if (soln) {
                out->push_back(soln);
            }
            return Status::OK();
        } else {
            // Find the ID index in indexKeyPatterns. It's our hint.
            for (size_t i = 0; i < params.indices.size(); ++i) {
                if (isIdIndex(params.indices[i].keyPattern)) {
                    hintIndex = params.indices[i].keyPattern;
                    break;
                }
            }
        }
    }

    ...
}

So the real request of this ticket is: I think 'it's typically _id order' advice given below in the documentation for mongodump is now misleading, as WiredTiger is more common, and we should remove it.

--forceTableScan

Forces mongodump to scan the data store directly: typically, mongodump saves entries as they appear in the index of the _id field. If you specify a query --query, mongodump will use the most appropriate index to support that query.

Use --forceTableScan to skip the index and scan the data directly. Typically there are two cases where this behavior is preferable to the default:

If you have key sizes over 800 bytes that would not be present in the _id index.
Your database uses a custom _id field.
When you run with --forceTableScan, mongodump does not use $snapshot. As a result, the dump produced by mongodump can reflect the state of the database at many different points in time.

IMPORTANT
Use --forceTableScan with extreme caution and consideration.

If you agree I can make suggestions for the wording and port it to a docs ticket.

depends on

TOOLS-1952 Use --forceTableScan by default when running against WiredTiger nodes

Closed

Details

Description

--forceTableScan

Attachments

Issue Links

Activity

People

Dates