-
Type: Task
-
Resolution: Unresolved
-
Priority: Minor - P4
-
None
-
Affects Version/s: None
-
Component/s: mongodump
-
None
-
1,915
-
v4.2
This ticket is really a documentation update request, but I figure it's going to require technical verification so I'm creating here in Tools first.
I was caught out in some testing when I did a mongodump + mongorestore with a WiredTIger-using server. (All v3.4.) I wanted the data to be re-inserted by _id order. I was not using the --forceTableScan option of mongodump, or the --numInsertionWorkersPerCollection arg of mongorestore, but by looking at the recordId values when iterating the documents by _id I found the re-inserted documents were in an order that looked exactly like the original collection-scan order.
db.coll.find({}, ..., {showRecordId: true}).sort({_id: 1}).forEach(print(_id + ": " recordId)); 1: 1 2: 605003 3: 1277000 4: 1805464 5: 2 2: 605004 3: 1277001 ... mongodump mongorestore db.coll.find({}, ..., {showRecordId: true}).sort({_id: 1}).forEach(print(_id + ": " recordId)); 1: 1 2: 621037 3: 1284322 4: 1791763 5: 2 2: 621038 3: 1284324 ... //(same story at high _id values, not just from 1 ~)
I spent a fair bit of time working this out as I had a big dataset that I wanted to use the exact document data of. (Just rearranged more serially in the underlying storage for performance comparison reasons.)
It seems my expectation that it would be dumped in _id order is outdated. Reading today I found it was valid for MMAP due to the hint used below, so long as you don't specify a query or a sort. (Code taken from the v3.2 branch, but I assume is the same 3.0 - 3.4).
void fillOutPlannerParams(OperationContext* txn, Collection* collection, CanonicalQuery* canonicalQuery, QueryPlannerParams* plannerParams) { ... ... // MMAPv1 storage engine should have snapshot() perform an index scan on _id rather than a // collection scan since a collection scan on the MMAP storage engine can return duplicates // or miss documents. if (isMMAPV1()) { plannerParams->options |= QueryPlannerParams::SNAPSHOT_USE_ID; } }
But without that hint being added (as it is for the MMAP case), it seems that you will get a plain collection scan.
Status QueryPlanner::plan(const CanonicalQuery& query, const QueryPlannerParams& params, std::vector<QuerySolution*>* out) { ... ... // If snapshot is set, default to collscanning. If the query param SNAPSHOT_USE_ID is set, // snapshot is a form of a hint, so try to use _id index to make a real plan. If that fails, // just scan the _id index. // // Don't do this if the query is a geonear or text as as text search queries must be answered // using full text indices and geoNear queries must be answered using geospatial indices. if (query.getParsed().isSnapshot() && !QueryPlannerCommon::hasNode(query.root(), MatchExpression::GEO_NEAR) && !QueryPlannerCommon::hasNode(query.root(), MatchExpression::TEXT)) { const bool useIXScan = params.options & QueryPlannerParams::SNAPSHOT_USE_ID; if (!useIXScan) { QuerySolution* soln = buildCollscanSoln(query, isTailable, params); if (soln) { out->push_back(soln); } return Status::OK(); } else { // Find the ID index in indexKeyPatterns. It's our hint. for (size_t i = 0; i < params.indices.size(); ++i) { if (isIdIndex(params.indices[i].keyPattern)) { hintIndex = params.indices[i].keyPattern; break; } } } } ... }
So the real request of this ticket is: I think 'it's typically _id order' advice given below in the documentation for mongodump is now misleading, as WiredTiger is more common, and we should remove it.
--forceTableScan
Forces mongodump to scan the data store directly: typically, mongodump saves entries as they appear in the index of the _id field. If you specify a query --query, mongodump will use the most appropriate index to support that query.
Use --forceTableScan to skip the index and scan the data directly. Typically there are two cases where this behavior is preferable to the default:
If you have key sizes over 800 bytes that would not be present in the _id index.
Your database uses a custom _id field.
When you run with --forceTableScan, mongodump does not use $snapshot. As a result, the dump produced by mongodump can reflect the state of the database at many different points in time.
IMPORTANT
Use --forceTableScan with extreme caution and consideration.
If you agree I can make suggestions for the wording and port it to a docs ticket.
- depends on
-
TOOLS-1952 Use --forceTableScan by default when running against WiredTiger nodes
- Closed