Details
-
Task
-
Status: Accepted
-
Minor - P4
-
Resolution: Unresolved
-
None
-
None
-
None
-
v4.2
Description
This ticket is really a documentation update request, but I figure it's going to require technical verification so I'm creating here in Tools first.
I was caught out in some testing when I did a mongodump + mongorestore with a WiredTIger-using server. (All v3.4.) I wanted the data to be re-inserted by _id order. I was not using the --forceTableScan option of mongodump, or the --numInsertionWorkersPerCollection arg of mongorestore, but by looking at the recordId values when iterating the documents by _id I found the re-inserted documents were in an order that looked exactly like the original collection-scan order.
psuedo repro code |
db.coll.find({}, ..., {showRecordId: true}).sort({_id: 1}).forEach(print(_id + ": " recordId));
|
1: 1
|
2: 605003
|
3: 1277000
|
4: 1805464
|
5: 2
|
2: 605004
|
3: 1277001
|
...
|
mongodump
|
mongorestore
|
db.coll.find({}, ..., {showRecordId: true}).sort({_id: 1}).forEach(print(_id + ": " recordId));
|
1: 1
|
2: 621037
|
3: 1284322
|
4: 1791763
|
5: 2
|
2: 621038
|
3: 1284324
|
...
|
//(same story at high _id values, not just from 1 ~)
|
I spent a fair bit of time working this out as I had a big dataset that I wanted to use the exact document data of. (Just rearranged more serially in the underlying storage for performance comparison reasons.)
It seems my expectation that it would be dumped in _id order is outdated. Reading today I found it was valid for MMAP due to the hint used below, so long as you don't specify a query or a sort. (Code taken from the v3.2 branch, but I assume is the same 3.0 - 3.4).
void fillOutPlannerParams(OperationContext* txn, |
Collection* collection,
|
CanonicalQuery* canonicalQuery,
|
QueryPlannerParams* plannerParams) {
|
...
|
...
|
|
// MMAPv1 storage engine should have snapshot() perform an index scan on _id rather than a |
// collection scan since a collection scan on the MMAP storage engine can return duplicates |
// or miss documents. |
if (isMMAPV1()) { |
plannerParams->options |= QueryPlannerParams::SNAPSHOT_USE_ID;
|
}
|
}
|
But without that hint being added (as it is for the MMAP case), it seems that you will get a plain collection scan.
Status QueryPlanner::plan(const CanonicalQuery& query, |
const QueryPlannerParams& params, |
std::vector<QuerySolution*>* out) {
|
...
|
...
|
|
// If snapshot is set, default to collscanning. If the query param SNAPSHOT_USE_ID is set, |
// snapshot is a form of a hint, so try to use _id index to make a real plan. If that fails, |
// just scan the _id index. |
// |
// Don't do this if the query is a geonear or text as as text search queries must be answered |
// using full text indices and geoNear queries must be answered using geospatial indices. |
if (query.getParsed().isSnapshot() && |
!QueryPlannerCommon::hasNode(query.root(), MatchExpression::GEO_NEAR) &&
|
!QueryPlannerCommon::hasNode(query.root(), MatchExpression::TEXT)) {
|
const bool useIXScan = params.options & QueryPlannerParams::SNAPSHOT_USE_ID; |
|
if (!useIXScan) { |
QuerySolution* soln = buildCollscanSoln(query, isTailable, params);
|
if (soln) { |
out->push_back(soln);
|
}
|
return Status::OK(); |
} else { |
// Find the ID index in indexKeyPatterns. It's our hint. |
for (size_t i = 0; i < params.indices.size(); ++i) { |
if (isIdIndex(params.indices[i].keyPattern)) { |
hintIndex = params.indices[i].keyPattern;
|
break; |
}
|
}
|
}
|
}
|
|
...
|
}
|
So the real request of this ticket is: I think 'it's typically _id order' advice given below in the documentation for mongodump is now misleading, as WiredTiger is more common, and we should remove it.
--forceTableScan
Forces mongodump to scan the data store directly: typically, mongodump saves entries as they appear in the index of the _id field. If you specify a query --query, mongodump will use the most appropriate index to support that query.
Use --forceTableScan to skip the index and scan the data directly. Typically there are two cases where this behavior is preferable to the default:
If you have key sizes over 800 bytes that would not be present in the _id index.
Your database uses a custom _id field.
When you run with --forceTableScan, mongodump does not use $snapshot. As a result, the dump produced by mongodump can reflect the state of the database at many different points in time.
IMPORTANT
Use --forceTableScan with extreme caution and consideration.
If you agree I can make suggestions for the wording and port it to a docs ticket.
Attachments
Issue Links
- depends on
-
TOOLS-1952 Use --forceTableScan by default when running against WiredTiger nodes
-
- Closed
-