-
Type:
New Feature
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Aggregation Framework
-
Query Execution
-
v8.1, v8.0, v7.0, v6.0
-
None
-
3
-
TBD
-
None
-
None
-
None
-
None
-
None
-
None
Summary
Verification slowness frequently complicates large migrations. To mitigate this, we would like verification to compute a document hash on the server, which would be sent in lieu of the full document.
This would confer at least these advantages:
- Reduced network I/O. Instead of sending, say, a 50k document, we would send an 8-byte hash.
- Reduced server memory usage. Instead of caching entire documents in memory to serve verification queries, the server would only cache document hashes.
Details
The envisioned operator would look thus in the pipeline:
{ $bsonHash: { input: "$$ROOT", algorithm: "fnv1a_64", } }
... and would output a BinData of the hash. (I suggest little-endian encoding and subtype=0.)
The only hash algorithm that C2C uses presently is fnv1a_64.
Proofs of Concept
migration-verifier
I made a migration-verifier branch that compares the result of $toHashedIndexKey rather than full documents.
($toHashedIndexKey is not suitable for full verification because it elides numeric type differences.)
With this change, the total time to verify a 31-GiB dataset fell from 37 minutes to under 15 minutes. The load on migration-verifier also seemed far lower than it is on upstream.
$bsonHash operator
I created a rough proof-of-concept $bsonHash operator. This may facilitate assessment of the work needed.
Real-World Cases
Adobe
In [Adobe's recent migration|HELP-73941], Migration Factory had to revert to checking manually-sampled documents because C2C's standalone migration-verifier at one point caused enough load to cause OOMs on both source & destination clusters.
We have since [addressed this problem|REP-5996], but others like it may yet emerge.
WISE
This migration took 7 days to do an initial verification of 47 TiB. Because of the long migration time, a large # of documents had to be rechecked. Each of those enqueued rechecks then was deleted. There were enough rechecks deleted that, by the time migration-verifier started checking again, it had fallen off the oplog, and verification had to restart.
We have [accelerated the speed of recheck deletions|REP-5963], but the 7-day verification time remains a problem. During that period the customer cannot perform any DDL operations and, because verification reads every document in full, may suffer degraded production workload performance, impacting revenue.
Additional Notes
- The server appears already to have code to compute FNV hashes at src/third_party/wiredtiger/src/support/hash_fnv.c. Thus, additional 3rd-party code should be unneeded. (I didn’t use this in my proof-of-concept because I didn’t know how to pull it in.)
- The further this can be backported, the more migrations it can benefit.