-
Type: Task
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Query Execution
-
Fully Compatible
-
v8.0, v7.0
-
QE 2024-10-14, QE 2024-10-28, QE 2024-11-11, QE 2024-11-25, QE 2024-12-09, QE 2024-12-23
-
200
Context:
We (Atlas Search) plan on using $natural order collection scan in monogt for logical initial-sync to address a frequent performance bottleneck seen by multiple customers in production (one example HELP-55062). Today, mongot uses an aggregate command with a sort stage over _id field to enable resuming initial sync either on normal flow or transient error. The issue arises when mongot rebuilds search indexes (e.g. new, definition change, unrecoverable exception) and the layout of documents in WT does not correlate well _id order. In such cases we see severe performance degradation in server performance due to disk latency and available IOPS. Natural order collection scan will solve the issue. However, it has a main drawback (see problem below).
Problem:
Search logical initial-sync algorithm relies heavily on resume support to avoid mongot in buffering change-stream (op-log) updates during collection scan. We do so by alternating between collection scan and catching up with the collection's change stream. The current implementation of aggregate command is unable to start or resume a natural order scan if `$_resumeAfter` points to a deleted document / recordId. This limitation is a significant concern for us as it could lead to a new set of production issues depending on customers workload.
Ask:
Mongod to support start or resuming $natural collection scans after a deleted recordId.
UPDATE:
Following HELP-59576, we distilled the ask to mongod:
- Provide $gt(e) / $lt(e) semantics for a recordId in aggregation pipeline.
- Aggregation will (can) provide a resume token in conjunction with $gt(e) / $lt(e).
- Provided resume token in (2) can be passed to $gt(e) / $lt(e) aggregation stage in (1) to restart a query. And resume won't fail if passed recordId does not exist / deleted (implicit).
- causes
-
SERVER-98756 Test QueryTestBackwardCollscanWithResumeScanPointFails sometimes hangs
- Closed
- is related to
-
SERVER-90497 Support resume on reversed natural order scan
- Backlog
- related to
-
SERVER-89910 Add ability to perform a natural scan on a specific recordId range
- Needs Scheduling