-
Type:
Improvement
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Storage Execution
-
Storage Execution 2025-12-22
-
None
-
None
-
None
-
None
-
None
-
None
-
None
An annoying limitation of resumable index builds is that they are only resumable once.
However, I think it is feasible with a trick that was added in 8.0 here to support independent oplog writing and application. Essentially, rather than waiting for the top of oplog to be majority-comitted (which we do on the primary in steady-state), during startup recovery or secondary steady-state we wait for the startIndexBuild oplog entry to be majority committed. This case is trivially true during startup recovery because we're replaying writes that are majority-acknowledged.
The restrictions here about not being resumable in replication recovery are no longer restrictions, and deleting them should "just work" with additional testing to verify the theory.
This idea doesn't work. An index build is only "resumable" when 1) it reads only data that cannot be rolled-back (i.e. majority-committed) and 2) it also persist its state during clean shutdown. The majority commit point is unavailable during replication recovery, and the stable checkpoint can be arbitrarily lagged, so we don't know what timestamp is safe to read from to guarantee we only read majority-committed data and make the index re-resumable. We can't just wait until the majority point is available, because a commit oplog entry during recovery would get stuck.
An alternative idea:
- After resuming the index build, if the majority commit-point isn't available yet (i.e. startup recovery hasn't completed), we will read at the stable timestamp. This is safe because draining before commit is best-effort to avoid a long critical section during commit. In the commit critical section, we revert back to reading at lastApplied for correctness, to make sure we apply everything (this is already true).
- If the commit is received after startup recovery, then the majority point will have already been established, and we will do a final drain reading at majority before committing (this would operates as it does today).
- If the commit is received during startup recovery, we will need to apply any side writes from the stable timestamp to the commit timestamp. This will still block startup, but is still better than a full index rebuild. When paired with SERVER-112315 (persist the resume info after checkpointing and before voting for commit), this means we should have very little work to do this phase.
- is related to
-
SERVER-112315 Avoid full index rebuild during startup when crashing after commit
-
- In Progress
-