[SERVER-55397] Index build restart ident drops are not timestamped during startup recovery Created: 22/Mar/21  Updated: 29/Oct/23  Resolved: 05/May/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.9.0
Fix Version/s: 5.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Louis Williams Assignee: Gregory Noma
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-55552 Unreplicated collection idents can ge... Closed
related to SERVER-38910 Remove redundant rollback handling on... Closed
is related to SERVER-56639 Timestamp index ident drops for start... Closed
is related to SERVER-58159 Extend relaxation of index ident reco... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Execution Team 2021-04-19, Execution Team 2021-05-03, Execution Team 2021-05-17
Participants:
Linked BF Score: 137

 Description   

This assertion added in SERVER-38910, made an assumption that all ident drops are timestamped. During startup recovery, if an index build is unfinished and needs to be restarted, we drop the original ident without a timestamp. This can cause this assertion to fail in certain circumstances.

I think we need to either:

  • Timestamp the ident drop on startup recovery. This would ensure we don't drop the ident until it is majority-committed. Unfortunately, we aren't starting the index build from oplog entry, so we would potentially have to use a timestamp in the future, ahead of the recovery timestamp.
  • Relax this assertion to exclude unfinished index builds. Similar to the original code, if we discover that an ident is missing for an index, we only need to guarantee it will be rebuilt. If we do this, we also need to ensure that we don't error if we try to drop an ident that doesn't exist.


 Comments   
Comment by Githook User [ 05/May/21 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-55397 Relax index ident reconciliation invariant when restarting an index build after an unclean shutdown
Branch: master
https://github.com/mongodb/mongo/commit/95bf4b5d5d1c88002452cf8bd1e7ffcc47b68d9a

Comment by Louis Williams [ 06/Apr/21 ]

Update: It turns out that the reason the index is dropped and restarted at all is that the rollback fuzzer disables commitQuorum for index builds. As a result, the index build does not qualify as "resumable", requiring the ident to be dropped and recreated at startup.

Since index builds are resumable from every phase in 4.4, both of these bug conditions only exist when the server is configured with enableIndexBuildCommitQuorum=false. As far as I know, this is not a parameter that we document.

I can confidently say that this is not required for 5.0. Users may, however, disable resumable index builds by providing a commitQuorum other than "votingMembers".

Comment by Louis Williams [ 05/Apr/21 ]

Considering this, I think there is probably a different bug, as follows:

  1. During startup recovery, there is an unfinished two-phase index build. We drop the ident and perform an un-timestamped write to the catalog entry to restart the index build.
  2. Before replicating a commitIndexBuild oplog entry, take a checkpoint and shut down. This persists the new catalog entry and the wiped-out history (the effect of the un-timestamped write) for that catalog entry.
  3. The server restarts. There is no fatal assertion because the catalog entry's index matches an existing index table. The catalog entry shows incorrect history for all timestamped reads between the oldest and stable timestamps.

If this were just a single startup (i.e. just step 1) this would not be possible, because we set collection's minimumVisibleSnapshot to the stable timestamp when the catalog entry is not visible at the oldest timestamp. But in this scenario, the catalog entry would be visible at both the stable timestamp and the oldest timestamp. As a result, we don't set the minimumVisible, allowing reads to observe a potentially incorrect historical state of the catalog entry.

Comment by Daniel Gottlieb (Inactive) [ 02/Apr/21 ]

Daniel Gottlieb, when we do the drop of the index entry, this happens inside of one WUOW, so there is never a state without an index entry.

To clarify, I was claiming that performing a write to an _mdb_catalog entry (for a replicated collection) without a timestamp can corrupt the update chain.

Prior to durable history, a case like this would be "okay" so long as we haven't performed any replication recovery (meaning the 0 timestamped write is only covering up the version as of the stable checkpoint MDB is starting up against). As you demonstrated, that single WUOW just changes one index ident value for another.

With durable history, we might be okay (modulo WT bugs coping with the complexity of this MDB requirement), but that would true only if we never read the _mdb_catalog at an arbitrarily early value (presumably between the oldest and stable timestamps).

Comment by Louis Williams [ 02/Apr/21 ]

daniel.gottlieb, when we do the drop of the index entry, this happens inside of one WUOW, so there is never a state without an index entry.

I think we could try to pass a larger timestamp to the timestamp monitor so that the old ident is not dropped until the next checkpoint.

Comment by Daniel Gottlieb (Inactive) [ 02/Apr/21 ]

louis.williams, does "dropping the ident without a timestamp" mean removing it from the _mdb_catalog entry for that collection?

For a replicated collection I would assume that to be a data corrupting write. Am I missing something?

edit
Or maybe "timestamping the ident drop" means just giving a non-zero value to the monitor that tracks for when to remove an index.

As per what gregory.wlodarek found in SERVER-55552, it's never okay to pass a zero-timestamp to the ident dropping monitor. We should probably add an invariant there.

Generated at Thu Feb 08 05:36:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.