[SERVER-77496] [BF-28463] IndexScanStageBase::prepareImpl() tassert(4938500) should be uassert Created: 25/May/23  Updated: 29/Oct/23  Resolved: 26/May/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 7.1.0-rc0, 7.0.0-rc2
Fix Version/s: 7.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Kevin Cherkauer Assignee: Kevin Cherkauer
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.0
Participants:
Linked BF Score: 14

 Description   

In ix_scan.cpp, IndexScanStageBase::prepareImpl() contains a uassert and some tasserts. tassert(4938500) should actually be a uassert because it is really a valid user-facing failure if the index the plan uses happened to get dropped just before the query tried to use it. The other tasserts are checking things that are programming errors.

The tassert causes the BF-28463 failure in secondary_reads_with_catalog_changes.js when the timing of this concurrent test is such that the index used by a query plan gets dropped just before the query tries to find that index during execution. Per mihai.andrei@mongodb.com, the Evergreen test harness considers every tassert to be a real test failure even if the test itself expects a failure with that error code (i.e. it's not okay to expect an error code from a programming bug, which is what a tassert denotes). This is not really a programming bug though - in this case the query should just fail gracefully.



 Comments   
Comment by David Storch [ 31/May/23 ]

Is it decided to revert now and look for another fix afterward, or is the decision waiting for more inputs? (I.e. what is the decision process, and has it concluded yet?)

I think yes, this has been decided, but kyle.suarez@mongodb.com could you confirm?

Is this considered a 7.0 release blocker? (It is possible for a customer to hit a tassert without any fix, however it would be rare as index drops are not that common and hitting the timing window seems hard to do from my experiments.)

I'd say it's probably not a release blocker, though it does sound like a bug we should aim to fix. Do we know which versions is affects? Was it a regression in 7.0 or does it go all the way back to 6.0?

Who should own the "look for another fix afterward" piece?

If you don't have bandwidth for it, this can go back into the triage queue to follow the normal process for work assignment.

Comment by Kevin Cherkauer [ 30/May/23 ]

david.storch@mongodb.com kyle.suarez@mongodb.com There is some ambiguity here to resolve:

  1. Is it decided to revert now and look for another fix afterward, or is the decision waiting for more inputs? (I.e. what is the decision process, and has it concluded yet?)
  2. Is this considered a 7.0 release blocker? (It is possible for a customer to hit a tassert without any fix, however it would be rare as index drops are not that common and hitting the timing window seems hard to do from my experiments.)
  3. Who should own the "look for another fix afterward" piece?
Comment by David Storch [ 30/May/23 ]

kevin.cherkauer@mongodb.com zixuan.zhuang@mongodb.com projjal.chanda@mongodb.com amr.elhelw@mongodb.com kyle.suarez@mongodb.com I took a brief look at this since the backport came up during the North America QE triage meeting. At first blush, this does not look like the correct fix to me. The current design for SBE is that the caller is expected to ensure that the collections and indexes used by the plan exist prior to calling prepare() on a plan and starting to execute it. With either lock-free or locked reads, the reader's view of the catalog remains stable except at yield points. That's why the various steps of the query optimization and plan compilation pipeline generally don't have to worry about indexes or collections disappearing at any time, but rather can obtain a view of the catalog at the beginning and assume that it is stable while choosing, constructing, and preparing a plan.

However, at yield points we advance our view of the catalog and might see that indexes or collections which we previously relied upon have been dropped. For this reason, when recovering from yield an SBE plan will uassert() if, for example, an index has been dropped: https://github.com/mongodb/mongo/blob/59a148452a15629161437ee4b5dc37bf5dbb4e0d/src/mongo/db/exec/sbe/stages/ix_scan.cpp#L183-L186. Similarly, when it is possible for a yield to occur in between the construction of an SBE plan and calling prepare() on that plan, we need to add special checks that the indexes and collections required by the plan still exist. In some places, we use the AllIndicesRequiredChecker for this purpose. The current design is that it is the caller's responsibility to ensure that the sbe::PlanStage tree remains valid before calling prepare(). While theoretically we could change the design such that prepare() tolerates arbitrary catalog changes at yield points happening in between construction of the plan and prepare(), it seems easier to me to preserve the design as it currently exists. At the very least, perhaps we could dig deeper into the root cause of the build failure – perhaps there is a code path related to one of the runtime planners which is missing a check related to index drops?

I propose that we revert the changes from both this ticket and SERVER-76909 while we study the issue further.

Comment by Kevin Cherkauer [ 26/May/23 ]

This fix needs to be backported to 7.0 (BACKPORT-16100) to make BF-28463 stop occurring there. (7.0 BACKPORT-16012 is also needed.)

Comment by Githook User [ 26/May/23 ]

Author:

{'name': 'Kevin Cherkauer', 'email': 'kevin.cherkauer@mongodb.com', 'username': 'kevin-cherkauer'}

Message: SERVER-77496 Change ix_scan.cpp tassert(4938500) to uassert
Branch: master
https://github.com/mongodb/mongo/commit/09d4923123730bbec3a3ade4d79744e7cbd01f82

Generated at Thu Feb 08 06:35:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.