[SERVER-62345] Consider re-evaluating fuzzer testing philosophy Created: 04/Jan/22  Updated: 06/Dec/22  Resolved: 09/Feb/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Jennifer Peshansky (Inactive) Assignee: Backlog - Query Execution
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-61389 Fuzzer does not tolerate failures dur... Closed
Related
Assigned Teams:
Query Execution
Participants:

 Description   

Our fuzzers are designed to alert us of any change in behavior between different versions or configurations of mongo. They currently generate a BFG every time a query fails on one version and succeeds on another. Often, these are invalid queries that contain some improper value or operation. However, the offending part of the query may be optimized away on one side before it has a chance to error; or it may trigger a different error first because it executes in a different order.

The problem increased sharply when we turned the SBE engine on by default. Since SBE re-implements MQL expressions from scratch, there are many differences in undocumented behavior from non-SBE versions of mongo. So far, we have addressed these BFs by adding workarounds to the fuzzer to suppress BFGs in specific scenarios. The list of these workarounds in the fuzzers is growing long.

We dedicate a lot of time and resources to these BFs. Many of them are connected to queries that customers should not be running, since they are invalid MQL. Our documentation does not promise consistent behavior for invalid queries.

  • How likely are customers to ever suffer as a result of these behavior changes in practice?
  • How much value does it provide to spend developer time resolving these types of differences?
  • What behavior changes do we want to prioritize identifying and fixing?
  • How can we most efficiently change the fuzzers to reflect these priorities?

One proposal is to stop the fuzzers from generating BFGs when one side errors and the other doesn’t, or when the error messages don’t match, since this usually happens with queries that are already invalid. In this world, the fuzzer would only generate a BFG when both sides succeed, but return different results for the same query.

There is a risk that a new version of mongo could introduce a bug where a valid query suddenly errors. This is something the fuzzers would currently catch, but we should discuss how a problem like this would be identified if we were to make this change.



 Comments   
Comment by Kyle Suarez [ 09/Feb/22 ]

Closing this as Done, as we had several meetings and have a path forward with regard to ignoring certain classes of errors.

Comment by Max Hirschhorn [ 05/Jan/22 ]

We dedicate a lot of time and resources to these BFs. Many of them are connected to queries that customers should not be running, since they are invalid MQL.

I'd be curious to see tabulated numbers on this. On the Sharding team there had been a lot of anecdotal evidence that any failure in the sharding_csrs_continuous_config_stepdown Evergreen task was a testing-only problem and would get solved by excluding the offending test. I was surprised to find—and reported in SERVER-59891—that for every 2 testing-only issues there has still been 1 real code issue. Not a good ratio by any means but much better than what I had anticipated discovering.

The list of questions in the description to evaluate are spot-on. Testing is about managing risk and that means weighing the severity and likelihood in the context of time scarcity and diminishing returns.

One proposal is to stop the fuzzers from generating BFGs when one side errors and the other doesn’t, or when the error messages don’t match, since this usually happens with queries that are already invalid. In this world, the fuzzer would only generate a BFG when both sides succeed, but return different results for the same query.

Has it ever been the case that the query results from the configuration which didn't error are actually nonsensical? In other words, has the differential testing from the fuzzer ever detected valid but missing error checking?

There are testing techniques beyond differential testing which could be applied to databases (e.g. metamorphic testing, pivoted query synthesis). Section 6. Related Work of https://arxiv.org/pdf/2007.08292.pdf might be a reasonable place to start for anyone who is interested. It'll be a fun research project to apply them to aggregation pipelines and not only find queries! I also enjoyed reading Section 3.4. Corner Cases and Limitations as I found it basically summarizes the complexities we've hit with the fuzzers over the years. With any of these testing techniques, there's a tradeoff between the kinds of bugs it can detect due to limiting the inputs/queries/transformations/etc. and how much of an idealized model must be implemented elsewhere to verify the results. The ones mentioned in the paper still accommodate our lack of a ground truth for the system behavior (aka a test oracle) though.

Generated at Thu Feb 08 05:54:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.