[SERVER-62345] Consider re-evaluating fuzzer testing philosophy Created: 04/Jan/22 Updated: 06/Dec/22 Resolved: 09/Feb/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Question | Priority: | Major - P3 |
| Reporter: | Jennifer Peshansky (Inactive) | Assignee: | Backlog - Query Execution |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Query Execution
|
||||||||||||
| Participants: | |||||||||||||
| Description |
|
Our fuzzers are designed to alert us of any change in behavior between different versions or configurations of mongo. They currently generate a BFG every time a query fails on one version and succeeds on another. Often, these are invalid queries that contain some improper value or operation. However, the offending part of the query may be optimized away on one side before it has a chance to error; or it may trigger a different error first because it executes in a different order. The problem increased sharply when we turned the SBE engine on by default. Since SBE re-implements MQL expressions from scratch, there are many differences in undocumented behavior from non-SBE versions of mongo. So far, we have addressed these BFs by adding workarounds to the fuzzer to suppress BFGs in specific scenarios. The list of these workarounds in the fuzzers is growing long. We dedicate a lot of time and resources to these BFs. Many of them are connected to queries that customers should not be running, since they are invalid MQL. Our documentation does not promise consistent behavior for invalid queries.
One proposal is to stop the fuzzers from generating BFGs when one side errors and the other doesn’t, or when the error messages don’t match, since this usually happens with queries that are already invalid. In this world, the fuzzer would only generate a BFG when both sides succeed, but return different results for the same query. There is a risk that a new version of mongo could introduce a bug where a valid query suddenly errors. This is something the fuzzers would currently catch, but we should discuss how a problem like this would be identified if we were to make this change. |
| Comments |
| Comment by Kyle Suarez [ 09/Feb/22 ] |
|
Closing this as Done, as we had several meetings and have a path forward with regard to ignoring certain classes of errors. |
| Comment by Max Hirschhorn [ 05/Jan/22 ] |
I'd be curious to see tabulated numbers on this. On the Sharding team there had been a lot of anecdotal evidence that any failure in the sharding_csrs_continuous_config_stepdown Evergreen task was a testing-only problem and would get solved by excluding the offending test. I was surprised to find—and reported in SERVER-59891—that for every 2 testing-only issues there has still been 1 real code issue. Not a good ratio by any means but much better than what I had anticipated discovering. The list of questions in the description to evaluate are spot-on. Testing is about managing risk and that means weighing the severity and likelihood in the context of time scarcity and diminishing returns.
Has it ever been the case that the query results from the configuration which didn't error are actually nonsensical? In other words, has the differential testing from the fuzzer ever detected valid but missing error checking? There are testing techniques beyond differential testing which could be applied to databases (e.g. metamorphic testing, pivoted query synthesis). Section 6. Related Work of https://arxiv.org/pdf/2007.08292.pdf might be a reasonable place to start for anyone who is interested. It'll be a fun research project to apply them to aggregation pipelines and not only find queries! I also enjoyed reading Section 3.4. Corner Cases and Limitations as I found it basically summarizes the complexities we've hit with the fuzzers over the years. With any of these testing techniques, there's a tradeoff between the kinds of bugs it can detect due to limiting the inputs/queries/transformations/etc. and how much of an idealized model must be implemented elsewhere to verify the results. The ones mentioned in the paper still accommodate our lack of a ground truth for the system behavior (aka a test oracle) though. |