[SERVER-66770] Consider turning knobs that increase the chaos in our tests Created: 25/May/22 Updated: 29/Oct/23 Resolved: 15/Sep/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 6.2.0-rc0 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Judah Schvimer | Assignee: | Sulabh Mahajan |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Sprint: | Execution Team 2022-06-27, Execution Team 2022-07-11, Execution Team 2022-08-22, Execution Team 2022-09-05, Execution Team 2022-09-19 | ||||||||||||
| Participants: | |||||||||||||
| Linked BF Score: | 0 | ||||||||||||
| Description |
|
There are a number of knobs we can turn to increase the chaos in our tests, we should consider if we should run our tests in these more "aggressive" scenarios. |
| Comments |
| Comment by Githook User [ 15/Sep/22 ] |
|
Author: {'name': 'Sulabh Mahajan', 'email': 'sulabh.mahajan@mongodb.com', 'username': 'sulabhM'}Message: |
| Comment by Sulabh Mahajan [ 13/Sep/22 ] |
|
I have moved this ticket into the blocked state. The patch test has found a bug, |
| Comment by Sulabh Mahajan [ 31/Aug/22 ] |
|
After going through all the comments and reading the code and functionality of the various options, I consider fuzzing the following: Eviction related:
Checkpoint related:
Others:
I decided not to pursue the following, and leave them for another round of fuzzing:
|
| Comment by Sulabh Mahajan [ 30/Aug/22 ] |
|
Among others donald.anderson@mongodb.com has already listed, there are a few more options to consider: Debug options:
Eviction options:
Next, I will take a call on what parameters to focus on and probably make some code changes to play with them. |
| Comment by Sulabh Mahajan [ 30/Aug/22 ] |
|
Just for the information here is a list of the parameters that are already fuzzed: Eviction related:
File handle management:
Table specific settings:
|
| Comment by Sulabh Mahajan [ 29/Aug/22 ] |
Sorry for the late reply. Dan, it is not obvious to me if lowering the oldest_timestamp window would uncover |
| Comment by Daniel Gottlieb (Inactive) [ 27/Jun/22 ] |
|
sulabh.mahajan@mongodb.com, do you think lowering the oldest_timestamp lag down from our 5 minute default would have helped in uncovering |
| Comment by Donald Anderson [ 02/Jun/22 ] |
|
judah.schvimer@mongodb.com Thanks for your thoughts on this. Looking at test/format, there are of course many random configurations, including key/value size variations, pct read/write/modify/update mix, whether to do backups, logging, compression (and what kind), encryption, cache size, checkpoint frequency, and I'm sure many things that MongoDB may not vary. The timing_stress_for_test values are mixed in as well to expose races if we can. Of course with zillions of combinations, we test, dozens, perhaps hundreds ? of combinations in one of our evergreen stress runs. I'm sure there must be a thousand different combinations tried in a day. There are several items I mentioned above that test/format doesn't test, some are done elsewhere, I see test/checkpoint turns on debug_mode=(eviction=true). Some others, like realloc_exact are not regularly tested in any stressful way, they might be interesting to add. (But we do ASAN runs, so we probably get similar bug coverage for that particular one.) Others, like debug_mode=(rollback_error=N) are designed to find bugs in the caller. It's not so interesting to find rollback coding errors in test/format or our other test programs, but finding them in MongoDB, may be a good exercise to test error paths. I also think it could be valuable for MongoDB to turn on some of the things that affect the timing. Tempting as it is to just use our undocumented timing_stress_for_test, it might be better to come with a new debug api in WT, like debug_mode=(vary_timing=X), X is a value that MongoDB could select randomly (and report in a log somewhere), based on X, WT internally varies the timing of certain operations in a predictable way, so that it can be replayed if we see a problem. That's at the WT level. If there are known potential race points at higher levels in MDB, I would encourage something similar, having a debug mode that potentially added artificial delays at key points. |
| Comment by Donald Anderson [ 01/Jun/22 ] |
|
judah.schvimer@mongodb.com, here are some thoughts. On trying to increase chaos, we might try these options on wiredtiger_open configuration:
A lot of those, from cache_size down are really tuning parameters, so may specifically cause more disk activity, which will change the timing of applications. They may also have overlapping effects. Expect workloads to run (possibly much) longer. There's also a class of options under timing_stress_for_test=VALUES where VALUES would choose from aggressive_sweep, history_store_sweep_race, history_store_search. There are many more of these values but they are pretty specific to WT operations. They generally inject calls to yield threads or do something else that is likely to change the timing and expose thread races. They are designed to expose WT problems, but it's possible that they may change timing for callers, enough to create a small amount of "chaos". timing_stress_for_test is undocumented and is subject to change, so if you want to head down that path, we should talk further about the best way to do that. You might get the same result by similarly injecting yield operations in the application threads. I made this list by looking at WiredTiger documentation and dist/api_data.py in the WT tree, which indirectly generates the doc and also has the undocumented things. Feel free to peruse it yourself and ask about anything that looks like it might be useful. I'm going to reassign back to you. |
| Comment by Daniel Gottlieb (Inactive) [ 25/May/22 ] |
|
We do have a "framework" for turning knobs. It only runs as a task (I believe against the concurrency_replication suite). There are some complexities trying to push more tests through configuration fuzzing (regarding reproducing test failures and knowing when a failure was introduced). Though those concerns are largely hypothetical as I'm not aware of the current incarnation having found anything interesting. |