[SERVER-66770] Consider turning knobs that increase the chaos in our tests Created: 25/May/22  Updated: 29/Oct/23  Resolved: 15/Sep/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.2.0-rc0

Type: Task Priority: Major - P3
Reporter: Judah Schvimer Assignee: Sulabh Mahajan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-69615 Rollback fuzzing in WiredTiger leads ... Closed
Related
Backwards Compatibility: Fully Compatible
Sprint: Execution Team 2022-06-27, Execution Team 2022-07-11, Execution Team 2022-08-22, Execution Team 2022-09-05, Execution Team 2022-09-19
Participants:
Linked BF Score: 0

 Description   

There are a number of knobs we can turn to increase the chaos in our tests, we should consider if we should run our tests in these more "aggressive" scenarios.



 Comments   
Comment by Githook User [ 15/Sep/22 ]

Author:

{'name': 'Sulabh Mahajan', 'email': 'sulabh.mahajan@mongodb.com', 'username': 'sulabhM'}

Message: SERVER-66770 Add more WiredTiger configurations to config fuzz
Branch: master
https://github.com/mongodb/mongo/commit/1b942407096c139fdb69b1d61cfe20e4e5c0f1eb

Comment by Sulabh Mahajan [ 13/Sep/22 ]

I have moved this ticket into the blocked state. The patch test has found a bug, SERVER-69615, that needs to be fixed before the change can go in.

Comment by Sulabh Mahajan [ 31/Aug/22 ]

After going through all the comments and reading the code and functionality of the various options, I consider fuzzing the following:

Eviction related:

  • eviction_updates_trigger and eviction_updates_target - changes when eviction starts and stops based on total size of updates in cache
  • debug_mode=(eviction=true) - for more aggressive eviction

Checkpoint related:

  • syncDelay - more/less frequent checkpoints
  • debug_mode=(slow_checkpoint=true) - slow down checkpoint creation by slowing down internal page processing

Others:

  • Lowering the oldest_timestamp - affects checkpoint cleanup/garbage collection and the reconciliation to discard updates more frequently.
  • debug_mode=(rollback_error=N) - force a rollback error to be returned every N operations.
  • debug_mode=(realloc_exact=true) - forces WT to realloc exact amount of memory needed. More likely to find memory issues.

I decided not to pursue the following, and leave them for another round of fuzzing:

  • stepdown_interval_ms - Judah reported that this was useful in one of the BFs. I worked with Louis and looks like the fuzzer job runs in parallel with stepdown scripts, so we should get some amount of random stperdowns. We can explore this more in the future.
  • os_cache_dirty_max, os_cache_max - I feel like we won't find functional bugs when adjusting these, but rather affect performance which we don't care about with these tests.
  • timing_stress_for_test=<VALUES> - This is an interesting one. These trigger internal "stress" points and could be useful. But as Don mentioned it is an internal API and we would need to think harder about using it outside WT testing.
  • wiredTigerCacheSizeGB - Running WiredTiger with a lower cache could be useful, but on the other hand, we usually find the failures that happen with a smaller cache are generally expected and not bugs. I fear we will not get a good signal on valid bugs with this one. Also, the various eviction targets/triggers are calculated on the cache size, which is easier with a static/default value. From experience with WiredTiger's test/format fuzzing getting good tests for smaller cache sizes is hard when fuzzing other parameters too. This parameter will be useful to be considered in a future round of fuzzing.
Comment by Sulabh Mahajan [ 30/Aug/22 ]

Among others donald.anderson@mongodb.com has already listed, there are a few more options to consider:

Debug options:

  • debug_mode=(slow_checkpoint=true) - slow down checkpoint creation by slowing down internal page processing
  • debug_mode=(cursor_reposition=true) - encourages more eviction - is a feature we developed to prevent holding hot pages from getting evicted - but did not turn on yet as we did not see perf improvements that we expected.
  • debug_mode=(update_restore_evict=true) - forces eviction of dirty pages through update restore path.

Eviction options:
WiredTiger also keeps a separate accounting for memory consumed by updates. There have been cases where these target/trigger have come into play as the cache is consumed more by tiny updates. So it will be worthwhile to fuzz these values

  • eviction_updates_target
  • eviction_updates_trigger

Next, I will take a call on what parameters to focus on and probably make some code changes to play with them.

Comment by Sulabh Mahajan [ 30/Aug/22 ]

Just for the information here is a list of the parameters that are already fuzzed:

Eviction related:

  • eviction_checkpoint_target
  • eviction_target
  • eviction_trigger
  • eviction_dirty_target
  • eviction_dirty_trigger

File handle management:

  • close_idle_time_secs
  • close_handle_minimum
  • close_scan_interval

Table specific settings:

  • internal_page_max
  • leaf_page_max
  • leaf_value_max
  • memory_page_max
  • split_pct
  • prefix_compression
  • block_compressor
Comment by Sulabh Mahajan [ 29/Aug/22 ]

Sulabh Mahajan, do you think lowering the oldest_timestamp lag down from our 5 minute default would have helped in uncovering WT-9500?

Sorry for the late reply. Dan, it is not obvious to me if lowering the oldest_timestamp window would uncover WT-9500. On the other hand that, the window duration is a crucial parameter that changes how eviction behaves and triggers checkpoint cleanup more often, so worth fuzzing. But I think most tests would finish within 5 mins, so it should not make a lot of difference in functional testing.

Comment by Daniel Gottlieb (Inactive) [ 27/Jun/22 ]

sulabh.mahajan@mongodb.com, do you think lowering the oldest_timestamp lag down from our 5 minute default would have helped in uncovering WT-9500?

Comment by Donald Anderson [ 02/Jun/22 ]

judah.schvimer@mongodb.com Thanks for your thoughts on this.  Looking at test/format, there are of course many random configurations, including key/value size variations, pct read/write/modify/update mix, whether to do backups, logging, compression (and what kind), encryption, cache size, checkpoint frequency, and I'm sure many things that MongoDB may not vary.  The timing_stress_for_test values are mixed in as well to expose races if we can.  Of course with zillions of combinations, we test, dozens, perhaps hundreds ? of combinations in one of our evergreen stress runs.  I'm sure there must be a thousand different combinations tried in a day.

There are several items I mentioned above that test/format doesn't test, some are done elsewhere, I see test/checkpoint turns on debug_mode=(eviction=true).  Some others, like realloc_exact are not regularly tested in any stressful way, they might be interesting to add.  (But we do ASAN runs, so we probably get similar bug coverage for that particular one.) Others, like debug_mode=(rollback_error=N) are designed to find bugs in the caller.   It's not so interesting to find rollback coding errors in test/format or our other test programs, but finding them in MongoDB, may be a good exercise to test error paths.

I also think it could be valuable for MongoDB to turn on some of the things that affect the timing.  Tempting as it is to just use our  undocumented timing_stress_for_test, it might be better to come with a new debug api in WT, like debug_mode=(vary_timing=X), X is a value that MongoDB could select randomly (and report in a log somewhere), based on X, WT internally varies the timing of certain operations in a predictable way, so that it can be replayed if we see a problem.  That's at the WT level.  If there are known potential race points at higher levels in MDB, I would encourage something similar, having a debug mode that potentially added artificial delays at key points.

Comment by Donald Anderson [ 01/Jun/22 ]

judah.schvimer@mongodb.com, here are some thoughts. On trying to increase chaos, we might try these options on wiredtiger_open configuration:

  • debug_mode=(eviction=true) more aggressive eviction - may change timing
  • debug_mode=(rollback_error=N) force a rollback error to be returned every N operations. Most WT operations must be prepared for rollback errors, this will really test that.
  • debug_mode=(realloc_exact=true) when WT reallocs, get the exact amount needed, instead of growing with extra space. This is mainly targeted to making WT bounds bugs more likely, but it will change the memory allocation pattern a bit.
  • cache_size=N. Setting cache lower will stress eviction, may change timing, and artificially create timeouts, which may test error paths.
  • os_cache_dirty_max (more fsync-ing after writes to a file)
  • os_cache_max (calls to posix_fadvise to say we don't need blocks)
  • eviction_dirty_target
  • eviction_dirty_trigger
  • eviction_target
  • eviction_trigger
  • eviction_updates_target
  • eviction_updates_trigger
  • frequency of checkpoints

A lot of those, from cache_size down are really tuning parameters, so may specifically cause more disk activity, which will change the timing of applications. They may also have overlapping effects. Expect workloads to run (possibly much) longer.

There's also a class of options under timing_stress_for_test=VALUES where VALUES would choose from aggressive_sweep, history_store_sweep_race, history_store_search. There are many more of these values but they are pretty specific to WT operations. They generally inject calls to yield threads or do something else that is likely to change the timing and expose thread races. They are designed to expose WT problems, but it's possible that they may change timing for callers, enough to create a small amount of "chaos". timing_stress_for_test is undocumented and is subject to change, so if you want to head down that path, we should talk further about the best way to do that. You might get the same result by similarly injecting yield operations in the application threads.

I made this list by looking at WiredTiger documentation and dist/api_data.py in the WT tree, which indirectly generates the doc and also has the undocumented things. Feel free to peruse it yourself and ask about anything that looks like it might be useful.

I'm going to reassign back to you.

Comment by Daniel Gottlieb (Inactive) [ 25/May/22 ]

We do have a "framework" for turning knobs. It only runs as a task (I believe against the concurrency_replication suite). There are some complexities trying to push more tests through configuration fuzzing (regarding reproducing test failures and knowing when a failure was introduced). Though those concerns are largely hypothetical as I'm not aware of the current incarnation having found anything interesting.

Generated at Thu Feb 08 06:06:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.