[SERVER-31099] Automate testing when oldest_timestamp stalls Created: 15/Sep/17  Updated: 30/Oct/23  Resolved: 01/Nov/17

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: 3.6.0-rc3

Type: Task Priority: Major - P3
Reporter: Eric Milkie Assignee: Sulabh Mahajan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File perf_latest.png     PNG File perf_test.png    
Issue Links:
Depends
depends on WT-3652 Skip lookaside reads for checkpoints ... Closed
Related
related to SERVER-30785 Slow secondary kills primary Closed
is related to SERVER-28166 Assess effects of pinning a lot of co... Closed
Backwards Compatibility: Fully Compatible
Sprint: Storage 2017-10-02, Storage 2017-10-23, Storage 2017-11-13
Participants:

 Description   

Devise tests for behavior of the system when timestamped writes continue to happen while the oldest_timestamp ceases to be updated. This situation can happen when a majority of secondaries stop replicating in a replica set.
Expectation is that as the cache fills with dirty data, the system degrades smoothly and lookaside table usage increases.



 Comments   
Comment by Githook User [ 01/Nov/17 ]

Author:

{'email': 'sulabh.mahajan@mongodb.com', 'name': 'Sulabh Mahajan', 'username': 'sulabhM'}

Message: SERVER-31099 Add automated test for stall when WiredTiger uses LAS file
Branch: master
https://github.com/mongodb/mongo/commit/0342b7bd64be6a8fec25a18ab633f2f9a27f0558

Comment by Sulabh Mahajan [ 27/Oct/17 ]

I retested today with latest WT develop and mongo master. I don't see stalls now:

Comment by Alexander Gorrod [ 27/Oct/17 ]

sulabh.mahajan there has been some additional work done in WT-3652, that has been merged into the WiredTiger develop branch - it'd be helpful if you could re-run the test and re-generate the graph using the new code.

Comment by Sulabh Mahajan [ 27/Sep/17 ]

milkie unfortunately that's true. I have discussed these results with michael.cahill, so he is aware of these stalls. These stalls correspond to the checkpoints reading back data from the LAS file and then writing out the checkpoint. The work with WT-3435 is still going on, I will re-run this test when the ticket concludes.

Comment by Eric Milkie [ 27/Sep/17 ]

If I am reading this graph correctly, it says that for about 100 seconds there were 0 writes per second? (From ~380 to ~500.)

Comment by Sulabh Mahajan [ 27/Sep/17 ]

Attached is perf degradation graph as LAS file gets used because of pinned timestamp, for MongoDB-3.6 with changes being made by Michael for WT-3435:
We have in the graph in blue is the insert count per second and in red is the LAS file size on disk, both versus the time elapsed in seconds.

Comment by Eric Milkie [ 21/Sep/17 ]

Thanks for that testing, Sulabh.
I'd also eventually like to see a similar analysis that was originally done for SERVER-28166, to graph the performance degradation as pinned timestamp data gets spilled into an LAS file. However, that will have to wait for the conclusion of WT-3435.

Comment by Sulabh Mahajan [ 21/Sep/17 ]

I did some testing for this ticket today. With the setup and workload from SERVER-28166 I executed the test and came to the following conclusion:

1. On mongodb master I got similar stall as in SERVER-28166. This is expected and detailed in this ticket. WT-3435 is expected to bring in changes to fix the stall.
2. On mongodb-3.4 I got a similar stall again. This is expected to be due to SERVER-30785 and fixed by WT-3296.
3. I patched changes from WT-3296 into mongodb-3.4 and ran the test again. I did not see a stall this time.

Comment by Eric Milkie [ 20/Sep/17 ]

Coincidentally, redbeard0531 has already done a bit of testing here, although unintentionally. While testing the performance of the server with timestamps, he encountered a bug in one-node replica sets for the inMemory storage engine that caused oldest_timestamp to never be updated. I'll be filing a ticket about this soon. Mathias can also assist Sulabh with this oldest_timestamp testing in general.

Comment by Alexander Gorrod [ 20/Sep/17 ]

sulabh.mahajan Please take a look at this ticket, and think about crafting a use case. I expect the workload to be similar to the work done in SERVER-28166, which has a workload attached.

Generated at Thu Feb 08 04:25:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.