[SERVER-23359] WiredTiger should not cache updates between named snapshots Created: 25/Mar/16  Updated: 06/Dec/22  Resolved: 21/Dec/16

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Mathias Stearn Assignee: Backlog - Storage Execution Team
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File monitor.py     PNG File named_snapshot_behavior.png     File plot.py     File workload.js    
Assigned Teams:
Storage Execution
Backwards Compatibility: Fully Compatible
Participants:

 Description   

See the attached image. An identical workload is run against two sets (test and control) with two nodes and an arbiter each. All hosts are configured with a 1GB WT cache. It starts by inserting one million documents in batches of 100 with a 900 byte random string. When this completes (indicated by a blue vertical line in the image), the secondary of the test set is killed, preventing the primary from advancing its commit point or deleting old snapshots. It will create new snapshots until the limit of 1000 uncommitted snapshots is hit. After the secondary is killed, the workload switches to updating documents for 20 minutes. The updates are done in batches of 1000 sequential documents.

The test set appears to use an unbounded amount of disk space and suffers from some extreme pauses. During some, but not all, of these pauses, the system seems to be completely idle with barely any CPU or disk utilization.

To confirm that the problem was not related to there being 1000 snapshots I limited the server to keeping 3 total snapshots by setting the uncommitted snapshot limit to 2 at https://github.com/mongodb/mongo/blob/r3.3.3/src/mongo/db/repl/oplog.cpp#L1100. This didn't seem to make much of a difference.

Also, moving the testSet.stop() line to above begineState('insert') will make the snapshots be of a empty collection, and all inserts will be after the snapshots. Even in this case, the disk usage seems to be unbounded.

Repro:

  1. Download the .js and .py files to a directory that contains a mongod binary
  2. If needed, install the python2 libs pymongo and matplotlib
  3. Launch a mongod on the default port (27017) for reporting and IPC
  4. Run mongo workload.js (This will launch the replica sets, run monitor.py, and do the workload)
  5. Once the workload starts run python plot.py (It will update as new data is collected)


 Comments   
Comment by Alexander Gorrod [ 21/Dec/16 ]

Cleaning up intermediate updates is incompatible with the requirements of future multi-document transaction support.

Comment by Michael Cahill (Inactive) [ 31/Mar/16 ]

redbeard0531, we did discuss discarding intermediate versions, that didn't make it into 3.2. It shouldn't be too hard, but we need some additional tracking of transaction snapshots over what we maintain today. I'll use this test case to measure how effective the solution is.

Comment by Mathias Stearn [ 29/Mar/16 ]

michael.cahill When we were working on the design for this, it sounded like WT would automatically purge unneeded intermediate versions of documents. So if there is a snapshot at version 1 of a document, then it is updated 1000 times, it will only keep version 1 and the latest version. If this isn't the case, we may need to rethink our snapshot retention policy.

Comment by Michael Cahill (Inactive) [ 29/Mar/16 ]

This looks like expected behavior given the current design: WiredTiger keeps all updates after the oldest snapshot. Once the cache becomes full, they overflow into the "lookaside table" (WiredTigerLAS.wt).

The next step should be to run the workload and gather diagnostic data to confirm that versions are overflowing into the lookaside table. We should also confirm where threads are blocked when no progress is being made.

Generated at Thu Feb 08 04:03:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.