[SERVER-32154] Unbounded lookaside table growth during insert workload Created: 03/Dec/17  Updated: 30/Oct/23  Resolved: 08/Aug/18

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 3.6.0-rc8
Fix Version/s: 3.6.6, 4.0.0

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Backlog - Storage Execution Team
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2017-12-15 at 4.12.53 pm.png     File dd-rc8-small.tar     PNG File growth.png     PNG File index-build.png    
Issue Links:
Depends
depends on WT-3766 Lookaside sweep for obsolete updates Closed
depends on WT-3911 Ignore lookaside data on shutdown Closed
depends on WT-3937 Tune lookaside sweep to react to work... Closed
Related
is related to SERVER-32139 Oplog truncation creates large amount... Closed
is related to WT-3766 Lookaside sweep for obsolete updates Closed
Assigned Teams:
Storage Execution
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:
Case:

 Description   

1-node repl set, SSD, 50 GB oplog, 4 GB cache, 5 threads inserting 130 kB docs.

It appears that the lookaside table, both number of entries and size of WiredTigerLAS.wt ("csv size" metric), grows without bound during a heavy write workload. Even after the workload is stopped the table doesn't shrink.

FTDC data attached.



 Comments   
Comment by Bruce Lucas (Inactive) [ 08/Aug/18 ]

Yes, this workload is much better behaved now on 3.6.6. Some of the work was done in earlier tickets, but that is the max fix version on the linked tickets, so I think we should mark this as fixed in 3.6.6.

Comment by Kelsey Schubert [ 07/Aug/18 ]

bruce.lucas, has this issue been resolved or does there need to be additional work that needs to be done?

Comment by Michael Cahill (Inactive) [ 25/Feb/18 ]

I've opened WT-3937 for tuning of the lookaside sweep and linked it here.

Comment by Ian Whalen (Inactive) [ 23/Feb/18 ]

OK, I've added the link to WT-3766. And it would be good if michael.cahill could comment on whether we should file another WT ticket based on the linked comment?

Comment by Bruce Lucas (Inactive) [ 23/Feb/18 ]

I think this ticket needs both WT-3911 and WT-3766. The latter is marked as fixed in 3.6.3, but I think it might need more work.

Comment by Ian Whalen (Inactive) [ 23/Feb/18 ]

Marking as fixversion=Backlog because there are no plans to work on this ticket directly and it will instead be resolved with the work on WT-3911.

Comment by Bruce Lucas (Inactive) [ 10/Feb/18 ]

michael.cahill, following up on our conversation yesterday, I'm still seeing the same shutdown behavior in 3.6.2 regarding lookaside table on shutdown - shutdown takes a very long time and uses a very large amount of memory, many times the size of the cache.

Comment by Bruce Lucas (Inactive) [ 21/Dec/17 ]

Thanks Alex, good to hear confirmation that the tickets you mention will fix this. I'm recording the symptoms I'm seeing here to help in diagnosis pending those fixes.

I'll also note that the shutdown process that you describe is extremely memory-intensive. Example:

  • Heavy insert workload of about 40 MB/s built up a replication lag of about 2.5 min, and resulted in a lookaside table of about 83 M entries with lookaside file of about 3.7 GB on the primary.
  • Let the system quiesce for a few minutes so that the secondary caught up, there were no longer any transactions pinned by timestamps, and all the data had been checkpointed by the periodic checkpoints.
  • Shutdown of mongod then took many minutes while memory usage grew from ~ 8 GB (the configured cache size) to use the full 64 GB of memory on the machine. (Actually I didn't wait for shutdown to complete because the memory usage had brought the machine to a standstill.)
Comment by Alexander Gorrod [ 21/Dec/17 ]

bruce.lucas The reason it takes a long time is that while the oldest ID is pinned we need to write data into the lookaside file instead of the data table. When we then do the shutdown, and the ID is no longer pinned that content is read from the lookaside file and written into the data file. There isn't much we can do to avoid that read. It would be possible to skip that step, and instead recover via the oplog from further back in time, but I expect that would make startup time much longer - so I don't think it'd be a net win.

On the other hand the changes in WT-3767 and WT-3768 do have relevant lookaside table optimizations, so I expect the behavior to improve once those changes are done and available in MongoDB.

Comment by Bruce Lucas (Inactive) [ 20/Dec/17 ]

Here's an example:

  • Start an index build of a collection with 1.5 M documents at B on the primary
  • It completes on the primary ~100 seconds later at C, and the index build on the secondary starts
  • This halts replication from C-D, which creates cache pressure due to having to keep uncommitted updates, and generates 3.7 M lookaside table entries.
  • The index build on the secondary completes ~40 seconds later at D, and we begin to catch up.

A subsequent shutdown of the primary (not shown) took ~45 seconds. In a larger example with a larger lag shutdown took many minutes.

Stack traces like the following indicating LAS activity during shutdown:

#0  0x0000561e818751c4 in bool snappy::InternalUncompress<snappy::SnappyArrayWriter>(snappy::Source*, snappy::SnappyArrayWriter*) ()
#1  0x0000561e8187587d in snappy::RawUncompress(char const*, unsigned long, char*) ()
#2  0x0000561e81873d3e in snappy_uncompress ()
#3  0x0000561e8023de55 in snappy_decompression ()
#4  0x0000561e802489ea in __wt_bt_read ()
#5  0x0000561e8024fc02 in __wt_page_in_func ()
#6  0x0000561e802792d4 in __wt_row_search ()
#7  0x0000561e802d3342 in __wt_btcur_search_near ()
#8  0x0000561e8028bcbd in __curfile_search_near ()
#9  0x0000561e8027c48e in __wt_las_cursor_position ()
#10 0x0000561e80250e72 in __wt_page_in_func ()
#11 0x0000561e8026bec0 in __wt_page_swap_func ()
#12 0x0000561e8026e02a in __tree_walk_internal ()
#13 0x0000561e802a05d7 in __wt_evict_file ()
#14 0x0000561e80262518 in __wt_cache_op ()
#15 0x0000561e802334f1 in __checkpoint_tree ()
#16 0x0000561e80236b7f in __wt_checkpoint_close ()
#17 0x0000561e801d7a63 in __wt_conn_dhandle_close ()
#18 0x0000561e801d84a7 in __wt_conn_dhandle_discard_single ()
#19 0x0000561e801d85e4 in __wt_conn_dhandle_discard ()
#20 0x0000561e801d9548 in __wt_connection_close ()
#21 0x0000561e801d1a02 in __conn_close ()
#22 0x0000561e801a7208 in mongo::WiredTigerKVEngine::cleanShutdown() ()
#23 0x0000561e8037d3ee in mongo::ServiceContextMongoD::shutdownGlobalStorageEngineCleanly() ()
#24 0x0000561e8016ed2e in mongo::(anonymous namespace)::shutdownTask() ()

Comment by Bruce Lucas (Inactive) [ 20/Dec/17 ]

One consequence of this that I've observed is that if there has been substantial lookaside activity, for example because replication lag during a write-heavy workload has created cache pressure due to the requirement to keep uncommited updates in cache on the primary, then normal shutdown of the node can take a very long time related to lookaside table cleanup (I think, based on collected stack traces).

Comment by Alexander Gorrod [ 15/Dec/17 ]

I've reproduced this behavior and have a few observations, but no clear conclusion yet:

There is no obvious correlation between checkpoints or other periodic activity and lookaside writes.
There does appear to be a correlation between IO saturation and lookaside writes.
There is a correlation between page splits and lookaside writes, but it isn't clear to me whether it's cause or effect.
Lookaside records are being removed, but it appears as though not all of them are being cleaned up.
There is lag in the readConcernMajority corresponding to the times when lookaside is used.
The cache isn't at the dirty limit when lookaside writes start happening.

My suspicion is that this workload overwhelms IO. I've been running with the journal and data files on the same drive. I think a journal flush is probably taking a long time due to the disk saturation, which means the oldest timestamp isn't being updated for several seconds, which in turn leads to lookaside usage as we try to evict pages with history that's required to satisfy read concern majority.

I've attempted to capture a set of statistics that reflect my above observations:

There are some things I'd like to understand:

  • Why aren't isn't more space being reclaimed from the lookaside file (I suspect it's because the workload is never reading the data back in, so it's not being cleaned up. The work in WT-3766 should address that issue.
  • Why is the workload triggering lookaside writes. I don't think they are helpful here, and I suspect we can do some work to adjust either the types of pages we choose to do lookaside writes of or the "lookaside score" heuristic which is used to decide whether to enable lookaside writes and see this workload run better.
Comment by Alexander Gorrod [ 04/Dec/17 ]

bruce.lucas I believe you are noticing an absence of WT-3766.

The way lookaside (cache overflow) works is that if we need to reclaim space from cache, we write that content into the lookaside file. The content in the lookaside file is then removed the next time that data is accessed or when a table is dropped. It's not surprising to me that a write only workload will show lookaside file growth, since there are no queries causing WiredTiger to remove redundant content from the lookaside file. Implementing an alternative mechanism for reclaiming that space is the goal of WT-3766.

I believe this particular workload could use the lookaside file less - but that's likely going to be covered by work in SERVER-32139.

Generated at Thu Feb 08 04:29:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.