[SERVER-28589] Reduce how often threads wait before flushing sizeStorer with WiredTiger Created: 03/Apr/17  Updated: 29/Jan/18  Resolved: 06/Nov/17

Status: Closed
Project: Core Server
Component/s: Storage, WiredTiger
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Alexander Gorrod Assignee: Xiangyu Yao (Inactive)
Resolution: Won't Fix Votes: 0
Labels: bkp
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File SERVER-28589-test-by-yxy.js    
Issue Links:
Backports
Related
Backwards Compatibility: Fully Compatible
Backport Requested:
v3.4
Sprint: Storage 2017-11-13
Participants:

 Description   

With the current implementation of the size storer

There is one thread in:

Thread 410 (Thread 0x7fb7892de700 (LWP 2741)):
#0  0x00007fb794c2fa82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000001a6813b in __wt_cond_wait_signal ()
#2  0x0000000001a3e79c in __wt_cache_eviction_worker ()
#3  0x0000000001a9ab16 in __session_begin_transaction ()
#4  0x00000000010a8775 in mongo::WiredTigerSizeStorer::syncCache(bool) ()
#5  0x00000000010908a6 in mongo::WiredTigerKVEngine::syncSizeInfo(bool) const ()
#6  0x0000000001091282 in mongo::WiredTigerKVEngine::haveDropsQueued() const ()
#7  0x00000000010a7299 in mongo::WiredTigerSessionCache::releaseSession(mongo::WiredTigerSession*) ()
#8  0x00000000010a779d in mongo::WiredTigerSessionCache::waitUntilDurable(bool) ()
#9  0x0000000001094bcb in mongo::WiredTigerKVEngine::WiredTigerJournalFlusher::run() ()
#10 0x00000000012b6c70 in mongo::BackgroundJob::jobBody() ()

i.e: It holds the WiredTigerSizeStorer cursor lock, and is blocked waiting on space in the WiredTiger cache.

There are 34 other threads waiting for the WiredTigerSizeStorer cursor lock:
(NB: This call stack is from a 3.2 mongod)

Thread 322 (Thread 0x7fb768a65700 (LWP 6178)):
#0  0x00007fb794c31f4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fb794c2dd02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007fb794c2dc08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00000000010a8431 in mongo::WiredTigerSizeStorer::syncCache(bool) ()
#4  0x00000000010908a6 in mongo::WiredTigerKVEngine::syncSizeInfo(bool) const ()
#5  0x0000000001091282 in mongo::WiredTigerKVEngine::haveDropsQueued() const ()
#6  0x00000000010a7299 in mongo::WiredTigerSessionCache::releaseSession(mongo::WiredTigerSession*) ()
#7  0x00000000010a4d57 in mongo::WiredTigerRecoveryUnit::~WiredTigerRecoveryUnit() ()
#8  0x00000000010a4de1 in mongo::WiredTigerRecoveryUnit::~WiredTigerRecoveryUnit() ()
#9  0x0000000000d20640 in mongo::OperationContextImpl::~OperationContextImpl() ()
#10 0x0000000000d206d1 in mongo::OperationContextImpl::~OperationContextImpl() ()
#11 0x00000000009b76e7 in mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*) ()

So we are essentially serializing threads through size storer sync. The thread holding the lock is waiting for space in the cache - which may block for an extended time.

Ideally in this case the 34 threads waiting for the lock would skip flushing size storer - they don't need it to be flushed for correctness.



 Comments   
Comment by Alexander Gorrod [ 26/Oct/17 ]

Thanks for the detailed description xiangyu.yao I agree that the way the elapsed time tracker works makes it difficult to see a path where all threads could end up waiting.

A possibility is that the timer is being checked a lot in some workloads. You say:

have to do 100,000 operations. Then that thread would become blocked

I believe what actually needs to happen is that intervalHasElapsed needs to be called 100,000 times. Is it possible that a path is calling that a surprising number of times? If you can rule that out via code inspection I'd be happy for this ticket to be closed. It's a long time since 3.2, and queued drops have had a number of changes since this ticket was opened.

Comment by Xiangyu Yao (Inactive) [ 26/Oct/17 ]

I have some updates on this ticket. Here is what I discovered today:
I tried to write a javascript test to reproduce this issue. It's attached to this comment.

Experiment #1:
I set the WiredTiger cache size to be 10MB, and wrote 40,000 documents whose
total size was ~40MB, Once that completed, I used 40 threads to continuously (and randomly) update the documents - each thread updates
its own set of 1000 documents so there will be no lock contention at the document-level.

I changed syncCache()'s lock call on cursorLock to be a tryLock instead. If the trylock fails, it means another thread is already syncing the cache,
so we just return. The code looks like this (I also put a log there for debugging):

stdx::unique_lock<stdx::mutex> cursorLock(_cursorMutex, stdx::try_to_lock); // try lock
if (!cursorLock) {
    // Someone else is syncing the cache, so we just skip this.
    log() << Someone is doing the job";
    return;
}

I started the test with the newly compiled mongod and used mongostat to monitor the server. It seemed after a while, the 'cache used' stat in mongostat stayed
between 98% and 100% and "application thread time waiting for cache (usecs)" was ~90 seconds, meaning we had the desired full-cache situation.

Result: By checking the mongod log, I cannot see any occurrence of the log "Someone is
doing the job", meaning that this issue didn't happen naturally (on my machine) in the full-cache situation.

I investigated further...

The backtrace in the ticket's description indicates that syncSizeInfo() was called from WiredTigerKVEngine::haveDropsQueued(). We call syncSizeInfo either after it hasn't be called for 60 seconds or 100,000 "pings", whichever comes earlier. (numbers
are hardcoded here

So I setup Experiment #2:
I changed the ElapsedTracker's "_hitsBetweenMarks" from 100,000 to 0 so basically every time a thread goes into haveDropsQueued() it will also call syncSizeInfo().

Result: "Someone is doing the job" got logged a lot. (13640 occurrences in 5-minute period). This means that a lot of threads were skipping
the unnecessary cache syncing that they would normally perform. This is not surprising given that we forced all of threads to call syncSizeInfo(), but it was still a useful sanity check.

This left me confused as to how the situation in the ticket arose. If _hitsBetweenMarks is 100,000, what would have to happen for 34 threads to be blocked in syncSizeInfo() at the same time? I assume that the JournalFlusher is holding cursorLock.

For the first thread to enter, all of the (presumably 34) threads would, in total, have to do 100,000 operations. Then that thread would become blocked, waiting for the cursorLock. The (presumably, 33) remaining threads would then have to do 100,000 more operations for another thread to get blocked. This would repeat until there was only one unblocked thread, which would have to do 100,000 operations by itself in order to call syncSizeInfo and also get blocked waiting for the cursorLock.

(It's also possible that there were originally many more than 34 threads, and that the stack trace was taken only after the first 34 of them became blocked, but the situation would be similar).

Either way, it seems like this scenario would take far too long to actually happen. Maybe on a much faster machine?

I also did some other experiments and tried to use a benchmark to show the performance improvement but it seems stat such as throughput is not suitable here because the full cache bottlenecks it. But since Experiment #2 conceptually proves that it's possible that some threads would wait on this lock, I think a change would be worthy.
Test script:
SERVER-28589-test-by-yxy.js

Comment by Alexander Gorrod [ 23/Oct/17 ]

The key to reproducing this will be in putting enough pressure on the cache so that threads are waiting for space to be available. The most effective way to do that is to induce a disk bottleneck. Using a write load is the best way to do that, so I'd recommend:

  • Create a data set that is much larger than cache (4x?)
  • Run a workload that does random updates across the entire data range with many (20?) threads in parallel.

Once you have setup the workload, look at the server statistics application thread time waiting for cache - it should be high if your workload is achieving the desired bottleneck.

Comment by Eric Milkie [ 03/Apr/17 ]

Interestingly, the threads are blocked at the destruction of the OperationContext that was created to handle the request. At this point all message processing has finished and we are about to send back the response over the socket. The only reason we destruct the OperationContext first before sending back the data is because we want to update the CurrentOp data so that the operation no longer appears once the client has the response. I wonder if we could rearchitect this to push more of the heavy destructing until after the message is returned.

Generated at Thu Feb 08 04:18:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.