[SERVER-32236] Avoid stalling ftdc when system is struggling Created: 08/Dec/17  Updated: 14/Aug/18  Resolved: 17/Jul/18

Status: Closed
Project: Core Server
Component/s: Diagnostics
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Mark Benvenuto
Resolution: Done Votes: 34
Labels: SWDI
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-32876 Don't stall ftdc due to WT cache full Closed
depends on SERVER-32875 Don't stall ftdc due to running out o... Closed
Related
is related to SERVER-32226 oldest_timestamp should track the las... Closed
Operating System: ALL
Participants:
Case:

 Description   

FTDC can stall for a few reasons when the system is struggling, but this is just when it might be most helpful. Reasons include

  • out of tickets
  • cache is full and ftdc thread gets used to do evictions - the ignore_cache_size session option (WT-2932) might help here

We should arrange for FTDC not to get stalled by these conditions.



 Comments   
Comment by Daniel Pasette (Inactive) [ 24/Jan/18 ]

I agree that a partial solution here is better than waiting for a perfect one. If necessary, can we identify and split out the ones we know about so we can make incremental progress here?

Comment by Bruce Lucas (Inactive) [ 29/Dec/17 ]

Agreed, we may have to identify more causes of blocking than the two called out in the initial comment. However my experience is that those are the two most common so if we can address those we will have made a large improvement.

Comment by Mark Benvenuto [ 28/Dec/17 ]

In general this is a difficult problem. We can make incremental improvements to serverStatus, write ticket handling and a few other things to reduce the chances of of FTDC blocking, but it will be difficult to eliminate blocking completely.

If we make a more aggressive design change, we could treat the various input sources into FTDC as independent and query them asynchronously in multiple threads. Basically, do a scatter gather query across all the sources, but then each source will be from a slightly different time which may make the results less useful. Also, this would not eliminate the blocking. It would just mean that when a server is stuck, you get some data, but you still be missing serverStatus for instance since it is the one that tends to get blocked.

Comment by Bruce Lucas (Inactive) [ 13/Dec/17 ]

I'm thinking that's due to accumulating blocked read operations during the index build and then running out of tickets as a result, in which case it's covered by the first item, but if you have information to the contrary it would be useful to look into that.

Comment by James Kovacs [ 13/Dec/17 ]

A long-running foreground index build on a secondary can also stall FTDC data collection. The system need not even be struggling for the stall to occur.

Generated at Thu Feb 08 04:29:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.