[SERVER-43870] High insert/delete workload with vectored insert results in unexpected flow control behavior/replication lag measurement Created: 07/Oct/19 Updated: 29/Oct/19 Resolved: 29/Oct/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.2.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Louis Williams | Assignee: | Maria van Keulen |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Execution Team 2019-11-04 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: |
| Description |
|
I wrote a workload that does heavy inserts and deletes, and I'm observing many WT cache stuck errors: "Cache stuck for too long, giving up", and replication lag exceeding 47s per node. I raised the vectored insert batch limit from 256K to 1MB, however, I wouldn't expect this to significantly alter the behavior of flow control, as I understand it. The workload repetitively performs a large number of inserts then deletes. The goal is to have the inserts batch together an get inserted as a "vector" through the groupAndApplyInserts function during secondary batch application. This is the patch where I first encountered the issue. I discussed with maria.vankeulen, and FTDC reports that the lastCommitted and wall time lags are inconsistent with the member lag: Flow Control, as a direct consumer of lastCommitted lag, will only throttle if this lag is greater than 5 seconds. Additionally, below are the FTDC stats obtained for Flow Control: Even when Flow Control is not actively throttling writes, we expect there to be FTDC data on the amount of locks each operation takes, since we sample this data regardless of whether Flow Control is enabled. However, this data is missing. |
| Comments |
| Comment by Maria van Keulen [ 29/Oct/19 ] |
|
bruce.lucas and I had an offline conversation discussing some ideas to capture sufficient data for locksPerOp in light of the integer restriction in FTDC. I've filed Closing this ticket as Done, since we've investigated the issues of interest. |
| Comment by Bruce Lucas (Inactive) [ 29/Oct/19 ] |
|
Everything is stored in in ftdc as a 64-bit integer, so I assume it's getting rounded down as it's stored in ftdc. If fractional values are important then you could store a scaled value or store the reciprocal (with suitable name change), depending on the expected interesting range of values. |
| Comment by Maria van Keulen [ 29/Oct/19 ] |
|
Given that the lastCommitted lag discrepancy is due to our testing set up rather than a bug in the lag calculation, the remaining interesting behavior to investigate is why locksPerOp is reported to be zero when we know it should be calculated regardless of lastCommitted lag. I ran the repro.js script locally and added extra logging to see locksPerOp values reported by serverStatus. It looks like locksPerOp is nonzero (albeit < 0.1) on the primary for much of the execution of the script. bruce.lucas Is it expected for very small values to be reported as zero in t2? |
| Comment by Maria van Keulen [ 08/Oct/19 ] |
|
louis.williams and I discussed yesterday whether the changes in the patch build that resulted in the above data could have had an effect on the results. I took a look at how the repro.js script behaves locally on this commit of v4.2. This workload caused infinitesimal spikes of reported lag on the individual secondaries, but those spikes were not reported in the lastCommitted lag. It is plausible that the replica set could have recovered from these spikes in between lastCommitted lag measurements. In this case, Flow Control would not have to throttle the writes. Below is a screen shot from a debug build, for parity with the patch build results. It seems like the changes from the patch build added some strain on the replica set. What remains to be seen is why the lastCommitted replication lag accuracy differs in these two cases. Additionally, in both cases, there is no reported locksPerOp data, so whomever picks up this ticket should investigate why this occurs as well. louis.williams, could you please update the ticket description with a brief summary of the changes in the patch build? |