[SERVER-57100] Investigate critical section timeout error for ReshardCollection.yml genny workload Created: 20/May/21 Updated: 02/Jun/21 Resolved: 02/Jun/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Question | Priority: | Critical - P2 |
| Reporter: | Lamont Nelson | Assignee: | Kshitij Gupta |
| Resolution: | Done | Votes: | 0 |
| Labels: | PM-234-M3, PM-234-T-autocommits, post-rc0 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Sprint: | Sharding 2021-06-14, Sharding 2021-05-31 | ||||||||||||
| Participants: | |||||||||||||
| Story Points: | 2 | ||||||||||||
| Description |
|
While attempting to reshard a collection in the https://github.com/mongodb/genny/blob/master/src/workloads/sharding/ReshardCollection.yml workload we experienced the following error in multiple runs: [genny::What*] = Resharding critical section timed out.: generic server error The log messages in the genny output are sparse. The purpose of this ticket is to investigate why the failures occurred, add additional logging if required, and identify appropriate t2 or other metrics to assist in diagnosing future issues. |
| Comments |
| Comment by Kshitij Gupta [ 02/Jun/21 ] |
|
|
| Comment by Lamont Nelson [ 27/May/21 ] |
|
We did observe the reported estimated time decaying as expected. We need to verify that these values are reasonable. kshitij.gupta Can you post those log lines and the graphs for resharding metrics from your experiments in this ticket? It seems clear that the fetcher's query to obtain more work is having issues. |
| Comment by Lamont Nelson [ 21/May/21 ] |
|
We looked at the failure data and determined that the incoming write workload may have been faster than the rate of applying the oplogs on the recipient. It would be useful to know what was the estimated time to completion the recipients reported over time. There was also a check point nearby the point where we started blocking writes. I'm not sure if this is relevant, but mentioning it here anyway. Metric to add to server status: t2 metric for recipients to show the estimated remaining operationTime |