[SERVER-57100] Investigate critical section timeout error for ReshardCollection.yml genny workload Created: 20/May/21  Updated: 02/Jun/21  Resolved: 02/Jun/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Critical - P2
Reporter: Lamont Nelson Assignee: Kshitij Gupta
Resolution: Done Votes: 0
Labels: PM-234-M3, PM-234-T-autocommits, post-rc0
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-57276 Capture max/min percent complete in F... Closed
related to SERVER-57303 Create transaction history iterator s... Closed
Sprint: Sharding 2021-06-14, Sharding 2021-05-31
Participants:
Story Points: 2

 Description   

While attempting to reshard a collection in the https://github.com/mongodb/genny/blob/master/src/workloads/sharding/ReshardCollection.yml workload we experienced the following error in multiple runs:

[genny::What*] = Resharding critical section timed out.: generic server error

The log messages in the genny output are sparse. The purpose of this ticket is to investigate why the failures occurred, add additional logging if required, and identify appropriate t2 or other metrics to assist in diagnosing future issues.



 Comments   
Comment by Kshitij Gupta [ 02/Jun/21 ]

SERVER-57303 was created to track the performance fix and SERVER-57276 was created to track the remaining operation time metric that the commit monitor sees.

Comment by Lamont Nelson [ 27/May/21 ]

We did observe the reported estimated time decaying as expected. We need to verify that these values are reasonable. kshitij.gupta Can you post those log lines and the graphs for resharding metrics from your experiments in this ticket? It seems clear that the fetcher's query to obtain more work is having issues.

Comment by Lamont Nelson [ 21/May/21 ]

We looked at the failure data and determined that the incoming write workload may have been faster than the rate of applying the oplogs on the recipient. It would be useful to know what was the estimated time to completion the recipients reported over time.

There was also a check point nearby the point where we started blocking writes. I'm not sure if this is relevant, but mentioning it here anyway.

Metric to add to server status: t2 metric for recipients to show the estimated remaining operationTime

Generated at Thu Feb 08 05:40:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.