-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Checkpoints
-
None
-
Storage Engines - Server Integration
-
296.398
-
SE Persistence backlog
-
None
SERVER-124519 enabled parallel checkpoints for disaggregated storage. By some metrics, this was overall an improvement (20 improvements/15 regressions), however some of the regressions are quite bad (on the order of 500%).
While there was some discussion of this at the time the ticket was merged, it didn't actually say why this is OK. There was also some speculation that it was just noise, but this has turned out not to be the case – these 500% regressions have proven to be "sticky":
The good news is that all of the big regressions came from just the find_one_and_update tasks, and only on the 11-node disagg variant. We should investigate what's going on with this particular task and at least have a reasonable explanation, and ideally a fix.