[SERVER-79544] Make the task activator handle errors more intelligently Created: 14/Jul/23  Updated: 29/Oct/23  Resolved: 02/Aug/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0

Type: Task Priority: Minor - P4
Reporter: Memento Slack Bot Assignee: Jeffrey Zambory
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Sprint: DAG 2023-08-07
Participants:
Story Points: 1

 Description   

It seems like the configure task endpoint is pretty consistently timing out for this particular patch/task when it tries to activate the tasks generated for it.

Here's the exact endpoint being hit

This patch is large, with a pretty large number of generated tasks within it. All of the tasks under the [JSTEST AFFECTED] variants are attempting to be activated.

Is there anything that can be done in order to allow for this call to succeed? Should this endpoint be batched instead on the client side so that we try to activate only a certain number of tasks within each network call? Are there any guidelines or best practices around how many tasks we should be activating at a single time? 

------------------------------------------------------------------------------------------------------------------------------------------------------------

AC:

  • Update the task activation code in the mongo codebase to batch how many tasks get activated at a time
    • Experiment with different batch sizes
    • The patch I linked in this ticket would be a good base to begin experiments on
  • Have the task activator handle errors more intelligently
    • We can't batch much more on the client side but we can at least activate as many tasks as possible
    • We should also throw a more user friendly logging message when some tasks were unable to be activated


 Comments   
Comment by Githook User [ 02/Aug/23 ]

Author:

{'name': 'Jeff Zambory', 'email': 'jeff.zambory@mongodb.com', 'username': ''}

Message: SERVER-79544: Make the task activator handle errors more intelligently
Branch: minh.luu-no_compile_sys-perf
https://github.com/mongodb/mongo/commit/e01e49ad9a3ea7756ab5ed7720c67640379ee0d7

Comment by Githook User [ 01/Aug/23 ]

Author:

{'name': 'Jeff Zambory', 'email': 'jeff.zambory@mongodb.com', 'username': ''}

Message: SERVER-79544: Make the task activator handle errors more intelligently
Branch: master
https://github.com/mongodb/mongo/commit/e01e49ad9a3ea7756ab5ed7720c67640379ee0d7

Comment by Jeffrey Zambory [ 21/Jul/23 ]

Gotcha, thanks kimberly.tao@mongodb.com .

Moving this over to DAG: to make us batch how many tasks get activated at a time. 

Comment by Kim Tao [ 19/Jul/23 ]

1. The main limit is the 60 second server timeout on all requests. I don't think we have a hard correlation between # tasks to activate and time to run (due to configuration-specific concerns like # dependencies that the tasks have), but activating ~1000 at a time should be feasible within 60 seconds. You can experiment with the particular batch sizes to see what's reasonable, or maybe try activating individual build variants rather than individual tasks.
2. Assuming we want to activate N tasks in the patch, I wouldn't expect batching the requests to change the time taken to activate N tasks that drastically, because it shouldn't change the total number of documents that need to be updated in total. This is just a best guess though, so experimental verification seems like a reasonable approach here as well.

Comment by Jeffrey Zambory [ 18/Jul/23 ]

That would likely be doable but I would like to learn more about how this might affect things. If we begin batching calls, we might wind up with some patches that are in a half activated state if some later calls fail. Which isn't the worst thing in the world but is definitely a weird state to be in.

Some questions:

  1. What would be a good batching size to use here? How many tasks can we expect to activate in a "timely" manner at a time?
  2. How would batching like this impact the overall runtime of the activation task? Is multiple round of batch calls going to cause the task to take on average a longer time?
Comment by Kim Tao [ 17/Jul/23 ]

Is it a sufficient workaround to submit configure requests in batches rather than in one big request (as described in the thread)? This likely is not that easily fixable since it's a performance issue with patch configuration.

Generated at Thu Feb 08 06:41:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.