Description
We have seen several outages where jobs appear to spin and never time out after some event. Recovering requires manual intervention: marking in-progress jobs complete and then restarting the app servers. Instead, all jobs should have a timeout as a back stop. This should be long enough that we never hit it except when there is a problem.