-
Type: Task
-
Resolution: Unresolved
-
Priority: Unknown
-
None
-
Component/s: Astrolabe
-
None
-
Not Needed
During Astrolabe testing, we are occasionally running into timeouts. For example:
[2022/04/18 05:48:51.009] INFO:astrolabe.runner:Running test 'retryReads_primaryTakeover' on cluster 'ea8fd37faa' [2022/04/18 05:48:51.127] INFO:astrolabe.utils:Starting workload executor subprocess [2022/04/18 05:48:51.127] INFO:astrolabe.utils:Started workload executor [PID: 2236] ... [2022/04/18 05:49:41.198] INFO:astrolabe.runner:Waiting to wait for cluster ea8fd37faa to become idle [2022/04/18 05:49:41.198] INFO:astrolabe.runner:Waiting for cluster ea8fd37faa to become idle [2022/04/18 05:49:41.260] INFO:astrolabe.runner:Cluster ea8fd37faa: current state: updating; wanted state: idle; waited for 0.1 sec [2022/04/18 05:50:41.810] INFO:astrolabe.runner:Cluster ea8fd37faa: current state: updating; wanted state: idle; waited for 60.6 sec [2022/04/18 05:51:42.434] INFO:astrolabe.runner:Cluster ea8fd37faa: current state: updating; wanted state: idle; waited for 121.2 sec ... [2022/04/18 06:49:21.790] INFO:astrolabe.runner:Cluster ea8fd37faa: current state: updating; wanted state: idle; waited for 3580.5 sec [2022/04/18 06:49:41.718] INFO:astrolabe.utils:Stopped workload executor [PID: 2236] ... [2022/04/18 06:49:41.720] astrolabe.exceptions.PollingTimeoutError: Polling timed out after 3600.0 seconds [2022/04/18 06:49:41.724] dotnet cancel workload> Canceling the workload task...Done.
Since this reconfiguration involved a rolling replacement of all cluster nodes, one hour was not quite enough for Atlas to complete the operation. When the reconfiguration above timed out at one hour, it was very close to finishing - it was waiting on the third node to come back online when the timeout was hit.
We should increase the timeout for Atlas reconfigurations to avoid this situation.