-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: Backlog
-
Component/s: Testing Infrastructure
-
None
-
Fully Compatible
-
ALL
-
STM 2020-07-27, STM 2020-09-07
-
0
-
0
The failures manifest themselves like
[2020/06/03 20:24:22.459] ERROR [2020-06-03 20:24:22,458] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why: [2020/06/03 20:24:22.459] java.util.concurrent.ExecutionException: java.lang.RuntimeException: Mongo setup on ip-10-122-57-2:20000 timed out! [2020/06/03 20:24:22.459] at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[na:1.8.0_252] [2020/06/03 20:24:22.459] at java.util.concurrent.FutureTask.get(FutureTask.java:192) ~[na:1.8.0_252] [2020/06/03 20:24:22.459] at clojure.core$deref_future.invokeStatic(core.clj:2208) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.core$future_call$reify__6962.deref(core.clj:6688) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.core$deref.invokeStatic(core.clj:2228) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.core$deref.invoke(core.clj:2214) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.core$map$fn__4785.invoke(core.clj:2644) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.lang.LazySeq.seq(LazySeq.java:49) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.lang.RT.seq(RT.java:521) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.core$seq__4357.invokeStatic(core.clj:137) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.core.protocols$seq_reduce.invokeStatic(protocols.clj:24) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.core.protocols$fn__6738.invokeStatic(protocols.clj:75) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.core.protocols$fn__6738.invoke(protocols.clj:75) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.core.protocols$fn__6684$G__6679__6697.invoke(protocols.clj:13) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.core$reduce.invokeStatic(core.clj:6545) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.core$into.invokeStatic(core.clj:6610) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.core$into.invoke(core.clj:6604) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at jepsen.control$on_nodes.invokeStatic(control.clj:373) ~[jepsen-0.1.8.jar:na] [2020/06/03 20:24:22.459] at jepsen.control$on_nodes.invoke(control.clj:357) ~[jepsen-0.1.8.jar:na] [2020/06/03 20:24:22.459] at jepsen.control$on_nodes.invokeStatic(control.clj:362) ~[jepsen-0.1.8.jar:na] [2020/06/03 20:24:22.459] at jepsen.control$on_nodes.invoke(control.clj:357) ~[jepsen-0.1.8.jar:na] [2020/06/03 20:24:22.459] at jepsen.core$run_BANG_$fn__4284$fn__4287.invoke(core.clj:584) ~[jepsen-0.1.8.jar:na] [2020/06/03 20:24:22.459] at jepsen.core$run_BANG_$fn__4284.invoke(core.clj:572) ~[jepsen-0.1.8.jar:na] [2020/06/03 20:24:22.459] at jepsen.core$run_BANG_.invokeStatic(core.clj:553) ~[jepsen-0.1.8.jar:na] [2020/06/03 20:24:22.459] at jepsen.core$run_BANG_.invoke(core.clj:500) ~[jepsen-0.1.8.jar:na] [2020/06/03 20:24:22.459] at jepsen.cli$single_test_cmd$fn__4984.invoke(cli.clj:329) ~[jepsen-0.1.8.jar:na] [2020/06/03 20:24:22.459] at jepsen.cli$run_BANG_.invokeStatic(cli.clj:273) [jepsen-0.1.8.jar:na] [2020/06/03 20:24:22.459] at jepsen.cli$run_BANG_.invoke(cli.clj:203) [jepsen-0.1.8.jar:na] [2020/06/03 20:24:22.459] at jepsen.mongodb.runner$_main.invokeStatic(runner.clj:164) [classes/:na] [2020/06/03 20:24:22.459] at jepsen.mongodb.runner$_main.doInvoke(runner.clj:162) [classes/:na] [2020/06/03 20:24:22.459] at clojure.lang.RestFn.invoke(RestFn.java:3894) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.lang.Var.invoke(Var.java:676) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at user$eval5.invokeStatic(form-init3654414589850886228.clj:1) [na:na] [2020/06/03 20:24:22.459] at user$eval5.invoke(form-init3654414589850886228.clj:1) [na:na] [2020/06/03 20:24:22.459] at clojure.lang.Compiler.eval(Compiler.java:6927) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.lang.Compiler.eval(Compiler.java:6917) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.lang.Compiler.load(Compiler.java:7379) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.lang.Compiler.loadFile(Compiler.java:7317) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.main$load_script.invokeStatic(main.clj:275) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.main$init_opt.invokeStatic(main.clj:277) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.main$init_opt.invoke(main.clj:277) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.main$initialize.invokeStatic(main.clj:308) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.main$null_opt.invokeStatic(main.clj:342) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.main$null_opt.invoke(main.clj:339) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.main$main.invokeStatic(main.clj:421) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.main$main.doInvoke(main.clj:384) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.lang.RestFn.invoke(RestFn.java:421) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.lang.Var.invoke(Var.java:383) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.lang.AFn.applyToHelper(AFn.java:156) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.lang.Var.applyTo(Var.java:700) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.main.main(main.java:37) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] Caused by: java.lang.RuntimeException: Mongo setup on ip-10-122-57-2:20000 timed out! [2020/06/03 20:24:22.459] at jepsen.mongodb.core$db$reify__2125.setup_BANG_(core.clj:330) ~[classes/:na] [2020/06/03 20:24:22.459] at jepsen.db$cycle_BANG_.invokeStatic(db.clj:25) ~[jepsen-0.1.8.jar:na] [2020/06/03 20:24:22.459] at jepsen.db$cycle_BANG_.invoke(db.clj:20) ~[jepsen-0.1.8.jar:na] [2020/06/03 20:24:22.459] at clojure.core$partial$fn__4759.invoke(core.clj:2516) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at jepsen.control$on_nodes$fn__2069.invoke(control.clj:372) ~[jepsen-0.1.8.jar:na] [2020/06/03 20:24:22.459] at clojure.lang.AFn.applyToHelper(AFn.java:154) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.lang.AFn.applyTo(AFn.java:144) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.core$apply.invokeStatic(core.clj:646) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1881) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1881) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.lang.RestFn.applyTo(RestFn.java:142) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.core$apply.invokeStatic(core.clj:650) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.core$bound_fn_STAR_$fn__4671.doInvoke(core.clj:1911) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.lang.RestFn.invoke(RestFn.java:408) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at jepsen.util$real_pmap$launcher__1160$fn__1161.invoke(util.clj:49) ~[jepsen-0.1.8.jar:na] [2020/06/03 20:24:22.459] at clojure.core$binding_conveyor_fn$fn__4676.invoke(core.clj:1938) ~[clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at clojure.lang.AFn.call(AFn.java:18) [clojure-1.8.0.jar:na] [2020/06/03 20:24:22.459] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_252] [2020/06/03 20:24:22.459] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_252] [2020/06/03 20:24:22.459] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_252] [2020/06/03 20:24:22.459] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_252]
A repeated execution of the latest failure is green indicating that this is a transient error rather than something linked to the server. This represents a bug in the test or infrastructure rather than in the server.
The fixes I see are:
- Disable this task
- Add retry logic to this task
- Dig deeper into the jepsen test itself to figure out why this happens roughly 15% of the time.