Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-49203

Jepsen-Smoke Has a 15% System-Failure Rate

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 4.7.0
    • Affects Version/s: Backlog
    • Component/s: Testing Infrastructure
    • Labels:
      None
    • Fully Compatible
    • ALL
    • STM 2020-07-27, STM 2020-09-07
    • 0
    • 0

      The failures manifest themselves like

       [2020/06/03 20:24:22.459] ERROR [2020-06-03 20:24:22,458] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
       [2020/06/03 20:24:22.459] java.util.concurrent.ExecutionException: java.lang.RuntimeException: Mongo setup on ip-10-122-57-2:20000 timed out!
       [2020/06/03 20:24:22.459] 	at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[na:1.8.0_252]
       [2020/06/03 20:24:22.459] 	at java.util.concurrent.FutureTask.get(FutureTask.java:192) ~[na:1.8.0_252]
       [2020/06/03 20:24:22.459] 	at clojure.core$deref_future.invokeStatic(core.clj:2208) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.core$future_call$reify__6962.deref(core.clj:6688) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.core$deref.invokeStatic(core.clj:2228) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.core$deref.invoke(core.clj:2214) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.core$map$fn__4785.invoke(core.clj:2644) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.lang.LazySeq.seq(LazySeq.java:49) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.lang.RT.seq(RT.java:521) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.core$seq__4357.invokeStatic(core.clj:137) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.core.protocols$seq_reduce.invokeStatic(protocols.clj:24) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.core.protocols$fn__6738.invokeStatic(protocols.clj:75) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.core.protocols$fn__6738.invoke(protocols.clj:75) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.core.protocols$fn__6684$G__6679__6697.invoke(protocols.clj:13) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.core$reduce.invokeStatic(core.clj:6545) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.core$into.invokeStatic(core.clj:6610) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.core$into.invoke(core.clj:6604) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at jepsen.control$on_nodes.invokeStatic(control.clj:373) ~[jepsen-0.1.8.jar:na]
       [2020/06/03 20:24:22.459] 	at jepsen.control$on_nodes.invoke(control.clj:357) ~[jepsen-0.1.8.jar:na]
       [2020/06/03 20:24:22.459] 	at jepsen.control$on_nodes.invokeStatic(control.clj:362) ~[jepsen-0.1.8.jar:na]
       [2020/06/03 20:24:22.459] 	at jepsen.control$on_nodes.invoke(control.clj:357) ~[jepsen-0.1.8.jar:na]
       [2020/06/03 20:24:22.459] 	at jepsen.core$run_BANG_$fn__4284$fn__4287.invoke(core.clj:584) ~[jepsen-0.1.8.jar:na]
       [2020/06/03 20:24:22.459] 	at jepsen.core$run_BANG_$fn__4284.invoke(core.clj:572) ~[jepsen-0.1.8.jar:na]
       [2020/06/03 20:24:22.459] 	at jepsen.core$run_BANG_.invokeStatic(core.clj:553) ~[jepsen-0.1.8.jar:na]
       [2020/06/03 20:24:22.459] 	at jepsen.core$run_BANG_.invoke(core.clj:500) ~[jepsen-0.1.8.jar:na]
       [2020/06/03 20:24:22.459] 	at jepsen.cli$single_test_cmd$fn__4984.invoke(cli.clj:329) ~[jepsen-0.1.8.jar:na]
       [2020/06/03 20:24:22.459] 	at jepsen.cli$run_BANG_.invokeStatic(cli.clj:273) [jepsen-0.1.8.jar:na]
       [2020/06/03 20:24:22.459] 	at jepsen.cli$run_BANG_.invoke(cli.clj:203) [jepsen-0.1.8.jar:na]
       [2020/06/03 20:24:22.459] 	at jepsen.mongodb.runner$_main.invokeStatic(runner.clj:164) [classes/:na]
       [2020/06/03 20:24:22.459] 	at jepsen.mongodb.runner$_main.doInvoke(runner.clj:162) [classes/:na]
       [2020/06/03 20:24:22.459] 	at clojure.lang.RestFn.invoke(RestFn.java:3894) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.lang.Var.invoke(Var.java:676) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at user$eval5.invokeStatic(form-init3654414589850886228.clj:1) [na:na]
       [2020/06/03 20:24:22.459] 	at user$eval5.invoke(form-init3654414589850886228.clj:1) [na:na]
       [2020/06/03 20:24:22.459] 	at clojure.lang.Compiler.eval(Compiler.java:6927) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.lang.Compiler.eval(Compiler.java:6917) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.lang.Compiler.load(Compiler.java:7379) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.lang.Compiler.loadFile(Compiler.java:7317) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.main$load_script.invokeStatic(main.clj:275) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.main$init_opt.invokeStatic(main.clj:277) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.main$init_opt.invoke(main.clj:277) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.main$initialize.invokeStatic(main.clj:308) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.main$null_opt.invokeStatic(main.clj:342) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.main$null_opt.invoke(main.clj:339) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.main$main.invokeStatic(main.clj:421) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.main$main.doInvoke(main.clj:384) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.lang.RestFn.invoke(RestFn.java:421) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.lang.Var.invoke(Var.java:383) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.lang.AFn.applyToHelper(AFn.java:156) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.lang.Var.applyTo(Var.java:700) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.main.main(main.java:37) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] Caused by: java.lang.RuntimeException: Mongo setup on ip-10-122-57-2:20000 timed out!
       [2020/06/03 20:24:22.459] 	at jepsen.mongodb.core$db$reify__2125.setup_BANG_(core.clj:330) ~[classes/:na]
       [2020/06/03 20:24:22.459] 	at jepsen.db$cycle_BANG_.invokeStatic(db.clj:25) ~[jepsen-0.1.8.jar:na]
       [2020/06/03 20:24:22.459] 	at jepsen.db$cycle_BANG_.invoke(db.clj:20) ~[jepsen-0.1.8.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.core$partial$fn__4759.invoke(core.clj:2516) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at jepsen.control$on_nodes$fn__2069.invoke(control.clj:372) ~[jepsen-0.1.8.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.lang.AFn.applyToHelper(AFn.java:154) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.lang.AFn.applyTo(AFn.java:144) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.core$apply.invokeStatic(core.clj:646) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1881) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1881) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.lang.RestFn.applyTo(RestFn.java:142) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.core$apply.invokeStatic(core.clj:650) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.core$bound_fn_STAR_$fn__4671.doInvoke(core.clj:1911) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.lang.RestFn.invoke(RestFn.java:408) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at jepsen.util$real_pmap$launcher__1160$fn__1161.invoke(util.clj:49) ~[jepsen-0.1.8.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.core$binding_conveyor_fn$fn__4676.invoke(core.clj:1938) ~[clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at clojure.lang.AFn.call(AFn.java:18) [clojure-1.8.0.jar:na]
       [2020/06/03 20:24:22.459] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_252]
       [2020/06/03 20:24:22.459] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_252]
       [2020/06/03 20:24:22.459] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_252]
       [2020/06/03 20:24:22.459] 	at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_252]
      

      A repeated execution of the latest failure is green indicating that this is a transient error rather than something linked to the server. This represents a bug in the test or infrastructure rather than in the server.

      The fixes I see are:

      1. Disable this task
      2. Add retry logic to this task
      3. Dig deeper into the jepsen test itself to figure out why this happens roughly 15% of the time.

            Assignee:
            robert.guo@mongodb.com Robert Guo (Inactive)
            Reporter:
            ryan.timmons@mongodb.com Ryan Timmons
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: