Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-34783

powercycle should ensure it has the complete output from mongod before crashing the machine

    XMLWordPrintable

    Details

    • Linked BF Score:
      4

      Description

      While investigating this powercycle failure, Maria van Keulen and I noticed a 2-minute gap between when the machine was crashed and when the server was restarted.

      2018-04-07T02:36:09.802+0000 I NETWORK  [conn2] received client metadata from 127.0.0.1:49695 conn2: { driver: { name: "PyMongo", version: "3.5.1" }, os: { type: "Linux", name: "debian jessie/sid", architecture: "x86_64", version: "3.13.0-24-generic" }, platform: "CPython 2.7.13.final.0" }
      2018-04-07T02:36:09.805+0000 I NETWORK  [conn2] end connection 127.0.0.1:49695 (1 connection now open)
      2018-04-07T02:36:09.805+0000 I NETWORK  [conn1] end connection 127.0.0.1:49694 (0 connections now open)
      2018-04-07T02:38:26.366+0000 I CONTROL  [main] ***** SERVER RESTARTED *****
      2018-04-07T02:38:26.452+0000 I CONTROL  [initandlisten] MongoDB starting : pid=1639 port=20001 dbpath=/data/db 64-bit host=ip-10-122-6-192
      2018-04-07T02:38:26.452+0000 I CONTROL  [initandlisten] db version v3.6.4-rc0-1-g4c5a017
      

      [2018/04/07 02:38:00.632] Server crashing now
      

      I believe this is behavior is "as designed" with how we only call flush() on the log file and not fsync(). This is undesirable because the mongod logs are a record of operations that happened against the database which is useful to consult when investigating a data inconsistency failure. After discussing with Eric Milkie via Slack, I think there a few ways to address this issue:

      1. Change powertest.py to send the mongod process a SIGUSR1 signal to cause it to reopen its log file (the mongod process is started with --logappend). I don't think this has the intended effect because close() doesn't implicitly call fsync().

      2. Change powertest.py to rsync the mongod.log file back to the controller before crashing the machine. This seems potentially tricky with how we'd like for the machine to be crashed immediately after the journaled canary insert finishes.

      3. Add a testing-only mode where mongod opens the log file with O_DSYNC. (http://man7.org/linux/man-pages/man2/open.2.html is a handy reference.)

      4. Add a testing-only command that causes mongod to fsync() its log file.

        Attachments

          Activity

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated: