Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-34783

powercycle should ensure it has the complete output from mongod before crashing the machine

    XMLWordPrintable

Details

    • 4

    Description

      While investigating this powercycle failure, maria.vankeulen and I noticed a 2-minute gap between when the machine was crashed and when the server was restarted.

      2018-04-07T02:36:09.802+0000 I NETWORK  [conn2] received client metadata from 127.0.0.1:49695 conn2: { driver: { name: "PyMongo", version: "3.5.1" }, os: { type: "Linux", name: "debian jessie/sid", architecture: "x86_64", version: "3.13.0-24-generic" }, platform: "CPython 2.7.13.final.0" }
      2018-04-07T02:36:09.805+0000 I NETWORK  [conn2] end connection 127.0.0.1:49695 (1 connection now open)
      2018-04-07T02:36:09.805+0000 I NETWORK  [conn1] end connection 127.0.0.1:49694 (0 connections now open)
      2018-04-07T02:38:26.366+0000 I CONTROL  [main] ***** SERVER RESTARTED *****
      2018-04-07T02:38:26.452+0000 I CONTROL  [initandlisten] MongoDB starting : pid=1639 port=20001 dbpath=/data/db 64-bit host=ip-10-122-6-192
      2018-04-07T02:38:26.452+0000 I CONTROL  [initandlisten] db version v3.6.4-rc0-1-g4c5a017
      

      [2018/04/07 02:38:00.632] Server crashing now
      

      I believe this is behavior is "as designed" with how we only call flush() on the log file and not fsync(). This is undesirable because the mongod logs are a record of operations that happened against the database which is useful to consult when investigating a data inconsistency failure. After discussing with milkie via Slack, I think there a few ways to address this issue:

      1. Change powertest.py to send the mongod process a SIGUSR1 signal to cause it to reopen its log file (the mongod process is started with --logappend). I don't think this has the intended effect because close() doesn't implicitly call fsync().

      2. Change powertest.py to rsync the mongod.log file back to the controller before crashing the machine. This seems potentially tricky with how we'd like for the machine to be crashed immediately after the journaled canary insert finishes.

      3. Add a testing-only mode where mongod opens the log file with O_DSYNC. (http://man7.org/linux/man-pages/man2/open.2.html is a handy reference.)

      4. Add a testing-only command that causes mongod to fsync() its log file.

      Attachments

        Activity

          People

            backlog-server-stm Backlog - Server Tooling and Methods (STM) (Inactive)
            max.hirschhorn@mongodb.com Max Hirschhorn
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: