[SERVER-34783] powercycle should ensure it has the complete output from mongod before crashing the machine Created: 02/May/18  Updated: 06/Dec/22  Resolved: 05/Nov/21

Status: Closed
Project: Core Server
Component/s: Logging, Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Backlog - Server Tooling and Methods (STM) (Inactive)
Resolution: Won't Fix Votes: 0
Labels: tig-powercycle
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Server Tooling & Methods
Participants:
Linked BF Score: 4

 Description   

While investigating this powercycle failure, maria.vankeulen and I noticed a 2-minute gap between when the machine was crashed and when the server was restarted.

2018-04-07T02:36:09.802+0000 I NETWORK  [conn2] received client metadata from 127.0.0.1:49695 conn2: { driver: { name: "PyMongo", version: "3.5.1" }, os: { type: "Linux", name: "debian jessie/sid", architecture: "x86_64", version: "3.13.0-24-generic" }, platform: "CPython 2.7.13.final.0" }
2018-04-07T02:36:09.805+0000 I NETWORK  [conn2] end connection 127.0.0.1:49695 (1 connection now open)
2018-04-07T02:36:09.805+0000 I NETWORK  [conn1] end connection 127.0.0.1:49694 (0 connections now open)
2018-04-07T02:38:26.366+0000 I CONTROL  [main] ***** SERVER RESTARTED *****
2018-04-07T02:38:26.452+0000 I CONTROL  [initandlisten] MongoDB starting : pid=1639 port=20001 dbpath=/data/db 64-bit host=ip-10-122-6-192
2018-04-07T02:38:26.452+0000 I CONTROL  [initandlisten] db version v3.6.4-rc0-1-g4c5a017

[2018/04/07 02:38:00.632] Server crashing now

I believe this is behavior is "as designed" with how we only call flush() on the log file and not fsync(). This is undesirable because the mongod logs are a record of operations that happened against the database which is useful to consult when investigating a data inconsistency failure. After discussing with milkie via Slack, I think there a few ways to address this issue:

1. Change powertest.py to send the mongod process a SIGUSR1 signal to cause it to reopen its log file (the mongod process is started with --logappend). I don't think this has the intended effect because close() doesn't implicitly call fsync().

2. Change powertest.py to rsync the mongod.log file back to the controller before crashing the machine. This seems potentially tricky with how we'd like for the machine to be crashed immediately after the journaled canary insert finishes.

3. Add a testing-only mode where mongod opens the log file with O_DSYNC. (http://man7.org/linux/man-pages/man2/open.2.html is a handy reference.)

4. Add a testing-only command that causes mongod to fsync() its log file.



 Comments   
Comment by Brooke Miller [ 05/Nov/21 ]

We're not planning on enhancing powercycle for this. Please consider reaching out to robert.guo about using Antithesis for additional debugging information.

Comment by Max Hirschhorn [ 13/Jun/18 ]

I discussed with sam.kleinman and he didn't recommend the remote syslog approach. We'll build something ourselves - it could be curator to send the logs somewhere, but there are a number of other options. My current thought is to just be copying the mongod log to the Evergreen-spawned machine as the powercycle test is going along rather than needing to signal with a blocking wait that logs have been flushed in a more persistent way.

Comment by Andrew Morrow (Inactive) [ 07/Jun/18 ]

max.hirschhorn - I don't believe calls to the remote syslog are buffered though. As soon as you log the message, it goes over the wire. Why wouldn't that be sufficient, as long as the machine you are crashing isn't the machine running the syslog daemon?

Comment by Max Hirschhorn [ 23/May/18 ]

Max Hirschhorn - Even a separate filesystem might flow through the same IO controller, no?

I don't know how computers work so I couldn't really say.

Regarding flushing the stream, can we point to some actual code to discuss? Where is the write to the stream that you would like to augment with a flush?

Using the word "flush" was probably misleading. What I was getting at is that the remote syslog has some queue of messages which we would want to wait to be drained before triggering the crash. I don't think there's any code I can point to because it would be some behavior of the remote syslog.

Comment by Andrew Morrow (Inactive) [ 23/May/18 ]

max.hirschhorn - Even a separate filesystem might flow through the same IO controller, no? Regarding flushing the stream, can we point to some actual code to discuss? Where is the write to the stream that you would like to augment with a flush?

Comment by Max Hirschhorn [ 23/May/18 ]

Here is another suggestion: what about logging to a remote syslog?

acm, that doesn't seem much different than #2 as we'll still want to attempt to flush any messages to the remote syslog before crashing the machine. Could you explain why a separate filesystem for the mongod logs is potentially insufficient for avoiding an impact on the data files and journal files?

Comment by Andrew Morrow (Inactive) [ 23/May/18 ]

max.hirschhorn - I'm inclined to agree with Bruce that any sort of D_SYNC/fsync option is likely to subvert the intention of the power cycle test, and I'm not particularly excited to plumb that option in all the way down the logging subsystem. We have hopes to make the logging implementation simpler and less configurable so that it can be made "immortal" in ways that it currently isn't, and that is definitely going to reduce its configurability. Here is another suggestion: what about logging to a remote syslog?

Comment by Ian Whalen (Inactive) [ 04/May/18 ]

Assigning to Platform since they have ownership of the logging subsystem and its different output methods.

Comment by Jonathan Abrahams [ 02/May/18 ]

The Powercycle test already uses a /log drive for logpath and /data drive for dbpath

Comment by Susan LoVerso [ 02/May/18 ]

bruce.lucas's concern is valid. However, option 3 or 4 also seem like the most straightforward way to solve the problem. Is there is a way to measure a weakening of the test pre/post a change? I'd define weakening also as a severe reduction in the amount of work the test can do, if say, O_DSYNC causes huge delays.

Perhaps one way to mitigate some durability interactions would be to apply option 3 or 4 and set the logpath of the mongodb log file onto a separate file system than the database directory and journal files. That would be a 3rd file system if the database and journal are already separated.

Comment by Bruce Lucas (Inactive) [ 02/May/18 ]

One concern is that fsyncing the log file may have the potential to weaken the test by making more than the log file durable (depending on kernel and device behavior), and also by changing the timing.

Comment by Max Hirschhorn [ 02/May/18 ]

I've sent this to the Storage team to triage because I suspect they are in the best position to comment on how to make mongod log files durable.

Generated at Thu Feb 08 04:37:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.