[SERVER-42315] Don't copy data files from a running mongod after a test fails Created: 22/Jul/19  Updated: 06/Dec/22  Resolved: 09/Jan/20

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Louis Williams Assignee: Backlog - Server Tooling and Methods (STM) (Inactive)
Resolution: Done Votes: 0
Labels: tig-resmoke
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-43049 Test failure file archiving can miss ... Closed
Assigned Teams:
Server Tooling & Methods
Participants:
Linked BF Score: 0

 Description   

Our test infrastructure copies data files for archival while a process is still running. The reason being that shutting down mongod may modify those files, making debugging more challenging.

If a checkpoint is active in WiredTiger, the data files will become completely inconsistent and unusable (e.g. copy a data file, then copy the WT metadata which can point to a new checkpoint absent in the already-copied data file). We should find a way to stop checkpoints, run fsyncLock or just SIGKILL the process before copying data files. I think SIGKILL is the simplest approach and would guarantee no files are modified before archival.

Here's a task where the data files are corrupt on node1 because the files were copied during an active checkpoint.



 Comments   
Comment by Max Hirschhorn [ 02/Aug/19 ]

Max Hirschhorn, could we prioritize this fix?

What I've understood from chatting in-person with vesselina.ratcheva and louis.williams is Server engineers want:

  • To have the in-memory state from the mongod process immediately after the test failure. This is because data consistency bugs may result from an inconsistency between the in-memory state and the on-disk state. Today, we only take core dumps on test timeouts and not for data inconsistency issues.
  • To have the data files from the mongod process without it going through clean shutdown. Clean shutdown rewrites the data files in a way which may mask the original data inconsistency issue.
    • The way archival in resmoke.py handles this today is to attempt to copy the data files while the process is still running. We've had a number of issues with this approach (esp. on Windows due to file sharing permissions) so the STM team is eager to do work in this area. The issue Louis is pointing out here is that even though the test has finished and no client will be performing writes, it is still possible for WiredTiger to take a new checkpoint while resmoke.py archival is copying the data directory. The data files gathered end up being unusable. We need to prevent new checkpoints from being taken while we're archival data files. Killing the mongod process is Louis's suggestion for how to achieve this.

kelsey.schubert, the fix is very likely an epic-worthy project - it needs a scope document.

Comment by Kelsey Schubert [ 02/Aug/19 ]

max.hirschhorn, could we prioritize this fix?

Comment by Eric Milkie [ 22/Jul/19 ]

Or alternatively, it could copy down onto the disk some in-memory data critical to debugging a task. Writing a new checkpoint just means the history of what the last checkpoint was, however many seconds or minutes ago, will possibly be overwitten.
I guess I'm unconvinced it would actually be detrimental to problem diagnosis.

Comment by Louis Williams [ 22/Jul/19 ]

milkie my understanding of why we don't do this today is that the act of writing a checkpoint could mask or overwrite data in files critical to debugging a task.

Comment by Eric Milkie [ 22/Jul/19 ]

I'm not sure we even need that new option – why not just call fsync, wait for it to return, and then copy the files?

Comment by Louis Williams [ 22/Jul/19 ]

max.hirschhorn My proposal is only to make sure file archival results in readable and reliable data files, which are much more useful than data files that have been corrupted because they were copied during an active checkpoint. I also don't understand how the current procedure makes any guarantee that writes are journaled or in the stable checkpoint. If we want to guarantee all writes are durable before doing SIGKILL, we could potentially add an option to fSync to only call waitUntilDurable (which will flush log files) and not force a checkpoint.

Comment by Max Hirschhorn [ 22/Jul/19 ]

louis.williams, I don't see how we could run SIGKILL when the test isn't guaranteed to have waited for a journal flush when doing its writes (i.e. we aren't necessarily doing j=true). Isn't it then possible for the corrupted data to not appear in the stable checkpoint at all and also make the data files useless?

Generated at Thu Feb 08 05:00:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.