[SERVER-60728] Improved MDB crash recovery testing Created: 15/Oct/21  Updated: 06/Nov/23  Resolved: 08/Feb/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.3.0

Type: New Feature Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Gregory Wlodarek
Resolution: Fixed Votes: 0
Labels: CA-PM, post-mortem
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File stop_cont_crash_testing.diff    
Issue Links:
Problem/Incident
causes SERVER-82781 Simulate crash test hook may leave be... Closed
causes WT-11821 WiredTiger data corruption detected i... Closed
Related
related to SERVER-66273 Add CleanupConcurrencyWorkloads hook ... Closed
is related to SERVER-60636 Create a passthrough suite to termina... Closed
Backwards Compatibility: Fully Compatible
Sprint: Execution Team 2022-02-07, Execution Team 2022-02-21
Participants:

 Description   

MDB currently has powercycle and process termination tests which have historically discovered durability bugs. It's not obvious those are fundamentally insufficient, but we've found other durability bugs when using other techniques. Specifically we can set up a mongodb cluster and:

  • Run a workload against the cluster
  • SIGSTOP one of the mongodb processes
  • Copy that process' dbpath to a tmp path with direct I/O
  • SIGCONT the paused process
  • Start a mongod on the tmp path
  • Run validate on all collections in the tmp path
  • Repeat in a tight loop.

There's interest in permanently adding this to our testing suites. Attached is an unrefined (apologies) patch that can be used as a starting point in implementing the above.



 Comments   
Comment by Githook User [ 08/Feb/22 ]

Author:

{'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}

Message: SERVER-60728 Add a new test hook to simulate crashes and validate data files after recovery
Branch: master
https://github.com/mongodb/mongo/commit/f61906c0621c146115b5f8e00b9b79c59ebc1088

Comment by Daniel Gottlieb (Inactive) [ 30/Nov/21 ]

I wouldn't say we understand the gap between the existing tests and what this proposes. What the patch offers did find some data corruption bugs that the existing tests seemed unable to uncover. It was requested we file a ticket to productionize the new testing, but it's not clear to me exactly what form this should take.

Comment by Connie Chen [ 29/Nov/21 ]

daniel.gottlieb, would you be able to answer judah.schvimer's question above?

Comment by Judah Schvimer [ 08/Nov/21 ]

Do we plan to add this testing to our existing terminate/kill passthroughs? Do we understand the gap between our existing tests and what this proposes?

Generated at Thu Feb 08 05:50:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.