[SERVER-42615] Run chkdsk command on Windows after each powercycle loop Created: 02/Aug/19  Updated: 29/Oct/23  Resolved: 02/Aug/19

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.2.1, 4.3.1

Type: New Feature Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: tig-powercycle
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-54898 Run the sync utility after initial po... Closed
is related to SERVER-42571 Collect Windows event logs on remote ... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.2
Sprint: STM 2019-08-12
Participants:
Linked BF Score: 26
Story Points: 2

 Description   

We've seen a variety of errors during powercycle testing on Windows after upgrading to Windows Server 2016, none of which are indicative of a MongoDB issue:

  • StartService fails with "The service did not respond to the start or control request in a timely fashion"
  • StartService fails with "The device is not ready"
  • StartService fails with "Access is denied"
  • StartService fails with "Error performing inpage operation"
  • The mongod-powertest service terminates unexpectedly due to not being able to access some file (unnamed by the Application event logs)

We should run the chkdsk command in read-only mode (i.e. without any extra parameters) to see if we can collect diagnostics indicating the NTFS volume is corrupt after using notmyfault.exe to crash the machine.



 Comments   
Comment by Githook User [ 13/Aug/19 ]

Author:

{'name': 'Max Hirschhorn', 'username': 'visemet', 'email': 'max.hirschhorn@mongodb.com'}

Message: SERVER-42615 Run chkdsk command on Windows after each powercycle loop.

(cherry picked from commit e6ef0ca20e99b2b3a6682952c2588e6e2d1ba8a9)
Branch: v4.2
https://github.com/mongodb/mongo/commit/5a70e4129ae8b8808d43ed41ded4b68ede816d9f

Comment by Max Hirschhorn [ 05/Aug/19 ]

jonathan.reams had another theory that we're simply not waiting for the data before starting the first powercycle loop (e.g. the mongod.exe executable that was scp'd over) to have been durably written to disk. mark.benvenuto had mentioned that the "Error performing inpage operation" message means we tried to call CreateProcess() on a binary that couldn't be fully read from disk (i.e. an unrecoverable page fault error) so that at least fits with the theory.

https://docs.microsoft.com/en-us/sysinternals/downloads/sync looks to be a utility we can use to ensure the contents of the C:, D:, and E: drives are all flushed if we see more failures.

Comment by Githook User [ 02/Aug/19 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-42615 Run chkdsk command on Windows after each powercycle loop.
Branch: master
https://github.com/mongodb/mongo/commit/e6ef0ca20e99b2b3a6682952c2588e6e2d1ba8a9

Generated at Thu Feb 08 05:00:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.