[SERVER-26867] Timeout-related assert.soon failures should trigger the hang analyzer Created: 01/Nov/16  Updated: 19/Nov/19  Resolved: 31/Oct/19

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.3.1

Type: Improvement Priority: Major - P3
Reporter: Mathias Stearn Assignee: Ryan Timmons
Resolution: Done Votes: 11
Labels: stm, tig-hanganalyzer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-43256 Fix incorrect uses of assert.soon and... Closed
Duplicate
is duplicated by SERVER-43255 Automatically call whatthefailure fro... Closed
is duplicated by SERVER-19030 Add hang-analyzer support for MongoRu... Closed
Related
related to SERVER-40857 Remove write concern wtimeouts that a... Backlog
related to SERVER-25346 Allow doassert() to invoke external e... Closed
is related to SERVER-44739 Running gdb concurrently with testing... Closed
Backwards Compatibility: Fully Compatible
Sprint: STM 2019-10-31
Participants:
Linked BF Score: 22
Story Points: 2

 Description   

When an test fails due to a timeout, it would be nice if that would trigger the hang analyzer before tearing down the processes. This would let us see what the server is doing that is preventing it from doing what we expect.

Timeout-related failures include:

  • assert.soon()
  • wtimeout expiring
  • $maxTimeMs expiring

Update: Per discussion below we will be limiting this to assert.soon, assert.soonNoExcept, assert.retry, and assert.assert.retryNoExcept.



 Comments   
Comment by Ryan Timmons [ 31/Oct/19 ]

This has been released. assert.soon, assert.retry, and assert.time will now call the hang-analyzer prior to throwing.

Comment by Githook User [ 31/Oct/19 ]

Author:

{'name': 'Ryan Timmons', 'username': 'rtimmons', 'email': 'ryan.timmons@mongodb.com'}

Message: SERVER-26867 Call hang_analyzer.py from assert functions
Branch: master
https://github.com/mongodb/mongo/commit/f97d73eefc1a4302e949d9cbc85d89fd07847ee6

Comment by Judah Schvimer [ 16/Oct/19 ]

SGTM. wtimeout and maxTimeMS timeouts we've decided to largely stop providing so that tests hang until the process kills them and provides the hang analyzer output. This may not be ideal, but it works.

Comment by Ryan Timmons [ 16/Oct/19 ]

You now call MongoRunner.runHangAnalyzer() in the shell to invoke hang_analyzer.py on all processes started by the shell and resmoke. This was done in SERVER-43254. (There's some followup work to get it working when debug symbols haven't already been inflated - SERVER-44017)

It is trivial to call this from assert.soon prior to actually throwing. But the original text of this ticket indicates also calling this for wtimeout and $maxTimeMs expiring.

I'm less confident with how we'd actually hook into those mechamisms within the shell without doing some trickery to monkeypatch calls to runCommand and inspect args for wtimeout/$maxTimeMs (same argument applies to assert.commandWorked).

It seems like we get most of the benefit from

  1. Having MongoRunner.runHangAnalyzer() as a thing that can be dropped in as-needed into catch blocks, and
  2. Automatically calling this from assert.soon, assert.soonNoExcept, assert.retry, and assert.assert.retryNoExcept.

Unless there's a simple way to incorporate this function-call to other timeout scenarios, I propose that we don't automatically run hang_analyzer.py for anything other than these 4 assert methods. This aligns with the original scope-doc for hang_analyzer.py integration PM-1546 but I'm curious if there is a pressing need for more and/or an easy way to get these other things.

Open questions to anyone watching this ticket:

  1. Is there a straightforward way to inject calls to hang_analyzer.py in all places where a query/command may have wtimeout; similar for maxTimeMs?
  2. How actually usuful is automatically calling hang_analyzer.py from places other than assert.soon given you can detect timeouts with exceptions and call MongoRunner.runHangAnalyzer() within test-logic?

If no objections in the coming day or two I'll update this ticket text/description to reflect just updates to assert.soon and friends. We can evaluate incorporating it further in future tickets.

Please feel free to reach out to me/STM on slack if easier.

Comment by Max Hirschhorn [ 07/Aug/19 ]

Max Hirschhorn and Robert Guo, what is the timeline for doing #2?

We're planning to write a scope document for this work in the next iteration. I don't imagine it'll need a design document but should have a sketch of the implementation as an appendix for reviewers to read through. It'd be worth having a conversation as to whether we'd want to bubble this up over the work on the query and aggregation fuzzers currently in-progress and the performance testing with slow networks that was slated to be next.

Would a repl team member be able to do it effectively?

Let's decide on that after the scope document (esp. the appendix section) is written.

Comment by Judah Schvimer [ 07/Aug/19 ]

max.hirschhorn and robert.guo, what is the timeline for doing #2? Would a repl team member be able to do it effectively?

Comment by Max Hirschhorn [ 06/Aug/19 ]

I chatted with william.schultz in-person yesterday. I believe he and I are on the same page about the local testing experience and why we'd rather consider having different timeouts of "forever" when running in Evergreen versus when running locally. (I stand by 5 minutes is already too long to have to wait locally to find out you wrote the wrong condition inside of an assert.soon() while iterating on writing/debuggin a test, but it'd be challenging to reverse SERVER-25977 at this point.)

To summarize the different options for SERVER-26867:

  1. We could increase the assert.soon(), ReplSetTest.kDefaultTimeoutMS timeout to 24 hours when running in Evergreen. (We would leave their current values when running locally.) The complication with this approach is that tests which run for a long period of time produce an obnoxious amount of log output, which makes the Build Baron tooling owned by the DAG team unhappy. While the Evergreen task would eventually time out, it may be after 30 minutes, 2 hours, or the timeout.update value that's been set.
  2. We could run python buildscripts/hang_analyzer.py ... from within assert.soon() when it would throw a JavaScript exception. This is more involved than the approach in #1 because we'd need to (i) ask the ProgramRunner instance in the mongo shell which MongoDB processes it spawned, and (ii) plumb the pids of the MongoDB processes spawned by the resmoke.py job thread for the currently running test.
    • One question may be whether there should be a JavaScript function which can be called from a test directly that is a no-op when run locally and runs python buildscripts/hang_analyzer.py ... in Evergreen. This would allow the debugger to be attached at an arbitrary point in the test and not just during assert.soon().

I had imagined we'd achieve this ticket by doing #2.

Comment by William Schultz (Inactive) [ 05/Aug/19 ]

BF-14272 (investigated by samy.lanka) is another case where getting stack traces after an awaitReplication timeout would have been very helpful. In a recent replication meeting we raised the question of why it was not considered acceptable to simply raise the default awaitReplication timeout to match the task timeout, so that the timeouts would trigger the hang analyzer. We thought that there may have been a prior objection to this on the grounds that it would make the local testing experience worse, but the details of the objection were not entirely clear. It would be good to clarify what the possible objections may be.

Another proposal that came up was to allow the awaitReplication timeout to be configured separately when running in Evergreen versus running locally. This would avoid any negative side effects on the local testing experience. One option would be to allow the awaitReplication timeout to be overriden by setting a "TestData" parameter. Something similar was done for altering log verbosity as a part of SERVER-32907. We could, for example, have an "awaitReplicationTimeout" parameter that we could set to the appropriate task timeout when running in Evergreen, but it could default to the existing value (10 minutes), when tests are running locally. This would not necessarily fix the issue for all varieties of these timeouts (e.g. wtimeout, assert.soon), but starting with awaitReplication could be a good step.

Comment by Siyuan Zhou [ 21/Jun/19 ]

BF-13678 doesn't have sufficient evidence to know what's causing the timeout. Core dump would also be very helpful when awaitReplication() times out.

Comment by Dianna Hohensee (Inactive) [ 21/Apr/17 ]

To jump on the bandwagon of providing use-cases incentives for this improvement:

Would be nice for assert.soon failures like BF-5335, where we're concerned about deadlock because nothing is happening inexplicably and nothing of interest is getting logged (and we've been doing significant sharding code additions). I'm going to try to artificially provoke the hang analyzer, somehow, to get stack dumps along with bumping the verbosity in hopes of gleaning more information – this ticket would do it all for me, apparently

Comment by Judah Schvimer [ 14/Nov/16 ]

This would make BF-3913 much easier to debug. It appears to be a hang elsewhere in the system that leads to writeConcern timing out. Knowing what else is happening on the secondary would make it clear what was hanging and causing the timeout.

Comment by Mathias Stearn [ 14/Nov/16 ]

Expanding scope to include other timeouts after seeing a hard to debug failure of wtimeout.

Comment by Eric Milkie [ 01/Nov/16 ]

Or rather, the threads are working normally (or are all idle), and it is the logic of the test that set up the failure to begin with.

Comment by Judah Schvimer [ 01/Nov/16 ]

If we do this, assert.retry() should also probably be included.

Comment by Eric Milkie [ 01/Nov/16 ]

I don't think this would be that useful. The overwhelming majority of assert.soon failures that I've seen are failing without the server threads actually being hung.

Generated at Thu Feb 08 04:13:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.