[SERVER-42356] teardown(finished=True) isn't ever called for the NoOpFixture Created: 23/Jul/19  Updated: 29/Oct/23  Resolved: 06/Aug/19

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.2.1, 4.3.1

Type: Bug Priority: Major - P3
Reporter: Robert Guo (Inactive) Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: tig-resmoke
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
related to SERVER-42622 resmoke.py doesn't attempt to tear th... Closed
related to SERVER-42623 sched module in Python 3 causes close... Closed
is related to SERVER-28940 Make resmoke fixture setup/teardown t... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.2
Sprint: STM 2019-08-12
Participants:
Linked BF Score: 31
Story Points: 1

 Description   

The flush thread will block forever for the next event if there isn't one lined up.

We should make it not wait. One way could be to use the non-blocking version of scheduler.run()



 Comments   
Comment by Githook User [ 13/Aug/19 ]

Author:

{'name': 'Max Hirschhorn', 'username': 'visemet', 'email': 'max.hirschhorn@mongodb.com'}

Message: SERVER-42356 Remove special casing for NoOpFixture setup and teardown.

(cherry picked from commit 39e0f9648d012e556b24f3ee1728675bd9b197bb)
Branch: v4.2
https://github.com/mongodb/mongo/commit/a20c374983724f59ae5e6700445070045aac7c84

Comment by Githook User [ 06/Aug/19 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-42356 Remove special casing for NoOpFixture setup and teardown.
Branch: master
https://github.com/mongodb/mongo/commit/39e0f9648d012e556b24f3ee1728675bd9b197bb

Comment by Max Hirschhorn [ 04/Aug/19 ]

1. we missed calling flush.close_later() on some Logger instance

After adding some tracing to BufferedHandler.__init__(), BufferedHandler.flush(), and BufferedHandler.close(), it turns we weren't calling close() on the FixtureLogger instance for the NoOpFixture. This is because Job.teardown_fixture() immediately returns without calling NoOpFixture.teardown().

I also found a couple other cases where close() wouldn't be called which I've written up in SERVER-42622 and SERVER-42623.


Steps to reproduce

Apply the following patch to add the CleanEveryN hook to a test suite that uses the NoOpFixture. Note that specifying n=1 isn't sufficient on its own and ASAN_OPTIONS=detect_leaks=1 is necessary because it causes the CleanEveryN hook to log a message. (BufferedHandler.emit() lazily schedules the first flush() event and would otherwise never be called.)

ASAN_OPTIONS=detect_leaks=1 python buildscripts/resmoke.py --suite=no_passthrough jstests/noPassthrough/shell_error_codes.js --log=buildlogger

diff --git a/buildscripts/resmokeconfig/suites/no_passthrough.yml b/buildscripts/resmokeconfig/suites/no_passthrough.yml
index 453ba6d861..41688103ed 100644
--- a/buildscripts/resmokeconfig/suites/no_passthrough.yml
+++ b/buildscripts/resmokeconfig/suites/no_passthrough.yml
@@ -22,3 +22,6 @@ executor:
     shell_options:
       nodb: ''
       readMode: commands
+  hooks:
+  - class: CleanEveryN
+    n: 1

Comment by Max Hirschhorn [ 25/Jul/19 ]

Yeah the queue is not empty.

I don't think the issue has anything to do with scheduler.run() then. I believe are two possibilities:

  1. we missed calling flush.close_later() on some Logger instance, or
  2. we called logging.Handler.close() but somehow failed to cancel the flush event. Such a bug did exist at one point but was thought to be fixed by the changes from 2db32e1 as part of SERVER-40514.
Comment by Robert Guo (Inactive) [ 24/Jul/19 ]

Yeah the queue is not empty. There's one thing [Event(time=1564005134.2195768, priority=0, action=<bound method BufferedHandler.flush of <BuildloggerGlobalHandler (NOTSET)>>, argument=(), kwargs={})] that keeps getting rescheduled in BufferedHandler.flush. Also close_later()'s scheduled handler.close doesn't show up in the scheduler...

Comment by Max Hirschhorn [ 24/Jul/19 ]

What were the contents of self.__scheduler._queue when you were debugging this? scheduler.run() won't return if there's something in the queue after all.

Comment by Robert Guo (Inactive) [ 24/Jul/19 ]

Odd.. I'm not seeing scheduler.run() return.

Here's my git diff:

diff --git a/buildscripts/resmokeconfig/suites/fle.yml b/buildscripts/resmokeconfig/suites/fle.yml
index 18926486..a829aacf 100644
--- a/buildscripts/resmokeconfig/suites/fle.yml
+++ b/buildscripts/resmokeconfig/suites/fle.yml
@@ -2,7 +2,7 @@ test_kind: js_test
 
 selector:
   roots:
-  - src/mongo/db/modules/*/jstests/fle/*.js
+  - src/mongo/db/modules/*/jstests/fle/fle_c*.js
 
 executor:
   config:
@@ -16,4 +16,4 @@ executor:
       tlsCertificateKeyFile: jstests/libs/client.pem
   hooks:
   - class: CleanEveryN
-    n: 20
+    n: 1
diff --git a/buildscripts/resmokelib/logging/flush.py b/buildscripts/resmokelib/logging/flush.py
index b812fc2b..f36f35cd 100644
--- a/buildscripts/resmokelib/logging/flush.py
+++ b/buildscripts/resmokelib/logging/flush.py
@@ -103,7 +103,9 @@ class _FlushThread(threading.Thread):
 
         try:
             while not (self.__should_stop.is_set() and self.__scheduler.empty()):
+                print("BEFORE SCHEDULER RUN")
                 self.__scheduler.run()
+                print("AFTER SCHEDULER RUN")
 
                 # Reset 'self.__schedule_updated' here since we've processed all the events
                 # thought to exist. Either the queue won't be empty or 'self.__schedule_updated'

Here's the log output: https://gist.github.com/guoyr/cdeb0253bad5915b81409ce6d5d3ac41

Comment by Max Hirschhorn [ 24/Jul/19 ]

Don't run buildlogger flush thread if there are no jobs

robert.guo, I'm not following this issue description. Are you saying the flush thread still runs even after the job threads all exit? That behavior on its own isn't a bug - the close_later() API for the flush thread exists because we don't wait for logs to be flushed to logkeeper before running the next test or test suite.

The flush thread will block forever for the next event if there isn't one lined up.

We should make it not wait. One way could be to use the non-blocking version of scheduler.run()

import sched
scheduler = sched.scheduler()
scheduler.run()

returns immediately. The blocking=False parameter would make it so the scheduler doesn't call the delayfunc parameter. However, the purpose of the interruptible_sleep() function is to make it so the flush thread can be woken up if new work enters the queue or if the thread should exit.

Generated at Thu Feb 08 05:00:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.