[SERVER-40514] Race between BufferedHandler.close() and BufferedHandler.flush() leads to resmoke.py hanging with --log=buildlogger Created: 06/Apr/19  Updated: 29/Oct/23  Resolved: 08/Apr/19

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 3.4.21, 3.6.13, 4.1.10, 4.0.10

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Mathew Robinson (Inactive)
Resolution: Fixed Votes: 0
Labels: tig-resmoke
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
is caused by SERVER-28991 Eliminate having individual threads i... Closed
Related
related to SERVER-42228 LoggerRuntimeConfigError exceptions c... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0, v3.6, v3.4
Sprint: Dev Tools 2019-04-08, Dev Tools 2019-04-22
Participants:
Linked BF Score: 5

 Description   

The changes from a4d2c52 as part of SERVER-28991 made it so a dedicated "flush thread" is used to do the actually sending of logs to logkeeper. The implementation was rewritten from using a custom AlarmClock thread for each logging handler to instead use a sched.scheduler instance to manage when to flush logs to logkeeper after every 10 seconds of the test running.

The change from AlarmClock.dismiss() to sched.scheduler.cancel() was flawed in that canceling the next flush event may fail if the BufferedHandler.flush() method is being concurrently called. This can be seen in how the BufferedHandler.close() method attempts to cancel self.__flush_event but ignores the return value.

def close(self):
    """Flush the buffer and tidies up any resources used by this handler."""
 
    with self.__emit_lock:
        if self.__flush_event is not None:
            flush.cancel(self.__flush_event)

As mathew.robinson and I saw when debugging together, it is therefore possible for a concurrent BufferedHandler.flush() call to schedule another flush event after sched.scheduler.cancel() fails to cancel the current one which leads to BufferedHandler.flush() being called perpetually.

def flush(self):
    """Ensure all logging output has been flushed."""
 
    self.__flush(close_called=False)
 
    with self.__emit_lock:
        if self.__flush_event is not None:
            # We cancel 'self.__flush_event' in case flush() was called by someone other than
            # the flush thread to avoid having multiple flush() events scheduled.
            flush.cancel(self.__flush_event)
            self.__flush_event = flush.flush_after(self, delay=self.interval_secs)
            self.__flush_scheduled_by_emit = False

A straightforward fix is to have BufferedHandler.close() record it was called and to have BufferedHandler.flush() not continuing to schedule itself again in that case.



 Comments   
Comment by Githook User [ 03/May/19 ]

Author:

{'email': 'chasinglogic@gmail.com', 'name': 'Mathew Robinson', 'username': 'chasinglogic'}

Message: SERVER-40514 Fix lint in resmoke handlers.py
Branch: v4.0
https://github.com/mongodb/mongo/commit/ed33d2f8e66cd907ef13ee162f6a1dac3ff25672

Comment by Githook User [ 01/May/19 ]

Author:

{'name': 'Mathew Robinson', 'username': 'chasinglogic', 'email': 'chasinglogic@gmail.com'}

Message: SERVER-40514 Fix race in Resmoke
Branch: v4.0
https://github.com/mongodb/mongo/commit/b67e3c53ea2be50c8ecdc27e1499604ee98d445c

Comment by Githook User [ 01/May/19 ]

Author:

{'name': 'Mathew Robinson', 'username': 'chasinglogic', 'email': 'chasinglogic@gmail.com'}

Message: SERVER-40514 Fix race in Resmoke
Branch: v3.6
https://github.com/mongodb/mongo/commit/9d3315ed0de0f70e944ce1e2b3c9e97808da131c

Comment by Githook User [ 01/May/19 ]

Author:

{'email': 'chasinglogic@gmail.com', 'name': 'Mathew Robinson', 'username': 'chasinglogic'}

Message: SERVER-40514 Fix race in Resmoke
Branch: v3.4
https://github.com/mongodb/mongo/commit/1a379e6d17fc344be69137cc16e8811f302af08c

Comment by Githook User [ 08/Apr/19 ]

Author:

{'name': 'Mathew Robinson', 'username': 'chasinglogic', 'email': 'chasinglogic@gmail.com'}

Message: SERVER-40514 Fix race in Resmoke
Branch: master
https://github.com/mongodb/mongo/commit/2db32e10f3e1e5aedb80703104fae838acfcf696

Generated at Thu Feb 08 04:55:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.