Uploaded image for project: 'Python Driver'
  1. Python Driver
  2. PYTHON-3186

AWS Lambda/FaaS pause and resume behavior causes SDAM heartbeats to timeout

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 4.1
    • Affects Version/s: None
    • Component/s: None
    • None

      AWS Lambda (and likely other FaaS services) will pause the app process when it's idle and resume it later on demand (when a new request comes in). This pause/resume behavior causes SDAM heartbeats to timeout which then clears the pool and marks the server Unknown. This causes connection churn and increased latency since the servers need to be rediscovers and all pooled connections need to be recreated.

      This behavior can be simulated locally using SIGSTOP + SIGCONT:

      2022-03-25 14:40:38,915 INFO event_loggers Heartbeat sent to server ('localhost', 27018)
      2022-03-25 14:40:38,916 INFO event_loggers Heartbeat sent to server ('localhost', 27019)
      [1]  + 93208 suspended (signal)  python repro-DRIVERS-2246.py
      $ sleep 60
      $ kill -SIGCONT 93208
      2022-03-25 14:42:16,835 WARNING event_loggers Heartbeat to server ('localhost', 27017) failed with error localhost:27017: timed out                                                                                                                                                                                        
      2022-03-25 14:42:16,835 WARNING event_loggers Heartbeat to server ('localhost', 27018) failed with error localhost:27018: timed out
      2022-03-25 14:42:16,836 INFO event_loggers Heartbeat sent to server ('localhost', 27017)
      2022-03-25 14:42:16,836 INFO event_loggers Heartbeat sent to server ('localhost', 27018)
      2022-03-25 14:42:16,836 WARNING event_loggers Heartbeat to server ('localhost', 27019) failed with error localhost:27019: timed out
      2022-03-25 14:42:16,837 INFO event_loggers Heartbeat sent to server ('localhost', 27019)
      

      We can mitigate this issue by performing one non-blocking check to see if the socket is readable after a timeout:

      2022-03-29 15:24:52,344 INFO event_loggers Heartbeat sent to server ('localhost', 27017)
      [1]  + 30988 suspended (signal)  python3.10 repro-DRIVERS-2246.py
      $ sleep 30 && kill -SIGCONT 30988
      2022-03-29 15:25:37,944 INFO event_loggers Heartbeat to server ('localhost', 27017) succeeded with reply {'topologyVersion': ...
      

            Assignee:
            shane.harvey@mongodb.com Shane Harvey
            Reporter:
            shane.harvey@mongodb.com Shane Harvey
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: