Uploaded image for project: 'Drivers'
  1. Drivers
  2. DRIVERS-2246

Heartbeat build up with streaming protocol when driver process is stopped (FAAS)



    • Bug
    • Status: Blocked
    • Major - P3
    • Resolution: Unresolved
    • SDAM
    • Needed



      The SDAM Monitoring spec defines the streamable hello protocol as way of having the server send hello updates as soon as there is a change or until maxAwaitTimeoutMS is reached. In FAAS (functions as a service) environments process execution is frozen, so the driver cannot consume hello responses being sent every maxAwaitTimeoutMS. When the FAAS environment wakes up the driver must process every heartbeat that is waiting on the socket to be read. This causes performance delays compared to typical environments. 

      An associated bug that the Node.js driver encountered specifically was that the FAAS environment allows timers continue until expiration between invocations but it keeps execution frozen. Once the FAAS wakes up the Node.js driver processed socket timeout errors prior to reading from the socket. This order of operations is inherent to the Node.js environment, timers always come first in the event loop, but it is an indicator of potentially an additional issue with streaming in environments where the timeout execution time cannot be considered reliable.

      We were able to solve this issue in Node.js by enforcing timeout errors to be handled after allowing the runtime to read from the socket. If the read succeeds then we were able to clear the erroneous timeout error, otherwise the timeout error is handled as normal. This required ordering could be worth encoding in the spec as part of fixing this related issue.


      Who is the affected end user?

      FAAS users.

      How does this affect the end user?

      Performance concerns, or out of date TopologyDescription.

      How likely is it that this problem or use case will occur?

      Main path. The bug is not a blocker, it will occur consistently on every invocation. The wider the gap between invocations in relation to the heartBeatFrequencyMS setting the larger the number of heartbeats that need to be processed.

      If the problem does occur, what are the consequences and how severe are they?

      FAAS environments are usually designed around charging per execution, factoring in CPU time and memory usage. The common potential for heartbeats to pile up on the socket has an impact on these metrics.
      The issue is mitigated by the limits imposed by TCP flow control (eventually the send and receive buffers fill up), but still can result in thousands of hello responses needing to be processed.

      Is this issue urgent?

      I think investigating a solution has "Major" (from JIRA) priority. There's been some proposals to consider adding a knob that forces the driver into polling mode but that comes with its own downsides (out of date TopologyDescription). Implementing the decided upon solution's priority can be considered on a per driver basis.

      Is this ticket required by a downstream team?


      Is this ticket only for tests?



        Issue Links



              Unassigned Unassigned
              neal.beeken@mongodb.com Neal Beeken
              0 Vote for this issue
              16 Start watching this issue