[SERVER-83980] Investigate using callback gRPC API Created: 07/Dec/23  Updated: 05/Jan/24  Resolved: 11/Dec/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Erin McNulty Assignee: Erin McNulty
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Sprint: Service Arch 2023-12-11
Participants:

 Description   

When doing initial performance testing in SERVER-80572, we found that gRPC was a lot slower than ASIO, and saw through flamegraphs that a lot of that slowness was due to gRPC acquiring a mutex shared by the whole process to synchronize access to the completion queue.

After looking around, we saw on the gRPC performance page that using the sync server is not suggested for performance-sensitive applications. gRPC instead provides a callback API, that is supposed to be easier to use than the async API but faster than the sync API.

Investigate the changes required to implement the callback api instead of the sync api, and run initial performance benchmarks on them.



 Comments   
Comment by Erin McNulty [ 11/Dec/23 ]

Update: it was easy to change the threading model using the method I described above I was able to build and connect to the server using the callback gRPC API, and run some benchrun trials with it.

TLDR is that we did not see significant improvements for gRPC with the callback API, but I would caution against using this data to justify ignoring the callback API for future investigations, because this was a very rough POC without any optimizations implemented.

The results are summarized here:

num threads   with sync API with callback API
1 gRPC Latency / MongoRPC Latency 1.97 2.06
1 gRPC ops/s / MongoRPC ops/s 0.51 0.49
       
4 gRPC Latency / MongoRPC Latency 3.10 3.71
4 gRPC ops/s / MongoRPC ops/s 0.32 0.27

As seen above, the callback API was not faster in initial performance runs with respect to MongoRPC on 1 or 4 threads-- it is around 2x slower on one thread for both implementations, and around 3-4x slower with 4 threads, with the sync API actually winning out by a bit. The more detailed results are here.

POC branch is here, focus on the changes in the grpc/* folder. I am putting this down in order to focus on completing the correctness work for this project, but I think the next steps for PM-3366 are to:

  • Profile this build, and try to note what exactly is happening in the session workflow that is slowing us down wrt ASIO. I think a good place to start might just be putting ScopedTimers throughout the session workflow.
  • Use the driver POCs to profile and evaluate this build-- the egress end of things in this POC is still using the sync API. Try to isolate the server ingress performance from the egress performance in order to gain a more complete understanding of where the gRPC slowness is.
  • Do some research about how other performance-oriented applications implement the gRPC callback API. I think taking the callback API and doing some research with that could put us in a better position to improve the performance of our gRPC implementation because the sync implementation is explicitly marked as not providing strong perf, while the callback API is supposed to be performant.
  • Evaluate the threading model in this POC-- right now, I just spin up another thread to do the handleStream() workflow on, which essentially mimics the thread per client model that MongoRPC implements. I think placing the threading model back in our hands could make investigating performance a lot easier, because for very little code we are relying on our synchronization primitives instead of gRPC's, which we have a better intuition for.
Comment by Erin McNulty [ 07/Dec/23 ]

I was able to make changes so that gRPC was using the callback API, but I think that changing the threading model is not as simple as we thought. I ran into a segfault when connecting to the server, and the backtrace led me to the spot in session_workflow right before we call source message. I also saw that right before this message, we switched threads. When I tried out the inline model with this, just to see what would happen, it locked up the entire server and I had to kill it.

TLDR is that its easy enough to switch to the callback API in terms of our direct gRPC code, but the threading model might not be so simple. I think my next step might be to try to give each gRPC stream its own thread on the handleStream level, and then still use kInline when we enter into the session workflow. All of my changes are here (this is from all of my perf testing, so only focus on the changes in the grpc/* folder if you are interested in looking).

Generated at Thu Feb 08 06:53:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.