[GODRIVER-1404] Improve performance of SelectServer Created: 07/Nov/19  Updated: 28/Oct/23  Resolved: 15/Nov/19

Status: Closed
Project: Go Driver
Component/s: Server Selection
Affects Version/s: None
Fix Version/s: 1.2.0

Type: Improvement Priority: Major - P3
Reporter: David Bartley Assignee: Divjot Arora (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File cpu.pprof     File server_selection.go    
Backwards Compatibility: Fully Compatible

 Description   

We're seeing a large fraction of time being spent in server selection; around 30% of the time taken by "driver.Operation.Execute" is being spent in "driver.Operation.selectServer", vs 45% for "driver.Operation.roundTrip", which obviously includes network roundtrip(s).

A large fraction of "driver.Operation.selectServer" seems to be around "Subscribe", and it seems that could easily be avoidable in an initial fast path, where you'd expect a server to always be available. Related to this, we're also seeing server selection timeouts with a 20ms server selection timeout – it appears that the subscription logic is sometimes slow enough that we're hitting that timeout even if server selection could otherwise be satisfied.



 Comments   
Comment by Divjot Arora (Inactive) [ 28/Feb/20 ]

That makes sense to me. I've filed GODRIVER-1499 for this.

Comment by David Bartley [ 27/Feb/20 ]

Sorry, tbc, I was suggesting removing both the ctx.Done and timeoutChan checks from selectServerFromDescription.

I guess it depends on whether you consider the fast-path server selection to be something that could actually be considered blocking? It seems like you'd almost always follow server selection with an actual command, which would check for cancellation before hitting the network?

Comment by Divjot Arora (Inactive) [ 27/Feb/20 ]

bartle Thank you for your feedback on this change. I don't think removing the timeoutChan check is sufficient, as a user could also provide a timeout through the Context sent to the function. It would be strange if a timeout provided through SetServerSelectionTimeout behaved differently than one in the Context.

I did consider the possibility of adding a fast path that did no timeout checks so we could ensure that one server selection check is always done, but this brings up the weird case of a user running an operation with an already-cancelled context. In this case, we should fail fast and return an error at first possibly blocking point, which would be server selection. Thoughts?

Comment by David Bartley [ 26/Feb/20 ]

We recently upgraded to 1.3.0, so I didn't get a chance to test this before, but I'm curious if we should actually be checking selectionState.timeoutChan in selectServerFromDescription? If you were to set ServerSelectionTimeout to 1ms, if you wanted fail-fast behaviour, it seems there's some risk that you could still fail server selection even if there were a primary available?

It also seems redundant, given that selectServerFromSubscription does this check as well.

Comment by Githook User [ 15/Nov/19 ]

Author:

{'name': 'Divjot Arora', 'username': 'divjotarora', 'email': 'divjot.arora@10gen.com'}

Message: GODRIVER-1404 Remove topology subscription initial server selection (#221)
Branch: master
https://github.com/mongodb/mongo-go-driver/commit/08f9f9e42b104c4ee367feb1f2bfff1def66eb54

Comment by Divjot Arora (Inactive) [ 12/Nov/19 ]

PR: https://github.com/mongodb/mongo-go-driver/pull/221

Comment by David Bartley [ 09/Nov/19 ]

Yeah, that looks pretty similar!

Comment by Divjot Arora (Inactive) [ 08/Nov/19 ]

I can consistently repro by doing 10000 inserts with 10000 goroutines on a 3 node replicaset. I've attached my repro script and the pprof output I'm seeing. Note that Subscribe doesn't consistently show up in the pprof output across runs. I've attached my repro script and a sample pprof output from one of the runs that does have Subscribe in it. Can you confirm that this is similar to what you're seeing?

server_selection.go

cpu.pprof

Comment by David Bartley [ 08/Nov/19 ]

No, failover isn't necessary, we were seeing these timeouts steady-state across all of our 5 node replsets. I wonder if modifying your test to issue inserts (or even queries) in parallel might be more realistic?

Comment by Divjot Arora (Inactive) [ 08/Nov/19 ]

bartle Thanks for the info. I do think it's worth figuring out if Subscribe is the issue and why it's taking so long if it is. I tried to repro this by setting up a Client with a 20ms server selection timeout and running 10,000 serial inserts against the same collection. pprof output does show ~20ms spent in SelectServer, but the bulk of the time is spent in timer creation and pthread condition variable singaling code. On one of the runs, it showed that the bulk of the time was spent in runtime.duffcopy. I have not been able to repro a situation where Subscribe is taking a lot of time.

These repros were running against a healthy 3-node replica set that did not fail over during program execution. Is this too isolated of an environment to repro what you're seeing? You mentioned the low server selection timeout was to fail fast during a failover. Is a failover necessary to repro the server selection timeout?

Comment by David Bartley [ 07/Nov/19 ]

We're seeing this in an internal proxy, so I'm not sure I can share much more detail. We're setting ServerSelectionTimeout to 20ms (we basically want to fail fast requests and let clients handle retries, typically during failovers) and are seeing regular "server selection timeout" errors. I added some additional logging and confirmed that we're hitting https://github.com/mongodb/mongo-go-driver/blob/85a8e363c138e7c405976deb9816c72efe54f7b4/x/mongo/driver/topology/topology.go#L402 in the first loop iteration, despite the server selection being satisfiable.

Comment by Divjot Arora (Inactive) [ 07/Nov/19 ]

Hi bartle,

Thank you for the report. Can you provide some more details about the perf issues you're seeing with SelectServer? The Subscribe function does take t.subLock which could be a potential perf issue but otherwise does not do much work (gets current description, updates a map, and makes a channel). Some concrete numbers or another type of profiling output might help us figure out what's causing the issue.

– Divjot

Comment by David Bartley [ 07/Nov/19 ]

I suspect https://github.com/mongodb/mongo-go-driver/blob/e82d777bfb1457703cc5b88d7b8f8bb8cc22a42a/x/mongo/driver/topology/topology.go#L406-L411 is also redundant, since things like "ReadPrefSelector" ultimately use "selectByKind", which obviously filter by e.g. RSPrimary or RSSecondary already.

Generated at Thu Feb 08 08:36:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.