[JAVA-3457] Gracefully handle mongos nodes exiting via mongodb+srv:// Created: 10/Oct/19 Updated: 27/Oct/23 Resolved: 27/Nov/19 |
|
| Status: | Closed |
| Project: | Java Driver |
| Component/s: | Cluster Management |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Ben Picolo | Assignee: | Jeffrey Yemin |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
We recently set up a shared cluster of MongoS servers in kubernetes via the fairly new mongodb+srv record support (https://www.mongodb.com/blog/post/mongodb-3-6-here-to-SRV-you-with-easier-replica-set-connections) In Kubernetes, when nodes enter a terminating state, they are removed from from both the SRV record broadcast, and their DNS resolution will also no longer succeed. In some cases (depends on configuration), they may still be available to handle connections for some amount of time, until the pod has fully terminated. The Mongo java driver currently scans SRV records every 60 seconds, which is hardcoded].]
When a mongos pod enters termination, that leaves an up-to-60-second gap where, to my understanding, we can hit issues in the java mongo driver through the following path.
There seem to be a couple potential mechanisms for improving this. I can imagine blacklisting hosts that have experienced dns failures until the next refresh when using mongodb+srv, but there seem to be several reasonable options.
We'd be happy to contribute a patch here if there's an agreed upon handling strategy for us to pursue.
|
| Comments |
| Comment by Jeffrey Yemin [ 27/Nov/19 ] |
|
Closing this out as I believe we've answered all the open questions, and demonstrated how to orchestrate a service such that there is no visible application effects. If you have further questions, please post them and we can re-open. |
| Comment by Jeffrey Yemin [ 17/Oct/19 ] |
|
| Comment by Ben Picolo [ 17/Oct/19 ] |
|
Which timeouts and server monitor frequencies are adjustable that help out here?
The second part you mention may be the missing piece of the puzzle here, but will have to figure out if there's a strategy for us to disallow new connections efficiently. I'll look into that path, and appreciate the response on this.
Unfortunately, I don't believe we get tailored control over the timings for SRV records in kubernetes (that's a path we were looking into as well). |
| Comment by Jeffrey Yemin [ 17/Oct/19 ] |
|
bpicolo@squarespace.com, the driver does handle application shutdown. Though there is a window during which one or more application threads may get exceptions, the window is fairly short, and can be controlled by the client through the setting of various timeouts and server monitor frequencies. The problem you seem to be having is due to the host being removed from DNS entirely prior to shutting the mongos process down. I can think of a few things you could do to improve your situation:
|
| Comment by Louis Plissonneau (Inactive) [ 15/Oct/19 ] |
|
andrey.belik if you manually kill/remove the pod, it will spin up a new one almost immediately when mongos crashes on the pod, the automation agent will try to restart mongos processes the liveness (every 30 seconds) and readiness (every 5 seconds) probes will detect the loss but they have a failure rate (to prevent over-reacting), so it will take 3 minutes minimum for kubernetes to react (we need 6 liveness failures in a row, and it's longer for readiness probe)
Thinking about this, it's about time we need to revisit the liveness probe |
| Comment by Ben Picolo [ 14/Oct/19 ] |
|
@Andrey - worth clarifying, the driver currently handles neither case, as far as I can tell (clean or unclean application shutdown). |
| Comment by Andrey Belik (Inactive) [ 14/Oct/19 ] |
|
louis.plissonneau please confirm if I am correct here. All mongos is fronted with Service that exposes SRV Records. When mongos is terminated K8S controller updates DNS pretty much immediately (but it is eventual consistency model) When mongos crashes it will be detected by K8S and that could take longer (few seconds) and then it will be taken our from DNS and new provisioned.
|
| Comment by Ben Picolo [ 10/Oct/19 ] |
|
I'll check whether that would be a factor for us - I'm not sure what sort of SLA we have in place. Let me consult some folk in my organization. |
| Comment by Jeffrey Yemin [ 10/Oct/19 ] |
|
No problem with opening a ticket directly here, but just be advised that there is no SLA in place when you do it this way.
|
| Comment by Ben Picolo [ 10/Oct/19 ] |
|
I am not - we thought that this board may be the best first point of discussion, but happy to redirect wherever would be best. |
| Comment by Jeffrey Yemin [ 10/Oct/19 ] |
|
It was not a bot Changed it back to what you intended. Are you in contact with our technical support organization on this already by any chance? |
| Comment by Ben Picolo [ 10/Oct/19 ] |
|
@jeff.yemin - I see you or a bot version of you tweaked some wording for me (thanks!) Want to note that "shared" was intentional, though. The sharding isn't new in this case, the shared MongoS fleet is. |
| Comment by Ben Picolo [ 10/Oct/19 ] |
|
I don't appear to have permissions to edit my ticket, but here's the link I had intended for the DefaultSrvRecordMonitorFactory:
Also worth mentioning - we're currently using the latest 3.x driver. |