[JAVA-4974] try another srv-record on MongoSocketException Created: 24/May/23 Updated: 27/Oct/23 Resolved: 31/May/23 |
|
| Status: | Closed |
| Project: | Java Driver |
| Component/s: | Retryability |
| Affects Version/s: | 4.0.5 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Unknown |
| Reporter: | Nikita Sokolov | Assignee: | Valentin Kavalenka |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Documentation Changes Summary: | 1. What would you like to communicate to the user about this feature? |
||||||||
| Description |
|
com.mongodb.internal.operation.OperationHelper#withReadConnectionSource could try getting another server until there are none left when opening connection fails, so the users would not get the MongoSocketException in case of mongo/mongos instances being moved
|
| Comments |
| Comment by Valentin Kavalenka [ 12/Jul/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Roughly speaking, yes. More precisely, if the driver retries an operation, then it goes through selecting a server for the operation again. If the driver does not consider a server available, it will not select it for an operation. Note that https://jira.mongodb.org/ is intended to be a place for reporting bugs and requesting features. If you have more questions, consider using https://www.mongodb.com/community/forums/. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nikita Sokolov [ 12/Jul/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
But after failing to lookup the 10-232-86-5.backend-mongos-production-user-critical.backend-mongos.svc.hoffman.local, one of the hosts which the mongodb+srv:// address was resolved to, another host would be tried, right? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Valentin Kavalenka [ 11/Jul/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
No, the driver does not combine the exception from the second failed attempt with the exception from the first failed attempt. However, I am not sure I correctly interpret what you mean here. The MongoConnectionPoolClearedException you posted means that there was a thread that caught a MongoSocketException (caused by UnknownHostException) when it was looking up 10-232-86-5.backend-mongos-production-user-critical.backend-mongos.svc.hoffman.local. As a result, that thread marked the corresponding server in the driver's view of the cluster as not available, and paused its connection pool. Concurrently, a different thread was able to select that same server for an operation, but when it tried to checkout a connection from its pool, the pool reported that it was paused (hence the MongoConnectionPoolClearedException), and that the cause of it being paused is the MongoSocketException.
If you suspect the operations are not retried, you can
1 "Retrying ..." messages may be duplicated, that's a known fixed bug. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nikita Sokolov [ 11/Jul/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Would not there be two different servers in the exception if that was the case? Also, this would make the exceptions rare, but on our side they are as frequent as they were. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Valentin Kavalenka [ 11/Jul/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi faucct@joom.com, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nikita Sokolov [ 10/Jul/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We have updated the mongo driver to 4.7.2 and still getting the exception:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Valentin Kavalenka [ 31/May/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
faucct@joom.com I noticed there was a bug https://jira.mongodb.org/browse/JAVA-4684 in the retry logic introduced in 4.4, which was fixed only in 4.7. Therefore, if you decide to upgrade, and you can't upgrade to the latest version, then at the very least upgrade to 4.7.2. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Valentin Kavalenka [ 31/May/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
As of 4.4 (see https://jira.mongodb.org/browse/JAVA-4034), retryable reads and writes do re-select the server during the second attempt. However, since currently the driver retries only once, it is still possible for it to select another server that is gone as a result of downscaling you mentioned, if the driver has not yet detected that the server is gone. If that happens, you still may observe the error you reported even if you upgrade the driver to 4.4.2 (given how big this jump is for you, you might as well try upgrading to the latest version, which is 4.9.1, provided that it's compatible with your MongoDB server version). Alternatively to upgrading and/or in addition to that, you may try removing those of your DNS SRV records that point to hosts that will be removed as a result of downscaling, before you do the downscaling. If the time interval between removing the SRV records pointing to soon-to-be-gone hosts and the downscaling is larger than 60 seconds (the driver's SRV records monitoring interval) + SRV record TTL, then the driver should see the corresponding seed hosts gone, and remove them from its cluster view, which should eliminate the errors you reported. However, if some hosts removed by downscaling were discovered by the driver not from the SRV records, then the driver still may select them for an operation and fail with the error you reported until it realizes that they are no longer available / part of the cluster. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nikita Sokolov [ 24/May/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yes, we are using the mongodb+srv scheme. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jeffrey Yemin [ 24/May/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks. And just to confirm: you're using a mongodb+srv connection string? Also, is this a blocker for you? Do you expect DNS failures like this in production environment? If so, can you elaborate on the scenario? Why would a subset of mongos servers not be available in DNS lookup like this? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nikita Sokolov [ 24/May/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have added the UnknownHostException. The rest is the spring-data up the stack. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jeffrey Yemin [ 24/May/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
There should be a nested IOException available. Can you post that as well? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nikita Sokolov [ 24/May/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nikita Sokolov [ 24/May/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The driver version is 4.0.5. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by PM Bot [ 24/May/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi faucct@joom.com, thank you for reporting this issue! The team will look into it and get back to you soon. |