[SERVER-6760] Gracefully handle shard closing connections (primary stepdown) Created: 10/Aug/12 Updated: 06/Dec/22 Resolved: 21/Mar/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Gopi | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Done | Votes: | 3 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 10.04.3 LTS |
||
| Assigned Teams: |
Sharding
|
| Participants: |
| Description |
|
cluster of 3 (or 4 or 5) nodes |
| Comments |
| Comment by Gregory McKeon (Inactive) [ 21/Mar/18 ] |
|
Automatic retry logic for queries was added in 3.2. |
| Comment by Vinay Gupta [ 14/Aug/12 ] |
|
We are using mongodb v2.0.2 |
| Comment by Jeffrey Yemin [ 14/Aug/12 ] |
|
Vinay, what version of mongos are you using? |
| Comment by Jeffrey Yemin [ 14/Aug/12 ] |
|
Moved this over because the issue is whether mongos is effectively handling retries of queries during a primary stepdown. |
| Comment by Vinay Gupta [ 13/Aug/12 ] |
|
Yes. Go ahead please move the ticket to the appropriate bucket. Thanks |
| Comment by Jeffrey Yemin [ 13/Aug/12 ] |
|
Vinay, You should open a ticket in the SERVER project, since mongos is in the best position to gracefully handle an rs.stepdown() if you're using a sharded cluster. Or if you prefer, I can move this ticket to the SERVER project. |
| Comment by Vinay Gupta [ 13/Aug/12 ] |
|
Looks like we're using the 2.7.1 java driver.. |
| Comment by Jeffrey Yemin [ 13/Aug/12 ] |
|
The driver does not retry writes. For reads, it will retry twice, which is not configurable. It only retries if connected in replica set mode and if the exception is not a SocketTimeoutException. The idea being that if the query is just slow, retrying will likely not be useful. There is also no delay between retries. For this particular stack trace, there was no retry done, because the error is being reported by mongos, which is unable to communicate with one of the shards. So this particular case is more of an issue that needs to be addressed in mongos. |
| Comment by Vinay Gupta [ 13/Aug/12 ] |
|
Jeff, concern is for both read/writes. Here is an example of sample stack trace. (I changed the primary node address to primary-rs0-0.xcom.net just to make it obvious.) Meanwhile can you tell us if Thx ------------ |
| Comment by Jeffrey Yemin [ 11/Aug/12 ] |
|
Ah, I see what you're getting at now. It's a bit tricky since a replica set can be without a primary for an arbitrarily long period of time, and the driver generally tends towards fail-fast behavior. Couple of questions. Is your concern more for reads, or for writes, or both? What version of the driver are you using? Can you provide a sample stack trace for the exception that you're getting, as there are a number of code paths where this can come up? Thanks, |
| Comment by Igor Bolotin [ 11/Aug/12 ] |
|
The driver does reconnect, but only after failing few requests that were issued during the re-election period. It would be great if instead of throwing these exceptions to the application to handle - the driver would recognize recoverable failures and retry the requests automatically. This behavior can be made optional to not break existing applications. Also makes sense to allow number of retries and delay between attempts to be configurable. |
| Comment by Jeffrey Yemin [ 10/Aug/12 ] |
|
The Java driver will retry on a stepdown of the primary if you connect in replica set mode (use the Mongo constructor that takes a list of ServerAddress instances). Is that how you are connecting? |