[SERVER-6760] Gracefully handle shard closing connections (primary stepdown) Created: 10/Aug/12  Updated: 06/Dec/22  Resolved: 21/Mar/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Gopi Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Done Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 10.04.3 LTS


Assigned Teams:
Sharding
Participants:

 Description   

cluster of 3 (or 4 or 5) nodes
from current primary issue rs.stepdown() causes mongos to lose connection. rather let java driver retry and connect



 Comments   
Comment by Gregory McKeon (Inactive) [ 21/Mar/18 ]

Automatic retry logic for queries was added in 3.2.

Comment by Vinay Gupta [ 14/Aug/12 ]

We are using mongodb v2.0.2

Comment by Jeffrey Yemin [ 14/Aug/12 ]

Vinay, what version of mongos are you using?

Comment by Jeffrey Yemin [ 14/Aug/12 ]

Moved this over because the issue is whether mongos is effectively handling retries of queries during a primary stepdown.

Comment by Vinay Gupta [ 13/Aug/12 ]

Yes. Go ahead please move the ticket to the appropriate bucket. Thanks

Comment by Jeffrey Yemin [ 13/Aug/12 ]

Vinay,

You should open a ticket in the SERVER project, since mongos is in the best position to gracefully handle an rs.stepdown() if you're using a sharded cluster. Or if you prefer, I can move this ticket to the SERVER project.

Comment by Vinay Gupta [ 13/Aug/12 ]

Looks like we're using the 2.7.1 java driver..

Comment by Jeffrey Yemin [ 13/Aug/12 ]

The driver does not retry writes. For reads, it will retry twice, which is not configurable. It only retries if connected in replica set mode and if the exception is not a SocketTimeoutException. The idea being that if the query is just slow, retrying will likely not be useful. There is also no delay between retries.

For this particular stack trace, there was no retry done, because the error is being reported by mongos, which is unable to communicate with one of the shards. So this particular case is more of an issue that needs to be addressed in mongos.

Comment by Vinay Gupta [ 13/Aug/12 ]

Jeff, concern is for both read/writes.

Here is an example of sample stack trace. (I changed the primary node address to primary-rs0-0.xcom.net just to make it obvious.)
I will find out the version of the driver and let you know soon.

Meanwhile can you tell us if
1) number of retries/delay between attempts are configurable in any way..
2) What are the default values for above?

Thx

------------
Caused by: com.mongodb.MongoException: dbclient error communicating with server: primary-rs0-0.xcom.net:27087
at com.mongodb.MongoException.parse(MongoException.java:82)
at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:312)
at com.mongodb.DBCursor._check(DBCursor.java:369)
at com.mongodb.DBCursor._hasNext(DBCursor.java:504)
at com.mongodb.DBCursor.hasNext(DBCursor.java:529)
at com.x.infra.xfabric.dao.mongo.BaseDAOImpl.toCollection(BaseDAOImpl.java:199)
at com.x.infra.xfabric.dao.mongo.BaseDAOImpl.toCollection(BaseDAOImpl.java:182)
at com.x.infra.xfabric.dao.mongo.MongoEncryptionKeyDAOImpl.getMaxVesionForKeyPurpose(MongoEncryptionKeyDAOImpl.java:87)
at com.x.infra.xfabric.authz.impl.TokenCipherManager.getCurrentTokenEncryptionKey(TokenCipherManager.java:47)
at com.x.infra.xfabric.dao.mongo.MongoAuthorizationDAOImpl.getAuthorizationByAuthToken(MongoAuthorizationDAOImpl.java:127)
at com.x.infra.xfabric.facade.FabricReadFacade.getAuthorizationByAuthToken(FabricReadFacade.java:85)
at com.x.infra.xfabric.filter.XSSFilterBus.authorizePublisher(XSSFilterBus.java:304)
... 16 more
---------------

Comment by Jeffrey Yemin [ 11/Aug/12 ]

Ah, I see what you're getting at now. It's a bit tricky since a replica set can be without a primary for an arbitrarily long period of time, and the driver generally tends towards fail-fast behavior.

Couple of questions. Is your concern more for reads, or for writes, or both? What version of the driver are you using? Can you provide a sample stack trace for the exception that you're getting, as there are a number of code paths where this can come up?

Thanks,

Comment by Igor Bolotin [ 11/Aug/12 ]

The driver does reconnect, but only after failing few requests that were issued during the re-election period.

It would be great if instead of throwing these exceptions to the application to handle - the driver would recognize recoverable failures and retry the requests automatically. This behavior can be made optional to not break existing applications. Also makes sense to allow number of retries and delay between attempts to be configurable.

Comment by Jeffrey Yemin [ 10/Aug/12 ]

The Java driver will retry on a stepdown of the primary if you connect in replica set mode (use the Mongo constructor that takes a list of ServerAddress instances). Is that how you are connecting?

Generated at Thu Feb 08 03:12:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.