[JAVA-2543] Mongo driver exception "Replication is shutting down" on mongo save during replicaset failover and election process Created: 21/Jun/17 Updated: 27/Oct/23 Resolved: 02/Jan/18 |
|
| Status: | Closed |
| Project: | Java Driver |
| Component/s: | Cluster Management |
| Affects Version/s: | 3.4.1 |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Garrett Donnelly | Assignee: | Unassigned |
| Resolution: | Gone away | Votes: | 2 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Windows Server 2012 Release 2 |
||
| Attachments: |
|
| Description |
|
Main Question: Should I change application code to resolve exceptions due to primary step down and election of a new primary by opening the connection again or should I be able to handle this using the driver timeout settings and a retry of save on exception? Background I have a four node replicaset that we are putting through initial development / resilience tests. We are pre-production.. I'm testing failover of the primary by stopping the mongod service. The election occurs and a new primary is elected. I have no errors in the mongod logs and the election takes place. However, I have a process that makes a connection to the replicaset over SSL, and when we are running a long running batch jobs performing batch mongo saves, when the service fails over, I get the following error on the client. The exception is caught and but the mongo save throws an exception again in the time before the new primary is elected. I have tried resetting values for electionTimeoutMilllis up and down, reducing the heartbeat etc.. We're Currently on Mongo 3.4.5 database on Windows and the replica set uses protocol 1. The client connection uses the default values for timeout. We use "majority" to write. For now I'm simply using the following.
The stacktrace and config are attaching in configuration.txt but the main error I get is: org.springframework.data.mongodb.UncategorizedMongoDbException: Query failed with error code 11600 and error message 'interrupted at shutdown' on server xxxx; nested exception is com.mongodb.MongoQueryException: Query failed with error code 11600 and error message 'interrupted at shutdown' I would be grateful for some ideas whether we could make a settings change which would enable the client code to detect the server problem and then wait for the election process to occur. I've tried pinging the server after the first mongo save but this hasn't helped. Do we need to re-architect our application code with a more sophisticated approach than retry on initial exception? Thanks in advance. |
| Comments |
| Comment by Jeffrey Yemin [ 08/Dec/17 ] |
|
gearoid68 I apologize that no one has responded to your initial question. As it's been quite a while now, can I ask what your current status is on this issue and if you still need assistance? Also, note that the JAVA project is for specific questions about the Java driver, and as this is more of an architectural question about how to properly use MongoDB (albeit in a Java environment), it's best handled by asking in the MongoDB user forum. Regards, |
| Comment by Garrett Donnelly [ 23/Jun/17 ] |
|
I have noticed the following in the mongod logs for the new primary after election. Is it possible 2017-06-15T14:16:04.642+0100 I REPL [ReplicationExecutor] My optime is most up-to-date, skipping catch-up and completing transition to primary. Is it possible this is linked to the driver error code 91? org.springframework.dao.DataIntegrityViolationException: Write failed with error code 91 and error message 'Replication is being shut down'; nested exception is com.mongodb.WriteConcernException: Write failed with error code 91 and error message 'Replication is being shut down' My theory on this is that the error is due to the oplog being too small to facilitate replication in a batch scenario. Data loss is detected by the driver. Does anyone know if the missing oplog entries can cause the "Replication is shutting down" message? In more detail, the replication of a long running batch in Mongo perhaps needs a larger oplog size. This would be equivalent to sizing the roll-back segment appropriately for batch in Oracle. The following advice appears in the Mongo documentation and concerns the example I have of 2,000 medium size documents being saved in batch mode. “For example, you might need to change the oplog size if your applications perform large numbers of multi-updates or deletes in short periods of time” The oplog is a capped collection so what appears to be happening is that some of the oplog entries are being lost on stepdown of the primary. If we were to increase the size of the oplog this would possibly be mitigated. There is a rather complex operation to increase the oplog size which could be attempted. |