[JAVA-4204] Scala failover test leads to a write concern exception Created: 18/Jun/21 Updated: 04/May/22 Resolved: 19/Jul/21 |
|
| Status: | Closed |
| Project: | Java Driver |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Valtteri Pirttilä | Assignee: | Ross Lawley |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | external-user | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Description |
|
Hi! I set up a test with the 4.2.3 Scala driver and Atlas. The test is a straightforward one, I increment a field recursively and then start a Failover test on Atlas. The recursive incrementation continues throughout the test. In my test case, I set the recursion to go through the incrementation 200 000 times as my experimentation showed that to be long enough to last through the entire Failover test. My expectation was that when the primary election is underway, the driver would pause operation during that time and resume once a primary was again available. However, this didn’t occur as the driver threw an exception and interrupted processing. The exception was a MongoWriteConcernException. The connection had retryWrites=true&w=majority set. There’s not a whole lot in the documentation about how to handle Failover. I was expecting the driver to handle the election without needing exception handling in our own code. Is this assumption correct? If it is, there seems to be an issue with the driver. Below is the relevant code from the test: {{}}
{{}} |
| Comments |
| Comment by Ross Lawley [ 19/Jul/21 ] |
|
Hi valtteri.pirttila@traplightgames.com, Apologies for the lack of response on this ticket. I can now confirm that the error from the Atlas failover test is because of a bug in the Java driver ( The driver is failing to acknowledge the error as retryable when it should be. The error scenario itself has shown to be racy. The nearer the code to the server the more the likelyhood of the race condition occurring. This explains why you've seen this error on the local AWS servers and not others. Marking this as a duplicate of Many thanks for your patience and help with this ticket! Ross |
| Comment by Ross Lawley [ 30/Jun/21 ] |
|
Hi valtteri.pirttila@traplightgames.com, Following on from our conversation via the community forums I'm reopening this ticket. Ross
|
| Comment by Ross Lawley [ 23/Jun/21 ] |
|
Marking as 'works as designed'. Even though writes might not be retried even when using retryWrites=true. Write concern errors may need further investigation to determine if the write was successfully replicated. If a primary node goes down during the replication of a write such that the write concern cannot be fulfilled, then a user will need to determine if the write was successful and eventually replicated to the rest of the replicaset. |
| Comment by Ross Lawley [ 23/Jun/21 ] |
|
Hi valtteri.pirttila@traplightgames.com, Thanks for the feedback, I will close this ticket. Ross |
| Comment by Valtteri Pirttilä [ 23/Jun/21 ] |
|
Thank you for your response Ross! I investigated the issue further with the help of the information you provided. Based on several tests, the failover testing feature of Atlas always results in a write concern exception which, if I understand correctly, is not a retryable error. I created further test cases that step down the primary in other ways and in my tests, the driver handled the primary stepdowns and election periods without a problem. I am somewhat saddened since it seems that handling the Atlas failover test seems to require application level handling - the options of which are not well covered in documentation. However, that is not a driver issue and the report can be marked as resolved. Thank you again for your assistance! Valtteri |
| Comment by Ross Lawley [ 21/Jun/21 ] |
|
Hi valtteri.pirttila@traplightgames.com,
At a certain level that is similar to what occurs, the Java driver has to select a server and once its determined that the primary has changed then updating its list of servers is performed. Server selection and monitoring is specified by the server selection and server selection and monitoring specifications. The Java driver follows the retryable write specification. Not all errors are retryable, see the determining retryable errors section from the spec for more details. A MongoWriteConcernError may indicate that a write succeeded on the primary (and possibly replicated to some nodes) but due to changes in topology it wasn't able to succeed and meet the write concern as a whole. More information may be available in the error message or server logs. There also is the process of retrying a write, if that fails (eg server selection times out) then the original error is returned. How best to handle failed writes? It really depends on the scenario, idempotent operations can be repeated but not all writes are idempotent in nature and may need custom checks. I hope that helps, Ross |