[SERVER-28916] Make mongos automatically retry failed retryable writes Created: 21/Apr/17 Updated: 30/Oct/23 Resolved: 07/Nov/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.0-rc4 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Kaloian Manassiev | Assignee: | Jack Mulrow |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Sprint: | Sharding 2017-11-13 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
This task should be done once all the other machinery is in place and is to make mongos automatically retry writes on network or other retryable errors. |
| Comments |
| Comment by Githook User [ 07/Nov/17 ] |
|
Author: {'name': 'Jack Mulrow', 'username': 'jsmulrow', 'email': 'jack.mulrow@mongodb.com'}Message: |
| Comment by Jeremy Mikola [ 02/Nov/17 ] |
kaloian.manassiev clarified that a 3.6 mongos cannot communicate with a 3.4 mongod binary, so the above question is irrelevant. |
| Comment by Jeremy Mikola [ 02/Nov/17 ] |
That seems to be at odds with SPEC-980 and the goals of the product team. The driver is already deciding whether to include lsid and txnNumber and allow for retryable behavior based on the wire protocol version and presence of logicalSessionTimeoutMinutes in isMaster. Without such logic, we're prone to assuming a server before 3.6 supports retryable writes. If the driver unconditionally sends lsid and txnNumber, that exposes us to accidentally issuing a write twice in the event of a network error. Inferring whether the server actually supports retryable writes based on the information we have from isMaster seems like the conservative, safe approach (no risk of executing a write twice), and also insulates applications from some errors.
Given that |
| Comment by Kaloian Manassiev [ 02/Nov/17 ] |
|
The driver should not be in the business of querying the cluster for whether retryable writes are supported or not. If retryWrites=true on the driver, the driver should unconditionally send the sessionId/txnNumber to the server and the server will decide what to do. For the scenario you described, in the absence of failing for FCV 3.4, the write will return a batch of responses, where some will be successful and some will have a ErrorCodes::NotSupported error code. |
| Comment by Jeremy Mikola [ 02/Nov/17 ] |
|
In SPEC-980, shane.harvey proposed that we relax the driver behavior if retryWrites=true and the cluster does not support sessions and retryable writes. Instead of raising an error to the user, we will simply leave the transaction ID out of a write command and forgo any retry logic (i.e. behave as if retryWrites=false). Hypothetically, how will mongos operate when distributing a write across three shards, where two shards support retryable writes and the third does not? Will it pass on a transaction ID to the supported shards (and enable retry behavior for those) and omit the transaction ID and retry behavior for the third shard? I understand that sessions are intended to be cluster-wide and persist across all shards, but is it possible for the above topology to come about? Perhaps by:
|
| Comment by Kaloian Manassiev [ 01/Nov/17 ] |
|
Based on a discussion with the drivers team, it appears that the SDAM retry spec only looks at the error message and does not peek inside the status of the individual batch entries. Because of this, an network error between mongos and a shard during the execution of a write batch will not be retried by the driver, because it will result in a response with an OK status, but an error batch entry. Because of this we should make the ARS here use a kIdempotent policy in case the request is retryable (i.e. contains a txnNumber). As part of the implementation, we need to also write a unit-test here in order to ensure this works as expected. |