[CDRIVER-4532] Add retry behavior for handshake failures in mongoc_cluster_stream_for_server Created: 05/Dec/22 Updated: 16/Jan/24 |
|
| Status: | Backlog |
| Project: | C Driver |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Unknown |
| Reporter: | Jeremy Mikola | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Quarter: | FY25Q1, FY25Q2 | ||||||||||||||||||||
| Description |
|
Quoting mongodb/mongo-c-driver#1141 for
|
| Comments |
| Comment by Ezra Chung [ 11/Jan/23 ] |
|
Moving to backlog, pending further updates in DRIVERS-746 or related tickets regarding concrete actions to take regarding this issue. |
| Comment by Kevin Albertson [ 12/Dec/22 ] |
Given the comment on DRIVERS-746, I assume this is no longer a concern.
The handshake retry may be invalidated by DRIVERS-555 or DRIVERS-1262, so I prefer not to add much additional complexity. But the handshake retry may still be a benefit to users until DRIVERS-555 or DRIVERS-1262 are addressed. This seems reasonable solution in the meantime. I.e. if a serverId option is passed, enable handshake retry inĀ mongoc_cluster_stream_for_server. |
| Comment by Jeremy Mikola [ 09/Dec/22 ] |
|
ezra.chung@mongodb.com: See my comment in DRIVERS-746. In paritcular:
This would have been caught by spec tests in DRIVERS-2247, which don't yet exist. My read of the spec also suggests that handshakes should never be retried for aggregate with $out/$merge, since neither spec considers those to be retryable. If you agree, I think this warrants a separate CDRIVER ticket to change the logic for whether handshakes should be retried to also consider the operation itself (more important for writes IMO, as I think all read ops are retryable). |
| Comment by Jeremy Mikola [ 09/Dec/22 ] |
Looks like the diff link doesn't work because GitHub defaults to hiding mongoc-cluster.c. I was referring to this portion of _mongoc_cluster_stream_for_optype where you add a label iff the operation type is not MONGOC_SS_READ.
I'm not sure how relevant this is, and there is evidently no test coverage for this. I would have expected there to be a test using a non-retryable write (e.g. aggregation with $out) and expecting no error label to be added; however, there don't appear to be any handshakeError tests that expect operations to fail – everything is "<operation> succeeds after retryable handshake server error". Looking back at the original commit for DRIVERS-746, I think the spec contradicts itself. The newly introduced paragraph talks about adding the label based on the retryWrites URI option, but the Q&A section talks about only doing this for retryable writes (with a txnNumber field). I think it's just a coincidence that the commit happened to modify a line in that Q&A entry (just whitespace). I'll follow up on DRIVERS-746 to ask for clarification on this (and potentially open a new ticket if that seems prudent). As for tests to assert that the handshake's retry attempt prevents the subsequent operation from retrying, I suppose that would not apply in drivers that implement CSOT. Likewise, there'd be no way to assert that the handshake failure adds a label (or does not) since CSOT would ultimately allow the subsequent operation to succeed. I think I misunderstood the purpose of mongoc_cluster_stream_for_aggr_with_write. This is necessary because it's a write operation that also takes a read preference, correct? Regarding MONGOC_SS_AGGREGATE_WITH_WRITE: looking back at the logic for adding a RetryableWriteError label, it looks like that might also apply to aggregation writes. That may be a bug, although quite minor as I don't think the label has much impact other than confusing the user. I'm unsure about mongoc_cluster_stream_for_aggr_with_write consulting the retryWrites URI option for initialization of is_retryable, which is later used to determine whether handshakes can be retried. AFAICT, the specs don't really provide any guidance on what to do there. Aggregations are not discussed at all in the retryable writes spec and pipelines containing $out and $merge are explicitly designated as not retryable in the reads spec. I think this is something we could clarify when following up on DRIVERS-746, since it's the same question of how do we handle handshake retries for operations that are otherwise not retryable (i.e. writes with no txnNumber). It seems unfortunate that handshake retries are being conflated with retryable reads/writes (echoing Matt's sentiment in this comment). Talking this through, I now understand why Daria highlighted DRIVERS-2247 (comment) and Patrick referred to DRIVERS-1262 (comment).
This sounds like something the spec is ambiguous about, and potentially may never be addressed now that the spec has been modified to consider CSOT and all previous mentions of "retry-once" behavior removed. That said, I think your understanding is correct given the following:
Given that, I think it makes sense to change the execution methods I referenced above (and potentially others) to use a stream constructor that conditionally allows handshake retries and then use _stream_for_server (i.e. never retries handshakes) internally when you need to create a new stream for retrying an operation. |