[CDRIVER-4532] Add retry behavior for handshake failures in mongoc_cluster_stream_for_server Created: 05/Dec/22  Updated: 16/Jan/24

Status: Backlog
Project: C Driver
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Unknown
Reporter: Jeremy Mikola Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by PHPLIB-1042 Drivers should retry operations if co... Blocked
is depended on by PHPLIB-1033 Improve test coverage for retryable h... Blocked
Related
is related to CDRIVER-4192 Drivers should retry operations if co... Closed
Quarter: FY25Q1, FY25Q2

 Description   

Quoting mongodb/mongo-c-driver#1141 for CDRIVER-4192:

Note, mongoc_cluster_stream_for_server, which is used for other miscellaneous connections to the server, is deliberately not involved in these changes. This function may eventually require similar changes according to DRIVERS-2063.

CDRIVER-4192 implemented retry logic for handshake failure to all mongoc_cluster_stream_for_* methods except mongoc_cluster_stream_for_server. The PHP driver exclusively uses mongoc_cluster_stream_for_server for executing operations, as it performs its own server selection and then specifies a server_id option to libmongoc's execute methods.



 Comments   
Comment by Ezra Chung [ 11/Jan/23 ]

Moving to backlog, pending further updates in DRIVERS-746 or related tickets regarding concrete actions to take regarding this issue.

Comment by Kevin Albertson [ 12/Dec/22 ]

My read of the spec also suggests that handshakes should never be retried for aggregate with $out/$merge, since neither spec considers those to be retryable. If you agree, I think this warrants a separate CDRIVER ticket to change the logic for whether handshakes should be retried to also consider the operation itself (more important for writes IMO, as I think all read ops are retryable).

Given the comment on DRIVERS-746, I assume this is no longer a concern.

Given that, I think it makes sense to change the execution methods I referenced above (and potentially others) to use a stream constructor that conditionally allows handshake retries and then use _stream_for_server (i.e. never retries handshakes) internally when you need to create a new stream for retrying an operation.

The handshake retry may be invalidated by DRIVERS-555 or DRIVERS-1262, so I prefer not to add much additional complexity. But the handshake retry may still be a benefit to users until DRIVERS-555 or DRIVERS-1262 are addressed. This seems reasonable solution in the meantime. I.e. if a serverId option is passed, enable handshake retry inĀ mongoc_cluster_stream_for_server.

Comment by Jeremy Mikola [ 09/Dec/22 ]

ezra.chung@mongodb.com: See my comment in DRIVERS-746. In paritcular:

I think the original libmongoc implementation (CDRIVER-4192) inadvertently introduced logic to retry handshakes for all operations (handshake retry eligibility is determined by the URI options alone and doesn't consider the operation itself).

This would have been caught by spec tests in DRIVERS-2247, which don't yet exist. My read of the spec also suggests that handshakes should never be retried for aggregate with $out/$merge, since neither spec considers those to be retryable. If you agree, I think this warrants a separate CDRIVER ticket to change the logic for whether handshakes should be retried to also consider the operation itself (more important for writes IMO, as I think all read ops are retryable).

Comment by Jeremy Mikola [ 09/Dec/22 ]

I am unable to discern what line you are referring to with the given link, but the link to the original spec change is indeed why the retryableWriteError label is conditionally applied.

Looks like the diff link doesn't work because GitHub defaults to hiding mongoc-cluster.c. I was referring to this portion of _mongoc_cluster_stream_for_optype where you add a label iff the operation type is not MONGOC_SS_READ.

I was not aware of the "only add the label to an error when the client has added a txnNumber to the command" description in the Q&A section.

I'm not sure how relevant this is, and there is evidently no test coverage for this. I would have expected there to be a test using a non-retryable write (e.g. aggregation with $out) and expecting no error label to be added; however, there don't appear to be any handshakeError tests that expect operations to fail – everything is "<operation> succeeds after retryable handshake server error".

Looking back at the original commit for DRIVERS-746, I think the spec contradicts itself. The newly introduced paragraph talks about adding the label based on the retryWrites URI option, but the Q&A section talks about only doing this for retryable writes (with a txnNumber field). I think it's just a coincidence that the commit happened to modify a line in that Q&A entry (just whitespace). I'll follow up on DRIVERS-746 to ask for clarification on this (and potentially open a new ticket if that seems prudent).

As for tests to assert that the handshake's retry attempt prevents the subsequent operation from retrying, I suppose that would not apply in drivers that implement CSOT. Likewise, there'd be no way to assert that the handshake failure adds a label (or does not) since CSOT would ultimately allow the subsequent operation to succeed.


I think I misunderstood the purpose of mongoc_cluster_stream_for_aggr_with_write. This is necessary because it's a write operation that also takes a read preference, correct?

Regarding MONGOC_SS_AGGREGATE_WITH_WRITE: looking back at the logic for adding a RetryableWriteError label, it looks like that might also apply to aggregation writes. That may be a bug, although quite minor as I don't think the label has much impact other than confusing the user.

I'm unsure about mongoc_cluster_stream_for_aggr_with_write consulting the retryWrites URI option for initialization of is_retryable, which is later used to determine whether handshakes can be retried. AFAICT, the specs don't really provide any guidance on what to do there. Aggregations are not discussed at all in the retryable writes spec and pipelines containing $out and $merge are explicitly designated as not retryable in the reads spec.

I think this is something we could clarify when following up on DRIVERS-746, since it's the same question of how do we handle handshake retries for operations that are otherwise not retryable (i.e. writes with no txnNumber).

It seems unfortunate that handshake retries are being conflated with retryable reads/writes (echoing Matt's sentiment in this comment). Talking this through, I now understand why Daria highlighted DRIVERS-2247 (comment) and Patrick referred to DRIVERS-1262 (comment).


my interpretation of the Retryable Reads/Writes specification suggested that only the first connection to a server during server selection via _stream_for_reads or _stream_for_writes was eligible for retryable handshakes according to the rule that "a single retryable handshake error makes any following operations ineligible for retryability". Therefore, _stream_for_server was omitted from retryable handshakes as the retryable handshakes (appeared to) always be handled by the first connection attempt via _stream_for_reads and _stream_for_writes. My interpretation on this may have been incorrect given PHP's use of _stream_for_server. Thoughts?

This sounds like something the spec is ambiguous about, and potentially may never be addressed now that the spec has been modified to consider CSOT and all previous mentions of "retry-once" behavior removed.

That said, I think your understanding is correct given the following:

  • If the initial stream creation encounters a handshake error and retries, then there will be no subsequent attempts to retry for the write operation. I think this is handled by you currently tracking whether a retry occurred on the stream struct.
  • If initial stream creation encountered no error and the operation itself fails, we return to create a new stream. In this case I think it also makes sense to prohibit retrying on a handshake error because the operation failure is consuming our one allowable retry.

Given that, I think it makes sense to change the execution methods I referenced above (and potentially others) to use a stream constructor that conditionally allows handshake retries and then use _stream_for_server (i.e. never retries handshakes) internally when you need to create a new stream for retrying an operation.

Generated at Wed Feb 07 21:21:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.