[DRIVERS-942] Consider resuming on aggregate for change streams Created: 21/Feb/20  Updated: 12/Jan/24  Resolved: 13/Dec/23

Status: Closed
Project: Drivers
Component/s: Change Streams, Retryability
Fix Version/s: None

Type: Spec Change Priority: Major - P3
Reporter: Divjot Arora (Inactive) Assignee: Kyle Kloberdanz
Resolution: Won't Do Votes: 0
Labels: jeff+
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Issue split
split to CDRIVER-4750 Consider resuming on aggregate for ch... Closed
split to CSHARP-4823 Consider resuming on aggregate for ch... Closed
split to CXX-2773 Consider resuming on aggregate for ch... Closed
split to GODRIVER-3028 Consider resuming on aggregate for ch... Closed
split to JAVA-5222 Consider resuming on aggregate for ch... Closed
split to MOTOR-1198 Consider resuming on aggregate for ch... Closed
split to NODE-5716 Consider resuming on aggregate for ch... Closed
split to PHPLIB-1293 Consider resuming on aggregate for ch... Closed
split to PYTHON-4013 Consider resuming on aggregate for ch... Closed
split to RUBY-3337 Consider resuming on aggregate for ch... Closed
split to RUST-1784 Consider resuming on aggregate for ch... Closed
Driver Changes: Needed
Quarter: FY24Q4
Engineering Lead: Kevin Albertson Kevin Albertson
Start date:
Driver Compliance:
Key Status/Resolution FixVersion
CDRIVER-4750 Won't Do
CXX-2773 Won't Do
CSHARP-4823 Won't Do
GODRIVER-3028 Won't Do
JAVA-5222 Won't Do
NODE-5716 Won't Do
MOTOR-1198 Won't Do
PYTHON-4013 Won't Do
PHPLIB-1293 Won't Do
RUBY-3337 Won't Do
RUST-1784 Won't Do

 Description   

SERVER-45505 adds a ResumableChangeStreamError label, which can be included on both aggregate and getMore command responses. The change stream spec says all errors on aggregate are considered fatal. Would it be possible for drivers to instead have a mechanism to resume aggregate attempts if the event of a transient error that has the new error label?



 Comments   
Comment by Kyle Kloberdanz [ 13/Dec/23 ]

Closing as "Won't Do" for the following reason: Given that aggregate is already retryable, it doesn't seem to add much value to also make it resumable.

A few ideas that we considered are below:

  1. Add the ResumableChangeStreamError label to the criteria for retryable reads
  2. Do not run the aggregate as a retryable read. Instead: resume the aggregate as is done for getMore.
Comment by Prashant Mital (Inactive) [ 08/Sep/20 ]

divjot.arora and I propose in light of the above comments that we don't do this. Divjot noted that network errors often occur due to things like expired/invalid certs in which case we don't really want to retry the initial aggregate more so than retryable reads already does. If jeff.yemin and shane.harvey concur, we are happy to close this as Wont Do.

Comment by Shane Harvey [ 08/Sep/20 ]

To add some more context here: The initial aggregate command issued by watch() is already retryable according to the retryable reads spec. For an example see the "db.coll.watch succeeds on second attempt" test.

Comment by Jeffrey Yemin [ 21/Feb/20 ]

From a chat with bernard.gorman

I don’t think it’s hugely important, but the reasons that we auto-resume for a getMore are things like shard down, shard in mid-election, etc. All of those are reasons why an initial aggregate might also fail, and are equally transient. it might not make sense for an aggregate that’s establishing an entirely new stream, but the example I gave Divjot was where a customer is attempting to manually resume with an aggregate that has a resume token. In that case, it seems reasonable to auto-retry if we hit a transient exception.

Generated at Thu Feb 08 08:22:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.