[DRIVERS-926] Consider making ReadConcernMajorityNotAvailableYet a retryable error Created: 09/Mar/20  Updated: 04/Dec/23

Status: Implementing
Project: Drivers
Component/s: Retryability
Fix Version/s: None

Type: Spec Change Priority: Major - P3
Reporter: Pavithra Vetriselvan Assignee: Kyle Kloberdanz
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Issue split
split to JAVA-5224 Make ReadConcernMajorityNotAvailableY... Backlog
split to CXX-2775 Consider making ReadConcernMajorityNo... Backlog
split to GODRIVER-3030 Consider making ReadConcernMajorityNo... Backlog
split to NODE-5718 Make ReadConcernMajorityNotAvailableY... Backlog
split to PHPLIB-1295 Consider making ReadConcernMajorityNo... Backlog
split to RUBY-3339 Make ReadConcernMajorityNotAvailableY... Backlog
split to CDRIVER-4752 Consider making ReadConcernMajorityNo... Closed
split to MOTOR-1200 Consider making ReadConcernMajorityNo... Closed
split to PYTHON-4016 Consider making ReadConcernMajorityNo... Closed
split to RUST-1786 Consider making ReadConcernMajorityNo... Closed
split to CSHARP-4825 Consider making ReadConcernMajorityNo... Scheduled
Related
Driver Changes: Needed
Quarter: FY24Q4
Downstream Changes Summary:

Summary of necessary driver changes

Commits for syncing spec/prose tests
(and/or refer to an existing language POC if needed)

Engineering Lead: Kevin Albertson Kevin Albertson
Start date:
Driver Compliance:
Key Status/Resolution FixVersion
CDRIVER-4752 Done 1.26.0
CXX-2775 Backlog
CSHARP-4825 Scheduled
GODRIVER-3030 Backlog
JAVA-5224 Backlog
NODE-5718 Backlog
MOTOR-1200 Duplicate
PYTHON-4016 Fixed 4.7
PHPLIB-1295 Backlog
RUBY-3339 Backlog
RUST-1786 Fixed 2.8.0

 Description   

This came up during testing for Safe Replica Set Reconfig.

During a safe reconfig, the primary will drop snapshots after writing down a new config document. If a read is issued on this node before it updates its snapshot, the server fails with ReadConcernMajorityNotAvailableYet.

The node should eventually be able to update the committed snapshot through heartbeats (2 second interval), so the read will eventually succeed. It seems like we should treat this as a retryable error.



 Comments   
Comment by Githook User [ 01/Dec/23 ]

Author:

{'name': 'Kyle Kloberdanz', 'email': 'kyle.kloberdanz@mongodb.com', 'username': 'kkloberdanz'}

Message: DRIVERS-926 Make ReadConcernMajorityNotAvailableYet a retryable read error (#1479)
Branch: master
https://github.com/mongodb/specifications/commit/a0bac5c874786d49cb6a3647182a9955fd2be94a

Comment by Shane Harvey [ 30/Nov/23 ]

Making ReadConcernMajorityNotAvailableYet a retryable read error makes sense to me.

Comment by Pavithra Vetriselvan [ 10/Mar/20 ]

Oh, got it! Thanks for clarifying. Let me know if you have any more questions about the server's behavior.

Comment by Shane Harvey [ 09/Mar/20 ]

Sorry, I'm not trying to say that we shouldn't retry here. Just trying to point out that our current retry logic may be insufficient to actually address this scenario in practice.

Comment by Pavithra Vetriselvan [ 09/Mar/20 ]

Hmm, I see. The reconfig doesn't cause a state transition, so the node will continue to report itself as primary.

Your comment helps explain why we didn't run into this issue with rollback dropping snapshots. We kill any in progress reads before transitioning to rollback and fail with "InterruptedDueToReplStateChange." We also don't allow any new reads during this state.

I did not realize that the drivers only retry once, that's definitely a good point. The read should eventually succeed if we retry enough, but isn't guaranteed to immediately succeed upon one retry.

I'm curious, what is the user expected to do when receiving a "ReadConcernMajorityNotAvailableYet" error?

Comment by Shane Harvey [ 09/Mar/20 ]

 The node should eventually be able to update the committed snapshot through heartbeats (2 second interval), so the read will eventually succeed.

If a read fails with ReadConcernMajorityNotAvailableYet does that mean the node is in an unknown SDAM state? (Ie, does the primary stop reporting itself as primary in isMaster responses?)

If the answer is no, then it seems like the retry will most likely immediately proceed (without blocking in server selection) and fail with the same error. If this is the case then it seems like retrying simply delays the error. Note that drivers only retry once.

Generated at Thu Feb 08 08:22:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.