[SERVER-45686] Increase topologyVersion and respond to waiting isMasters on mock State Change Errors from the failCommand failpoint Created: 21/Jan/20 Updated: 27/Oct/23 Resolved: 31/Jan/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Shane Harvey | Assignee: | Jason Chan |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Sprint: | Repl 2020-02-10 |
| Participants: |
| Description |
|
When the server responds with a State Change Errors from the failCommand failpoint, it should also increase topologyVersion and respond to waiting isMasters. The Drivers team uses failCommand extensively in spec tests for retryable writes+reads. Without this change, it takes the client ~10 seconds (maxAwaitTimeMS) to rediscover the server's state. For example:
After this change to 10 seconds hang should be removed:
|
| Comments |
| Comment by Tess Avitabile (Inactive) [ 30/Jan/20 ] |
|
Ah, thank you for figuring that out! Should this ticket be closed? jason.chan, I'm sorry we didn't catch this before you wrote the code. |
| Comment by Shane Harvey [ 30/Jan/20 ] |
|
After testing against the latest server (v4.3.3-54-gd1fe174) I no longer believe failCommand with a State Change Error is a problem. Sorry for the confusion. At the time I was testing with a server that did not add topologyVersion to State Change Errors. Now that I'm testing with a server that does, this is the behavior I see:
We still have to deal with a similar problem (10-second pauses) with failCommand that closes the connection but I think we will solve it on the client side.
Please ignore this. The current behavior of topologyVersion solves this problem too. |
| Comment by Tess Avitabile (Inactive) [ 30/Jan/20 ] |
|
jason.chan and I discussed this.
Hopefully, in practice the retry will not take 10 seconds. The driver should have an up-to-date status for the primary, so it will retry against the primary. Does that sound right to you? Maybe I'm missing something. |
| Comment by Shane Harvey [ 29/Jan/20 ] |
|
1. Yes, topologyVersion needs to change on mongos anytime it returns a State Change Error to the client. Otherwise, a NotMaster error returned from a mongos could cause a driver's retryable write/read to take 10 seconds. Drivers reset a server to Unknown on State Change Errors regardless if it's a standalone, mongos, or replica set member so I think standalones should also have the same behavior. Number 1 brings up an interesting case. When directly connected to a secondary, every write will fail with NotMaster and trigger a retry. If the server does not increment the topologyVersion in practice then each retry will take 10 seconds. This seems like an oversight/bug in retryable reads/writes that can be addressed separately. |
| Comment by Jason Chan [ 29/Jan/20 ] |
|
shane.harvey, some requirements questions came up during the implementation we are hoping you could help answer: 2. The design includes the category of ShutdownErrors as a State Change Error. Does Drivers still expect the TopologyVersion to be incremented on shutdown errors? This does not simulate the true server behaviour since the server won't increment TopologyVersion on a real shutdown, but we are wondering if this would be helpful for drivers spec tests. |