[SERVER-50500] Increment topologyVersion for state change errors in failCommand Created: 24/Aug/20 Updated: 10/Sep/20 Resolved: 10/Sep/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Divjot Arora (Inactive) | Assignee: | Tess Avitabile (Inactive) |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Sprint: | Repl 2020-09-07, Repl 2020-09-21 |
| Participants: |
| Description |
|
This was originally proposed in EDIT: My comment about the 0.5s lag only apply to a subset of state change errors on 4.4 where the response does not contain a topologyVersion field. The drivers tests addressed by this ticket run on 4.4+, so it'd be helpful to have this functionality backported. I think we'd need to first backport |
| Comments |
| Comment by Tess Avitabile (Inactive) [ 10/Sep/20 ] |
|
Cool, thanks for letting me know! |
| Comment by Divjot Arora (Inactive) [ 10/Sep/20 ] |
|
tess.avitabile After discussing this some more with Shane, we don't think it'll be a good idea to implement this in the server. Adding this would actually introduce a race in drivers between the application and monitoring threads because the application thread would want to mark the server Unknown and the monitoring threads would want to continue marking the server as known based on the streamed isMaster responses. Adding a failpoint like configureMaxAwaitTimeMS would work, but I'd prefer to avoid adding extra failpoints that we'd have to continue to support indefinitely if we can work around this by setting a low heartbeatFrequencyMS in drivers. Given all of that, I think we can close out this ticket. |
| Comment by Tess Avitabile (Inactive) [ 09/Sep/20 ] |
|
divjot.arora, I just wanted to check whether you're still interested in this ticket or if you decided to pursue a different solution, such as canceling in-progress monitor checks for shutdown errors without a topology version. |
| Comment by Tess Avitabile (Inactive) [ 01/Sep/20 ] |
|
Thanks, divjot.arora. Would it be concerning that the server increments its topology version without actually changing its state (and returning something different in the isMaster response)? |
| Comment by Divjot Arora (Inactive) [ 01/Sep/20 ] |
|
shane.harvey A new failpoint could work, but I think it'd be worth considering the option where we cancel in-progress monitor checks for shutdown errors without a topology version. tess.avitabile The driver tests primarily care that the server is marked Unknown when an application thread sees a state change error. Right now, however, the error triggered via failCommand causes the application thread to mark the server Unknown, but the server is not re-discovered for ~10s because the server is actually healthy so the in-progress streaming check does not receive a new response. |
| Comment by Tess Avitabile (Inactive) [ 01/Sep/20 ] |
|
I agree that it would be better to make the errors from failCommand act more like real state change errors. It's a little worrisome that today, we simulate the server responding with a NotMaster error without the topologyVersion changing, since that doesn't match real server behavior. However, making failCommand increment the topologyVersion won't match real server behavior either, since then we would increment the topologyVersion without changing anything in the isMaster response. I'm interested to learn more about what behavior the drivers tests want to simulate in the server, in order to know the right way to solve this. For the purpose of speeding up the tests, adding a failpoint like configureMaxAwaitTimeMS would work for me. |
| Comment by Shane Harvey [ 31/Aug/20 ] |
|
Divjot and I talked about this issue when he created the ticket and I was in favor of the change because it would make the errors from failCommand act more like real state change errors. However given your response it sounds like we're asking to backport a lot of the quiesce mode project to 4.4. The best we can do in some drivers right now is set maxAwaitTimeMS to 500ms. That's as low as it can go due to API restrictions in some drivers. This means that some tests take at least 500ms which adds up quite fast when running 10's/100's of these tests. Perhaps we can add a failpoint like configureMaxAwaitTimeMS to override the maxAwaitTimeMS on all isMaster commands? Then we could speed up these slow tests by setting the maxAwaitTimeMS to 5ms with the failpoint. Another workaround is that when a driver sees a shutdown error without topologyVersion it can cancel the monitor heartbeat (similar to what we do for network errors). This would trigger the next heartbeat sooner than heartbeatFrequencyMS but probably not faster than 500ms. What do you think Divjot? |
| Comment by Tess Avitabile (Inactive) [ 31/Aug/20 ] |
|
Hi divjot.arora. I have some of the same concerns about this work as previously. While topology version now changes on mongos, it still does not change on standalone, and it doesn't change on mongos in 4.4, so we wouldn't want to backport incrementing the topology version on mongos to 4.4. Also, it's important that we not backport Do you think there are other workarounds we could consider? Some things we discussed on |